Given a benchmark corpus, the evaluation criteria could be:
- Facts extracted: the amount of relevant facts extracted from the corpus
- Interpretations : based on the facts, % of correct interpretations made
- Correct Predictions: based on the above, % of correct extrapolations / interpolations / predictions made
The ground truth could be in JSON file per example.
(If the solution you want to benchmark uses a graph db, you could compare these aspects with a LLM as judge.)
---
The actual writing is more about formal/business/academic style, and I find less relevant for a benchmark.
However I would find it crucial to run a "reverse RAG" over the generated report to ensure each claim has a source. [0]
- Facts extracted: the amount of relevant facts extracted from the corpus
- Interpretations : based on the facts, % of correct interpretations made
- Correct Predictions: based on the above, % of correct extrapolations / interpolations / predictions made
The ground truth could be in JSON file per example. (If the solution you want to benchmark uses a graph db, you could compare these aspects with a LLM as judge.)
---
The actual writing is more about formal/business/academic style, and I find less relevant for a benchmark.
However I would find it crucial to run a "reverse RAG" over the generated report to ensure each claim has a source. [0]
[0] https://venturebeat.com/ai/mayo-clinic-secret-weapon-against...