Given a benchmark corpus, the evaluation criteria could be:

- Facts extracted: the amount of relevant facts extracted from the corpus

- Interpretations : based on the facts, % of correct interpretations made

- Correct Predictions: based on the above, % of correct extrapolations / interpolations / predictions made

The ground truth could be in JSON file per example. (If the solution you want to benchmark uses a graph db, you could compare these aspects with a LLM as judge.)

---

The actual writing is more about formal/business/academic style, and I find less relevant for a benchmark.

However I would find it crucial to run a "reverse RAG" over the generated report to ensure each claim has a source. [0]

[0] https://venturebeat.com/ai/mayo-clinic-secret-weapon-against...