That's not what I meant by interaction. The evaluator had to ask the models to d...

That's not what I meant by interaction. The evaluator had to ask the models to do tasks for them that they thought of by their own. Otherwise there are just too many ways that information could have leaked.

OpenAI's model isn't immune from this either, so take any so-called evaluation metrics with a huge grain of salt. This also highlights the difficulties of properly evaluating LLMs: any metrics, once set up, can become a memorization target for LLMs and lose their meaning.