The study focuses on evaluating GPT-4o, Claude 3.5, and Llama3-8B, but it might benefit a bit from testing across more architectures (like Mixtral, DeepSeek, Gemini). This would help show generalizing of ACD.
What do you think that would help show? 4o and Llama are quite different; reportedly, the 4-series is a large MoE, whereas Llama is famously a dense model.
Testing across more architectures helps to clarify if ACD uncovers failures tied to model scale, training data, or architectural differences like MoE or other desings.
If failures are model-specific quirks or can generalize across LLMs, that would support claims about ACD’s robustness and usefulness for broad AI evaluation.