I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.
I'm also curious what results we would get if SWE came up with a new set of 500 problems to run all these models against, to guard against overfitting.