The test has many reasoning, code and instruction following questions which I ex...

The test has many reasoning, code and instruction following questions which I expected o1 to be excelling at. I do not have an interpretation for such poor results on our test, was just sharing them as a data point for people to make their own mind. My best guess at this point is that o1 is optimized for a very specific and narrow use case, similar to what you suggest.