I've been working on something like this too, for quite a while! Though I'm trying to get a non-quadratic-attention LLM (or SLM) up and running.
And anyway, I think the most important thing is dataset quality. Dumping in whatever dataset you find on Huggingface is a recipe for mediocrity, so I'm also spending a lot of time on that.
I've seen Opus do some incredibly token-costly things before too. In fact after most sessions I ask it about which tools it used often, which tools could be simplified/made less verbose, could be "combined" into one, ... So for each project I mostly create a few little scripts that do a bunch of things in one go that it would normally do in multiple tool calls.
For example: one thing Opus was really bad at was re-running the test suite followed by a bunch of `| grep` suffixes. So it would often re-run 5+ minute test suites just to grep the output a bit differently
The solution was to wire up a little script that ran the test suite, save the output to a file, and then inform it where that file is and to NOT re-run the suite just so it can grep the output differently. This saved me a bunch of time & tokens.
Fable 5 on medium is amazing. It's handling everything I throw at it
I had _one_ instance where for some obscure reason it decided to fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and implemented a super obvious feature in a slightly-wrong way.
Same here. Claude isn't perfect. It still makes a lot of mistakes. But whenever I try GPT-5.5 it's ten times worse, and Claude just has to clean up GPT's mess.
And anyway, I think the most important thing is dataset quality. Dumping in whatever dataset you find on Huggingface is a recipe for mediocrity, so I'm also spending a lot of time on that.
reply