Played around with the code to implement a little bit of SIMD. Was able to squeeze out a decent improvement, ~250 fps avg, ~140 low, ~333 high (on an m4). Looks pretty straightforward to do threading with as well. Cool stuff! Could work to bring more gpu stuff back down to the cpu.
Did you exhaust the five-hour usage limit already? As I understand it, the ”additional usage” refers to anything beyond the standard five-hour usage limit.
According to the providers that I keep track of, Cumulus is typically pretty price competitive, except for MiniMax where DeepInfra and Together are much cheaper and GLM-5 where DeepInfra and z.AI's own hosting is much cheaper.
(Also technically qwen3 8b w/ novita being first place but barely)
Can we get context length / output length docs (looks like you mention "Max tokens (chat)" of 128k but it's unclear what that means)? Also it looks like your docs page is out of date from your playground page.
Also piece of feedback: it kind of sucks to have glm/minimax/kimi on separate api endpoints. I assume it's a game you play to get lower latency on routing for popular models but from a consumer perspective it's not great.
reply