Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.
Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)
Thanks for bringing clownfish and relm to my attention! afaik other libraries loop over the entire vocabulary at every step of the generation. We on the other hand build an index at initialization by looping once over the vocabulary. Then generation is just as fast as standard generation.
torch-grammar generates a mask per PDA stack... we don't try to compute all the possible stacks. I'm sure there's something smarter that could be done here and you've probably figured it out (though IIRC regular languages don't have the arbitrarily recursive stack problem that you get when you get to context-free languages?) anyway, in practice we spend a few milliseconds on the first few requests building caches and then just apply masks from caches after that.
https://github.com/1rgs/jsonformer
or
https://github.com/newhouseb/clownfish
or
https://github.com/mkuchnik/relm
or
https://github.com/ggerganov/llama.cpp/pull/1773
or
https://github.com/Shopify/torch-grammar
Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.
Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)