Does anyone know an efficient way to "embed" models like this? I'm currently working in a Tamagotchi-style RPI toy and I use GPT-2 to generate answers to the chat. I wrote a simple API that returns from the server. If I could embed my model, it would save me having to have a server.
The hard part of embedding is that the smallest 124M GPT-2 model itself is huge at 500MB, which would be unreasonable for performance/storage on the user end (and quantization/tracing can't save that much space).
Hence why I'm looking into smaller models, which has been difficult, but releasing aitextgen was a necessary first step.
The size of the model you need to get good enough generation with something like GPT-2 is going to be pretty impractical on a raspberry pi.
You might maybe be able to fit a 3-layer distilled GPT-2 in RAM (not quite sure what the latest RPI have in term of RAM, 4GB?), but the latency is going to be pretty horrible (multiple seconds).
why not put it on a server, and just use an api to communicate and get the results, then the embed of the code that interfaces w/ api should be much smaller, and the server can be as big as you need.