Isn't the inference cost of running these models at scale challenging? Currently it feels like small LLMs (1B-4B) are able to perform well for simpler agentic workfows. There are definitely some constraints but surely much easier than to pay for big clusters on cloud running for these tasks. I believe it distributes the cost more uniformly
It is very likely that you consume less power running a 1B LLM on an Nvidia supercluster than you do trying to download and run the same model on a smartphone. I don't think people understand just how fast the server hardware is compared to what is in their pocket.
We'll see companies push for tiny on-device models as a novelty, but even the best of those aren't very good. I firmly believe that GPUs are going to stay relevant even as models scale down, since they're still the fastest and most power-efficient solution.
I would love to talk more about it and understand your take. Is there a way I can reach out to you? I have been reading the content in web 3 and form my own opinion on the merits/demerits.
I believe edge computing is one of the game changing technologies for the decade. I am not sure if it can fall under the purview of Web 3 or not. For example one of the library I implemented was along training ML models on user devices instead of the cloud maintaining privacy and personalization.
That's what I am worried about, from its description the intent looks good but being aligned to only one industry or vertical defeats the purpose of ubiquity. People wouldn't trust it if they all just see is a volatile currency