and shows how LLM technology has a lot more to offer than "ChatGPT". The real takeaway is that by training LLMs with real training data (even with a "less powerful" model) you can get an error rate more than 10x less than you get with the "zero shot" model of asking ChatGPT to answer a question for you the same way that Mickey Mouse asked the broom to clean up for him in Fantasia. The "few-shot" approach of supplying a few examples in the attention window was a little better but not much.
The problem isn't something that will go away with a more powerful model because the problem has a lot to do with the intrinsic fuzziness of language.
People who are waiting for an exponentially more expensive ChatGPT-5 to save them will be pushing a bubble around under a rug endlessly while the grinds who formulate well-defined problems and make training sets will actually cross the finish line.
Remember that Moore's Law is over in the sense that transistors are not getting cheaper generation after generation, that is why the NVIDIA 40xx series is such a disappointment to most people. LLMs have some possibility of getting cheaper from a software perspective as we understand how they work and hardware can be better optimized to make the most of those transistors, but the driving force of the semiconductor revolution is spent unless people find some entirely different way to build chips.
But... people really want to be like Mickey in Fantasia and hope the grinds are going to make magic for them.
If you look back just 2 years we had the grinds build those specialized models for QA, NER, Sentiment, Classification etc. and all their deep investment was rug-pulled by GPT-3 and then GPT-4.
You say that training datasets will win, but this is where OpenAI is currently have a big leg up: Everyone is dumping tons of real data into them, while the LocalLLM crowd is using GPT-4 to try to keep up.
> Remember that Moore's Law is over in the sense that transistors are not getting cheaper generation after generation, that is why the NVIDIA 40xx series is such a disappointment to most people.
I am unconvinced by the idea of trying to redefine Moore's Law to be about MSRP. The NVIDIA H100 has twice the FLOPS of the A100 on a smaller die. That's Moore's Law, full stop. When NVIDIA has useful competition in the AI space, they'll be forced to cut prices, as has reliably been the case for every semiconductor vendor for the last 60 years.
Agreed. We need Intel to get the software side of ARC together.
Or we need something like the unified ram or apple silicon. Apple have accidental stumbled into being the most competitive way to run llms with their 192gb studio.
Moore's law is irrelevant. Large language models are going to leave the digital paradigm behind altogether.
Neural nets don't need fully precise digital computing. Especially with quantization we're seeing that losing a bit of precision in the weights isn't impactful. Now that we're serving huge foundation models with static weights there's an enormous incentive to develop analog hardware to run them.
Mark my words, this will lead to a renaissance in analog computing, and in the future we will be shocked at the enormous waste of having run huge models on digital chips.
Just think, how many multiplications per second is the light refracting through your window right now clocking? More or less than is required to ChatGPT do you think? If only the crystals were configured correctly and the patterns of light coming through could be interpreted...
> Just think, how many multiplications per second is the light refracting through your window right now clocking?
It really depends where you draw the lines, because you could also say that one single transistor in my electrical CPU is doing a kerjillion calculations for all of the atoms and electrons involved.
Fresh approaches to AI hardware are emerging, like the Groq Chip which utilizes software-defined memory and networking without caches. To simplify reasoning about the chip, Groq makes it synchronous so the compiler can orchestrate data flows between memory and compute and design network flows between chips. Every run becomes deterministic, removing the need for benchmarking models since execution time can be precisely calculated during compilation. With these innovations, Groq achieved state-of-the-art speed of 240 tokens/s on 70B LLaMA.
Fascinating stuff - a synchronous distributed system allows treating 1000 chips as one, knowing exactly when data will arrive cycle-for-cycle and which network paths are open. The compiler can balance loads. No more indeterminism or complexity in optimizing performance (high compute utilization). A few basic operations suffice, with the compiler handling optimization, instead of 100 kernel variants of CONV for all shapes. Of course, it integrates with Pytorch and other frameworks.
In addition to what the other commenter said about Moores law, innovations like Flash Attention which reduced memory usage by over 10x and FA 2 which made huge leaps in compute efficiency show there is still a lot of room to improve the models and inference algorithms themselves. Even without compute we likely haven’t scratched the surface of efficient transformers.
You have to create a prompt/function that for a wide set of inputs, generates a token sequence that will perpetually expand in a manner that corresponds to an externally observed truth.
Way too often it feels like you have to shove a universal decoding sequence into a prompt.
“Talk your steps, list your clues, etc.”
Just trying to luck into a prompt that keeps decompressing the model/ generating the next token that ensures the next token is true.*
Basically - LLMs don’t reason , they regurgitate. If they have the right training data, and the right prompt, they can decompress the training data into something that can be validated as true
——-
* Also this has to be done in a limited context window, there is no long term memory, and there is no real underlying model of thought.
It's too early to say who is winning/will win, of course. But so far the UI and its accessibility have made a huge difference in how different gen AI models are being used.
For example, I struggle to see DALL-E winning over Firefly if Firefly is integrated into a very rich environment, whereas DALL-E is basically prompt UI only (while DALL-E 3 is a better model IMO).
Only Big Tech (Microsoft,Google,Facebook) can crawl the web at scale because they own the major content companies and they severly throttle the competition's crawlers, and sometimes outright block them. I'm not saying it's impossible to get around, but it is certainly very difficult, and you could be thrown in prison for violating the CFAA.
I'm not sure if training on a vast amount of content is really necessary in the sense that linguistic competence and knowledge can probably be separated to some extent. That is, the "ChatGPT" paradigm leads to systems that just confabulate and "makes shit up" and making something radically more accurate means going to something retrieval-based or knowledge graph-based.
In that case you might be able to get linguistic competence with a much smaller model that you end up training with a smaller, cleaner, and probably partially synthetic data set.
Yep, quality over quantity. The difference between 99.9% accurate and 99.999% accurate can be ridiculously valuable in so many real world applications where people would apply LLMs.
The improvements seem to be leveling off already. GPT-4 isn't really worth the extra price to me. It's not that much better.
What I would really want though is an uncensored LLM. OpenAI is basically unusable now, most of its replies are like "I'm only a dumb AI and my lawyers don't want me to answer your question". Yes I work in cyber. But it's pretty insane now.
I haven't played with the self-hosted LLMs at all yet, but back when Stable Diffusion was brand new I had a ton of fun creating images that lawyers wouldn't want you to create. ("Abraham Lincoln and Donald Trump riding a battle elephant."
It's just so much funnier with living people!) I imagine that Llama-2 and friends offer a similar experience.