The problem is performance:
- if you have GPUs with > 330GB VRAM, it'll run fast
- otherwise, you'll run from RAM or NVMe, but very slowly - generating one token every few minutes or so (depending on RAM size / NVMe speed)
The future might be brighter: fp8 already exists and halves the RAM requirements (although it's still very hard to get it running), and there is ongoing research on fp4. Even that would still require 84GB of VRAM to run...
> It is remarkable that such large multi-lingual model is openly available for everybody.
Am I the only one thinking that this remark is a insight into societal failure? The model has been trained on global freely available content, anyone who has published on the Web has contributed.
Yet the wisdom gained from our collective knowledge is assumed to be withheld from us. As the original remark was one of surprise, the authors (and our) assumption is that trained models are expected to be kept from us.
I think it’s similar to how search engines keep their ranking formulas secret, and you can’t run your own off a copy of their index.
Yet we also all contributed to it by publishing (and feeding it, for instance by following googles requirements for micro data). But we don’t own any of it.
Main difference with a search engine is that a search engine ultimately links back to you. So the user, interested in more or want to know where it comes from, ends up on your website.
The same is not true for these AI tools. The output could have been contributed by you, someone else, or everyone, or a combination of those, but it'll never be clear who actually contributed and there will be no credit to anyone besides the author(s) of the models.
How much money do we spend contributing to the training set?
Those insights, comments, articles, code example, etc are free to use because we published those on sites that don't own the content but earn from it. If they owned them, the they would be responsible for hate speech.
So our costs for producing the training set is negligible.
If it fits in system memory, is it still faster on GPU than CPU? Does that involve swapping out one layer at a time? Otherwise I'm very curious how it handles the PCIe latency.
Enough system memory to fit 84GB isn't all that expensive...
Yes, the connection between system memory and the GPU isn’t fast enough to keep the compute units fed with data to process. Generally PCIe latency isn’t as much of a problem as bandwidth.
Honestly even if it were to take a few minutes per response, that's likely sufficient for many use cases. I'd get value out of that if it allowed bypassing a paywall. I'm curious how these models end up being monetized/supported financially, as they sound expensive to run at scale.
The required disk space seems the biggest barrier for local.
I also wonder how open.ai etc provides access to these for free. Reminds me of the adage from when Facebook rose to popularity: "if something is free, 'you' are the product". Perhaps to gather lots more conversational training data for fine tuning.
> As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements.
>> Will you use my conversations for training?
> Yes. Your conversations may be reviewed by our AI trainers to improve our systems.
He means multiple GPUs in parallel that have a combined VRAM of that size. So around 4 x NVIDIA A100 80GB, which you can get for around $8.4 / hour in the cloud. or 7 x NVIDIA A6000 or A40 48GB for $5.5 / hour
So not exactly cheap or easy yet for the everyday user, but I believe the models will become smaller and more affordable to run, these are just the "first" big research models focused demonstrating some usefulness after that they can be more focus on the size and speed optimizations. There are multiple methods and lot of research into making them smaller with distilling them, converting to lower precision, pruning the less useful weights, sparsifying. Some achieve around 40% size reduction 60% speed improvement with minimal accuracy loss, others achieve 90% sparsity. So there is hope to run them or similar models on a single but powerful computer.
You'd basically need a rack mount server full of Nvidia H100 cards (80 Vram, they cost $40 thousand us dollars each). So... good luck with that?
On the relatively cheap end Nvidia tesla cards are kinda cheap used, 24 gig ones going for ~$200 with architectures from a few years ago. That's still nearly $3000 worth of cards not counting the rest of the whole computer. This isn't really something you can run out home without having a whole "operation" going on.
Down that far, I start to wonder if trinary circuits might become useful again.
fp4 with 1-3-0 would mean 27 values if the first bit were interpreted as binary. But--and an engineer should check me on this cause to me a transistor is a distant abstraction--I think you could double that to 54 values if you were clever with the sign bit and arithmetic circuitry. Maybe push it to 42 if only some of my intuition is wrong.
"At that time [1955], transistors were not yet available, but it was clear that the machine should not use vacuum tubes. Tubes have a short lifespan, and tube-based machines were idle most of the time because they were always being repaired. A tube machine worked at best for several hours, then it was necessary to look for another malfunction. Yuli Izrailevich Gutenmakher built the LEM-1 machine on ferrite-diode elements. The thought occurred to me that since there are no transistors, then you can try to make a computer on these elements. Sobolev, whom everyone respected very much, arranged for me to go on an internship with Gutenmacher. I studied everything in detail. Since I am a radio engineer by education, I immediately saw that not everything should be done the way they did it. The first thing I noticed is that they use a pair of cores for each bit, one working and one compensating. And an idea came to my mind: what if we make the compensation core do work, as well? Then each cell becomes three-state. Consequently, the number of cores in Setun was seven times less than in LEM-1."
But why? There's nothing special about having 4 storage elements. If you want 54 values then 6 bits are going to be just as effective as 4 trits, and easier to implement in every way.
> Politics and consumer capitalism are motivated to identify and target stupid people...
My bet is on a variation of this. To some extent, we all target people to advance our goals, be it to get them hooked on our product, to get people to rally behind an idea or a policy we want, or perhaps to get ourselves elected to a public office.
Entities with more resources naturally invest more in this and have more advanced tools to get people do what they want. Most likely emotions work much better when targeting large groups of people than smartness and objective truth, so that's what we get.
I think it's nothing new, but the recent research advances and the ease of reaching out to people personally these day made it so that using people to accomplish your goals become probably the most powerful tool on the planet. Why build weapons or wage wars if you could just make people do what you want on their own will and also sing you praises along the way?
Highly debated argument though; to me it's like saying you CPU doesn't understand HTML and your browser is running on a CPU, hence it can't understand HTML either. Scott Aaronson explained it nicely too: https://scottaaronson.com/democritus/lec4.html#:~:text=Searl... . Even the wikipedia page mentions many reasonable counter-arguments.
One point missed by the article is visibility: even with SMS-2FA, I at least know when my password is being used by someone else (modulo sim-based attacks). For example, if my password manager gets hacked, and a password leaks. I think the overall conclusion is still right: it's a rather minor concern and let's come up with a proper solution and not waste the developers' good will on this one.
Another way to think about it is comparing to how children learn. First, children spend inordinate amount of time just trying to make sense of words they hear. Once they develop their language models, adults can explain new concepts to them using the language. What'd be really exciting is being able to explain a new concept to GPT-n in words, and have it draw conclusions from it. Few-shots learning is a tiny step in that direction.
Children don't spend inordinate amounts of time learning words. In fact, past the first months, children often learn words from hearing them a single time.
I have a 4, 6, and 8 year old, and each of them are still learning words. Yeah they don’t spend 80% of each day learning words, but building up their vocabulary legit takes a looong time.
Oh, absolutely. I'm 31 and I'm still learning words!
But I don't think I've ever spent time to learn a particular word - it's almost always enough to hear it in context once, and maybe get a chance to actually use it yourself once or twice, and you'll probably remember it for life.
If it's a word for a more complex concept (e.g. some mathematical construct), you may well need more time to actually understand the meaning, and you may also pretty easily forget the meaning in time, but you'll likely not forget the word itself.
"But I don't think I've ever spent time to learn a particular word - it's almost always enough to hear it in context once, and maybe get a chance to actually use it yourself once or twice, and you'll probably remember it for life."
I'd strongly bet against this. If it were true, SAT and similar vocabulary tests would be trivial to anybody who has taken high school English, and I think it is not the case that most people perceive the SAT to be trivial.
That's of course correct. Perhaps GPT-3 can do that too? I don't have access to it, but I wonder if it can be taught new words using few-shot learning.
In fact, even GPT-2 gets close to that. Here's what I just got on Huggingface's Write With Transformer: Prompt: "Word dfjgasdjf means happiness. What is dfjgasdjf?" GPT-2: "dfjgasdjf is a very special word that you can use to express happiness, love or joy."
What takes time is all the learning a child needs to go through before they can be taught new words on the spot.
Awesome work, thanks for sharing! For those trying to replicate it, could you please share some insights on which steps to train the model worked the best for you? I see 3 different train.py invocations in your colab - for how long did you end up running each of them?
Do you have any insights on whether getting an L1-A got more difficult with the current administration, and if/how the kind of documents/evidence needed for it changed recently?
I'm asking because I just got my L1-A renewal rejected (after having it successfully renewed 2 years ago), on the grounds of insufficient proof of my position being managerial. My I-94 expired and I don't have another visa, so I had to leave the US – and it's awful for my startup! Timing couldn't be worse.
I'm applying for a new L1-A, so any insights on my question above, or any advice on how to maximize my chance of getting it as quickly as possible, would be super welcome!
Non-blanket L-1s are just really tough and probably the "best" example of irrational and unfair decision-making by USCIS. In short, extensive documentation of the structure/organization of the U.S. and foreign companies and of the employees managed and to be managed needs to be provided along with DETAILED descriptions of current and future managerial job duties. But L-1s are just tough and were tough even under the prior administration.
An L-1A - if you can show management of people now and in the U.S. - is much easier than an L-1B - unless the L-1B involves advanced scientific research and development.
I had an L1-B extension request inside the US and they came back with a request for more information. Honestly it looked like they were about to refuse my visa extension. I told the immigration lawyers that I'd be back in London over Christmas and they changed their tune. Applying through London, albeit with an L1 company blanket doc, took all of 30 minutes starting with review of the application and finishing with an interview where the person interviewing me obviously knew very little about my industry.
There appear to be wide discrepancies within the system that can be arbitraged.
Actually you can, it even works without GPU, here's a guide on running BLOOM (the open-source GPT-3 competitor of similar size) locally: https://towardsdatascience.com/run-bloom-the-largest-open-ac...
The problem is performance: - if you have GPUs with > 330GB VRAM, it'll run fast - otherwise, you'll run from RAM or NVMe, but very slowly - generating one token every few minutes or so (depending on RAM size / NVMe speed)
The future might be brighter: fp8 already exists and halves the RAM requirements (although it's still very hard to get it running), and there is ongoing research on fp4. Even that would still require 84GB of VRAM to run...