GPT-4 is not the same product. I know it seems like it due to the way they position 3.5 and 4 on the same page, but they are really quite separate things. When I signed up for ChatGPT plus I didn't even bother using 3.5 because I knew it would be inferior. I still have only used it a handful of times. GPT-4 is just so much farther ahead that using 3.5 is just a waste of time.
Would you mind sharing some threads where you thought ChatGPT was useful? These discussions always feel like I’m living on a different planet with a different implementation of large language models than others who claim they’re great. The problems I run into seem to stem from the fundamental nature of this class of products.
Context: had a bunch of photos and videos I wanted to share with a colleague, without uploading them to any cloud. I asked GPT-4 to write me a trivial single-page gallery that doesn't look like crap, feeding it the output of `ls -l` on the media directory, got it on first shot, copy-pasted and uploaded the whole bundle to a personal server - all in few minutes. It took maybe 15 minutes from the idea of doing it first occurring to me, to a private link I could share.
I have plenty more of those touching C++, Emacs Lisp, Python, generating vCARD and iCalendar files out of blobs of hastily-retyped or copy-pasted text, etc. The common thread here is: one-off, ad-hoc requests, usually underspecified. GPT-4 is quite good at being a fully generic tool for one-off jobs. This is something that never existed before, except in form of delegating a task to another human.
I use ChatGPT for all sorts of things - looking into visas for countries, coding, reverse engineering companies from job descriptions, brainstorming etc etc.
It saves a lot of time and gives way more value than what you pay for it.
Based on my research, GPT-3.5 is likely significantly smaller than 70B parameters, so it would make sense that it's cheaper to run. My guess is that OpenAI significantly overtrained GPT-3.5 to get as small a model as possible to optimize for inference. Also, Nvidia chips are way more efficient at inference than M1 Max. OpenAI also has the advantage of batching API calls which leads to better hardware utilization. I don't have definitive proof that they're not dumping, but economies of scale and optimization seem like better explanations to me.
I also do not have proof of anything here, but can't it be both?
They have lots of money now and the market lead. They want to keep the lead and some extra electricity and hardware costs are surely worth it for them, if it keeps the competition from getting traction.
The reason cement is a major contributor to CO2 emissions is because of how much cement we produce. I don't know the lifetime or effectiveness of this catalyst, but typically you only need a tiny amount of catalyst to start a reaction and the catalyst material can be used over and over for a long time.
We could not do that inadvertently. To block even 1% of light using Starlink sized satellites (~30 m^2 with solar panels deployed) would require tens of billions of satellites. We could do it on purpose with huge rotating solar reflectors, and it should honestly be considered as a real option, but we couldn't and wouldn't do so just by launching communication satellites.
The quality difference is substantial. I don't care if it's wasteful to use something that has many uses for a supposedly narrow task (although I don't see translation as a particularly narrow task anymore than I see writing as a narrow task). I would gladly waste untold trillions of floating point operations for a 1% increase in translation quality. From my experiments, though, it's much higher than 1% increase in translation quality. And regardless of how wasteful the compute is, it's actually cheaper in terms of dollars. Using GPT-3.5 to translate Korean to English would cost about $11 per million words, based on the average characters per token of the small sample of text I gave it. DeepL (the best translation service I could find) costs $25 per million characters, or for my sample text, about $64 per million words. At $11 per million words I can have GPT-3.5 perform multiple translation passes and use it's own judgment to pick the best translation and STILL save money compared to DeepL.
It's worse on English and a lot of other common languages (see Appendix C of the paper). It does better on less common languages like Latvian or Tajik, though.
Which implies, Whisper just hasn't focused on those languages? Seems disingenuous to make the claim that the error rate has halved, when it's worse in the apex language
There will always be overhead, but that doesn't mean it will always be a huge amount of overhead. I believe the state of the art is 97% efficiency (https://www.osti.gov/biblio/1495980) which is better than a lot of wired chargers. Real world systems will be less efficient, and it may be too expensive, but a maglev system would be even more expensive.
"This is within the expected range of the thermal limits of the envisaged mechanical design of the coils and also indicates a minimum of 98% coil-to-coil power transfer efficiency."
Key here is coil-to-coil, NOT total charging efficiency, let alone "better than wired." Worse, it was a stationary setup with a distance of 5 inches between coils, so adding asphalt, protection for the car's coil, etc. is just going to make it worse.
I would bet money against that. Replicating GPT-4 pre-training with current hardware would cost about 40-50m in compute. Compute will continue to decrease in cost and algorithmic improvements may allow for more efficient training, but probably not 3 orders of magnitude in a few years. I think there will be plenty of open source models that will claim GPT-4 quality, and some of them will be close, but they will be models that used millions of dollars (probably from some corporate benefactor but possibly from crowdsourcing) in compute to train. You will probably be able to fine-tune and run inference on fairly cheap hardware, but you can't cheat scale. It's going to take a major innovation to move away from the expensive base model paradigm.
I did my own calculations based on plotting loss on benchmarks compared to models with known parameters and training data, as well as using a quote from Sam Altman that said that GPT-4 would not use very many more parameters than GPT-3. Based on this, I estimated that GPT-4 probably used about 250B parameters, and since I had an estimate for the total compute I was able to estimate that the training data was about 15T tokens. 250B parameters times 15T tokens times 6 (https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-la...) means the compute was about 2.2510^25 FLOPs. I estimated that A100s cost about $1/hr and can process about 5.410^17 FLOPs at 50% efficiency per hour. Therefore, the compute cost would be (2.2510^25)/(5.410^17) or about $40 million.
Interestingly, my own calculations lined up pretty well with this calculation, although they approached the problem from a different direction (a leak by Morgan Stanley about how many GPUs OpenAI used to train GPT-4 as well as an estimate of how long it was trained):
https://colab.research.google.com/drive/1O99z9b1I5O66bT78r9S...
1. We don't know what the number of parameters is, could be 175B, could be 250B, could be 400B. Ok, let's stick with 250B.
2. Training data: GPT-3 was trained on 300B tokens. It already used most of the high-quality data available on the internet, but let's say they somehow managed to find and prepare three times as much high quality data for GPT-4. This means GPT-4 was trained on about 1T tokens.
3. 5.4e+17 FLOPs/hour means 150TFlops, which is half of the BFLOAT16 max theoretical output, sounds reasonable.
4. $1/A100/hr is reasonable.
OK, so we need to divide your cost estimate by a factor of 15: Total cost to train GPT-4 comes out to be around $2.7M.
Regarding Altman's statement about "more than 100M to train GPT-4" - I'm pretty sure he was talking about the total cost to develop GPT-4, which includes a lot of experimentation and exploration, many training runs, and many other administrative costs which are not relevant to the cost of a single training run to reproduce the existing results. Just salaries alone: ~200 people worked on GPT-4 for let's say half a year, at $400k/year: 0.5 * 400k * 200 = $40M.
Especially if you consider that as compute costs decrease, so does the ability of scale players to process larger datasets.
If we extrapolate that relation, you eventually reach a point where the biggest player can collect and process the most information and produce an ever-evolving model to maintain that relation.
Better hope it's creators have your best interests at heart.
They do update the model in the background, although I'm not sure how often or how much they update it. To avoid issues with this practice they offer gpt-4-0314 which says this in the documentation:
"Snapshot of gpt-4 from March 14th 2023. Unlike gpt-4, this model will not receive updates, and will only be supported for a three month period ending on June 14th 2023."
Unfortunately this experiment is using the frozen snapshot model gpt-4-0314 instead of the unfrozen gpt-4 or gpt-4-32k models, so any differences are literally 100% noise. This would be a somewhat interesting experiment if someone were to use an unfrozen model, though. I do appreciate the author for captioning the images with the exact model they used for generation so that this bug could be caught quickly.
Author here: these images are using `gpt-4` but I'm recording the specific model that OpenAI use with each result. As the incremental updates come out, that will change (without requiring me to change anything.)