OLMo: Accelerating the Science of Language Models [pdf]

bravura · on Feb 2, 2024

"We intend to follow up on this release with another one soon that includes the following:

...

Weights & Biases logs for our training runs."

That's amazing. I've never seen that before in a paper of this quality. Or, any paper at all.

marvinalone · on Feb 5, 2024

Weights & Biases for OLMo 7B are now out: https://wandb.ai/ai2-llm/OLMo-7B/reports/OLMo-7B--Vmlldzo2Nz...

swyx · on Feb 2, 2024

i think huggingface and facebook have both offered this level of detail in the past? still great though

arugulum · on Feb 2, 2024

EleutherAI as well.

gillesjacobs · on Feb 2, 2024

It's more common than you think. I did the same for one of my research papers.

nl · on Feb 2, 2024

It's very interesting that they went to the effort of doing complete end-to-end runs on both NVidia and AMD hardware.

A pity they didn't release the speed of training, but the software is now there for someone else (not under benchmark embargo) to do that.

alchemist1e9 · on Feb 2, 2024

They detail the energy used and therefore estimated carbon emissions which is interesting. When I estimate the raw electricity cost using 7-20 cents per kWh for US commercial rates, then we are only talking about $16-50k for electricity, that seems pretty small! Is my math wrong?

Is there any information on how much the computing costs were for renting the clusters?

Is the barrier to entry for a 7B model only a couple $100K?

EDIT: https://news.ycombinator.com/item?id=39223467#39224534

Perhaps only $85K total

passion__desire · on Feb 2, 2024

Facing unbearable heat, Qatar has begun to air-condition the outdoors :

https://www.washingtonpost.com/graphics/2019/world/climate-e...

I feel like we trying to optimize what we measure. No such measurements happen for other industries. How much does Las Vegas use electricity for the extravagant display of lights, water shows and so on.

anonylizard · on Feb 2, 2024

Despite the typical complaints about "X new thing harming the environment!!!", LLMs are as friendly as it gets, it

1. Consumes a minor amount of electricity (Data centers is only 2% of US electricity use, and currently AI is maybe only 5-10% of that). Its trivial compared to say metal smelting.

2. Consume water for cooling.

That's it, there is 0 direct pollution generated from AI, and even the water use is very minor compared to say farming, and can be improved via more water efficient cooling techs.

The main concern is the scaling speed. As LLMs scale up 10x, 100x, 1000x, those previously very minor electricity costs can quickly become grid impacting in a decade.

marmaduke · on Feb 2, 2024

I can't buy this kind of argument anymore. How about the external effect of AI steering the entire semiconductor industry to increase GPU/NPU capacity?

vortegne · on Feb 2, 2024

This kind of argument is actually totally valid. But only if you subscribe to the current meta of widely accepted handwaving.

Externalities are never a part of capitalist math. Non-trivial consequences can never hurt if one never looks further than their own nose.

StopTheTechies · on Feb 2, 2024

> we are only talking about $16-50k for electricity, that seems pretty small

I suppose this depends greatly on how you view the utility of LLMs. In a capitalist sense, sure—there's great utility here persuading VCs to part with their coins and jobs to be replaced with correspondingly larger profit margins. But the opportunity cost of not solving major problems most of humanity can agree on seems nearly incalculably large. Not that capitalists give a shit.

wegfawefgawefg · on Feb 2, 2024

This is a process of exploring new technology. Research is expensive and probably doesnt always yield immediate returns, but when it does you get infinite returns.

Imagine how not obvious the first machines must have seemed at the start of the industrial revolution. You only have to feed a man and he can work, but a machine requires iron, oil, water, fuel, engineers, operators. The up front cost for exploring early digging machines must have been absurd. And im sure some people at the time thought: "Wow we could be spending this money on bread for the poor instead."

Arent you glad we didnt.

antonvs · on Feb 3, 2024

If we had spent the money on bread for the poor instead, we wouldn't be facing an existential threat created by our lack of understanding of the consequences of our actions, and our collective inability to respond to that effectively.

wegfawefgawefg · on Feb 3, 2024

Consider the world before industry...

You really want to have 10 kids and have 50% or more of them die before 10 years old? You want a world before penecillin and antibiotics? No computers? No travel. Women getting marrried off at 15 immediatly pregnant. Most of the world in absolute poverty. Destroyed by a single bad season. Mass famines, plagues, tribal warfare that sweeps over your village. No clean water and soap. malnutrition.

These are just non problems for huge portions of the planet now.

FLT8 · on Feb 2, 2024

What if investing in AI tech like LLMs eventually allows knowledge workers to be more productive with fewer resources, and therefore ultimately frees up more people to focus on the so-called major problems?

Maybe we can invest more human hours in speeding up the path to zero emissions and energy abundance, or re-planting deserts, or cleaning up forever chemicals / microplastics, or helping at-risk kids, etc etc.

wegfawefgawefg · on Feb 2, 2024

not disagreeing, but sidenote, many of the issues causing the human issues may be semi orthogonal to the level of technology going forward. We already have enough resources for the poor amd hungry and homeless. Its behavioural issues we dont know how to fix. How to bootstrap a crackhead into a bank teller, so to speak.

i hope the bottom 10% rung on a dyson sphere society doesnt just look like hungry homeless people, but on a space station.

Nevermark · on Feb 2, 2024

> Its behavioural issues we dont know how to fix

Or how to tax labor no greater than capital.

Or view quality education and healthcare for children, and keeping their parents out of survival mode, as a much better investment for everyone, than funding the adventures of overly war happy presidents.

I am enthusiastically agreeing with you. Behavioral changes at the top and bottom of society are most of the problem - not tech.

wegfawefgawefg · on Feb 3, 2024

I live in japan and they have implemented what you are requesting. ive recently been to a relatives house here where they live off government handouts despite having jobs. the government pays the woman for having children. as you can imagine this is a perverse incentive. She and her four ish children have jobs. Despite collectively having more than 100k a year to work with, they live in essentially a dirty crack house with a toilet that hasnt worked for years. They fight over money and emotionally blackmail family to get 1-10k dollars at a time, and never pay it back.

Being poor like this is not a money problem. Its a behaviour problem.

You cant fix this by giving them money. They just spend it on alcohol and cigarettes.

Having interacted with them I know they arent obviously stupid and they are educated. My wife attended the same strict japanese school. Very high quality compared to an average american school, they made it up through calculus as high schoolers. She still remembers reiman sums 15 years later.

Your current perception of the world isnt quite right. It sounds like youve got this magic fix in your head, but in reality it just wouldnt work. Youre ignoring the thing you profess to actually care about... the people.

Nevermark · on Feb 3, 2024

> Behavioral changes at the top and bottom of society are most of the problem

(Added emphasis)

I agree with everything you say, lots of irresponsible people and culture. But that isn't the whole story.

The wealthy and asset owners also tilt the economy toward themselves and away from labor and the less wealthy in many ways.

Poor outcomes for young individuals do have strong correlations, with strong causal support, to low income districts with poor health and education resources, poor safety, and poverty level parents. That is a circular problem created by treating the education, health, and safety of children as a "local" issue, instead of what it obviously is, a national issue.

Also, housing is a problem for many working people, while the rich magnify the problem by using the limited availability of real estate as a useful financial instrument to park money, making profitable returns based on exclusivity and productive economic growth elsewhere which increases further investment in land, even if the land is underutilized.

This is due to the perverse incentive of taxation on total land and development value instead of just the land. (Development on land should be encouraged, not taxed. Other developent and property isn't "wealth" taxed. Whereas, the underlying land is limited, so taxing those who make it unavailable for others is a community neutral bargain - and makes the underutilization of land unprofitable.)

This goes on and on ... regulation capture, use of personal loans against personal property give wealthy asset owners liquidity events that fund high lifestyles without any taxes associated with it, taxes on labor that increase beyond tax rates on capital and for corporations, etc.

The rich and asset ownership classes use government policy to actively tilt things there way, on the backs of those who's primary "asset" is their labor value, throughout society.

gardenfelder · on Feb 2, 2024

https://huggingface.co/allenai/OLMo-7B

Edit: add https://github.com/allenai/OLMo

nl · on Feb 2, 2024

Who will be the first to do a useful Instruct-trained variant?

It's a pity the Mistral 7B Instruct 0.2 dataset isn't available because I've found that a much higher quality than any of the finetunes around, and I suspect we'll have to rely on the same groups doing finetunes for this.

bugglebeetle · on Feb 2, 2024

Nous just released their full instruction tuning dataset, so I dunno why someone with enough compute couldn’t do this.

cosmojg · on Feb 2, 2024

And Capybara be lookin' fiiine for tuning too. Seriously, though, you're right. These are some of the highest quality generative datasets in existence, and I'm surprised more isn't being done with them.

nl · on Feb 2, 2024

The Nous finetunes of Mistral benchmark well but in practice seem worse than the original Mistral versions IMHO.

Of course we don't know how to measure this so respect to them for the benchmark performance.

casercaramel144 · on Feb 2, 2024

I'm sorry, I don't understand the exact contribution here? There's many tutorials on how to train a language model. If it's a repository of SOTA techniques for training, this will be outdated in at max 3 months, and anyways the ground shifts under you in this field so you might as well read Arxiv all day if your intention is to keep up with SOTA.

chuckhend · on Feb 2, 2024

It looks like this team gave us everything we need to reproduce their models, the actual artifacts needed to reproduce it. As far as I can tell, they share the data and every step along the way to final model...not just describing what they did.

tkellogg · on Feb 2, 2024

researchers don't read tutorials, they cross check each other's work. You need details to do that.

casercaramel144 · on Feb 2, 2024

wdym by cross check each others work? Surely just reporting the final loss is good enough if that's the intention. The final end goal is lower loss anyways so it's not even a bad metric.

jerrygenser · on Feb 2, 2024

Pretty cool that it runs on and and Nvidia

shwaj · on Feb 2, 2024

Not sure if you’re downvoted for the typo: “and” instead of “AMD”?

jerrygenser · on Feb 2, 2024

Yes I meant AMD.

artninja1988 · on Feb 2, 2024

Feels like there must be 40 or so distinct open source llms now. What gives? We need some more new text to image models too... :(

wokwokwok · on Feb 2, 2024

If you read around, training a 7B model costs on the order of $85,000; the 1.4 stable diffusion release cost around $600,000 to train.

You don't see a lot of 70B or larger models being released for the same reason; it's expensive.

We should just be grateful for what we're getting right now: basically, people are spending 100s of thousands of dollars on training and giving the results away for free. Hugging face is hosting them for free. ollama is hosting them for free. People are writing free inference engines (eg. llama.cpp) and giving them away.

Don't complain. We've got it pretty damn good right now.

alchemist1e9 · on Feb 2, 2024

> If you read around, training a 7B model costs on the order of $85,000; the 1.4 stable diffusion release cost around $600,000 to train.

That seems remarkably cheap actually and likely getting cheaper fairly quickly with improvements in training efficiencies I’d imagine.

sjwhevvvvvsj · on Feb 2, 2024

On the other hand, the systems are trained on “free” data so it kinda should be public property by default.

Claiming it’s fair use to suck up the entire web and pay wall the derived result is absurd argument.

We all created the lifeblood of LLM and we’re entitled to the product.

vlovich123 · on Feb 2, 2024

You’ve just described Google which derives most of its ad revenue from ads it places on the search engine that’s crawling the public web. It has always been thus that derivative products that provide a meaningful transformation of the input is a wholly separate piece of copyright.

sjwhevvvvvsj · on Feb 2, 2024

No, this is very different. Google will link you to the NYT, you read there, and see ads. If GPT eats the web and pay walls it, they are 100% free riding.

Now, I also think the Google model is proven at this point to be a bad model since the web is 90% ads and SEO dogshit. They strip mined the value, took them a while, but its nearly decimated.

nl · on Feb 2, 2024

The value of ChatGPT isn't that it regurgitates the NYT. The value is that it will read the NYT and the Washington Post and Fox News and The Guardian and everything else for you and synthesise a new view from it all that represents the viewpoint you ask for.

That's completely different to Google a d completely different to anything done before. It's as transformative as a human expert news analyst giving you a new perspective on a story.

visarga · on Feb 2, 2024

> We all created the lifeblood of LLM and we’re entitled to the product.

sounds so nice, yet there are going to be objections, NYT for example doesn't think we all should be entitled to the product

sjwhevvvvvsj · on Feb 2, 2024

Of course, that is partially my point: if OpenAI et al wants to make the argument anything online is fair game, then they should release the weights. If not, they have no leg to stand on.

wokwokwok · on Feb 2, 2024

Whether that’s true or not, the fact remains that a lot of people are spending real money in astonishing large amounts and not asking for anything in return.

Seriously, complaining they haven’t spent enough money or didn’t spend 600k making exactly you the model you wanted is…

Let’s just say, ungracious.

Got some cake for my birthday, but it wasn’t the chocolate deluxe cream cake I wanted.

…just remember, the cake is pretty good, and it’s free. :)

Over time the cost of training models will come down and bigger open models will turn up, eventually.

dragonwriter · on Feb 2, 2024

> If you read around, training a 7B model costs on the order of $85,000; the 1.4 stable diffusion release cost around $600,000 to train.

SD 1.x is a ~1B parameter model, so its interesting that it cost so much more than a 7B LLM.

senseiV · on Feb 2, 2024

yes the size is different, but training a diffusion model and a language model are really different, like how RL models can be small but take a long time to train aswell

swyx · on Feb 2, 2024

does ollama actually host the models or is it a set of aliases to huggingface? and is it llama.cpp under the hood?

trying to figure out how thick this layer is

chuckhend · on Feb 2, 2024

The training datasets are also available, which sets them apart a bit IMO.

https://huggingface.co/datasets/allenai/dolma

thawab · on Feb 2, 2024

Open source means i have documentation to reproduce the same results. This is only true with tinyllama and this model. The other models (llama, mistral) are free to use and not open source.

arugulum · on Feb 2, 2024

The Pythia models have all the training data, code, and configurations available.

refulgentis · on Feb 2, 2024

Languages, sizes, and degrees of open-ness.

chuckhend · on Feb 2, 2024

There's some more commentary on their open-ness in this blog too https://www.interconnects.ai/p/olmo

dwagnerkc · on Feb 2, 2024

That post also very helpfully links to another paper they published alongside the OLMo paper just on the dataset.

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

https://arxiv.org/abs/2402.00159

hedgehog · on Feb 2, 2024

There are few that are >1B params, competitive, and "open source" in the sense that the necessary ingredients to re-train are available. Models like Llama and thus its descendants (including Mistral's public models) have weights available but not the training data.