At 8x86B, looks like the largest open model yet by far. Would be interesting to ...

swalsh · on March 17, 2024

Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.

lukan · on March 17, 2024

"it really emphasises how important fine tuning is"

Or rather the quality of the training data?

llm_trw · on March 17, 2024

We don't know since no one is releasing their data.

Calling these models open source is like calling a binary open source because you can download it.

Which in this day and age isn't far from where were at.

DreamGen · on March 17, 2024

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw · on March 17, 2024

You can also build on top of binaries if you use gotos and machine code.

shwaj · on March 18, 2024

This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.

llm_trw · on March 18, 2024

If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.

shwaj · on March 18, 2024

I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.

samus · on March 18, 2024

One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.

visarga · on March 18, 2024

You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.

adrianN · on March 18, 2024

Or shell scripts

tarruda · on March 18, 2024

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

DreamGen · on March 19, 2024

We are in agreement -- that's exactly what I am saying :)

swalsh · on March 17, 2024

We should just call it open weight models at this point.

cl3misch · on March 18, 2024

FWIW the Grok repo uses the term "open weights".

mcv · on March 26, 2024

Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.

cainxinth · on March 18, 2024

> We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

JohnFen · on March 19, 2024

I think it's really, really clear that the majority of the data used to train all of these things was used without permission.

Zambyte · on March 19, 2024

The model is open weight, which is less useful than open source, but more useful than fully propriety (akin to the executable binaries you compare to)

boulos · on March 18, 2024

How about "weights available" as similar to the "source available" moniker?

fragmede · on March 18, 2024

weights available or model available, but yes.

drexlspivey · on March 17, 2024

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

minimaxir · on March 17, 2024

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

llm_trw · on March 17, 2024

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

zx8080 · on March 18, 2024

Or even "here's the Linux Kernel makefiles, no sources included, enjoy".

jahsome · on March 17, 2024

[flagged]

drexlspivey · on March 17, 2024

Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.

jahsome · on March 18, 2024

It's fascinating you doubled down on your own straw man and still have the nerve to call others delusional.

You missed my point, which I wasn't very clear about, so my mistake. Although it doesn't seem like you're interested in understanding anyway.

fragmede · on March 17, 2024

that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.

rezonant · on March 17, 2024

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

convery · on March 17, 2024

The X algorithm is also opensource, so you can verify before commenting..

threeseed · on March 18, 2024

X algorithm Github project hasn't been updated in 8 months:

https://github.com/twitter/the-algorithm

So clearly they aren't running it in production.

Also they didn't open source the list of people who are being artificially boosted e.g. Elon.

fragmede · on March 17, 2024

just because they open sourced it doesn't mean that's actually what they're running on it though

chrisco255 · on March 17, 2024

It's not like he needs boosting, he was one of Twitter's top followed accounts long before he bought them. He's pretty good at getting attention.

latexr · on March 17, 2024

And yet it’s not enough to curb the desire to tip the scales.

https://arstechnica.com/tech-policy/2023/02/report-musk-had-...

lukan · on March 17, 2024

No idea about the current state, but the open sourcing did show, they were favoring elon:

https://mashable.com/article/twitter-releases-algorithm-show...

And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.

maccaw · on March 18, 2024

> they were favoring elon

No, and that's not what the article says either. They were just tracking how well his tweets were doing versus others. They were not favoring Elon.

lukan · on March 18, 2024

"They were just tracking how well his tweets were doing versus others. "

Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:

https://mashable.com/article/elon-musk-super-bowl-joe-biden-...

They officially boost people, who pay a little bit. Elon payed a lot.

And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?

"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."

Also, you probably missed that:

"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."

Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.

See also this HN comment and discussion about it:

https://news.ycombinator.com/item?id=35391854

"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""

machdiamonds · on March 17, 2024

It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.

internetter · on March 17, 2024

Did you not read the article linked in the comment you're replying to?

jokethrowaway · on March 18, 2024

Sounds a bit far fetched

So changes in power users stats would also result in audience balancing?

Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.

Most likely the balancing code is somewhere else and it affects only republican / democrats.

nonethewiser · on March 18, 2024

> I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?

jakderrida · on March 18, 2024

Aren't they usually built on most of the same training data?

GaggiX · on March 17, 2024

Or even how much it was trained on this dataset, the amount of FLOPs.

lairv · on March 17, 2024

I would say it emphasises that training a good model is more than throwing random data and compute

gordian-mind · on March 18, 2024

Current metrics are a poor way to measure the usefulness of LLMs.

make3 · on March 18, 2024

no it empathizes the importance of training smaller models for longer, like the Mistral "overtrained" models

gdiamos · on March 18, 2024

Show the proof? Does it include IFT?

zone411 · on March 17, 2024

It's actually not the largest. https://huggingface.co/google/switch-c-2048 is 1.6T parameters.

WeMoveOn · on March 20, 2024

but is switch c even usable? iirc the training set was nowhere near enough for a model of that size to be coherent in a conversation

p1esk · on March 17, 2024

It’s not 8x86B. Total number of parameters is 314B.

Perhaps it’s 8x39B to fit on a single 8xA100 (40GB) server?

dheera · on March 17, 2024

They all do this marketing bull.

Mixtral has an 8x7B model but it's actually 46.7B, not 56B params.

Kinda similar to how 4K displays are 3840 pixels wide, not true 4K which would be 4096. Marketing people called it 4K, not engineers.

guitarlimeo · on March 18, 2024

I've always thought of 4K as "4x FullHD". In that way it makes sense.

dheera · on March 18, 2024

Bleh no, K means thousand.

For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.

Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".

mavhc · on March 18, 2024

TV and Digital Cinema have different standards, because of course they do

throwaway11460 · on March 19, 2024

It's 2x Full HD though

_kuvn · on March 19, 2024

2x in a single direction, 4x the number of pixels

throwaway11460 · on March 19, 2024

Oh yeah... What I meant is, 1920x2 = 3840 ~~ 4000

cma · on March 17, 2024

Active parameters is 86B, so wouldn't that be the size of the largest two experts (where they may all be the same) + the weights of the selector?

moffkalast · on March 17, 2024

Most likely it's a MoE of Grok-0 which would be 8x33B + 50B for the router.