At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.
Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.
This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.
I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.
One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.
You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.
Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.
Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.
Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.
that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.
And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.
"They were just tracking how well his tweets were doing versus others. "
Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:
They officially boost people, who pay a little bit. Elon payed a lot.
And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?
"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."
Also, you probably missed that:
"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."
Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.
"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""
It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.
So changes in power users stats would also result in audience balancing?
Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.
Most likely the balancing code is somewhere else and it affects only republican / democrats.
For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.
Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".