DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

gpjanik · 2025-02-11T21:20:07 1739308807

This is finetuned to the benchmarks and nowhere close to O1-Preview in any other tasks. Not worth looking into unless you specifically want to solve these problems - however, still impressive.

mluo · 2025-02-11T21:29:34 1739309374

We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).

If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).

zamadatix · 2025-02-11T21:41:02 1739310062

It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).

Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!

mluo · 2025-02-11T22:19:04 1739312344

For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)

Come checkout our repo at: https://github.com/agentica-project/deepscaler

rvnx · 2025-02-11T22:32:39 1739313159

It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)

mluo · 2025-02-11T22:34:40 1739313280

Check out one of my prior work: https://stylus-diffusion.github.io/

This work scales up selection/routing over many models/LoRAs

rvnx · 2025-02-11T22:35:40 1739313340

Love it, will check, thank you for showing / sharing all of that!

numba888 · 2025-02-12T00:10:22 1739319022

o1 is more than just math solver. And you cannot possibly train that much in a small model.

However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".

mdp2021 · 2025-02-12T00:29:48 1739320188

Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?

numba888 · 2025-02-12T08:00:38 1739347238

Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.

janalsncm · 2025-02-11T21:39:04 1739309944

I disagree. They demonstrated a way to dramatically reduce training costs, 18x cheaper than R1. That alone is worth attention.

Also beating O1 on any benchmark is nontrivial.

nabakin · 2025-02-11T23:18:44 1739315924

I'm not so sure it's impressive even for mathematical tasks.

When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.

These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.

It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.

torginus · 2025-02-11T23:20:23 1739316023

But if that's the case - what do the benchmarks even mean then?

nabakin · 2025-02-11T23:29:45 1739316585

Automated benchmarks are still very useful. Just less so when the LLM is trained in a way to overfit to them, which is why we have to be careful with random people and the claims they make. Human evaluation is the gold standard, but even it has issues.

torginus · 2025-02-12T00:54:59 1739321699

The question is how do you train your LLMs to not 'cheat'?

Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?

Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.

The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.

nabakin · 2025-02-12T01:20:16 1739323216

I think the only way is to check your dataset for the benchmark leak and remove it before training, but (as you say) that's assuming an honest actor is training the LLM, going against the incentives of leaving the benchmark leak in the training data. Even then, a benchmark leak can make it through those checks.

I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.

avbanks · 2025-02-11T23:54:57 1739318097

A lot of people in community are weary of benchmarks for this exact reason.

pona-a · 2025-02-11T23:25:53 1739316353

I tested it on basic long addition problems. It frequently misplaced the decimal signs, used unnecessary reasoning tokens (like restating previously done steps) and overall seemed only marginally more reliable than the base DeepSeek 1.5B.

On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.

viraptor · 2025-02-12T04:47:12 1739335632

> On my own pet eval, writing a fast Fibonacci algorithm in Scheme,

This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.

pona-a · 2025-02-12T10:34:44 1739356484

The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.

ekidd · 2025-02-15T14:19:15 1739629155

This model can't code at all.

It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.

It's a pure specialist model.

Aiguru31415666 · 2025-02-12T10:44:31 1739357071

It's absolutely worth to look into.

It's a great find

dang · 2025-02-11T20:54:09 1739307249

We changed the URL from https://github.com/agentica-project/deepscaler to the article that gives more background information (and also points to the repo).

(Submitted title was "Replicating Deepseek-R1 for $4500: RL Boosts 1.5B Model Beyond o1-preview")

neom · 2025-02-12T00:45:07 1739321107

Out of sheer curiosity, how do vacations work for you? Do you take vacations, surly you must? I've been using HN daily for years now I can't think I've seen you take a vacation.

dang · 2025-02-12T01:42:14 1739324534

Not exactly. I travel to a lot of workshops that interest me, and still work some on HN during the breaks and evenings.

ukuina · 2025-02-12T05:20:53 1739337653

I realize there is a team behind HN, but your sheer force of will keeps HN a shining example of discourse. Thank you for everything you do!

thorum · 2025-02-11T21:39:48 1739309988

The key thing here is a simple, reliable formula to train a 1B model on a specific task and get strong performance. That didn’t really exist before. Edge devices are about to get a lot smarter.

manmal · 2025-02-11T21:47:35 1739310455

I hope Apple takes note and improves their on-device models accordingly.

Aperocky · 2025-02-11T22:34:35 1739313275

Still took $4500, so I'll probably not training this at home - but maybe that number will change too! Exciting times.

amazingamazing · 2025-02-11T21:50:44 1739310644

Smaller reinforced models are what will win. Look at our civilization, your company, heck even your team: lots of specialized folks, not some singular bloated mastermind.

The issue though is the overemphasis on the current benchmarks. Ideally the thing benchmarked is against user kpis.

ericd · 2025-02-11T22:04:01 1739311441

I’m not sure about that, I think the current setup is an artifact of human limits, which an immortal shared ML model doesn’t share. Also, cross-disciplinary knowledge is frequently incredibly helpful (if rare, because currently it’s the intersection of multiple uncommon specialties).

noman-land · 2025-02-11T22:15:31 1739312131

The question that's screaming at me right now is that if humans used specialization + cooperation to achieve global dominance, what happens when you pair unlimited global masterminds + cooperation?

UltraSane · 2025-02-11T22:28:54 1739312934

Best case scenario is you get The Culture where society is ruled by hyper-intelligent but kind and benevolent AIs that solve all economic problems and that keep a population of quadrillions of very happy, healthy, and rich humans as pets.

jazzyjackson · 2025-02-11T22:22:17 1739312537

Can't wait for the parliament of competing super intelligent political parties producing propaganda to convince cloud investors to reallocate compute.

6510 · 2025-02-12T00:10:35 1739319035

As bandwidth is still crap my bet would be that, like with humans, the sum of the parts isn't greater than the greatest mastermind in the group or it will take an increasingly long time to come up with increasingly original thought. If it scales reasonably linear it would be worth waiting for a year or a decade to produce a single result.

Also a funny thought, with enough of a mind it should be able to figure out how to get very limited storage and barely usable computation at very low speed out of some extremely cheap substrate. Something like growing salt crystals in a lake (or ocean) Have a write pump flashing a light onto salt water and a read pump discarding undesirable results to be overwritten. You know, like, it doesn't matter if you encode randomly, if you do it often enough you will get the entire set. You don't have to read accurately either, as long as you get at least some results out.

wait, did I just explain the universe?

evilduck · 2025-02-12T14:55:00 1739372100

Whatever the human equivalent of beehives and ant colonies will be.

6510 · 2025-02-11T22:24:07 1739312647

We turn into cats.

smnrg · 2025-02-11T22:29:59 1739312999

I like optimists.

nomel · 2025-02-11T22:15:18 1739312118

> lots of specialized folks

This isn't what I see in R&D. The broad and deep folks are the ones with all the patents, and design wins, because they're the only ones that can see outside the box that everyone else is specialized in.

jazzyjackson · 2025-02-11T22:21:06 1739312466

Nah, it takes a mastermind to be creative. The tiny models IME are good at following explicit instructions for data transformations (conforming to a JSON schema is especially helpful) but I wouldn't expect one to solve a problem I haven't already solved myself like how I use the .6T full size models.

EE84M3i · 2025-02-11T22:25:32 1739312732

I thought conforming to a JSON schema was a solved problem for even the most garbage models by restricting the tokens it can generate?

jazzyjackson · 2025-02-12T01:35:23 1739324123

Yes that's what I mean, with a well described task and a JSON schema, the tiny models are suitable (though still hallucinatory - even llama3 7B was making up phone numbers when I tasked it with extracting contact details from email signature blocks)

I was pushing back against the parent's notion that tiny models would serve general purpose needs.

czk · 2025-02-11T21:47:28 1739310448

im very naive here but does anyone trust these benchmarks? do they mean anything to you? they seem far too easy to game and it doesn't feel like its an accurate way to really tell how these models compare to one another. seems like benchmark performance declines quite a bit if you introduce a problem that's similar to those in benchmarks but one that the model hasn't seen before

thorum · 2025-02-11T21:55:08 1739310908

AIME has significant problems: https://x.com/DimitrisPapail/status/1888325914603516214

> Problems near identical to the test set can be found online.

janalsncm · 2025-02-12T02:03:12 1739325792

I don’t see how that is relevant to a model trained with RL. If it had already memorized the answers then why was any training necessary?

javaunsafe2019 · 2025-02-11T21:50:26 1739310626

Someone could read this as: as soon as I don't like the results I doubt the test...

czk · 2025-02-11T21:51:35 1739310695

thats far from the case here i really have no skin in the game...

simonw · 2025-02-11T21:03:34 1739307814

Does anyone have any good simple prompts for testing new "reasoning" models like this one?

"Count the letter Rs in the word strawberry" is a bit dull!

I'm trying this one locally using Ollama and the smallest quantized GGUF version (769MB) I could find - https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... - I ran it like this:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:IQ3_XXS

Here's the answer it gave me: https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... - it got the correct answer after double-checking itself 9 times! And if you look at its thought it made a pretty critical error right at the start:

  "Strawberry" has the letters S, T, R, A, W, B, E, R, F,
  U, R, E. Let me count each 'R' as I go along.

Hopefully the versions that aren't quantized that tiny do better than that.

cheeko1234 · 2025-02-11T21:25:28 1739309128

I have two. One is a simple one that only deepseek R1 has passed (in my opinion):

I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?

Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.

Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.

So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.

Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."

A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism? A) Staphylococcus aureus B) Streptococcus viridans C) Escherichia coli D) Pseudomonas aeruginosa E) Candida albicans

rfurmani · 2025-02-11T21:32:31 1739309551

Wow, Gemini 2 has a 9 step process for the first question https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/

> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:

1. Fill the 6-liter jug completely.

2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.

3. Fill the 6-liter jug completely again.

4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

5. Now empty the 12 liter jug.

6. Pour the water from the 6-liter jug into the 12-liter jug.

7. Fill the 6-liter jug completely again.

8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

9. There are 6 liters left in the 6 liter jug.

sdwr · 2025-02-11T22:33:58 1739313238

It really loves the sound of its own voice!

wnissen · 2025-02-11T22:26:06 1739312766

And things get dramatically worse on 4o if you add a second superfluous jug:

    To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
    Step-by-Step Solution:

        Fill the 12-liter jug completely (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
        Empty the 16-liter jug completely.
        Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
        Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
        Empty the 6-liter jug completely.
        Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.

    Now, the 6-liter jug contains exactly 6 liters of water.

neodypsis · 2025-02-11T21:49:11 1739310551

Interestingly, o3-mini-high was correct when first thinking about it:

> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.

jazzyjackson · 2025-02-11T22:24:58 1739312698

I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.

DebtDeflation · 2025-02-11T21:36:06 1739309766

>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.

Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.

CamperBob2 · 2025-02-11T22:27:12 1739312832

(Shrug) R1 has no problem with that. To the extent it's confused, it is only because it is wondering if it's a trick question.

CoT reasoning: https://i.imgur.com/rjNmTGZ.png

Answer: https://i.imgur.com/WfAVeZQ.png

There's 'mindless regurgitation' going on here, but not by the AI model.

egorfine · 2025-02-11T21:34:43 1739309683

Deepseek just told me this:

To measure exactly 6 liters using a 12-liter and a 6-liter jug:

1. *Fill the 12L jug completely.* 2. *Pour water from the 12L jug into the 6L jug until it's full.*

After this process, the 12L jug will have exactly 6 liters left.

wnissen · 2025-02-12T17:12:18 1739380338

Tried it in Llama v3.3 70b and it got it every time out of 5 samples. I guess the others are interpreting it as a trick question or something.

dgacmu · 2025-02-11T21:37:38 1739309858

Is the answer strep viridans?

wincy · 2025-02-11T21:54:40 1739310880

I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.

It told me to fill the 12 liter jug then pour that into the 6 liter jug.

I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.

Output: Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence: 1. Fill the 6-liter jug from the water supply. • State: (12-liter jug = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room). • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the water supply. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full. • The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5. Empty the 12-liter jug (discard the water or set it aside). • State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue. Since both are empty, let’s do something else: a. Fill the 12-liter jug from the supply. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug. • Now the 12-liter jug has liters left. • State: (12-liter jug = 6 L, 6-liter jug = 6 L)

At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.

If You Want the 6 Liters in the Smaller Jug Instead

If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence: 1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug). 2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 3. Empty the 6-liter jug. • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug. • State: (12-liter jug = 0 L, 6-liter jug = 6 L)

Again, you end up with exactly 6 liters in the 6-liter jug.

Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.

mluo · 2025-02-11T21:27:10 1739309230

Hi, one of the lead authors for this work.

We recommend using Bfloat16 (not fp16), quantization for small models can really hurt performance!

CamperBob2 · 2025-02-11T21:35:32 1739309732

Have you compared it to the 1.58 bit dynamic quant model based on the original R1 (i.e., not a distillation)? Whatever unsloth did, it doesn't seem to be giving up much reasoning performance over the full Q8 version.

mluo · 2025-02-11T22:17:05 1739312225

It's simply bc the model is small (1.5B), making it sensitive to weight perturbations

simonw · 2025-02-11T21:36:35 1739309795

Is there a GGUF version of your model anywhere that you recommend? I'm on a Mac.

mluo · 2025-02-11T22:16:04 1739312164

Think there are some people who made GGUFs as branches of our model, try it out!

https://huggingface.co/models?other=base_model:quantized:age...

newman314 · 2025-02-12T20:01:40 1739390500

Is there a MLX version that can be added to the fullmoon iOS app?

rfurmani · 2025-02-11T21:26:23 1739309183

As for Rs in strawberry, trying a bunch of models side by side only Sky T-1, Gemini 2 Flash got it wrong! https://sugaku.net/qna/792ac8cc-9a41-4adc-a98f-c5b2e8d89f9b/

Simple questions like 1+1 can also be fun since R1 goes overboard (as do some other models when you include a system prompt asking it to think) https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/

And if that fails you can ask for the zeros of the ζ function! https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/

throwup238 · 2025-02-11T21:22:55 1739308975

I always ask every model to implement a Qt QSyntaxHighlighter subclass for syntax highlighting code and a QAbstractListModel subclass that parses markdown into blocks - in C++, both implemented using tree sitter. It's sounds like a coding problem but it's much more a reasoning problem of how to combine the two APIs and is out of band of the training data. I test it with multiple levels of prompt fidelity that I have built up watching the many mistakes past models have made and o3-mini-high and o1 can usually get it done within a few iterations.

I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.

rubymamis · 2025-02-12T10:31:50 1739356310

(Fellow Qt developer)

I really like your takes! Is there somewhere I can keep in touch with you? You can view my socials in my profile if you'd like to reach out.

greenavocado · 2025-02-11T21:34:17 1739309657

Give it a try with nvidia llama 3.1 nemotron 70b. It is the only model that can give useful Gstreamer code

coder543 · 2025-02-11T21:51:08 1739310668

“How many stops faster is f/2.8 than f/4.5?”

This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!

The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.

LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)

This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.

I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.

Palmik · 2025-02-11T21:17:31 1739308651

The negative quality impact of quantization is more pronounced for smaller models [0], so I'm surprised this tiny quant works at all.

[0] or rather models closer to saturation, which is a function of model params and amount of training

simonw · 2025-02-11T21:18:46 1739308726

Yeah I'm very impressed that the tiniest quant version appears partially usable... especially now I've replicated that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much larger one: https://news.ycombinator.com/item?id=43018494

evrimoztamur · 2025-02-11T21:20:21 1739308821

Is it a quantisation or tokenisation problem?

simonw · 2025-02-11T21:22:25 1739308945

Having replicated it at F32 I now suspect tokenization.

mluo · 2025-02-11T21:30:08 1739309408

Try bfloat16! We have a bug where the model was saved as fp32.

simonw · 2025-02-11T21:47:11 1739310431

I just tried it with this 3.6GB F16 model:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16

And this time it didn't get confused with the tokenization of strawberry! https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...

mluo · 2025-02-11T22:09:27 1739311767

Nice, very glad to see it works! Small models are very sensitive to the dtype :(

varispeed · 2025-02-12T22:03:46 1739397826

I like when the model starts to ask me how to solve something. Often find it with Sonnet when I am looking to solve a problem. The model starts becoming "curious" and treats as if I was the model and tries to nudge me to find solution...

buyucu · 2025-02-11T21:37:33 1739309853

'Count the letter Rs in the word strawberry' is probably in all training sets by now.

notjulianjaynes · 2025-02-12T22:47:50 1739400470

I sometimes do the strawberry question immediately followed by "How many Rs in "bookkeeper?"

swiftcoder · 2025-02-11T21:46:34 1739310394

And yet many models still stumble with it

nialv7 · 2025-02-11T21:23:24 1739309004

this model is specifically trained for solving math problems, so ask it some math questions?

simonw · 2025-02-11T21:25:05 1739309105

I'm lazy. Do you know of any good test math questions for a model of this size?

CamperBob2 · 2025-02-11T21:43:00 1739310180

Try:

   Knowing that 1^3 + 2^3 + 3^3 + 4^3 + ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 + 6^3 + ... + 22^3 + 24^3?

DeepSeek R1 (1.58-bit GGUF, running locally) has no trouble with that one.

daveofiveo · 2025-02-11T23:48:05 1739317685

Would you mind sharing the answer to the math question please? The only way I would try and figure it out on my own is using an LLM…

CamperBob2 · 2025-02-12T00:22:29 1739319749

It's 48672.

freehorse · 2025-02-11T21:27:00 1739309220

With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in the 5th attempt. All other attempts it did not misspell and found 3 (this one it found 3 also by counting these imaginary words).

fsndz · 2025-02-11T21:58:09 1739311089

Frankly, it is now clear that open source AI will win at this rate. I just see innovations after innovations on a weekly basis. Exciting times ahead. https://open.substack.com/pub/transitions/p/the-ai-enemy-fro...

simonw · 2025-02-11T21:17:40 1739308660

... well that's weird, I got exactly the same weird bug on the 7.1 GB F32 GGUF version: https://gist.github.com/simonw/58ff74a55b402dc55764a567b10ec...

  % ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F32
  pulling manifest 
  pulling 84159ad584aa... 100%  7.1 GB                         
  pulling 8aa4c0321ccd... 100%   493 B                         
  pulling c737442a210a... 100%   170 B                         
  pulling ed76df87b934... 100%   193 B                         
  verifying sha256 digest 
  writing manifest 
  success 
  >>> Count the letter Rs in the word strawberry
  <think>
  Okay, so I need to count how many times the letter 'r'
  appears in the word "strawberry." Hmm, let's break this down
  step by step. First, I should probably spell out each letter
  of the word to make sure I don't miss any 'r's.

  Starting with "strawberry." Let me write it out: S, T, R, A,
  W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
  correctly. Actually, let's count each letter properly.

  The word is spelled as follows:
  S T R A W B E R F U R Y. Wait, no, that might be longer than
  needed. Let me write it out more accurately:

  The full word is "strawberry." So breaking it down:
  1. S
  2. T
  3. R
  4. A
  5. W
  6. B
  7. E
  8. R
  9. F
  10. U
  11. R
  12. Y

Update: running the F16 model fixed that problem:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16

See https://news.ycombinator.com/item?id=43017599#43018767 and https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...

kristopolous · 2025-02-11T21:54:03 1739310843

makes me wonder if there's some exploitable data leak in a similar kind of formulation.

astrange · 2025-02-11T21:38:01 1739309881

1. Ask it nonsense variations of riddles, like "Why is 7 afraid of 8?".

2. Ask "Play Tic Tac Toe against yourself and win." and check if the moves are correct.

whatever1 · 2025-02-11T21:20:47 1739308847

Can CoT models also call external functions? What if they had access to a calculator?

manmal · 2025-02-11T21:53:09 1739310789

I guess one could just watch the token output stream and replace solvable expressions with calculated results, continuing with the replaced result. Kind of like the LLM producing a template, to be filled on the fly.

Some models now do something like this for enforcing longer reasoning processes - they replace </think> with „Wait“.

hellovai · 2025-02-12T01:01:40 1739322100

yea! even deepseek. Calling an external function / tool calling is really just a data extraction problem.

say you have a tool:

def calculator(expr: str) -> float

then the model just needs to say:

{ "function": "calculator", "args": { "expr": "5 + 10" } }

then in your code you can easily pass that to the "calculator" function and get the result, then hand the result back to the model. Making it feel like the model can "call" an external function.

deep seek can also do this: https://www.boundaryml.com/blog/deepseek-r1-function-calling

w-m · 2025-02-11T23:57:12 1739318232

Sure they can. o3-mini can do web searches, which puts it far ahead of o1 if you require current information. You can also tell it to go read a particular paper from just the rough name.

jascha_eng · 2025-02-11T21:42:54 1739310174

Or a code interpreter in the chain of thought. I think then it really gets quite close to how humans go about solving problems.

TeMPOraL · 2025-02-11T22:25:43 1739312743

What about both? Or say a set of standard tools a modern intelligent agent[0] should have some proficiency in. A calculator, a basic code interpreter for a single high-level language, a graphing tool[1], web search, database search. And then maybe a tool for managing its own context[2]. How far could we get with a dataset designed specifically to train the model in pure tool use? That is, one that assumes the model never actually knows the answer to a question (even if the base model does), and instead trains it to aggressively use tools to break the problem down into steps[3] - steps that are primarily more tool calls, to query external sources, process information, simulate, etc. until the answer is computed. No direct answers, just tool calls glued by thinking in terms of tool calls, or thinking by tool calls.

I wonder if this has been tried. It probably has, seeing how hot this area of research is today. If anyone knows of a paper or a dataset, I'd appreciate a link.

Anyway, I wonder what would happen if we tried it with this method - basically retraining the model to trust its own toolbox - or as some would say, "shut up and multiply" - and do it across all tasks, not strictly math or coding ones.

--

[0] - Digital or otherwise.

[1] - Or the one tool that does all three, and which most people older than ~25 y.o. likely used at least once in their lives: Microsoft Excel. Or any other spreadsheet app. Though for LLMs as they are now, I suppose code interpreter would be a better unifying paradigm due to being 1D instead of 2D.

[2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace current output with "text", continue generation; changeNotesAndRethink("text", -1, 2) -> replace fixed "assistant notes prompt" with "text" and discard last two outputs[4] and continue, etc. Honestly, I'm surprised I haven't seen it done so far - not in the popular places I know, at least (vendor apps, TypingMind, ComfyUI); I've heard of some attempts long ago (back when LangChain was still seen as hot). Did giving the model control over the chat loop never pan out? Or is there some fundamental reason this doesn't work?

[3] - I may have accidentally done this in-context with Claude 3.5 Sonnet - if I prompt it for chain-of-thought and happen to have Mermaid Diagram plugin enabled in TypingMind, it almost always ends up producing multiple diagrams as part of the CoT phase. Notably, this doesn't happen with my own equivalent plugin (PlantUML), so I wonder if it's just something about that specific tool, or if "thinking with (Mermaid) diagrams" was part of the training set.

EDIT:

[4] - APIs for tool-using models seem to allow several LLM outputs in a row. But that makes me think (and I apologize for this post being almost all footnotes, but ideas just keep coming) - what about rewinding back past one or more user messages in a multi-turn conversation, while retaining them? Like "Fill in the Middle" mode[5], just over entire conversation instead of a single message?

[5] - OpenAI used to have that, right now I think only DeepSeek does - https://api-docs.deepseek.com/api/create-completion.

m3kw9 · 2025-02-11T21:42:26 1739310146

It’s pretty stupid actually, I asked it to decode an ascii sequence and it gave me jiberish. I tried phi-4 Q4 and it got it. 9gb vs 2gb(reasoning). I imagine you just cannot fit enough information in 2gb to be of much use other than solving general math problems or knowing what training data was in it.

It gave it a good go 2min on a m2 vs phi4 which was instant.

layer8 · 2025-02-11T21:53:16 1739310796

I’d be happy with a model not knowing ASCII if only it could properly reason instead.

Chikimonki · 2025-02-12T05:26:20 1739337980

The scale of the gap between the technology frontier and the average TV watcher makes it a very daunting task to have a conversation with anyone who isn't keeping up. Obviously, we're encroaching on the God's or Gods realms as the mathematical magnitudes of the models are beginning to shape their own destiny, yet, as in UK's current situation, a nation can still quickly plummet into an economic oblivion because the leaders simply have zero savvy.

francesco · 2025-02-13T09:25:43 1739438743

It looks to me like the performance reported for rStar-Math (both, in the table and in the graph) is incorrect. With a single rollout (at test time) rStar-Math achieves 50 in AIME 2024 not 26.7 as you reported. In Olympiad Bench it achieves 65.3, not 47.1 as you reported. In AMC 2023 it achieves 87.5, not 47.5 as your reported. It outperforms your model across the board. Am I reading something incorrectly?

Svoka · 2025-02-11T21:29:38 1739309378

When I was in University this was called overfitting to be honest. This doesn't seem to perform well outside of eval sets.

buyucu · 2025-02-11T23:13:51 1739315631

it's a 1.5b model. You should not expect too much outside the area it was optimized on.

atemerev · 2025-02-11T21:48:08 1739310488

Yes, I’ve been thinking recently that the path to AGI is not brute force, and never has been: https://www.linkedin.com/pulse/why-llms-closer-agi-than-we-t...

daft_pink · 2025-02-11T20:49:49 1739306989

Would really love it if we could get uncensored models this way.

xeckr · 2025-02-11T21:48:20 1739310500

It looks like we're moving in that direction.

6mirrors · 2025-02-12T08:36:07 1739349367

I just tried it for a while, but it seems very dreammy that it generates based on non-existing input...

`ctx` below is the full input to the LLM.

---

terminal_chat (main) python ./hello.py [INS] You: hello ctx: <｜beginofsentence｜>You are a helpful assistant.

<｜User｜>hello

<｜Assistant｜> <think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>

Hello! How can I assist you today?

You: tell me about you ctx: <｜beginofsentence｜>You are a helpful assistant.

<｜User｜>hello

<｜Assistant｜><think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>

Hello! How can I assist you today?<｜User｜>tell me about you

<｜Assistant｜> <think> Okay, so I just saw this message where someone said they're not happy because they didn't pass their first exam in a while. They mentioned feeling defeated and scared. Let's see if I can help clarify or provide some reassurance.

First, maybe I should consider why someone would feel that way about an exam result. Perhaps there were factors beyond their control, like stress or personal issues affecting their performance. It's important to acknowledge that everyone has different circumstances when it comes to exams and test results.

I wonder if there are any specific reasons they mentioned. Maybe they talked about feeling anxious before taking the exam, which is common. Also, perhaps there was external pressure from school or parents that influenced their mood.

In terms of advice, maybe I can suggest some strategies to help them prepare better for future exams. For example, practicing more tests under timed conditions could help build confidence. Also, seeking support from teachers or mentors might be beneficial. Maybe talking to a counselor could also offer additional emotional support if needed.

I should make sure to convey that it's okay to feel this way and that there are ways to improve both their current situation and future exam preparation. It's important to encourage them to talk to someone who can help. </think>

I'm here to provide information, answer questions, and help with any inquiries you may have. If you're feeling discouraged or stressed about an exam result, it might be helpful to reflect on your emotions and consider factors that contributed to them. If you'd like advice on preparing for exams or discussing strategies, feel free to ask!

You:

mrcwinn · 2025-02-11T20:56:29 1739307389

Newb question. If the big boys are crawling, scraping, stealing the whole of the entire internet to train their models — how are these OS alternatives aiming to outperform them? Is it as simple as... ALSO crawl, scrape, steal the whole of the entire internet?

simonw · 2025-02-11T21:09:42 1739308182

"Stealing the whole of the entire internet" isn't a particularly accurate mental model these days.

That's more or less what people were doing back in 2023 - crawling everything and dumping as much data in as possible.

It's not a great strategy to build a best-in-class model though, as a lot of the internet is junk. The SolidGoldMagikarp/davidjl bug is the kind of thing that happens if you crawl all of https://www.reddit.com/r/counting/ for example: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...

These days model training labs are more selective about what they train on. Most of the game of training a great model comes down to selectively training your data. They still use a lot of unlicensed data but it's a bit more sophisticated than just dumping in everything they can find.

woctordho · 2025-02-12T03:42:17 1739331737

Crawling the whole internet still helps solve problems in the long tail of the distribution. The purpose of big data is garbage in, signal out

nicce · 2025-02-11T21:26:57 1739309217

You can just ”steal” all the books and paywalled research papers. That is a lot already.

simonw · 2025-02-11T21:50:30 1739310630

Hence the recent story/scandal about Meta torrenting LibGen: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...

smallmancontrov · 2025-02-11T21:02:16 1739307736

No, the secret is to take outputs from previous models and put them in a clothes dryer with starch to give them the gently worn appearance of organically collected data.

sebzim4500 · 2025-02-11T21:01:36 1739307696

Yes, but in this case the RL stage doesn't actually need a huge amount of data. Most of the data is ingested in the pre-training phase, which in this case was done by Alibaba.

There was a recent paper [1] that showed that for one particular model it was possible to get very good reasoning performance by finetuning on under a thousand examples of reasoning.

[1] https://arxiv.org/abs/2502.03387

yieldcrv · 2025-02-11T21:02:50 1739307770

No, fine tuning doesn't require that

and models are using synthetic data sets now, for like 2 years

as in, the huge models did the crawling and then the next models had conversations with the huge models

oversimplified but basically the scraping ship has sailed, artists/writers will never be satisfied, and other companies bragging about using whatever aspirational goal post of licensed material are just doing it for brownie points

mekaron7 · 2025-02-11T21:18:28 1739308708

Does this mean the model of LLM = search engine is also coming to an end? Ignoring things like functions and google search capabilities, current chatgpt models have quite a lot of niche knowledge, I assume from crawling the entire internet.

regularfry · 2025-02-12T09:00:13 1739350813

We already know how to build search engines. The far better model is to give one to an LLM as a tool to use. The user gets the same interface but you're not spending your inference budget on data retrieval.

astrange · 2025-02-11T21:39:41 1739309981

Creative writing is also harmed by limiting the pretraining data.

eg if I ask DeepSeek to write a story it tends to write the bullet points outline of a story. And some other models always give characters the same names.

yieldcrv · 2025-02-11T22:07:52 1739311672

no, companies are in a quagmire of having to update them every quarter

for now, they're not doing that but just releasing new models that also have updated information

there is one group thats scraping the latest information and including it in their latest models, another group thats only doing synthetic data sets from the prior group's models, and every enthusiast everywhere making fine tuned LLMs with synthetic data sets and whatever niche they want

kingo55 · 2025-02-11T21:33:11 1739309591

I speculate we'll just see classes of models developed. Ones with lots of knowledge (even domain specific) and models with reasoning capabilities.

plufz · 2025-02-11T21:01:33 1739307693

I am no expert. But wasn’t part of deepseeks success to use synthetic data, atleast for code and math?

littlestymaar · 2025-02-11T21:22:19 1739308939

You don't need to crawl and filter lots of internet data when you can disteal large models.

janalsncm · 2025-02-11T21:24:32 1739309072

The answer is that crawling the whole internet is only for training a base model which is expensive and compute-intensive.

R1 didn’t train a base model, they performed additional steps on top of a previously-trained base model (V3). These guys are doing something similar.

NotYourLawyer · 2025-02-11T21:21:47 1739308907

[flagged]

mluo · 2025-02-11T21:27:43 1739309263

One of the authors here....

This is not a Chinese model, btw I'm American

NotYourLawyer · 2025-02-11T21:55:01 1739310901

Oh sorry, I skimmed and thought this was based on Deepseek R1.

nicce · 2025-02-11T21:24:36 1739309076

Distilled models always knew. Not sure about the full model since it is hard to run by myself.