Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL (pretty-radio-b75.notion.site)
322 points by sijuntan 11 months ago | hide | past | favorite | 127 comments


This is finetuned to the benchmarks and nowhere close to O1-Preview in any other tasks. Not worth looking into unless you specifically want to solve these problems - however, still impressive.


We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).

If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).


It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).

Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!


For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)

Come checkout our repo at: https://github.com/agentica-project/deepscaler


It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)


Check out one of my prior work: https://stylus-diffusion.github.io/

This work scales up selection/routing over many models/LoRAs


Love it, will check, thank you for showing / sharing all of that!


o1 is more than just math solver. And you cannot possibly train that much in a small model.

However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".


Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?


Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.


I disagree. They demonstrated a way to dramatically reduce training costs, 18x cheaper than R1. That alone is worth attention.

Also beating O1 on any benchmark is nontrivial.


I'm not so sure it's impressive even for mathematical tasks.

When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.

These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.

It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.


But if that's the case - what do the benchmarks even mean then?


Automated benchmarks are still very useful. Just less so when the LLM is trained in a way to overfit to them, which is why we have to be careful with random people and the claims they make. Human evaluation is the gold standard, but even it has issues.


The question is how do you train your LLMs to not 'cheat'?

Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?

Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.

The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.


I think the only way is to check your dataset for the benchmark leak and remove it before training, but (as you say) that's assuming an honest actor is training the LLM, going against the incentives of leaving the benchmark leak in the training data. Even then, a benchmark leak can make it through those checks.

I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.


A lot of people in community are weary of benchmarks for this exact reason.


I tested it on basic long addition problems. It frequently misplaced the decimal signs, used unnecessary reasoning tokens (like restating previously done steps) and overall seemed only marginally more reliable than the base DeepSeek 1.5B.

On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.


> On my own pet eval, writing a fast Fibonacci algorithm in Scheme,

This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.


The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.


This model can't code at all.

It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.

It's a pure specialist model.


It's absolutely worth to look into.

It's a great find


We changed the URL from https://github.com/agentica-project/deepscaler to the article that gives more background information (and also points to the repo).

(Submitted title was "Replicating Deepseek-R1 for $4500: RL Boosts 1.5B Model Beyond o1-preview")


Out of sheer curiosity, how do vacations work for you? Do you take vacations, surly you must? I've been using HN daily for years now I can't think I've seen you take a vacation.


Not exactly. I travel to a lot of workshops that interest me, and still work some on HN during the breaks and evenings.


I realize there is a team behind HN, but your sheer force of will keeps HN a shining example of discourse. Thank you for everything you do!


The key thing here is a simple, reliable formula to train a 1B model on a specific task and get strong performance. That didn’t really exist before. Edge devices are about to get a lot smarter.


I hope Apple takes note and improves their on-device models accordingly.


Still took $4500, so I'll probably not training this at home - but maybe that number will change too! Exciting times.


Smaller reinforced models are what will win. Look at our civilization, your company, heck even your team: lots of specialized folks, not some singular bloated mastermind.

The issue though is the overemphasis on the current benchmarks. Ideally the thing benchmarked is against user kpis.


I’m not sure about that, I think the current setup is an artifact of human limits, which an immortal shared ML model doesn’t share. Also, cross-disciplinary knowledge is frequently incredibly helpful (if rare, because currently it’s the intersection of multiple uncommon specialties).


The question that's screaming at me right now is that if humans used specialization + cooperation to achieve global dominance, what happens when you pair unlimited global masterminds + cooperation?


Best case scenario is you get The Culture where society is ruled by hyper-intelligent but kind and benevolent AIs that solve all economic problems and that keep a population of quadrillions of very happy, healthy, and rich humans as pets.


Can't wait for the parliament of competing super intelligent political parties producing propaganda to convince cloud investors to reallocate compute.


As bandwidth is still crap my bet would be that, like with humans, the sum of the parts isn't greater than the greatest mastermind in the group or it will take an increasingly long time to come up with increasingly original thought. If it scales reasonably linear it would be worth waiting for a year or a decade to produce a single result.

Also a funny thought, with enough of a mind it should be able to figure out how to get very limited storage and barely usable computation at very low speed out of some extremely cheap substrate. Something like growing salt crystals in a lake (or ocean) Have a write pump flashing a light onto salt water and a read pump discarding undesirable results to be overwritten. You know, like, it doesn't matter if you encode randomly, if you do it often enough you will get the entire set. You don't have to read accurately either, as long as you get at least some results out.

wait, did I just explain the universe?


Whatever the human equivalent of beehives and ant colonies will be.


We turn into cats.


I like optimists.


> lots of specialized folks

This isn't what I see in R&D. The broad and deep folks are the ones with all the patents, and design wins, because they're the only ones that can see outside the box that everyone else is specialized in.


Nah, it takes a mastermind to be creative. The tiny models IME are good at following explicit instructions for data transformations (conforming to a JSON schema is especially helpful) but I wouldn't expect one to solve a problem I haven't already solved myself like how I use the .6T full size models.


I thought conforming to a JSON schema was a solved problem for even the most garbage models by restricting the tokens it can generate?


Yes that's what I mean, with a well described task and a JSON schema, the tiny models are suitable (though still hallucinatory - even llama3 7B was making up phone numbers when I tasked it with extracting contact details from email signature blocks)

I was pushing back against the parent's notion that tiny models would serve general purpose needs.


im very naive here but does anyone trust these benchmarks? do they mean anything to you? they seem far too easy to game and it doesn't feel like its an accurate way to really tell how these models compare to one another. seems like benchmark performance declines quite a bit if you introduce a problem that's similar to those in benchmarks but one that the model hasn't seen before


AIME has significant problems: https://x.com/DimitrisPapail/status/1888325914603516214

> Problems near identical to the test set can be found online.


I don’t see how that is relevant to a model trained with RL. If it had already memorized the answers then why was any training necessary?


Someone could read this as: as soon as I don't like the results I doubt the test...


thats far from the case here i really have no skin in the game...


Does anyone have any good simple prompts for testing new "reasoning" models like this one?

"Count the letter Rs in the word strawberry" is a bit dull!

I'm trying this one locally using Ollama and the smallest quantized GGUF version (769MB) I could find - https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5... - I ran it like this:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:IQ3_XXS
Here's the answer it gave me: https://gist.github.com/simonw/5943a77f35d1d5185f045fb53898a... - it got the correct answer after double-checking itself 9 times! And if you look at its thought it made a pretty critical error right at the start:

  "Strawberry" has the letters S, T, R, A, W, B, E, R, F,
  U, R, E. Let me count each 'R' as I go along.
Hopefully the versions that aren't quantized that tiny do better than that.


I have two. One is a simple one that only deepseek R1 has passed (in my opinion):

I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?

Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.

Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.

So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.

Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."

A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism? A) Staphylococcus aureus B) Streptococcus viridans C) Escherichia coli D) Pseudomonas aeruginosa E) Candida albicans


Wow, Gemini 2 has a 9 step process for the first question https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/

> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:

1. Fill the 6-liter jug completely.

2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.

3. Fill the 6-liter jug completely again.

4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

5. Now empty the 12 liter jug.

6. Pour the water from the 6-liter jug into the 12-liter jug.

7. Fill the 6-liter jug completely again.

8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

9. There are 6 liters left in the 6 liter jug.


It really loves the sound of its own voice!


And things get dramatically worse on 4o if you add a second superfluous jug:

    To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
    Step-by-Step Solution:

        Fill the 12-liter jug completely (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
        Empty the 16-liter jug completely.
        Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
        Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
        Empty the 6-liter jug completely.
        Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.

    Now, the 6-liter jug contains exactly 6 liters of water.


Interestingly, o3-mini-high was correct when first thinking about it:

> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.


I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.


>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.

Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.


(Shrug) R1 has no problem with that. To the extent it's confused, it is only because it is wondering if it's a trick question.

CoT reasoning: https://i.imgur.com/rjNmTGZ.png

Answer: https://i.imgur.com/WfAVeZQ.png

There's 'mindless regurgitation' going on here, but not by the AI model.


Deepseek just told me this:

To measure exactly 6 liters using a 12-liter and a 6-liter jug:

1. *Fill the 12L jug completely.* 2. *Pour water from the 12L jug into the 6L jug until it's full.*

After this process, the 12L jug will have exactly 6 liters left.


Tried it in Llama v3.3 70b and it got it every time out of 5 samples. I guess the others are interpreting it as a trick question or something.


Is the answer strep viridans?


I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.

It told me to fill the 12 liter jug then pour that into the 6 liter jug.

I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.

Output: Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence: 1. Fill the 6-liter jug from the water supply. • State: (12-liter jug = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room). • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the water supply. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full. • The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5. Empty the 12-liter jug (discard the water or set it aside). • State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue. Since both are empty, let’s do something else: a. Fill the 12-liter jug from the supply. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug. • Now the 12-liter jug has liters left. • State: (12-liter jug = 6 L, 6-liter jug = 6 L)

At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.

If You Want the 6 Liters in the Smaller Jug Instead

If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence: 1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug). 2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 3. Empty the 6-liter jug. • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug. • State: (12-liter jug = 0 L, 6-liter jug = 6 L)

Again, you end up with exactly 6 liters in the 6-liter jug.

Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.


Hi, one of the lead authors for this work.

We recommend using Bfloat16 (not fp16), quantization for small models can really hurt performance!


Have you compared it to the 1.58 bit dynamic quant model based on the original R1 (i.e., not a distillation)? Whatever unsloth did, it doesn't seem to be giving up much reasoning performance over the full Q8 version.


It's simply bc the model is small (1.5B), making it sensitive to weight perturbations


Is there a GGUF version of your model anywhere that you recommend? I'm on a Mac.


Think there are some people who made GGUFs as branches of our model, try it out!

https://huggingface.co/models?other=base_model:quantized:age...


Is there a MLX version that can be added to the fullmoon iOS app?


As for Rs in strawberry, trying a bunch of models side by side only Sky T-1, Gemini 2 Flash got it wrong! https://sugaku.net/qna/792ac8cc-9a41-4adc-a98f-c5b2e8d89f9b/

Simple questions like 1+1 can also be fun since R1 goes overboard (as do some other models when you include a system prompt asking it to think) https://sugaku.net/qna/a1b970c0-de9f-4e62-9e03-f62c5280a311/

And if that fails you can ask for the zeros of the ζ function! https://sugaku.net/qna/c64d6db9-5547-4213-acb2-53d10ed95227/


I always ask every model to implement a Qt QSyntaxHighlighter subclass for syntax highlighting code and a QAbstractListModel subclass that parses markdown into blocks - in C++, both implemented using tree sitter. It's sounds like a coding problem but it's much more a reasoning problem of how to combine the two APIs and is out of band of the training data. I test it with multiple levels of prompt fidelity that I have built up watching the many mistakes past models have made and o3-mini-high and o1 can usually get it done within a few iterations.

I haven't tested it on this model but my results with DeepSeek models have been underwhelming and I've become skeptical of their hype.


(Fellow Qt developer)

I really like your takes! Is there somewhere I can keep in touch with you? You can view my socials in my profile if you'd like to reach out.


Give it a try with nvidia llama 3.1 nemotron 70b. It is the only model that can give useful Gstreamer code


“How many stops faster is f/2.8 than f/4.5?”

This photography question can be solved with the right equations. A lot of non-reasoning LLMs would spout some nonsense like 0.67 stops faster. Sometimes they’ll leave a stray negative sign in too!

The answer should be approximately 1.37, although “1 and 1/3” is acceptable too.

LLMs usually don’t have trouble coming up with the formulas, so it’s not a particularly obscure question, just one that won’t have a memorized answer, since there are very few f/4.5 lenses on the market, and even fewer people asking this exact question online. Applying those formulas is harder, but the LLM should be able to sanity check the result and catch common errors. (f/2.8 -> f/4 is one full stop, which is common knowledge among photographers, so getting a result of less than one is obviously an error.)

This also avoids being a test that just emphasizes tokenizer problems… I find the strawberry test to be dreadfully boring. It’s not a useful test. No one is actually using LLMs to count letters in words, and until we have LLMs that can actually see the letters of each word… it’s just not a good test, in my opinion. I’m convinced that the big AI labs see it as a meme at this point, which is the only reason they keep bringing it up. They must find the public obsession with it hilarious.

I was impressed at how consistently well Phi-4 did at my photography math question, especially for a non-reasoning model. Phi-4 scored highly on math benchmarks, and it shows.


The negative quality impact of quantization is more pronounced for smaller models [0], so I'm surprised this tiny quant works at all.

[0] or rather models closer to saturation, which is a function of model params and amount of training


Yeah I'm very impressed that the tiniest quant version appears partially usable... especially now I've replicated that same S, T, R, A, W, B, E, R, F, U, R, Y bug on a much larger one: https://news.ycombinator.com/item?id=43018494


Is it a quantisation or tokenisation problem?


Having replicated it at F32 I now suspect tokenization.


Try bfloat16! We have a bug where the model was saved as fp32.


I just tried it with this 3.6GB F16 model:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
And this time it didn't get confused with the tokenization of strawberry! https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...


Nice, very glad to see it works! Small models are very sensitive to the dtype :(


I like when the model starts to ask me how to solve something. Often find it with Sonnet when I am looking to solve a problem. The model starts becoming "curious" and treats as if I was the model and tries to nudge me to find solution...


'Count the letter Rs in the word strawberry' is probably in all training sets by now.


I sometimes do the strawberry question immediately followed by "How many Rs in "bookkeeper?"


And yet many models still stumble with it


this model is specifically trained for solving math problems, so ask it some math questions?


I'm lazy. Do you know of any good test math questions for a model of this size?


Try:

   Knowing that 1^3 + 2^3 + 3^3 + 4^3 + ... + 11^3 + 12^3 = 6084, what is the value of 2^3 + 4^3 + 6^3 + ... + 22^3 + 24^3?
DeepSeek R1 (1.58-bit GGUF, running locally) has no trouble with that one.


Would you mind sharing the answer to the math question please? The only way I would try and figure it out on my own is using an LLM…


It's 48672.


With q8 I got "s-t-r-a-w-f-u-r-e-r" and then "strawfurber" in the 5th attempt. All other attempts it did not misspell and found 3 (this one it found 3 also by counting these imaginary words).


Frankly, it is now clear that open source AI will win at this rate. I just see innovations after innovations on a weekly basis. Exciting times ahead. https://open.substack.com/pub/transitions/p/the-ai-enemy-fro...


... well that's weird, I got exactly the same weird bug on the 7.1 GB F32 GGUF version: https://gist.github.com/simonw/58ff74a55b402dc55764a567b10ec...

  % ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F32
  pulling manifest 
  pulling 84159ad584aa... 100%  7.1 GB                         
  pulling 8aa4c0321ccd... 100%   493 B                         
  pulling c737442a210a... 100%   170 B                         
  pulling ed76df87b934... 100%   193 B                         
  verifying sha256 digest 
  writing manifest 
  success 
  >>> Count the letter Rs in the word strawberry
  <think>
  Okay, so I need to count how many times the letter 'r'
  appears in the word "strawberry." Hmm, let's break this down
  step by step. First, I should probably spell out each letter
  of the word to make sure I don't miss any 'r's.

  Starting with "strawberry." Let me write it out: S, T, R, A,
  W, B, E, R, F, U, R, Y? Wait, maybe I'm not spelling it
  correctly. Actually, let's count each letter properly.

  The word is spelled as follows:
  S T R A W B E R F U R Y. Wait, no, that might be longer than
  needed. Let me write it out more accurately:

  The full word is "strawberry." So breaking it down:
  1. S
  2. T
  3. R
  4. A
  5. W
  6. B
  7. E
  8. R
  9. F
  10. U
  11. R
  12. Y
Update: running the F16 model fixed that problem:

  ollama run hf.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF:F16
See https://news.ycombinator.com/item?id=43017599#43018767 and https://gist.github.com/simonw/9e79f96d69f10bc7ba540c87ea0e8...


makes me wonder if there's some exploitable data leak in a similar kind of formulation.


1. Ask it nonsense variations of riddles, like "Why is 7 afraid of 8?".

2. Ask "Play Tic Tac Toe against yourself and win." and check if the moves are correct.


Can CoT models also call external functions? What if they had access to a calculator?


I guess one could just watch the token output stream and replace solvable expressions with calculated results, continuing with the replaced result. Kind of like the LLM producing a template, to be filled on the fly.

Some models now do something like this for enforcing longer reasoning processes - they replace </think> with „Wait“.


yea! even deepseek. Calling an external function / tool calling is really just a data extraction problem.

say you have a tool:

def calculator(expr: str) -> float

then the model just needs to say:

{ "function": "calculator", "args": { "expr": "5 + 10" } }

then in your code you can easily pass that to the "calculator" function and get the result, then hand the result back to the model. Making it feel like the model can "call" an external function.

deep seek can also do this: https://www.boundaryml.com/blog/deepseek-r1-function-calling


Sure they can. o3-mini can do web searches, which puts it far ahead of o1 if you require current information. You can also tell it to go read a particular paper from just the rough name.


Or a code interpreter in the chain of thought. I think then it really gets quite close to how humans go about solving problems.


What about both? Or say a set of standard tools a modern intelligent agent[0] should have some proficiency in. A calculator, a basic code interpreter for a single high-level language, a graphing tool[1], web search, database search. And then maybe a tool for managing its own context[2]. How far could we get with a dataset designed specifically to train the model in pure tool use? That is, one that assumes the model never actually knows the answer to a question (even if the base model does), and instead trains it to aggressively use tools to break the problem down into steps[3] - steps that are primarily more tool calls, to query external sources, process information, simulate, etc. until the answer is computed. No direct answers, just tool calls glued by thinking in terms of tool calls, or thinking by tool calls.

I wonder if this has been tried. It probably has, seeing how hot this area of research is today. If anyone knows of a paper or a dataset, I'd appreciate a link.

Anyway, I wonder what would happen if we tried it with this method - basically retraining the model to trust its own toolbox - or as some would say, "shut up and multiply" - and do it across all tasks, not strictly math or coding ones.

--

[0] - Digital or otherwise.

[1] - Or the one tool that does all three, and which most people older than ~25 y.o. likely used at least once in their lives: Microsoft Excel. Or any other spreadsheet app. Though for LLMs as they are now, I suppose code interpreter would be a better unifying paradigm due to being 1D instead of 2D.

[2] - E.g. changeNotesAndRethink("text", 0, 1) -> replace current output with "text", continue generation; changeNotesAndRethink("text", -1, 2) -> replace fixed "assistant notes prompt" with "text" and discard last two outputs[4] and continue, etc. Honestly, I'm surprised I haven't seen it done so far - not in the popular places I know, at least (vendor apps, TypingMind, ComfyUI); I've heard of some attempts long ago (back when LangChain was still seen as hot). Did giving the model control over the chat loop never pan out? Or is there some fundamental reason this doesn't work?

[3] - I may have accidentally done this in-context with Claude 3.5 Sonnet - if I prompt it for chain-of-thought and happen to have Mermaid Diagram plugin enabled in TypingMind, it almost always ends up producing multiple diagrams as part of the CoT phase. Notably, this doesn't happen with my own equivalent plugin (PlantUML), so I wonder if it's just something about that specific tool, or if "thinking with (Mermaid) diagrams" was part of the training set.

EDIT:

[4] - APIs for tool-using models seem to allow several LLM outputs in a row. But that makes me think (and I apologize for this post being almost all footnotes, but ideas just keep coming) - what about rewinding back past one or more user messages in a multi-turn conversation, while retaining them? Like "Fill in the Middle" mode[5], just over entire conversation instead of a single message?

[5] - OpenAI used to have that, right now I think only DeepSeek does - https://api-docs.deepseek.com/api/create-completion.


It’s pretty stupid actually, I asked it to decode an ascii sequence and it gave me jiberish. I tried phi-4 Q4 and it got it. 9gb vs 2gb(reasoning). I imagine you just cannot fit enough information in 2gb to be of much use other than solving general math problems or knowing what training data was in it.

It gave it a good go 2min on a m2 vs phi4 which was instant.


I’d be happy with a model not knowing ASCII if only it could properly reason instead.


The scale of the gap between the technology frontier and the average TV watcher makes it a very daunting task to have a conversation with anyone who isn't keeping up. Obviously, we're encroaching on the God's or Gods realms as the mathematical magnitudes of the models are beginning to shape their own destiny, yet, as in UK's current situation, a nation can still quickly plummet into an economic oblivion because the leaders simply have zero savvy.


It looks to me like the performance reported for rStar-Math (both, in the table and in the graph) is incorrect. With a single rollout (at test time) rStar-Math achieves 50 in AIME 2024 not 26.7 as you reported. In Olympiad Bench it achieves 65.3, not 47.1 as you reported. In AMC 2023 it achieves 87.5, not 47.5 as your reported. It outperforms your model across the board. Am I reading something incorrectly?


When I was in University this was called overfitting to be honest. This doesn't seem to perform well outside of eval sets.


it's a 1.5b model. You should not expect too much outside the area it was optimized on.


Yes, I’ve been thinking recently that the path to AGI is not brute force, and never has been: https://www.linkedin.com/pulse/why-llms-closer-agi-than-we-t...


Would really love it if we could get uncensored models this way.


It looks like we're moving in that direction.


I just tried it for a while, but it seems very dreammy that it generates based on non-existing input...

`ctx` below is the full input to the LLM.

---

terminal_chat (main) python ./hello.py [INS] You: hello ctx: <|beginofsentence|>You are a helpful assistant.

<|User|>hello

<|Assistant|> <think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>

Hello! How can I assist you today?

You: tell me about you ctx: <|beginofsentence|>You are a helpful assistant.

<|User|>hello

<|Assistant|><think> Okay, the user greeted me with "hello". I should respond politely and let them know I'm here to help. I'll keep it simple and friendly. </think>

Hello! How can I assist you today?<|User|>tell me about you

<|Assistant|> <think> Okay, so I just saw this message where someone said they're not happy because they didn't pass their first exam in a while. They mentioned feeling defeated and scared. Let's see if I can help clarify or provide some reassurance.

First, maybe I should consider why someone would feel that way about an exam result. Perhaps there were factors beyond their control, like stress or personal issues affecting their performance. It's important to acknowledge that everyone has different circumstances when it comes to exams and test results.

I wonder if there are any specific reasons they mentioned. Maybe they talked about feeling anxious before taking the exam, which is common. Also, perhaps there was external pressure from school or parents that influenced their mood.

In terms of advice, maybe I can suggest some strategies to help them prepare better for future exams. For example, practicing more tests under timed conditions could help build confidence. Also, seeking support from teachers or mentors might be beneficial. Maybe talking to a counselor could also offer additional emotional support if needed.

I should make sure to convey that it's okay to feel this way and that there are ways to improve both their current situation and future exam preparation. It's important to encourage them to talk to someone who can help. </think>

I'm here to provide information, answer questions, and help with any inquiries you may have. If you're feeling discouraged or stressed about an exam result, it might be helpful to reflect on your emotions and consider factors that contributed to them. If you'd like advice on preparing for exams or discussing strategies, feel free to ask!

You:


Newb question. If the big boys are crawling, scraping, stealing the whole of the entire internet to train their models — how are these OS alternatives aiming to outperform them? Is it as simple as... ALSO crawl, scrape, steal the whole of the entire internet?


"Stealing the whole of the entire internet" isn't a particularly accurate mental model these days.

That's more or less what people were doing back in 2023 - crawling everything and dumping as much data in as possible.

It's not a great strategy to build a best-in-class model though, as a lot of the internet is junk. The SolidGoldMagikarp/davidjl bug is the kind of thing that happens if you crawl all of https://www.reddit.com/r/counting/ for example: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...

These days model training labs are more selective about what they train on. Most of the game of training a great model comes down to selectively training your data. They still use a lot of unlicensed data but it's a bit more sophisticated than just dumping in everything they can find.


Crawling the whole internet still helps solve problems in the long tail of the distribution. The purpose of big data is garbage in, signal out


You can just ”steal” all the books and paywalled research papers. That is a lot already.


Hence the recent story/scandal about Meta torrenting LibGen: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...


No, the secret is to take outputs from previous models and put them in a clothes dryer with starch to give them the gently worn appearance of organically collected data.


Yes, but in this case the RL stage doesn't actually need a huge amount of data. Most of the data is ingested in the pre-training phase, which in this case was done by Alibaba.

There was a recent paper [1] that showed that for one particular model it was possible to get very good reasoning performance by finetuning on under a thousand examples of reasoning.

[1] https://arxiv.org/abs/2502.03387


No, fine tuning doesn't require that

and models are using synthetic data sets now, for like 2 years

as in, the huge models did the crawling and then the next models had conversations with the huge models

oversimplified but basically the scraping ship has sailed, artists/writers will never be satisfied, and other companies bragging about using whatever aspirational goal post of licensed material are just doing it for brownie points


Does this mean the model of LLM = search engine is also coming to an end? Ignoring things like functions and google search capabilities, current chatgpt models have quite a lot of niche knowledge, I assume from crawling the entire internet.


We already know how to build search engines. The far better model is to give one to an LLM as a tool to use. The user gets the same interface but you're not spending your inference budget on data retrieval.


Creative writing is also harmed by limiting the pretraining data.

eg if I ask DeepSeek to write a story it tends to write the bullet points outline of a story. And some other models always give characters the same names.


no, companies are in a quagmire of having to update them every quarter

for now, they're not doing that but just releasing new models that also have updated information

there is one group thats scraping the latest information and including it in their latest models, another group thats only doing synthetic data sets from the prior group's models, and every enthusiast everywhere making fine tuned LLMs with synthetic data sets and whatever niche they want


I speculate we'll just see classes of models developed. Ones with lots of knowledge (even domain specific) and models with reasoning capabilities.


I am no expert. But wasn’t part of deepseeks success to use synthetic data, atleast for code and math?


You don't need to crawl and filter lots of internet data when you can disteal large models.


The answer is that crawling the whole internet is only for training a base model which is expensive and compute-intensive.

R1 didn’t train a base model, they performed additional steps on top of a previously-trained base model (V3). These guys are doing something similar.


[flagged]


One of the authors here....

This is not a Chinese model, btw I'm American


Oh sorry, I skimmed and thought this was based on Deepseek R1.


Distilled models always knew. Not sure about the full model since it is hard to run by myself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: