Hacker Newsnew | past | comments | ask | show | jobs | submit | SethTro's commentslogin

I wrote Problem 371, https://projecteuler.net/problem=371 , as a high school student in 2012!

I'm so happy to have spent twenty years of my life learning math and solving problems on Project Euler and elsewhere.


This is one of my favourite problems, I still remember that it has a very real edge case even though I solved it more than 10 years ago. Thank you for the problem!


I'm guessing if you only calculate based on the digits, the probability is going to be slightly different than the real one, because you only have a finite number of plates you can choose from.


I'm glad you enjoyed! It was a real game I played when driving around.


Is your real name also Seth? This is wholesome and hilarious


Yes, the other name in the problem is my sister's name :)


Sounds like the birthday paradox problem. Is it?


Nearly, but not 8 digits of precision worth.


Can you reuse a plate with 500?


The wording seems to strongly imply no; you need two separate plates with 500 on them.


That sounds like a combinatorial problem...alphabets from AAA to ZZZ, numbers from 000 to 999.

That means one of the total sum of possible car plates is 26^3.

Since we want to find pairs (x, y) that x + y = 1000. That means the total sum would also add up sum([1 for x in range(1000) for y in range(1000) if x + y == 1000])/2 since there is a symmetry.

But wait, find the expected number of plates he needs to see for a win. So maybe we need to borrow something from statistics (Possion/chi-squared distribution) or queueing theory...?

Edit: ah I saw the solution, it is a Markov chain.


Interesting- I ask a license plate question (when will california run out of plates in its current serialization format, based on a couple of plates observed in two different years). It's a much simpler question, though (just linear extrapolation).


Article doesn't seem to mention price which is $4,000 which makes it comparable to a 5090 but with 128GB of unified LPDDR5x vs the 5090's 32GB DDR7.


They're in a different ballback in memory bandwidth. The right comparison is the Ryzen AI Max 395 with 128GB DDR5-8000 which can be bought for around $1800 / 1750€.


$4,000 is actually extremely competitive. Even for an at-home enthusiast setup this price is not our of reach. I was expecting something far higher, that said, nVidia's MSRP is something of a pipe dream recently so we'll see when it's actually released and the availability. Curious also to see how they may scale together.


A warning to any home consumer throwing money at hardware for AI (fair enough if you have other use cases)...

Things are changing rapidly and there is a non insignificant chance that it'll seem like a big waste of money within 12 months.


For this form factor it will be likely ~2 years for the next one based on Vera CPU and whatever GPU. The 50W CPU will probably improve power efficiency.

If SOCAMM2 is used it will still probably be at most near the range of 512/768 GB/s bandwidth, unless LPDDR6X / LPDDR7X or SOCAMM2 is that much better, SOCAMM on the DGX Station is just 384 GB/s w/ LPDDR5X.

Form factor will be neutered for the near future, but will probably retain the highest compute for the form factor.

The only way there will be a difference is if Intel or AMD pump their foot on the gas, which this makes maybe 2/3 years of it, with another 2 years unless they have something cooking it isn't going to happen.


Software driven changes could occur too! Maybe the next model will beat the pants off of this with far inferior hardware. Or maybe itll be so amazing with higher bandwidth hardware that anyone running at less than 500gbs will be left feeling foolish.

Maybe a company is working on something totally different in secret that we cant even imagine. The amount of £ thrown into this space at the moment is enormous.


Based on what data? I'm not denying the possibility but this seems like baseless FUD. We haven't even seen what folks have done with this hardware yet.


If you compare DGX Spark with Ryzen AI Max 395, do you still think that $4000 for the NVidia device is very competitive?

To me it seems like you're paying more than twice the price mostly for CUDA compatibility.


A 5090 is $2000.


Msrp, but try getting your hands on one without a bulk order and/or camping out in a tent all weekend. I have seen people in my area buying pre-biult machines as they often cost less than trying to buy an individual card.


It’s not that hard to come across MSRP 5090s these days. It took me about a week before I found one. But if you don’t want to put any effort or waiting into it, you can buy one of the overpriced OC models right now for $2500.


But you put in a $1500 PC (with 128 GB DRAM).

Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.

The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.


And about 1/4 the memory bandwidth, which is what matters for inference.


More precisely, the RTX 5090 has a memory bandwidth of 1792 GB/s, while the DGX Spark only has 273 GB/s, which is about 1/6.5.

For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.


My understanding is that the Jetson Thor is just as good a platform, and likely more readily available.

Then there's the Mac Studio, which outdoes them in all respects except FP8 and FP4 support. As someone on Reddit put it: https://old.reddit.com/r/LocalLLaMA/comments/1n0xoji/why_can...


I’ve been thinking the same… I have jetson Thor and only difference I can imagine is the capability to connect two DGX sparks together… but then I’d rather go for RTX pro 6000 instead of buying two DGX spark units, because I prefer the higher memory bandwidth, more Cuda cores, tensor cores and RT cores over 256 GB memory for my use case.


The jetson thor seems to be quite different. The Thor whitepaper lists 8 TFlop/s of FP32 compute where the DGX sparks seems to be closer to 30 TFlop/s. Also 48 SMs on the Spark vs 20 on the Jetson.

The DGX seems vastly more capable.


Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happens to launch a Mac Mini with 128GB RAM it would eat Nvidia Sparks’ lunch every day.


Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.

People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.

Even if Metal is actually a cool API to program for.


It depends what you're doing. I can get valuable work done with the subset of Torch supported on MPS and I'm grateful for the speed and RAM of modern Mac systems. JAX support is worse but hopefully both continue to develop.


CUDA is equally proprietary and not an industry standard though, unless you were thinking of Vulcan/OpenCL which doesn’t bring much in this situation.


Yes it is an industry standard, there is even a technical term for it.

It is called De facto standard, which you can check in your favourite dictionary.


CUDA isn't the industry standard? What is then?


Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.


Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.


FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |


Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.

Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...

Example looking at the same weight on Ollama is BF16:

https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360


I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..


Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.

Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.


Was. They've been diverging.


Now this looks much more interesting! Is the top one input tokens and the second one output tokens?

So 38.54 t/s on 120B? Have you tested filling the context too?


Yes, I provided detailed numbers here: https://github.com/ggml-org/llama.cpp/discussions/16578


Makes sense you have one of the boxes. What's your take on it? [Respecting any NDAs/etc/etc of course]


Curious to how this compares to running on a Mac.


TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB


So so many… eBay search shows only 15 results, 6 of them being ads for new systems…

https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...


Just don't try to run a NCCL


Wouldn't you be able to test nccl if you had 2 of these?


What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.


Not with Mac studio(s), but yes multi host NCCL over RoCE with two DGX Sparks or over PCI with one


I don't see any benchmarks on your github.

How long does it take for you to test up to 10^9? 10^12?


So at 12 threads. I can do 10^12 in about 3hrs and 45 mins. But obviously takes over the whole system. I think it could even be more optimized if the concept was taken further in the right hands or hardware. I should add some benchmarks for sure. Will do so soon.


Is that for [0, 10^12] or for 10^18 + 10^12? Is the 3.75 hrs total core hours or system time?

I wrote https://github.com/sethtroisi/goldbach in the last four hours and I think it's 10-100x faster than your code

  time ./goldbach -t8 -K 1024 -e 100''000''000''000
  real 0m44.133s
  user 3m32.193s
  sys 0m0.092s

  time ./goldbach -t12 -K 1500 -e 1''000''000''000''000
    267468893942 = 267468891139 + 2803 (407)
    926868341768 = 926868338701 + 3067 (437)
    599533546358 = 599533542901 + 3457 (481)
  real 2m3.909s
  user 16m0.729s
  sys 0m37.960s

I know of at least another 4x I could improve it but I've been nerd sniped enough for the night.

I wouldn't trust it 100% but it finds all terms in https://oeis.org/A025018 and https://oeis.org/A025019 and I can verify in https://sweet.ua.pt/tos/goldbach.html


> Goldbach states: every even number ≥ 4 is a sum of two primes. The naive check for an even n tries many primes p and hopes that n − p is prime. Our idea is simpler: fix a small set of primes Q = {q1, . . . , qK} (the “gear”), and for each even n only test p = n − q with q ∈ Q

I don't see how your idea is different from the naive check. As far as I can tell you are basically saying do the naive check but only up to p > 250-300?


The idea is that a fixed gear approach means that instead of exhaustively checking against everything, a small subset of primes (k=300) actually is sufficient for effectively complete coverage and holds true at large slices in the quadrillions, or even higher, or all the way through an entire cycle of evens to 1 trillion. It’s saving a massive amount of compute and appears to be just as effective as naive checks. Like orders of magnitude faster.

As far as I know, no one has tested this method or written an algorithm precisely like this. And then determined that k=300 is the sweet spot for primes sets. Complexity isn’t required for improvements.


Think of it like this: “naive 300“ would try and check if n minus any of the first 300 primes lands on another prime. For big n, it falls apart fast as prime gaps explode, and you start missing left and right. But here I am doing a small, constant set of primes, call it the “gear”… and for each even n, I check n – q for every q in that gear. Not just as a casual test, but as a full-blown, multi-threaded, checkpointed sweep with audited logs.


That's the magic part, they aren't


Totally agree with the title. I discovered Fenwick trees as part of solving a Project Euler, then from the problem forum I found out someone had invented this in the 90s and other people had imagined it much earlier.


The board likely kept him till most of the bad PR had subsided so the new CEO can have a friendlier reception.


This has 2 of the 3 features (float support, faster clock) + more POI that was keeping me on ESP32. For projects that need wifi, and can tolerate the random interrupts, I'll stick with ESP32.


> Phind-70B is significantly faster than GPT-4 Turbo ... We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs


As someone who has utilized Nvidia Triton Inference Server for years it's really interesting to see people publicly disclosing use of TensorRT-LLM (almost certainly in conjunction with Triton).

Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.


Being accesible is huge.

I still see post of people running ollama on H100s or whatever, and that's just because its so easy to set up.


How many H100 GPUs does it take to serve 1 Phind-70B model? Are they serving it with bf16, or int8, or lower quants?


This video [1] shows someone running at 4-bit quant in 48gb VRAM. I suspect you need 4x that to run at full f16 precision, or approx 3 H100.

https://www.youtube.com/watch?v=dJ69gY0qRbg


Yeah, 4bit would take 35 GB at least. 16bit would be 140 GB. I'm more interested in how Phind is serving it. But I guess that's their trade secret.


I've had a ton of success convetring simple drawing to fabric with Ink/stitch on a Brothers pe 770. So glad we have one in our makerspace. Things like this this eye https://i.imgur.com/0egzvbc.jpeg or names for shirts


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: