Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The best bang for the buck on VRAM is a maxed out Mac Studio.


No one who claims this ever posts a benchmark

Prompt eval is slow, inference for large models at high context is slow, training is limited and slow.

It's better than not having anything, but we got rid of our M1 Max 192GBs after about a year.


> No one who claims this ever posts a benchmark

I have a Mac with a lot of RAM for running models. I haven’t done it in a month because I can tell that it’s not only slow, but the output also doesn’t come close to what I can get from the latest from Claude or ChatGPT.

It’s actually amazing that I can run LLMs locally and get the quality of output that they give me, but it’s just a different level of experience than the state of the art.

I’m becoming convinced that the people who sing the praises of running locally are just operating differently. For them, slow and lower quality output aren’t a problem because they’re having fun doing it themselves. When I want to get work done, the hosted frontier models are barely fast enough and have hit or miss quality for me, so stepping down to the locally hosted options is even more frustrating.


I'm hoping to see some smaller MoE models released this year, trained with more recent recipes (higher quality data, much longer pretraining). Mixtral 8x7B was impressive when it came out, but the exact same architecture could be a lot more powerful today, and would run quite fast on Apple Silicon.


What will you pay me for the benchmarks, for the professional knowledge and analysis?

I can post benchmarks for these Mac machines and clusters of Studios and M4 Mac Minis (see my other HN posts last month, the largest Mac cluster I can benchmark for you has 4 TB of ultrafast unified memory and around 9216 M4 cores).


I mean, I can't pay you anything, but that sounds interesting as hell. Are there any interesting use cases to massive amounts of memory outside of training?


> No one who claims this ever posts a benchmark

I meant to explain why no one ever posts a benchmark, it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc. You don't often hand that over for free. With my benchmark results some companies can save millions if they take my advice.

>any interesting use cases to massive amounts of memory outside of training?

Dozens, hundreds. Almost anything you use databases, CPUs, GPUs or TPUs for. 90% of computing is done on the wrong hardware, not just datacenter hardware.

The interesting use case we discussed here on HN last week was running full DeepSeek-R1 LLms on 778 GB fast DRAM computers locally. I benchmarked getting hundreds of tokens per second on a cluster of M4 Mac minis or a cluster of M2 Mac Studio Ultras where others reported 0.015 or 6 tokens per second on single machines.

I just heard of a Brazilian man who build a 256 Mac Mini cluster at double the cost that I would. He leaves $600K value on the table because he won't reverse engineer the instruction set, rewrite his software or even call Apple to negotiated a low price.

HN votes me down for commenting that I, a supercomputer builder for 43 years, can build better cheaper faster low power supercomputers from Mac Mini's and FPGA's than from any Nvidia, AMD or Intel state of the art hardware, it even beats the fastest supercomputer of the moment or the Cerebras wafer engine V3 (on energy. coding cost and performance per watt per dollar).

I design and build wafer scale 2 million core reconfigurable supercomputers for $30K a piece that cost $150-$300 million to mass produce. That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

As a consulting job I do benchmarks or build your on-prem hardware or datacenter. This job consists mainly teaching the customer's programming staf how to program massively parallel software or convincing the CEO not to rent cloud hardware but buy on-prem hardware. OP at Fly.io should have hired me, then he wouldn't have needed to write his blog post.

I replied to your comment in hope of someone hiring me when they read this.


Interesting! Fingers crossed someone who's looking for your skillset finds your post.

What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?


>What is your process to turn Mac minis into a cluster

1) Apply science. Benchmark everyting until you understand if its memory bound, i/o bound or compute bound [1].

2) Rewrite software from scratch in a parallel form with message passing.

3) Reverse engineer native instruction sets of CPU, GPU and ANE or TPU. Same for NVIDIA (don't use CUDA).

No special hardware needed but adding FPGA's for optimizing the network between machines might help.

So you analyse the software and hardware, then restructure it by reprogramming and rewireing and adaptive compilers. Then you benchmark again and you find what hardware runs the algorithm fastest for less $ using less energy and weigh that against the extra cost for reprogramming.

[1] https://en.wikipedia.org/wiki/Roofline_model


I discussed all the points you ask about in my HN postings last month, but never in enough detail so you must ask me to specify and that's when people hire me.

As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.

>What is your process

Science. To measure is to know, my prof always said.

To answer your questions in detail, email me.

You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).

First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.

You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.


Sounds interesting, but I don’t see any HN submissions on your profile last month. Are you referring to comments you made?


>Are you referring to comments you made?

Yes. Several pages of comments about M4 clusters, wafer scale integrations and a few about DeepSeek.

https://news.ycombinator.com/threads?id=morphle (a few pages- press more).

https://news.ycombinator.com/item?id=42799072


> it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc

When people casually ask for benchmarks in comments, they’re not looking for in-depth comparisons across all of the alternatives.

They just want to see “Running Model X with quantization Y I get Z tokens per second”.

> That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

Macs are great for being able to fit models into RAM within a budget and run them locally, but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine unless you’re deliberately excluding all of the systems that hobbyist commonly build under $30K which greatly outperform Mac hardware.


>They just want to see “Running Model X with quantization Y I get Z tokens per second”.

Influencers on Youtube will give them that [1] but its meaningless. If a benchmark is not part of an in-depth comparison than it doesn't mean anything and can't inform you on what hardware will run this software best.

These shallow benchmarks influencers post on youtube and twitter are not just meaningless but also take days to browse through. And they are influencers, they are meant to influence you and are therefore not honest or reliable.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

>but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine

I conclude that if you can't afford to develop custom chips than in certain cases a cluster of M4 Mac Mini's will be the fastest cheapest option. Cerebras Wafers or NVDIA GPUs have always been too expensive compared to custom chips or Mac Mini clusters, independent of the specific software workload.

I also meant to say that a cluster of $599 Mac Minis will outperform a $6500 M2 Ultra Mac Studio with 192GB and be half the price for higher performance and DRAM but only if you utilize the M4 Mac Mini aggregated 100 Gbps networking.


a million buckeroos


Absolutely! I have been playing with Ollama on a Macbook Pro 192 GiB RAM and it is able to run most models whereas my 3090 runs our of RAM.


Do you mean 128GB? Not aware of any variant of the Macbook Pro with that much RAM.


192GB is available for the M2 Mac Studio.


I was curious "how bad is it?" and it seems $5500-ish https://www.ebay.com/sch/i.html?_nkw=192gb+studio&_sop=15


$6500 depending on VAT. But 10-12 times M4 Mac mini's with 100 Gbps networking gives you triple the cores and 160 GB with 2.5 times the memeory bandwith if the sharding of the NN layers is done right.


$6500!! You may as well buy 5x3090s for $1000 each for 120GB ram, spend the extra $1500 on the sundries.

Like, I'm sure Nvidia is aware of Apple's "unified memory" as an alternative to their cards and yet...they aren't offering >24GB consumer cards yet, so clearly they don't feel threatened.

Don't get me wrong, I've always disliked Apple as a company, but the M series chips are brilliant, I'm writing this on one right now. But people seem to think that Apple will be able to get the same perf increases yoy when they're really stretching process limits by dumping everything onto the same die like that - where do they go from here?

That said Nvidia is using HBM so it does make me wonder why they aren't also doing memory on package with HBM, I think SK Hynix et al were looking at making this possible.

I'm glad we're headed in the direction of 3d silicon though, always seemed like we may as well scale in z, I imagine they can stack silicon/cooling/silicon/cooling etc. I'm sure they can use lithography to create cooling dies to sandwich between everything else. Then just pass connections/coolant through those.


Hoping that M4 Ultra Mac Pros will bump this again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: