Those ratios seem way off if you're referring to the M1 Max and not the base M1. If we use Geekbench CPU performance, the Ryzen 9 7945HX (which is from 2023) is around 12% faster single core and 32% faster multicore than the M1 Max (which is from 2021). If you look at the 2024 M4 Max, it's substantially faster than the Ryzen and Intel you mentioned.
153 GB/s is not bad at all for a base model; the Nvidia DGX Spark has only 273 GB/s memory bandwidth despite being billed as a desktop "AI supercomputer".
Models like Qwen 3 30B-A3B and GPT-OSS 20B, both quite decent, should be able to run at 30+ tokens/sec at typical (4-bit) quantizations.
Even at 1.8x the base memory bandwidth and 4x the memory capacity Nvidia spent a lot of time talking about how you can pair two DGXs together with the 200G NIC to be able to slowly run quantized versions of the models everyone was actually interested in.
Neither product actually qualifies for the task IMO, and that doesn't change just because two companies advertised them as such instead of just one. The absolute highest end Apple Silicon variants tend to be a bit more reasonable, but the price advantage goes out the window too.
It's colleges they they have been clamping down on, as they were bringing in absolutely massive numbers of mostly Indian students who were coming mainly to work in low-end jobs and get out of India rather than to legitimately study.
The number of graduate students being allowed in hasn't changed significantly, and undergraduate university students are also continuing to be brought in at rates similar to pre-pandemic times.
Mistral models are largely along the likes of what you were asking for. However, Grok (any version) absolutely is not a “don’t say gay” model, it talks about sexuality of all forms quite openly and fairly and is happy to produce creative content of any level of explicitness about these topics. It’s the least censored unmodified model I’ve encountered on any topic. People dismiss Grok as a Nazi model based on Musk’s politics without using it themselves.
In general, I agree. However, many older cars were small, light, simple, and raw - characteristics that have largely disappeared from modern cars. Automatic transmissions from the mid-90s and earlier generally sucked, though good old manual transmissions are not much different from good modern ones.
As an example, I owned a W126 S class from the late 80s, and it was fun in its unique way and no modern cars replicate its experience. It had somewhat heavy and very feedback rich steering feel, and Porsche-like firm and tactile pedal feel, while having a super supple ride over the most awful roads with SUV-like ground clearance and tremendous suspension travel. The car was also super simple and reliable; my 300SE had nearly 400k km with all original powertrain when I sold it, it never let me down, and it weighed less than a modern A class or CLA. While not as safe as modern cars, it was exceptionally safe for its era and comparable to normal cars of the early 2000s for crash structure safety.
The W140 (I used to own one too) had a much better powertrain, but it lost the raw tactile scrappy nature of its predecessor, and nor could it handle super awful potholed roads as well as the W126. There are no modern cars that combine the rich raw tactile control feel and super supple ride the W126 had.
Look at cars like the BMW E30, or Mercedes-Benz 190E (W201), or the superbly engineered workhorses that the W123 and W124 were. There are no modern cars that replicate the genuinely delightful driving experience of those.
Oh yes, preach the gospel of the W126. I had a 1986 300SD for a while, and I’d own one again in a heartbeat. I’ve never felt safer, or cooler, driving a car. You had a gasser, which I bet was faster than mine, but the sound of that diesel spinning up the turbo was something else.
I agree about the W123 as well. I’ve owned half a dozen of those. For a couple generations there it seemed that Mercedes had cars just about solved.
My daily is a W126 with the OM603. It's getting harder every day to find parts (when I need them, which is infrequent) but it's worth the hassle because like you say there is nothing modern that has the same combination of feel and ride quality. Or visibility! I can parallel park this car (long wheelbase too) in tiny spots easier than a modern compact because you can actually see.
I've got a W140 with the M120 and a W123 with the OM616 and a 4-speed too, and while they have their charms (especially the W123) nothing tops the W126. It truly was not just the finest production sedan Mercedes made, but ever made by anyone. (Other contenders being the W100, the W140, and the Lexus LS.)
> However, many older cars were small, light, simple, and raw - characteristics that have largely disappeared from modern cars.
I feel parent's point still stands.
Sure, you won't be able to go to a random Ford dealership and go home with a small light and simple car, but there are plenty of modern car accessible through a modicum of effort. Even buying something new abroad and bring it back home will probably be less hassle than restoring an old car.
I wonder if buying a kit car would still be simpler, for still better results.
Aside from the Mazda MX-5 (which isn’t the most practical car), almost all small, simple, and light cars made today are econoboxes. They’re not designed to have the rich control feel, balanced and satisfying handling near the limits, responsiveness, material quality, suspension sophistication, etc. compared to say German luxury compact cars of the 1980s (BMW E30 or M-B W201). Even cars like 90s Hondas, while front wheel drive and built to a much lower price point, had rich control feel, liveliness, and agility that modern cars don’t give.
Modern luxury cars from essentially all brands around the world have become huge, heavy, numb, and over-complicated. They’re much faster and quieter than the say the old Benzes and BMWs of the 80s, but they don’t have the fun raw feel, small size, light weight, tossability, and simplicity of the old cars.
A BMW E30 or M-B W201 have a weight somewhere between a Mazda MX-5 and Subaru BRZ, but are far more practical than either for passengers and cargo despite being around the same width and only slightly longer.
The only modern cars with similar size and weight are some European market compact cars and econoboxes like the Mitsubishi Mirage, Nissan Micra, and Chevy Spark (which are also disappearing from North America). For steering feel, handling, general raw and connected driving feel, powertrain responsiveness, and interior quality, these modern economy cars can’t compete. Some of the European market specific B-segment cars come closest to those older compact luxury cars, but they still don’t match them for the qualities I described.
Kit cars generally suck from a practical perspective compared to well engineered 80s/90s cars and aren’t a very practical option either.
> They’re not designed to have the rich control feel, balanced and satisfying handling near the limits, responsiveness, material quality, suspension sophistication, etc.
Sounds to me like you're looking for a Lotus or a 911 at budget prices. I agree with you that's pretty far from the "simple, simple, light" vehicle, and it's fully in the hobby realm.
If you're that deep into cars, I'd say more power to you, and spending ungodly amount of money time and effort on vintage cars is probably a pleasure as well.
That’s the thing - old German compact luxury sedans from the 80s had the control feel, balance, and light weight you get from a Porsche, while also being practical family cars. There’s nothing like that made today. They were also decently safe and comfortable and reliable and generally just good.
Also the bigger ones like the W126, while not as light and agile as a Porsche or Lotus, still had similar control feel, very comfortable and spacious interiors, and could glide over the worst most broken and potholed roads better than any modern car I’ve driven. They’re also much simpler than any modern luxury cars, much less to break, and they just keep going and going as long as you take basic care. From personal experience, a much younger used W220 or W221 S class needs far more maintenance and repair than an old W126.
The more powerful but still reliable engines and nicer transmissions of the late W140 or W220 would be nice to have in a W126 though. My problem with the newer S classes is the complexity and fragility of the rest of the car.
Of course, these are 40 year old cars and need more care and maintenance than a new car, but they’re not too bad either as long as you get a good example of the car. They’re pretty reliable once sorted, and can last a very long time and very high mileage as long as they’re at least somewhat cared for.
While cloud models are of course faster and smarter, I've been pretty happy running Qwen 3 Coder 30B-A3B on my M4 Max MacBook Pro. It has been a pretty good coding assistant for me with Aider, and it's also great for throwing code at and asking questions. For coding specifically, it feels roughly on par with SOTA models from mid-late 2024.
At small contexts with llama.cpp on my M4 Max, I get 90+ tokens/sec generation and 800+ tokens/sec prompt processing. Even at large contexts like 50k tokens, I still get fairly usable speeds (22 tok/s generation).
Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
Add big law to the list as well. There are at least a few firms here that I am just personally aware of running their models locally. In reality, I bet there are way more.
A ton of EMR systems are cloud-hosted these days. There’s already patient data for probably a billion humans in the various hyperscalers.
Totally understand that approaches vary but beyond EMR there’s work to augment radiologists with computer vision to better diagnose, all sorts of cloudy things.
It’s here. It’s growing. Perhaps in your jurisdiction it’s prohibited? If so I wonder for how long.
In the US, HIPAA requires that health care providers complete a Business Associate Agreement with any other orgs that receive PHI in the course of doing business [1]. It basically says they understand HIPAA privacy protections and will work to fulfill the contracting provider's obligations regarding notification of breaches and deletion. Obviously any EMR service will include this by default.
Most orgs charge a huge premium for this. OpenAI offers it directly [2]. Some EMR providers are offering it as an add-on [3], but last I heard, it's wicked expensive.
I'm pretty sure the LLM services of the big general-purpose cloud providers do (I know for sure that Amazon Bedrock is a HIPAA Eligible Service, meaning it is covered within their standard Business Associate Addendum [their name for the Business Associate Agreeement as part of an AWS contract].)
Sorry to edit snipe you; I realized I hadn't checked in a while so I did a search and updated my comment. It appears OpenAI, Google, and Anthropic also offer BAAs for certain LLM services.
In the US, it would be unthinkable for a hospital to send patient data to something like ChatGPT or any other public services.
Might be possible with some certain specific regions/environments of Azure tho, because iirc they have a few that support government confidentiality type of stuff, and some that tout HIPAA compliance as well. Not sure about details of those though.
Possibly stupid question, but does this apply to things like M365 too? Because just like with Inference providers, the only thing keeping them from reading/abusing your data is a pinky promise contract.
Basically, isn't your data as safe/unsafe in a sharepoint folder as it is sending it to a paid inference provider?
I do think Devs are one of the genuine users of local into the future. No price hikes or random caps dropped in the middle of the night and in many instances I think local agentic coding is going to be faster than the cloud. It’s a great use case
I am extremely cynical about this entire development, but even I think that I will eventually have to run stuff locally; I've done some of the reading already (and I am quite interested in the text to speech models).
(Worth noting that "run it locally" is already Canva/Affinity's approach for Affinity Photo. Instead of a cloud-based model like Photoshop, their optional AI tools run using a local model you can download. Which I feel is the only responsible solution.)
I agree totally. My only problem is local models running on my old macMini run very much slower than that for example Gemini-2.5-flash. I have my Emacs setup so I can switch between a local model and one of the much faster commercial models.
Someone else responded to you about working for a financial organization and not using public APIs - another great use case.
These being mixture of expert (MOE) models should help. The 20b model only has 3.6b params active at any one time, so minus a bit of overhead the speed should be like running a 3.6b model (while still requiring the RAM of a 20b model).
Here's the ollama version (4.6bit quant, I think?) run with --verbose
total duration: 21.193519667s
load duration: 94.88375ms
prompt eval count: 77 token(s)
prompt eval duration: 1.482405875s
prompt eval rate: 51.94 tokens/s
eval count: 308 token(s)
eval duration: 19.615023208s
eval rate: 15.70 tokens/s
15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.
On my M4 Max MacBook Pro, with MLX, I get around 70-100 tokens/sec for Qwen 3 30B-A3B (depending on context size), and around 40-50 tokens/sec for Qwen 3 14B. Of course they’re not as good as the latest big models (open or closed), but they’re still pretty decent for STEM tasks, and reasonably fast for me.
I have 128 GB RAM on my laptop, and regularly run multiple multiple VMs and several heavy applications and many browser tabs alongside LLMs like Qwen 3 30B-A3B.
Of course there’s room for hardware to get better, but the Apple M4 Max is a pretty good platform running local LLMs performantly on a laptop.
You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).
https://browser.geekbench.com/processors/amd-ryzen-9-7945hx
https://browser.geekbench.com/processors/intel-core-ultra-7-...
https://browser.geekbench.com/macs/macbook-pro-16-inch-2021-...
https://browser.geekbench.com/macs/macbook-pro-16-inch-2024-...