I've been building a memory safe language that transpiles to Zig with a Go-like runtime that can run interpreted (no GC) or compiled - high-level that feels like Ruby but with incremental typing like TypeScript.
The Zig team between 0.16 and this has really made me glad I chose Zig as the target instead of Rust - which probably would've been a lot easier to target (since it's already memory safe).
I believed it had the best build system design and was the best transpilation target, and I really believe that 6 months later.
The main reason I wanted no GC is because I think aliasing is the root of all evil, and I want a language with zero global complexity (but doesn't require a PhD to use).
Working on something kinda similar. No GC, Python feel, managed memory, performance approaching C. It's here: https://blorp-lang.org if you want to compare approaches.
Yeah, concurrency in blorp doesn't allow shared mutable references, so deadlocks aren't really a concern. Otherwise it's meant to be simple-ish -- virtual threads, channels, no async/await. Pure functions allow safe parallelism naturally, so that's fairly straightforward, though the API is still incomplete, for example the "Parallel" section here: https://blorp-lang.org/docs/lists/. It's still under heavy development (working on it right now).
1) I want to minimize global complexity, which by definition maximizes local reasoning.
2) I want to make the vast majority of bugs simply unrepresentable - taking it past Rust, and even past Pony - WHILE allowing shared mutable memory, but without requiring a PhD to use.
The goal is in EASY mode, it's barely harder to use than Ruby or Python (just the occasional pedant compiler error that has automatic options to fix itself most of the time). You don't even have to supply types or compile. It has a REPL, etc.
When you bump to DEFAULT mode and then to STRICT mode, all the annotation is automatic - your code just might look "ugly" if you like having no types anywhere etc.
But DEFAULT & STRICT mode give people and LLMs everything they need to know to understand the effects of an individual function.
I have some similar goals. Have you considered leaning more into inference than gradual typing? One pattern I like is allowing the compiler to develop a more complex mental model, but keeping it straightforward for users -- you can do that with inference, ownership, purity, effect types, etc. What I actually think is really tantalizing is using tooling to fill in some of those gaps -- for instance, the editor could know types, required capabilities etc, without the user ever needing to type anything, but when the user needs it, they can find it, query it, test against it.
Cool - it sounds pretty similar. It's interesting that it looks so different. I'll have to investigate more.
WRT to inference, yes. I infer everything in EASY mode.
And the compiler give the user autofix via choice when a type is ambiguous (I don't default to huge union types - I assume no one wants to do that and make them choose a type - may potentially allow AutoUnion to allow that).
I couldn't tell if you're using affine ownership, but I assume so if you don't have a GC. If they try to create an alias - they get a use after move error, and the compiler tells them they need to either COPY (auto-fix) or create a RefCount (usually auto-fix) - they pick.
The trope about external consultants is that your VP brings them in to review the company, and they talk to everybody and write a report on how to improve the business, and the report says exactly what you've been telling your VP but they've been ignoring you.
They are paid to justify decisions executives have already made. It's often referred to as due diligence, but in practice these reports mostly just allow executives to tell the board it wasn't their fault if it goes wrong.
It is. I definitely agree that strings in Zig can be tedious, but the upside is that if you need it, you can build a string library that does everything you want it to do, in the way you want.
For comparison, while Rust offers a very rich string library, it's also very strict about what you can/cannot do with strings, so if your use case falls outside of that you're out of luck. With Zig, you can pretty easily roll your own and make it do what you want. (and when Zig is post 1.0, I imagine there will be some very nice pre-made string libraries by the community etc.)
How much that makes it into enterprise pricing is TBD, since none of the hyper scalers are making money yet of selling AI inference.
Almost all businesses are ahead of the gun. For most of their use cases, AI is either not yet good enough on its own, or good enough but too expensive.
No one wants to get left behind, so everyone's trying to get onto it now, even though it's not ready for what most enterprises want to do with it.
It's easy for them to look at a small startup without billions of lines of legacy business logic debt and see them having success and wonder why they can't have just as much - or more - why they're bigger so they should have better and more success, right???
Wrong...
But when it gets ~99% cheaper for local inference over the next 4 years, at the same time the price per watt improve 4x -> a lot of those cases will start to pencil out.
Going from Opus 4.5 to 4.7 secretly required 6x more compute to run. 4.8 is apparently 30% more on top. I haven't seen any optimizations lately aside from distillation. Nobody's optimizing, they're just scaling up.
I retired a few years ago, but I still write a fair bit of code. I was using Copilot's code completion before I retired, but coding agents hadn't come around yet. I've been wanting to try them, but I kept putting it off, and now the price increases make it hard to justify.
So I just started trying CodeWhale (https://github.com/Hmbown/CodeWhale) with DeepSeek V4. I expected to be impressed by the abilities (which still require plenty of oversight). I didn't expect to be completely shocked by how cheep it is. After most of a week of using it 4-8 hours a day, which would amount to a full week of coding in many jobs after you account for non-coding activities, I'm about to hit $3 in total usage. So we're talking $10-20 per month for single-agent use by a full time software developer? And I'm sure some of my usage is waste as I'm still getting my head around things like compaction. If I take a break for a few weeks, I pay nothing because there is no subscription.
If DeepSeek and Xiaomi MiMo stay within a few months of the US-based models in terms of capabilities and US companies don't figure out how to drastically cut prices, I can't see how China hasn't already won. Protectionism would be one reason, but that might be ceding 50-90% of the total addressable market, and bring us closer to moving knowledge work out of the US the same way we did with manufacturing because it's too expensive in the US.
Holy F.. $3 .. once I'm done with my base cursor allocation, each nontrivial question costs $5 . And yes, I'm now switching to a mix of codex and ds4pro
Do you mean the marginal cost by the producer, or the cost on the consumer? I can't see the price of electricity falling much, and the demand curve is apparently exponential if the hype is to be believed.
DeepSeep V4 Pro is 99% cheaper than similarly performing models were 2 years ago (if such a model even existed).
Computing has always been about how to wring out more efficiency. The ENIAC was 150,000 watts, with 3 phase 240 volt power, and cost about $500,000.
My day to day laptop (a year old) is 35 watts, with 1 phase 20 volt power, and cost $1,000, so that's 99.98% less power consumption, 99.8% cheaper, and it has about 10 orders of magnitude more computing power, all on a time span of 80 years.
It died before AI came around and today's coding agents are somewhere upwards of twice as competent as whatever the state of the art of automatic coding was in 2020. 8I
A good chunk of that was one-time gains from shifting GPU and memory architectures to better match what LLMs need at scale as well as some algorithmic improvements. Most of the low-hanging architecture optimization has already been harvested. We'll certainly have more algorithmic gains but the consensus is they'll generally be smaller and less frequent.
There's always a chance we'll have some dramatic gains far larger than DeepSeek's optimizations a year ago, but it hasn't happened again yet at even that scale. It would be nice but I certainly wouldn't count on it.
I don't see how this is even remotely true. Unless there's some super breakthrough into a fundamentally different architecture, there's not really a path to a 50% reduction in price, much less a 99% reduction.
And yet 90% drops for the same level of quality every 18 months have happened like clockwork...
And the technology already exists on the algorithmic front TODAY to lock in another 10x gain -> when, typically, algorithmic gains only account for ~30% of that drop and the other ~70% comes from better data (often synthetic) and knowledge distilation from frontier models.
The technology already exists now on the algorithmic front for the next 10x drop between everyone adopting DeepSeek's MLA, MoE (mostly already done), Medusa (a better version of Google's speculative decoding), Kimi's Attn Residuals, and Mimo's Sliding Window Attn, and (possibly) Microsoft's 1.58b (this may be a nothing burger).
Historic trends, every 18 months, performance for the same level of quality has gone down 90%.
Historically, algorithmic gains are only ~30% of the pie, but there's enough out there to get to 10x, with just what's available already. The other ~70% of the pie is better training data (often synthetic) and distilling frontier knowledge. There's no sign we are tapped out on that front.
Additionally, GRAM (from ~10 days ago) is likely to be a 5-10x on its own (if not substantially more for smaller models). It's unlikely within 4 years LeCun's JEPA ideas and similar ideas like GRAM applied to LLMs have ZERO impact. The preliminary results are absolutely astounding (5000x better reasoning - this is not peanuts).
Further, that's not even counting that cost per watt is still dropping ~2x every 2 years on its own on the hardware front.
If you look at the "cost" of inference. People think it's electricity - but it's currently almost ~80% hardware amortization. The memory shortage is not going to last, nor are Nvidia's ~80-90% margins.
The human brain is still 8-10 orders of magnitude more efficient than the best LLMs of today. With ~1/10th of global capex riding on AI, if you don't think they're going to knock of 2 orders of magnitude more, when it's this obvious and easy... I don't know what to tell you...
Sure, it might take 6 years instead of 4. My crystal ball isn't perfect.
Sure, the price will come down a lot, even if we can argue about the timeline.
I think what will also happen, once we get past this current CEO AI FOMO mania, is that companies will start to look at AI spending more rationally like any other company expense, and will revert to more rational decision making.
Even if the cost comes down considerably over the next few years, that's plenty of time for companies to look at their financial results and question why AI expenditure isn't resulting in increase in revenue and/or profitability.
Additionally, on the context front -> all the labs are aware that for many tasks you can get 10x+ increases in output quality by feeding better context.
> The actual cost is going to drop 99% in ~4 years.
We have little visibility into current frontier model costs at mass scale. As a broad historical trend, tech costs tend to fall over longer time periods but your claim far exceeds Moore's Law rates in its heyday - and that heyday is long gone.
In 2021 TSMC announced it was increasing it's price per gate for new nodes for the first time in its history. In the past five years cutting edge nodes have delivered ~8-15% real-world performance gains on average at costs at least 10-20% more than the last node. If you're positing a string of unprecedented efficiency breakthroughs in LLM algorithms - such extraordinary claims require extraordinary evidence.
Prices have been very obviously trending up, not down. Even open weights models are becoming more expensive with every release. Computer hardware is ballooning in price.
There has never yet been a new model which actually improved over the previous ones. They suck just as much, and in the same ways, as the models of 3 years ago.
Grab a 5090 and run Qwen 3.6 35b on it (6 parameter seems to work best for me).
Then buy $10 (or $2, if you're cheap, and they take PayPal) of DeepSeek credits.
Whilst you're at it spring for a Claude subscription too and GPT.
Switch models between Qwen, DeepSeek Flash, DeepSeek Pro, and you can meet 99% of your code generation needs.
Hop over to Opus 4.7 (or 4.8, but I haven't really used it yet) and GPT-5.5 when doing very complex architecture/design or troubleshooting something where DeepSeek Pro is getting stuck.
It is ridiculous how cheap this stuff is now. It's affordable at third world prices.
I just tested this on a bug fixing benchmark I'm working on.
It did not perform as well as I expected. Qwen2.5-Coder-3B (2 years old) outperformed it by a wide range -> fixing ~50% of bugs whereas this model only fixed ~12%.
Granted, it's not a coder specific model, but given its benchmark performance to Gemma models, and that it's two years newer, and that it's an MoE with 8B total params, I expected it to be more competitive.
I personally find any model smaller than something like Qwen 3.6 35B-A3B (8-bit quantization, about 49GB memory usage when loaded into llama.cpp) to be too "stupid" for reliable use.
I would much rather not run the model on my local laptop hardware and offload that to some system sitting under my desk in my home office, accessible via VPN, than take the risk of using an unreliable and flaky tool for the convenience of having it on the same hardware on my lap.
I pay very little attention to 8 billion or whatever (or even much smaller) models these days and I don't feel like I'm missing much.
Yes, Qwen 3.6 MoE is hitting like 80-90tk/s on Strix halo. On R9700 I had like 170t/s. It was not possible to keep up. But MoE is circling very often. I switch then to dense model and have 20-30t/s but it is able to solve quite a lot of tasks.
Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.
I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.
I have not spent a lot of time running FP16 'full precision' versions of some things, but as the other commenter says, it's not much difference. There's a really wide array of benchmarks and tests from a lot of third parties unrelated to the trainer of the AI models that shows at most a two percent difference in score and capability between BF16 and Q8.
Q8 quant is very minimal fall off in terms of KLD against the lab 16 bit. If you have the memory for BF16 KV-cache (which is usually easier to stomach) then the Q8 is very close. But even Q8 quant model with Q8 KV-cache is very close.
Smaller quants for the model start to fall off but more importantly, smaller KV-cache quants fall off much faster so avoid less than Q8 there.
It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6
That's not all that surprising, IMO. From what I understand, LiquidAI is focusing pretty narrowly on building models that operate as the "agentic core" of a larger system.
If I were going to use this model, I'd be looking to use it more as is the primary chat interface of a larger system, and having it orchestrate & delegate tasks to other places via tool calls. It's not quite as exciting on the surface as a local "do it all" model, but it does enable some pretty neat use-cases, IMO.
I'm imagining a local agent that is super low latency, works entirely offline, and capable of queuing up complex tasks for larger/smarter cloud agents which execute them asynchronously.
I just did the same. Absolutely awful. I assume OpenCode's heavy context is a problem, and it's probably better to use Liquid's own OpenCode alternative for this.
I will test it when it's accessible via OpenRouter, but the previous LFM2 model (lfm-2-24b-a2b) didn't do well on my tests, it got only 1/20 questions/tasks right, way below Gemma 31B or Qwen 35b-a3b (those get like 10/20 right)
Some of the coding-specific fine-tunes were really impressive boosts. Qwen2.5-3B-Instruct is also available [0] -- if it's not too much to ask, I'd be curious how more general models stack up in your benchmark?
It's almost as if Postgres isn't perfect, and one size shoe doesn't fit all.
Some people want some of the benefits you get from SQLite.
SQLite is obviously not perfect, but it's an incredible piece of software, and people regularly find good ways to make use of an excellent pieces of software.
Yes, agreed on SQLite/Postgres. But I'm going to benchmark RocksDB next and see what the performance characteristics are. I suspect the LSM tree storage engine of RocksDB might perform better since agents are so write heavy when running highly concurrent workloads. After all, you are streaming LLM tokens into disk and fanning them out to subscribed clients.
Thank you for sharing a benchmark pretty much exactly like the parent comment is planning to do.
Also thanks for the incidental exposure to a DB I'd never heard of before... with a browser-based demo CozoDB may be a good way to start experimenting with Datalog.
The Zig team between 0.16 and this has really made me glad I chose Zig as the target instead of Rust - which probably would've been a lot easier to target (since it's already memory safe).
I believed it had the best build system design and was the best transpilation target, and I really believe that 6 months later.
The main reason I wanted no GC is because I think aliasing is the root of all evil, and I want a language with zero global complexity (but doesn't require a PhD to use).
reply