Hacker Newsnew | past | comments | ask | show | jobs | submit | armanj's commentslogin

> buy

good luck


How reliable is this uptime? and why it's sooo different from gh's official status numbers?

Their headline figure is a bit exaggerated, it's driven from the official status numbers, but aggregates across all GH services.

Imagine you run 365 services, and each goes down 1 day a year.

If those are all on the same day, this would report you having 99.7% uptime.

If instead, each service goes down 1 day per year but on different days, this would report you having 0% uptime.

Despite the same actual downtime for any given service.

The truth is somewhere in the middle, that github has run degraded for a significant amount of time.

But I don't think it is fair to take an incident like this one[1], where 5% of requests were incorrectly denied authorisation, and count it the same as you would the whole of github being down.

[1] https://www.githubstatus.com/incidents/02z04m335tvv


yeah, it's a hard problem to accurately tell people a reliablity number.

Rachel famously wrote about this in "Your nines are not my nines"[0].

The truth is though, that some systems depend on others. Actions being down means you don't merge code or release: but you know... git operations being unavailable has the same effect. It's meaningless to separate the two.

So it depends on the framing.

[0]: https://rachelbythebay.com/w/2019/07/15/giant/


1. This one counts downtime from any service, so if anything is down or degraded they count it as 100% down, which is harsh.

2. Github is doing some classic big org sneaky things where they don't count degraded service fully. So if github actions is partially down for most people in a away that makes you say "github is down", there's a good chance that microsoft doesn't count that or counts it partially instead.


> Github is doing some classic big org sneaky things where they don't count degraded service fully.

Even worse example is the Travis CI. For more than a year their CI jobs sometimes get stuck or do not start for days, and, surprise-surprise, it's never shown at their status page[1] - always green. We would switch to something else entirely if not the unique offering of PowerPC and SystemZ servers/runners. Apart from that - it's the worst CI service I used so far.

[1] https://www.traviscistatus.com/history


> How reliable is this uptime?

IT seems to be quoting incident reports for the duration of each outage, so there is accountability in terms of being able to verify all the details of what they are counting.

> and why it's sooo different from gh's official status numbers?

Maybe this is counting any period with any service showing any level of issue as a complete fail, and the official numbers are cherry-picking a bit (only counting core services? not counting significant performance issues that the other count does because things were working, just v…e…r…y … s…l…o…w…l…y) or averaging values (so 75% services running at a given time looks ¼ as bad in their figures), the two sets of calculations could be done with a different granularity, …

In other words: lies, damned lies, and statistics!

The only way to know is to know how both are calculated in detail, and that information might not be readily available.


There is a link to the repo to verify the code and explain their process

i've been a zed user for almost 6 months. i've encountered maaany bugs which i reported, or that were already reported. they're still there. meanwhile, every single update shipped a feature or bugfix for "ai agents".

not sure how 1.0 ships with that massive pile of bugs. but ai agents are the first-class citizen in this editor, and developer experience is not a priority.

funny thing is i uninstalled zed right before the 1.0 release. kinda relieved i didn't miss anything.


I have a few lightweight apps using deepseek api, and funny how the initial credit I topped up for using r1 is still left. Nothing makes the user happier than getting more for less. cc: anthropics with its fancy token-wasting claude code "features"

Not like on Openai where the credits just expire

hn is this true


I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.


while it seems even with 4.7 we will never see the quality of early 4.6 days, some dude is posting 'agi arrived!!!' on instagram and linkedIn.


I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.


The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!


Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.


Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.


You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.


if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.


Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961


I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.


"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"


That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.


What? 35B-A3B is not nearly as smart as 27B.


One interesting thing about Qwen3 is that looking at the benchmarks, the 35B-A3B models seem to be only a bit worse than the dense 27B ones. This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.


> This is very different from Gemma 4, where the 26B-A4B model is much worse on several benchmarks (e.g. Codeforces, HLE) than 31B.

Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.


They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.


Gemma's dense is bigger than its moe's total parameters. You could totally expect the moe to do terribly by comparison.


yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b


I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.


Dense is (much) worse in terms of training budget. At inference time, dense is somewhat more intelligent per bit of VRAM, but much slower, so for a given compute budget it's still usually worse in terms of intelligence-per-dollar even ignoring training cost. If you're willing to spend more you're typically better off training and running a larger sparse model rather than training and running a dense one.

Dense is nice for local model users because they only need to serve a single user and VRAM is expensive. For the people training and serving the models, though, dense is really tough to justify. You'll see small dense models released to capitalize on marketing hype from local model fans but that's about it. No one will ever train another big dense model: Llama 3.1 405B was the last of its kind.


You want to take bets on this? I'm willing to bet 500USD that an open access dense model of at least 300B is released by some lab within 3 years.


MoE isn't inherently better, but I do think it's still an under explored space. When your sparse model can do 5 runs on the same prompt in the same time as a dense model takes to generate one, there opens up all sorts of interesting possibilities.


Yes.


Based on the release schedule of 3.5 previously, my optimistic take is that they distill the small models from the 397B, and it is much faster to distill a sparse A3B model. Hopefully the other variants will be released in the coming days.


Probably coming next


I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.


MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.


MoE is excellent for the unified memory inference hardware like DGX Sparc, Apple Studio, etc. Large memory size means you can have quite a few B's and the smaller experts keeps those tokens flowing fast.


kinda ironic you can clearly see signs of Claude, as it shows misaligning table walls in the readme doc


Parenthesized, comma-separated lists with no “and” is an even stronger tell. Claude loves those.


I also use those extensively, they just flow better, especially if you have an "and" in the surrounding sentence.


> kinda ironic you can clearly see signs of Claude, as it shows misaligning table walls in the readme doc

This one is such a gigantic clusterfuck... They're mimicking ASCII tables using Unicode chars of varying length and, at times, there's also an off-by-one error. But the model (not Claude, but the model underneath it) is capable of generating ASCII tables.

P.S: I saw the future... The year is 2037 and we've got Unicode tables still not properly aligned.


I mean, just reading the readme content it is pretty obvious it is Claude


> This project is early and experimental. Core concepts are settled, but expect rough edges. Local mode: relatively stable - Hub-based workflows: ~80% verified - Kubernetes runtime: early with known rough edges

i guess gastown is a better choice for now? idk i don't feel good about "relatively stable"


imagine thinking gas town is a better choice over _literally anything else_


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: