Why not use both? I just built a pipeline for document data extraction that uses PaddleOCR, then Gemini 3 to check + fix errors. It gets close to 99.9% on extraction from financial statements finally on par with humans.
I did the opposite. Tesseract to get bboxes, words, and chars and then mistral on the clips with some reasonable reflow to preserve geometry. Paddle wasn’t working on my local machine (until I found RapidOCR). Surya was also very good but because you can’t really tweak any knobs, when it failed it just kinda failed. But Surya > Rapid w/ Paddle > DocTr > Tesseract while the latter gave me the most granularity when I needed it.
Edit: Gemini 2.0 was good enough for VLM cleanup, and now 2.5 or above with structured output make reconstruction even easier.
Setting aside safety for a moment, consider just hygiene: BART is shockingly dirty. Which suggests mismanagement, above and beyond just a lack of detterence of criminality.
As for safety -- firing squads are probably not in the cards, but would jailing the violent be too much to hope for?
No. It is pretty typical for anything gov to be pretty bad. Most dont work there due to how bureaucratic it is rather than the comp. This is what my friends who work in gov say at least.
There is a strong correlation between hiring low end people and being or becoming ever more bureaucratic. Bureaucracy like everything else is there for a reason.
BART is a government organization and all California government employee pay is public. You can see that BART has about 40 software engineers and they earn about 70% of the market rate:
It seems to me that they are over-worked & under-paid and are doing a good job given the circumstances.
NIMBYs have blocked BART in Silicon Valley. BART doesn't reach Menlo Park, Palo Alto, Stanford, Mountain View, Sunnyvale, Los Altos, Santa Clara, or Cupertino. A few years ago, it finally reached San Jose.
A separate train (CalTrain) goes from SF through Silicon Valley. Last year they switched to electric trains which are faster and run more frequently. The SF CalTrain station is inconvenient (20-mins walk from downtown, under a highway), but they are working to extend CalTrain to the central SF station: https://en.wikipedia.org/wiki/Salesforce_Transit_Center#Futu... .
So Silicon Valley transit is getting better, slowly.
BART barely goes into Silicon Valley. Fremont was the closest stop up until 2017. Now it gets to North San Jose. Even if was funded, any further extension wouldn't be complete for over a decade.
I'll bite: Silicon Valley isn't known for good infrastructure, we are just able to roll back changes very easily. The cost of getting software wrong for BART is far higher than if my div is padded incorrectly.
Not as big when Q8 quantization is already considered overkill and cuts it down to 50% (and a flat 2x speed boost without any additional compute overhead mind you) and the more common Q4KM is more like 30%. Definitely interesting if it can be added to existing quantization, but K quants do already use different precision levels for different layers depending on general perplexity impact which is similar to this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not even considering calibrated imatrix which does something conceptually similar to FFT to compress even higher.
I do? I spend a ton of time post-training models for creative tasks.
The effects of model quantization are usually qualified in terms of performance on benchmaxxed tasks with strong logit probabilities, temp 0, and a "right" answer the model has to pick. Or even worse they'll be measured on metrics that don't map to anything except themselves like perplexity (https://arxiv.org/pdf/2407.09141)
I agree Q8 is strong but I also think the effects of quantization are constantly being underappreciated. People are often talking about how these models perform while fundamentally using 10+ variants of a single model with distinct performance profiles.
If you're trying to really snarkily refer to the article on Dynamic Quants 2.0 and how carefully developed they were, they're comparing their quants to the methodology 99.99% quants out there use.
The problem is not that people are making quants "haphazardly", it's that people keep parroting that various quants are "practically lossless" when they actually have absolutely no clue how lossy they are given how application specific the concept is for something as multidimensional as an LLM.
The moment anyone tries a little harder to quantify how lossy they are, we repeatedly find that the answer is "not any reasonably definition of lossless". Even in their example where Q4 is <1% away in MMLU 5-shot is probably massively helped by a calibration dataset that maps to MMLU-style tasks really well, just like constantly using WikiText massively helps models that were trained on... tons of text from Wikipedia.
So unless you're doing your own calibrated quantization with your own dataset (which is not impossible, but also not near common), even their "non-haphazard" method could have a noticeable impact on performance.
You are saying that people are using quantized models haphazardly and talking about them haphazardly. I'll grant it's not the exact same thing as making them haphazardly, but I think you took the point.
The terms shouldn't be used here. They aren't helpful. You are either getting good results or you are not. It shouldn't be treated differently from further training on dataset d. The weights changed - how much better or worse at task Y did it just get?
The term is perfectly fine to use here because choosing a quantization strategy to deploy already has enough variables:
- quality for your specific application
- time to first token
- inter-token latency
- memory usage (varies even for a given bits per weight)
- generation of hardware required to run
Of those the hardest to measure is consistently "quality for your specific application".
It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...
When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.
And when you consider that the usual final step in the pipeline is that a sampler goes ham on the probabilities and just picks some random nonsense, the tolerance for lossy compression is fairly high.
In fact, there's this funny occurrence where Q4 models on occasion perform better than their fp16 counterparts on benchmarks ran with top_k=1 since the outputs are slightly more random and they can less deterministically blunder past the local maximum into a more correct solution.
"strict" means something. People, including yourself, only care if there is a practical difference in performance. "this is lossless and that isn't lossless" is a completely useless statement in this realm. In many domains lossy compression is either not tolerated, not legal or not practical.
This paper is basically statistical mechanics with a quantum veneer. Two major issues:
1. Scale: They're simulating just 13 qubits with QuTiP and making grand claims about quantum thermodynamics. The computational complexity they're glossing over here is astronomical. Anyone who's actually worked with quantum systems knows you can't just handwave away the scaling problems.
2. Measurement Problem: Their whole argument about instantaneous vs time-averaged measurements is just repackaging the quantum measurement problem without actually solving anything. They're doing the same philosophical shell game that every "breakthrough" quantum paper does by moving around where they put the observer and pretending they've discovered something profound.
1. The main underpinning of this article is the analytical theory they come up with independent of their simulation. The fact that it explains a few qubits well is exactly why this is interesting. If you were to scale up their model - a spin-1/2 ising model, you would effectively get a classical magnet, which is obviously well described by classical thermodynamics. It's in limit of small systems that quantum mechanics makes thermodynamics tricky.
2. Their time averaging is just to remove fluctuations in the state, not avoid the measurement problem. They're looking at time averages of the density matrix, which still yields a quantum object that will collapse upon measurement. And as their mathematical model points out, this can be true for arbitrary time averaging windows, the limits just change respectively as smaller time averages allow for larger fluctuations. There's nothing being swept under the rug here.
As long as they are isolated, their state is a superposition of all possible states, and evolves determinsitically, with the amplitude of each of these "sub-states" evolving perfectly determinsitically. If you want to perform a measurement, you choose a possible decomposition of the superposition state and measure along that axis, and you'll get one of the values along that axis, with a probability that is the modulus of the square of the (complex) amplitude of that value.
What a nice comment!! This has been a big failing of my mental model. I always believed if I was smart enough I should understand things without effort. Still trying to unlearn this....
Unfortunately you must look closely at the details to deeply understand how something works. Even when I already have a decent mental heuristic about how an algorithm works, I get a much richer understanding by calculating the output of an algorithm by hand.
At least for me, I don't really understand something until I can see all of the moving parts and figure out how they work together. Until then, I just see a black box that does surprising things when poked.
It's also important to learn how to "teach yourself".
Understanding transformers will be really hard if you don't understand basic fully connected feedforward networks (multilayer perceptrons). And learning those is a bit challenging if you don't understand a single unit perceptron.
Transformers have the additional challenge of having a bit weird terminology. Keys, queries and values kinda make sense from a traditional information retrieval literature but they're more a metaphor in the attention system. "Attention" and other mentalistic/antrophomorphic terminology can also easily mislead intuitions.
Getting a good "learning path" is usually a teacher's main task, but you can learn to figure those by yourself by trying to find some part of the thing you can get a grasp of.
Most complicated seeming things (especially in tech) aren't really that complicated "to get". You just have to know a lot of stuff that the thing builds on.
99% persperation, 1% inspiration, as the addage goes...and I completely agree.
The frustration for the curious is that there is more than you can ever learn. You encounter something new and exciting, but then you realize that to really get to the spot where you can contribute will take at least a year or six, and that will require dropping other priorities.