Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Compression is AI. It’s never going to be “all” figured out.


Another way if saying it is that compression is understanding.


Lossy compression is, I feel compelled to add.


Actually both! Arithmetic coding works over any kind of predictor.


End-credits are just text. So it should be possible to put it through OCR and save only text, positions, and fonts. And the text is also possible to compress with a dictionary.


Credits also contain logos/symbols (near the end), and often have stylistic flairs as well. Video compression is based on making predictions and then adding information (per Shannon's definition) for the deltas from those predictions. The pattern of credits statically sliding at a consistent rate is exactly the sort of prediction codecs are optimized for; for instance, the same algorithms will save space by predicting repeated pixel patterns during a slow camera pan.

Still, I've often thought it would be nice if text were a more first-class citizen within video codecs. I think it's more a toolchain/workflow problem than a shortcoming in video compression technology as such. Whoever is mastering a Blu-Ray or prepping a Hollywood film for Netflix is usually not the same person cutting and assembling the original content. For innumerable reasons (access to raw sources, low return on time spent, chicken-egg playback compatibility), it just doesn't make sense to (for instance) extract the burned-in stylized subtitles and bake them into the codec as text+font data, as opposed to just merging them into the film as pixels and calling it a day.

Fun fact: nearly every Pixar Blu-Ray is split into multiple forking playback paths for different languages, such that if you watch it in French, any scenes with diegetic text (newspapers, signs on buildings) are re-rendered in French. Obviously that's hugely inefficient; yet at 50GB, there's storage to spare, so why not? The end result is a nice touch and a seamless experience.


Text with video is difficult to do correctly for a few different reasons. Just rendering text well is a complicated task that's often done poorly. Allowing arbitrary text styling leads to more complexity. However for the sake of accessibility (and/or regulations) you need some level of styling ability.

This is all besides complexity like video/audio content synced text or handling multiple simultaneous speakers. Even that is besides workflow/tooling issues that you mentioned.

The MPEG-4 spec kind of punted on text and supports fairly basic timed text subtitles. Text essentially has timestamp where it appears and a duration. There's minimal ability to style the text and there's limits on the availability of fonts though it does allow for Unicode so most languages are covered. It's possible to do tricks where you style words at time stamps to give a karaoke effect or identify speakers but that's all on the creation side and is very tricky.

The Matroska spec has a lot more robust support for text but it's more of just preserving the original subtitle/text encoding in the file and letting the player software figure out what to do with that particular format and then displaying it as an overlay on the video.

It's unfortunate text doesn't get more first class love from multimedia specs. There's a lot that could be done, titles and credits as you mention, but also better integration of descriptive or reference text or hyperlink-able anchors.


MPEG 4 (taken as a the whole body of standards, not as two particular video codecs) actually has provisions for text content, vector video layers and even rudimentary 3D objects. On the other hand I'm almost sure that there are no practical implementations of any of that.


Oh, and that's only the beginning. The MPEG-4 standard also includes some pretty wacky kitchen-sink features like animated human faces and bodies (defined in MPEG-4 part 2 as "FBA objects"), and an XML format for representing musical notation (MPEG-4 part 23, SMR).


Don't forget Java bytecode tracks!


Scene releases often had optimized compression settings for credits (low keyframes, b&w, aggressive motion compensation, etc.)


The text, positions and fonts could very well take up more space than the compressed video. And then with fonts, you have licensing issues as well.


Recognizing text and using it to increase compression ratios is possible. I believe that's what this 1974 paper is about:

https://www.semanticscholar.org/paper/A-Means-for-Achieving-...


True, but end credits take very little space compared to the rest of the movie.


x264 is kinda absurdly good at compressing screencasts, even a nearly lossless 1440p screencast will only have about 1 Mbit/s on average. The only artifacts I can see are due to 4:2:0 chroma subsampling (i.e. color bleed on single-pixel borders and such), but that has nothing to do with the encoder, and would almost certainly not happen in 4:4:4, which is supported by essentially nothing as far as distribution goes.


Why not use deep learning to recognize actor face patterns in scenes and build entire movies from AI models?


I'm not super strong on theory, but if I'm not mistaken, doesn't Kolmogorov complexity (https://en.wikipedia.org/wiki/Kolmogorov_complexity) say we can't even know if it is all figured out?

The way I understand it is that one way to compress a document would be to store a computer program and, at the decompression stage, interpret the program so that running it outputs the original data.

So suppose you have a program of some size that produces the correct output, and you want to know if a smaller-sized program can also. You examine one of the possible smaller-sized programs, and you observe that it is running a long time. Is it going to halt, or is it going to produce the desired output? To answer that (generally), you have to solve the halting problem.

(This applies to lossless compression, but maybe the idea could be extended to lossy as well.)


I really ain't a theorist either, but:

If you are looking at Kolmogorov complexity you are right, we can't ever know. But Kolmogorov complexity is about single points in the space of possible outputs. It basically says "there might be possible outputs that do look random, but are actually produced by a very short encoding". One example would be the digits of pi.

But if you look at the overall statistics of possible output streams, and at their averages, there is a lower bound for compression on average. As soon as the bitlength in the compressed stream matches the entropy in the uncompressed stream in bits, you reached maximum compression. There will be some streams that don't conform to those statistics, but their averages will.

However, we are somewhat far away from matched entropy equilibrium for video compression. And even then, improvements can be made, not in compression ratio but in time, ops and energy needed for de/encoding.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: