More

artemisart · 2025-10-06T21:16:36 1759785396

I may be misunderstanding the question but that should be just decompressing gzip & compressing with something better like zstd (and saving the gzip options to compress it back), however it won't avoid compressing and decompressing gzip.

artemisart · 2025-09-15T23:24:15 1757978655

Does refactoring mean moving things around for people? Why don't you use your IDE for this, it already handles fixing imports (or use find-replace) and it's faster and deterministic.

groby_b · 2025-09-17T15:56:06 1758124566

Not always, but since OP mentioned it was "deleting and rewriting" files - that's how the CLI agents usually "move" files.

And sure, you can use an IDE, but that's harder to do if you live in vibe land. (We really need to understand that for some things, we have perfectly fine non-AI answers, but that's not the world as it is right now. Mechanical refactors, move + import fixes, autocomplete - all of those do not require an LLM. We're not great at drawing that line yet)

jumploops · 2025-09-15T23:36:31 1757979391

Not necessarily -- in the case I posted about, we first abstracted some common functionality to internal libs, and then further abstracted that functionality into a number of packages (so they could be used by other clients).

So it was part simplification (dedupe+consolidate), and part moving files around.

artemisart · 2025-09-10T01:32:25 1757467945

Do you know about other security issues? If it's only about curl | sh it really isn't a problem, if the same website showed you a hash to check the file then the hash would be compromised at the same time as the file, and with a package manager you still end up executing code from the author that is free to download and execute anything else. Most package managers don't add security.

artemisart · 2025-09-10T01:15:11 1757466911

Why should we expect companies to be able to reuse the correct token if they can't coordinate on using a single domain in the first place?

JimDabell · 2025-09-10T01:29:50 1757467790

Your assumption that they use more than one domain by accident due to a lack of coördination is not correct. Separating, e.g. your product email from your mailing list email from your corporate email has a number of benefits.

Anyway, I already mentioned a solid incentive for them to use the correct token. Go back and read my earlier comment.

cuu508 · 2025-09-10T05:03:42 1757480622

It is correct at least in some cases. https://news.ycombinator.com/item?id=45190323

artemisart · 2025-09-09T10:57:25 1757415445

Nvidia parakeet and canary are better and faster, here is a leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

diggan · 2025-09-09T10:59:58 1757415598

> Nvidia parakeet and canary are better and faster

Is that based on your own experience using those and also Whisper, comparing them side-by-side? Or is that based just on those benchmark results?

artemisart · 2025-09-09T23:34:50 1757460890

Yes for parakeet, but only comparing benchmark results for canary. Whisper also has severe hallucinations on silence and noise and WhisperX helps a lot, it adds voice activity detection i.e. a model to detect when someone speaks, to filter the input before running whisper. https://github.com/m-bain/whisperX

wahnfrieden · 2025-09-09T11:44:57 1757418297

Parakeet isn’t more accurate than whisper large

artemisart · 2025-09-04T11:18:46 1756984726

No, you never compute individual pixels because you never need to, and it's always faster to it in bulk (vectorization, memory access...) and so over an area you take the same number of pixels as input (or a little bit more with padding) and the blur will only increase significantly the compute.

arghwhat · 2025-09-04T11:25:05 1756985105

You misunderstood, this is not about computing individual pixels but only selective rerendering graphical elements which have been changed, and in turn figuring out the total area of change. This propagates through the entire stack to let the GPU scanout hardware know which tiles have changed, and allow partial panel self refresh updates (depending on hardware).

Rendering is still done in bulk for the changed areas, avoiding rendering expensive elements (e.g., transformed video buffers, deeply layered effects, expensive shaders). It's a fundamental part of most UI frameworks.

aeonfox · 2025-09-04T12:15:19 1756988119

Are windowed GUIs still doing diffed screen updates? I would have assumed that GPUs make this kind of thing very unrewarding to implement as an optimisation. I'd imagine every window is being redrawn every frame as a 2D billboard with textures and shaders.

The Guassian blur and lensing effects would still slow things down by needing to fetch pixels from the render target to compute the fragment, vs painting opaque pixels.

arghwhat · 2025-09-04T13:04:12 1756991052

The usual mechanism is to mark widgets that changed dirty, accumulate the bounding boxes of such dirty areas, take the next swapchain buffer and get its invalid regions, iterate through the widget tree and render anything that intersects with the bounding box or invalid regions, and submit the buffer + the dirty areas to the display server/driver.

And yeah, having a render step depend on the output of a previous non-trivial render step is Bad™.

aeonfox · 2025-09-04T23:23:24 1757028204

I was under the impression that for GPU accelerated GUIs, all windows are rendered to a render target. It might be that windows underneath have gone to sleep and aren't updating, but they would have their last state rendered to a texture. This permits things like roll-over previews and layered effects to have a more trivial overhead.

Software renderers typically do the optimisation you're suggesting to reduce on memory and CPU consumption, and this was a bigger deal back in the day when they were the only option. I think some VNC-like protocols benefit from this kind of lazy rendering, but the actual VNC protocol just diffs the entire frame.

On the GPU, the penalty for uploading textures via the bus negate the benefit, and the memory and processing burden is minimal relative to AAA games which are pushing trillions of pixel computations, and using GBs of compressed textures. GPUs are built more like signal processors and have quite large bus sizes, with memory arranged to make adjacent pixels more local to each other. Their very nature makes the kinds of graphics demands of a 2D GUI very negligible.

arghwhat · 2025-09-08T14:08:50 1757340530

> I was under the impression that for GPU accelerated GUIs, all windows are rendered to a render target.

Each window renders to one or more buffers that they submit to the display server, which will then be either software or hardware composited ("software" here referring to using the GPU to render a single output buffers vs. having the GPU scanout hardware stitch the final image together from all the source buffers directly).

Note that in the iPhone cases, the glass blur is mostly an internal widget rendered by the app, what is emitted to the display server/hardware is opaque.

> It might be that windows underneath have gone to sleep and aren't updating,

The problem with blur is when content underneath does update, it requires the blur to also update, and rendering of it cannot start until the content underneath completed rendering.

> Software renderers typically do the optimisation you're suggesting to reduce on memory and CPU consumption,

I am solely speaking about GPU-accelerated rendering, where this optimization is critical for power efficiency. It's also required to propagate all the way down to the actual scanout hardware.

It also applies to CPU rendering (and gpu-accelerated rendering still CPU renders many assets), but that's not what we're talking about here.

> I think some VNC-like protocols benefit from this kind of lazy rendering, but the actual VNC protocol just diffs the entire frame.

Most modern, non-VNC remote desktop protocols use h264 video encoding. Damage is still propagated all the way through so that the client knows which areas changed.

The frames are not "diffed" except by the h264 encoder on the server side, which may or may not be using damage as input. The client has priority for optimization here.

> Their very nature makes the kinds of graphics demands of a 2D GUI very negligible.

An iPhone 16 Pro Max at 120 fps is sending 13.5 Gb/s to the display, and the internal memory requirements are much higher. This is expensive.

Not rendering a texture and being able to pass it off to scanout hardware so that the render units can stay off is the difference between a laptop giving you a ~5 hour battery life and a 15-20+ hour battery life.

The GPU could texture your socks off, but you're paying a tax every microsecond your GPU's render units are active, which matter when you're battery powered or thermally constrained. This is why display servers and GUI toolkits go through lengths to not render anything.

aeonfox · 2025-09-10T02:08:45 1757470125

> Note that in the iPhone cases, the glass blur is mostly an internal widget rendered by the app, what is emitted to the display server/hardware is opaque.

This sounds wild to me, so I'm just going to ask. Do you work on these kind of optimisations for a modern OS? If so, just ignore my ponderings and I'll just accept what you're saying here.

I honestly couldn't imagine this kind of compositing not happening completely on the GPU or requiring any back and forth between the CPU and GPU. That is, the windowing system creates a display list, and that display list is dispatched to the GPU along with any assets it requires (icons, font etc.). I'd also imagine this is the same as how the browser renders.

As for optimisations, if the display list is the same for a particular render target (e.g., window, widget, subsection, or entire screen), there's no reason to rerender it. There's no reason to even rebuild the display list for an application that is asleep or backgrounded. Tile-based culling and selective update of the screen buffer^ can also happen at the GPU level. Though hierarchical culling at the CPU level would be trivial and low-cost.

This is not my wheelhouse, so perhaps I'm missing something crucial here.

^ Edit: It does look like the Apple silicon GPUs do use tile-based deferred rendering.

https://developer.apple.com/documentation/metal/tailor-your-...

arghwhat · 2025-09-10T11:52:44 1757505164

> Do you work on these kind of optimisations for a modern OS

I have worked on modern display servers and application interfaces, and as such also dealt with (but not written) a fair amount of client application render code, generally optimizing for power consumption and latency.

> I honestly couldn't imagine this kind of compositing not happening completely on the GPU or requiring any back and forth between the CPU and GPU.

Well the CPU is always involved, and even responsible for rendering certain assets that are not viable to render on the GPU - fonts in particular are usually rendered by the CPU, with occasional attempts at GPU rendering being made such as servo's pathfinder - but for simplicity let's talk about only widgets rendered by the GPU.

In most cases[^1], a window is rendering to a single "render target" (texture/buffer), handing off a flat and opaque buffer to the display server which the server will either try to display directly (direct scanout) or composite with other windows. In this context, the display server's main purpose is to have exclusive control over display hardware, and apart from compositing multiple windows together it is not involved in render processes.

The application itself when rendering will normally walk its widget tree, accumulate damage and cull widgets to create some form of a render list. Depending on the toolkit and graphics API in question, you'll ultimately end up submitting an amount of GPU work (e.g., a render pass) to render your window buffer (an IOSurface or DMA-BUF), and you then send the buffer to the display server one way or another. The window buffer will become ready later once the asynchronous render tasks complete, and the display server will wait on the relevant fences (on the CPU side, which is responsible for most GPU scheduling tasks) before starting any render task that would texture that buffer or before attempting scanout from the buffer.

The problem with blur is that you have a render task that depends on the full completion of all the prior render tasks, as its shader must read the output buffer[^2] as it completed rendering to an intermediate state. Additionally, other render steps depend on that render task as it has to be overlaid on top of the blur widget, and only after that completes is the buffer ready for the display server. That's a pipeline stall, and because it's on top of the primary content, it's holding up every frame from that app, and due to the blur operation itself an update that before only affected one tile now affects several.

Reading your own output is something you avoid like the plague, and blur is that. If you're used to web/network development, think of it like blocking network roundtrips.

... well this turned out to be a wall of text ...

---

^1: The more advanced case of hardware compositing is where you send a small number of buffers to the display server, e.g. a video buffer, an underlay with some background and an overlay with some controls, and have the display server configure hardware planes using those buffers such that the window is stitched together as the signal is about to be sent to the display. This is not the general case as planes are limited in count and capability, they cannot perform any effects other than basic transforms and blending, and scanout hardware is very picky about what it can use as input.)

^2: One could implement this instead by creating a different render list for just the area the blur needs to sample instead, in the hopes that this will render much faster and avoid waiting on completion of the primary buffer, but that would be an app specific optimization with a lot of limitations that may end up being much slower in many scenarios.

aeonfox · 2025-09-11T23:28:59 1757633339

> and the display server will wait on the relevant fences (on the CPU side, which is responsible for most GPU scheduling tasks) before starting any render task that would texture that buffer or before attempting scanout from the buffer.

Given the fetching and compute performance per watt of modern GPUs I'm still surprised that the watts saving of reducing overdraw is anything but negligible, and certainly if you're talking about pipeline stalls, having pixel data shuttling over the bus between the GPU and CPU seems like a much bigger deal?

> ^2: One could implement this instead by creating a different render list for just the area the blur needs to sample instead, in the hopes that this will render much faster and avoid waiting on completion of the primary buffer, but that would be an app specific optimization with a lot of limitations that may end up being much slower in many scenarios.

It looks like Apple Silicon avoids the overdraw problem with TBDR, and the tile system would efficiently manage the dependency chain right back to the desktop background if needed. So if a browser is maximised over a bunch of other windows, only a portion of a portion of render targets are being sampled, with no intermediate CPU rendering.

To me, the flex by Apple here is that they can do this efficiently, because their rendering system is likely fully GPU and also resource efficient in a way that other typical display servers and GPUs can't be. For this to work on Linux or Windows, a complete refactoring of the display servers would be required, and it would only service GPUs that have tile-based deferred rendering, which seems to be nil outside of Apple's Silicon (and their older PowerVR chips).

ohdeargodno · 2025-09-04T12:02:09 1756987329

>you never compute individual pixels because you never need to

Pixel shaders are looking at this laughing at you. PS_OUTPUT is a single pixel whether you want it or not. PS wavefronts are usually very small, so you're still going to be doing a lot of sampling.

artemisart · 2025-09-02T11:42:13 1756813333

This seems to be exclusive to Safari, I can't get it to work in Chrome either (and didn't know about the feature before right now, the discoverability is terrible).

frizlab · 2025-09-02T12:17:15 1756815435

It’s system-wide. chrome explicitly did not support it.

saagarjha · 2025-09-02T12:03:43 1756814623

It's that Safari implemented support for it. Firefox added their own version and chose to give it a different affordance

Y_Y · 2025-09-02T15:30:56 1756827056

> You keep using that word, I do not think it means what you think it means.

- Inigo Montoya

This is no affordance. There's nothing the design of either browser that suggests you can obtain a link preview by those actions, you just have to be told what action to take beforehand.

saagarjha · 2025-09-03T08:57:59 1756889879

What part of long pressing a link is discoverable to you

artemisart · 2025-08-28T10:10:20 1756375820

That's very true and what's segmenting the market, but I don't understand why you're saying the 5090 supports only 12B model when it can go up to 50-60B (= a bit less than 64B to leave room for inference) as it supports FP4 as well.

nabla9 · 2025-08-28T10:29:07 1756376947

Its for comparison using raw, non optimized models. Both can do much better when you optimize for inference.

Information is in the ratio of these numbers. They stay the same.

artemisart · 2025-08-28T10:41:35 1756377695

Ok then just to clarify: you can fit 4x larger models on the Spark vs 5090, not 17x.

ilirium · 2025-08-28T11:39:33 1756381173

@nabla9 have tried to tell you that for DGX Spark, you can also use optimized models; therefore, this means that Spark can also be used for inference with bigger models, such as those exceeding 200B.

Please compare the same things: carrots VS carrots, not apples VS eggs.

artemisart · 2025-08-28T11:51:22 1756381882

I don't understand what's not optimized on 5090. If we're comparing with Apple chips or AMD Strix Halo yes you will have very different hardware + software support, no FP4 etc. but here everything is CUDA, Blackwell vs Blackwell, same FP4 structured sparsity, so I don't get how it would be honest to compare a quantized FP4 model on Spark with an unoptimized FP16 model on a 5090 ?

NewsaHackO · 2025-08-28T13:20:35 1756387235

To me, what I think they are saying is that the Spark can use a FP16 unoptimized model with 200B parameters. However I don't really know.

reissbaker · 2025-08-28T13:29:08 1756387748

You can't. The Spark has 128GB VRAM; the highest you can go in FP16 is 64B — and that's with no space for context.

200B is probably a rough estimate of Q4 + some space for context.

The Spark has 4x the VRAM of a 5090. That's all you need to know from a "how big can it go" perspective.

canucker2016 · 2025-08-28T22:29:21 1756420161

from the NVidia DGX Spark datasheet:

  With 128 GB of unified system memory, developers can experiment, fine-tune, or inference models of up to 200B parameters. Plus, NVIDIA ConnectX™ networking can connect two NVIDIA DGX Spark supercomputers to enable inference on models up to 405B parameters.

reissbaker · 2025-08-29T03:11:13 1756437073

The datasheet isn't telling you the quantization (intentionally). Model weights at FP16 are roughly 2GB per billion params. A 200B model at FP16 would take 400GB just to load the weights; a single DGX Spark has 128GB. Even two networked together couldn't do it at FP16.

You can do it, if you quantize to FP4 — and Nvidia's special variant of FP4, NVFP4, isn't too bad (and it's optimized on Blackwell). Some models are even trained at FP4 these days, like the gpt-oss models. But gigabytes are gigabytes, and you can't squeeze 400GB of FP16 weights into only 128GB (or 256GB) of space.

The datasheet is telling you the truth: you can fit a 200B model. But it's not saying you can do that at FP16 — because you can't. You can only do it at FP4.

canucker2016 · 2025-08-29T11:23:50 1756466630

I never claimed the 200B model was FP16.

If the 200B model was at FP16, marketing could've turned around and claimed the DGX Spark could handle a 400B model (with an 8-bit quant) or a 800B model at some 4-bit quant.

Why would marketing leave such low-hanging fruit on the tree?

They wouldn't.

hnuser123456 · 2025-08-28T15:02:58 1756393378

You and nabla9 are both the one comparing apples and eggs. 4x more RAM means 4x larger models when everything else is held the same to make a fair comparison.

artemisart · 2025-08-25T10:51:38 1756119098

The economics don't make sense, each video is stored ~ once (+ replication etc. but let's say O(1)) but viewed n times, so server-side upscaling on the fly is way too costly and currently not good enough client-side.

pier25 · 2025-08-25T19:03:44 1756148624

Are you considering that the video needs to be stored for potentially decades?

Also shorts seem to be increasing exponentially... but Youtube viewership is not. So compute wouldn't need to increase as fast as storage.

I obviously don't know the numbers. Just saying that it could be a good reason why Youtube is doing this AI upscaling. I really don't see why otherwise. There's no improvement in image quality, quite the contrary.

artemisart · 2025-08-25T10:47:40 1756118860

> Then, once that is perfected, they will offer famous content creators the chance to sell their "image" to other creators, so less popular underpaid creators can record videos and change their appearance to those of famous ones, making each content creator a brand to be sold.

I'm frightened by how realistic this sounds.