Nice! I peeked at the code and thought I’d share a few tips for improving the low frame rate:
Base64 encoding the JPEG bytes will increase payload size up to ~30% and burns CPU cycles on both client and server. This is unnecessary, as Websocket protocol can send binary payloads (doesn’t need to be text).
Consider removing lossy jpg compression as well, ie just send the raw RGB bytes over the network. Then on the server side you can simply call Image.frombuffer(…).
StreamDiffusion can achieve high frame rates because of extensive batching in the pipeline. You’re not benefiting from that here as the client is only sending one frame at a time and then waiting for a response. See this example for an idea of how to queue input frames and consume them in batches https://github.com/cumulo-autumn/StreamDiffusion/blob/main/e... .
Alternatively you could take a look at the SDXL Turbo and Lightning models. They are very fast at img2img but have limited resolution of 512² or 1024² pixels respectively. Which appears a bit lower than what you’re aiming for here, but they can be run locally in real time on a high end consumer grade GPU. For reference I have some code demonstrating this here https://github.com/GradientSurfer/Draw2Img/tree/main
Ok, but I wonder if it really needs to be real time like this? Wouldn't it make more sense to have some kind of button: somebody makes a pose, takes a picture, the picture is run through some kind of transformation and comes back as a painting that stays there until someone takes another picture? Wouldn't the illusion of art be better that way? (It would not be a "mirror" anymore though.)
IMO embrace low frame rate. Better energy use if its AI based on the first place (cheaper, environment... more cycles for fancier effecs) THEN pair that with e-ink display (also good energy use), I'd go for old school charcoal style but color e-ink is a thing. e-ink with some additional effects could resemble a physical paper moreso then a LCD/LED and so would be less obnoxious in the darkness of night... AND this also could be a low key security camera.
yeah yeah yeah, do all these things, and afterwards, look at 2d interpolation methods that don't require AI for your inbetweens. There's some real fast kernel math that can lerp from one blob to another at 8 billion fps.
I think you’re getting downvoted because “yeah yeah yeah” is normally a sign that someone is sarcastically dismissing an idea, but the rest of your comment suggests you’re not at all - linerp is a great idea!
Base64 encoding the JPEG bytes will increase payload size up to ~30% and burns CPU cycles on both client and server. This is unnecessary, as Websocket protocol can send binary payloads (doesn’t need to be text).
Consider removing lossy jpg compression as well, ie just send the raw RGB bytes over the network. Then on the server side you can simply call Image.frombuffer(…).
StreamDiffusion can achieve high frame rates because of extensive batching in the pipeline. You’re not benefiting from that here as the client is only sending one frame at a time and then waiting for a response. See this example for an idea of how to queue input frames and consume them in batches https://github.com/cumulo-autumn/StreamDiffusion/blob/main/e... .
Alternatively you could take a look at the SDXL Turbo and Lightning models. They are very fast at img2img but have limited resolution of 512² or 1024² pixels respectively. Which appears a bit lower than what you’re aiming for here, but they can be run locally in real time on a high end consumer grade GPU. For reference I have some code demonstrating this here https://github.com/GradientSurfer/Draw2Img/tree/main