venkat_2811's comments

venkat_2811 · 2026-03-08T07:21:45 1772954505

ai agents failing silently or just lying is a big problem

Steadwing and openalerts save a lot of headache for sure !

congrats on the launch !

venkat_2811 · 2026-01-18T19:27:22 1768764442

With so much improvements in LLM Inference Kernels, Inter-GPU comms are becoming the bottleneck. Introducing my project YALI - Yet Another Low-Latency Implementation.

A custom CUDA kernel library that provides ultra low-latency primitives for inter-gpu comms collectives. Achieves 80-85% Speed-of-Light SW efficiency on p2p all_reduce_sum over NVLINK on 2xA100 GPUs.

It outperforms NVIDIA NCCL by 2.4x and over 50x stable tail latency.

https://venkat-systems.bearblog.dev/yali-vs-nvidia-nccl/

venkat_2811 · 2026-01-16T15:26:21 1768577181

100% OSS, MIT License. YALI - Yet Another Low-Latency Implementation. Achieves 80-85% Speed-of-Light SW efficiency by using ultra low-latency primitives for p2p all_reduce_sum comms collective. Very important operation in multi-gpu llm training and inference

venkat_2811 · 2026-01-16T15:30:23 1768577423

Wisdom from CPU land translate well to GPUs. Static Scheduling, Pre-fetching, 3-Stage Double-Buffering, Pre-allocation & memory ordering in custom CUDA kernel helps outperform NVIDIA NCCL. Experimental integration in vllm.rs shows ~20% prefill and ~10% decode latency improvements (TTFT & TPOT)