5 min readfrom Machine Learning

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

Our take

Introducing **torch-nvenc-compress**, a Python library designed to optimize GPU NVENC silicon for efficient data management. By leveraging the otherwise-idle NVENC/NVDEC hardware, this solution compresses activations and KV cache on the fly, significantly reducing bandwidth demands across PCIe connections. With measured compression ratios of 6.1× for diffusion and 2.7× for LLM KV cache, it transforms how large models operate on consumer GPUs. Explore the repository for runnable PoCs and contribute to advancing multi-GPU performance. Join the journey toward more efficient data processing with **
torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]
torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

I've been working on the consumer-multi-GPU PCIe bottleneck — Nvidia removed NVLink from the 4090/5090, and splitting a 70B model across two consumer cards drops you to ~30 GB/s over PCIe peer-to-peer.

Spent the last few months building a Python library that uses the GPU's otherwise-idle NVENC/NVDEC silicon to compress activations and KV cache on the fly, then ships the small bitstream across the same wire.

Repo: https://github.com/shootthesound/torch-nvenc-compress (Apache 2.0)

Prior art (this isn't novel as an idea)

  • LLM.265 — "Video Codecs are Secretly Tensor Codecs" (late 2025). The closest direct precedent: same insight applied to LLM weights, activations, KV cache.
  • KVFetcher (April 2026). KV compression for remote prefix fetching.
  • CodecFlow (April 2026). Codec motion-vector metadata for KV refresh during prefill.

The "video codec on tensors" idea was already in the literature when I started. What's added in this work:

  1. PCA + rank-truncation as preprocessing. Activations and KV in their standard basis are noise-like (~4× compression floor, basically the Gaussian-noise limit). The PCA basis reveals a heavy-tailed channel covariance that the codec can actually exploit. The basis is per-layer, computed offline, ships with the model LoRA-style (~32 MB for FLUX.2 Klein 9B's 8 double-blocks at K=500).
  2. Parallel-path / dual-lane architectural reframe. NVENC and NVDEC are physically separate hardware units from the SM cluster and the PCIe controller. With CUDA-stream pipelining, the codec time hides behind compute and transfer of other tensors. Compression ratio becomes effective-bandwidth multiplier rather than just a smaller payload.
  3. Pure-ctypes Direct Video Codec SDK wrapper (DirectBackend) — kills the FFmpeg subprocess overhead. Zero-copy from torch CUDA tensors, 8-deep async output ring per NVENC engine, optional CUDA stream binding via nvEncSetIOCudaStreams, MultiEngineDirectBackend across all 3 NVENC engines on the 5090.
  4. Three documented null findings — sparse residual, AV1 NVENC on Blackwell, channel reordering. So nobody else has to rerun the dead ends.

Measured results (RTX 5090, real workloads)

  • Compression ratios: 6.1× lossless on diffusion (FLUX.2 Klein 9B mid-block), 2.7× lossless on LLM KV cache (Mistral 7B v0.3). LOO-validated across 1,735 diffusion captures and 6 LLM prompts. (FLUX.2 Klein 9B was the internal research target; the public PoC repo uses FLUX.1-schnell since it's Apache 2.0 and freely downloadable. Numbers reproduce qualitatively on schnell — heavy-tailed PCA spectrum, similar Pareto.)
  • Codec speed: DirectBackend 0.243 ms/frame encode, 0.435 ms/frame decode at 256×256 YUV444 QP=18 on real PCA-rotated FLUX activations. MultiEngineDirectBackend across the 5090's 3 NVENC engines: 0.180 ms/frame encode, 0.262 ms/frame decode. ~7.9× over an FFmpeg subprocess baseline.
  • Parallel-path overlap empirically measured: 30×4096² fp16 GEMM on CUDA stream A + 64-frame DirectBackend encode on stream B (encoder bound to stream B via nvEncSetIOCudaStreams). Serialized wall-clock 40.1 ms; parallel wall-clock 26.0 ms; theoretical max overlap floor 20.9 ms. 1.34× speedup over serialized = 67% of theoretical max overlap realized. This is the load-bearing measurement for the architectural claim that NVENC silicon runs concurrently with SM compute.
  • Slow-wire wins, end-to-end: measured 3.13× wall-clock speedup at 100 Mbps residential broadband, 5.29× at 50 Mbps (real codec round-trip + simulated wire). 1.69× dual-lane on simulated 1 Gbit ethernet.

What is not measured end-to-end (projections from the above)

Multi-GPU PCIe peer-to-peer activation transfer recovering ~180 GB/s effective bandwidth — codec primitive is ready and benchmarked, but the cross-GPU PCIe peer-to-peer wiring is pending. (This is where I need community help, as my validation rig only has one desktop GPU and you need two on the same motherboard to test this).

Real two-machine ethernet split-model inference — wire-simulation PoC measures real codec time + simulated wire, but isn't a true two-machine deployment yet. (I have a 4090 laptop incoming next week to physically validate this networked leg).

Long-context KV-spill end-to-end tok/s on a real model decode loop — compression ratio is measured, but the actual N tok/s → 3N tok/s benchmark on e.g. 32B + 64K context isn't in the repo yet. The math implies it; the benchmark hasn't been written.

Where I'd value help

  • Anyone with a dual-4090 / dual-5090 / two-machine-with-PCIe-P2P rig who'd want to run the cross-GPU peer-to-peer benchmark when I write it. Would shrink the "75%" gap meaningfully.
  • Anyone running long-context KV-spill workloads who'd want to wire DirectBackend into their decode loop for the end-to-end tok/s measurement. I'd write the integration with you.
  • Cross-vendor coverage — AMD VCN and Intel QSV/Arc paths are completely open. Same architectural claim, different SDK surface.

What's in the repo

19 numbered runnable PoCs, every measured number reproducible. Honest status table at the top of the README. PCA basis builder + per-channel quantize + YUV pack/unpack + codec wrappers all separable so you can swap pieces.

Built solo around full-time caregiving — technical feedback, criticism, or pointers to related work I missed are genuinely appreciated.

submitted by /u/shootthesound
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#real-time data collaboration#real-time collaboration#no-code spreadsheet solutions#financial modeling with spreadsheets#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#machine learning in spreadsheet applications#row zero#spreadsheet API integration#GPU#NVENC#PCIe#compression#PCA#activations#KV cache#CUDA#tensor