Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]
Our take
![Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F02t3i0kafpzg1.jpg%3Fwidth%3D140%26height%3D140%26crop%3D1%3A1%2Csmart%26auto%3Dwebp%26s%3D530e90afba6d4ad410dce621fd834a9b37673286&w=3840&q=75)
| Hi all, Stjepan from Manning here. The mods said it's fine if I post this here. I wanted to share a new MEAP (early access) release we think will land well with people here: Quantization and Fast Inference by Kalyan Aranganathan: https://www.manning.com/books/quantization-and-fast-inference Quantization and Fast Inference A lot of ML deployment discussions still revolve around model quality first and infrastructure second. Then the bill shows up. Or latency becomes unacceptable. Or the model that worked fine on A100s suddenly needs to run somewhere much smaller. This book focuses on the practical side of making models cheaper and faster without rebuilding them from scratch. It starts with quantization fundamentals and works its way through PTQ, QAT, runtime packaging, and deployment trade-offs that matter once you’re dealing with production constraints rather than benchmarks. What I liked about the manuscript is that it doesn’t stop at “here’s INT8.” It gets into the annoying details people usually learn the hard way: activation outliers in LLMs, KV cache pressure, fake quantization workflows, straight-through estimators, and why some sub-8-bit formats behave very differently once you leave the paper and hit actual inference workloads. There’s also a solid balance between theory and implementation. The derivations are there if you care about the math, but the book keeps returning to operational questions like memory bandwidth, latency, and deployment cost. Since this is a MEAP release, the book is still being developed chapter by chapter, and readers get access to the manuscript as it evolves. We’ve found that ML books especially benefit from that process because readers often push authors toward clearer explanations and more relevant examples while the book is still in progress. We’ve got 5 free ebook copies for the first 5 people who comment with their experience using quantization in production or research. Success stories, failed experiments, weird edge cases — all fair game. If you’d rather grab it directly, we also put together a 50% discount code for the subreddit: MLKALYANARANGAN50RE Curious what people here think the current pain point is with quantization workflows. Accuracy collapse? Tooling fragmentation? Hardware-specific behavior? Something else entirely? I’ll stick around for discussion, and I’m happy to bring the author in for questions if there’s interest. Cheers, Stjepan [link] [comments] |
Read on the original site
Open the publisher's page for the full experience