Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]

Our take

Introducing "Quantization and Fast Inference" by Kalyan Aranganathan, a new MEAP release from Manning that addresses a crucial aspect of machine learning deployment: performance in real-world scenarios. This book goes beyond model quality, emphasizing cost-effective and efficient strategies for quantization without starting from scratch. Covering essential topics like PTQ, QAT, and the intricacies of deployment trade-offs, it balances theory with practical insights. Readers will benefit from continuous updates, enhancing their understanding of quantization challenges in production environments. Join the conversation and share your experiences!

Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]

Hi all,

Stjepan from Manning here. The mods said it's fine if I post this here.

I wanted to share a new MEAP (early access) release we think will land well with people here: Quantization and Fast Inference by Kalyan Aranganathan: https://www.manning.com/books/quantization-and-fast-inference

Quantization and Fast Inference

A lot of ML deployment discussions still revolve around model quality first and infrastructure second. Then the bill shows up. Or latency becomes unacceptable. Or the model that worked fine on A100s suddenly needs to run somewhere much smaller.

This book focuses on the practical side of making models cheaper and faster without rebuilding them from scratch. It starts with quantization fundamentals and works its way through PTQ, QAT, runtime packaging, and deployment trade-offs that matter once you’re dealing with production constraints rather than benchmarks.

What I liked about the manuscript is that it doesn’t stop at “here’s INT8.” It gets into the annoying details people usually learn the hard way: activation outliers in LLMs, KV cache pressure, fake quantization workflows, straight-through estimators, and why some sub-8-bit formats behave very differently once you leave the paper and hit actual inference workloads.

There’s also a solid balance between theory and implementation. The derivations are there if you care about the math, but the book keeps returning to operational questions like memory bandwidth, latency, and deployment cost.

Since this is a MEAP release, the book is still being developed chapter by chapter, and readers get access to the manuscript as it evolves. We’ve found that ML books especially benefit from that process because readers often push authors toward clearer explanations and more relevant examples while the book is still in progress.

We’ve got 5 free ebook copies for the first 5 people who comment with their experience using quantization in production or research. Success stories, failed experiments, weird edge cases — all fair game.

If you’d rather grab it directly, we also put together a 50% discount code for the subreddit: MLKALYANARANGAN50RE

Curious what people here think the current pain point is with quantization workflows.

Accuracy collapse? Tooling fragmentation? Hardware-specific behavior? Something else entirely?

I’ll stick around for discussion, and I’m happy to bring the author in for questions if there’s interest.

Cheers,

Stjepan

submitted by /u/ManningBooks
[link] [comments]

Tagged with

#financial modeling with spreadsheets#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#automation in spreadsheet workflows#big data performance#no-code spreadsheet solutions#Quantization#Fast Inference#ML deployment#latency#model quality#production constraints#deployment trade-offs#infrastructure#PTQ#QAT#activation outliers#memory bandwidth