I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]
Our take
![I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]](https://preview.redd.it/4vj50mvhu79h1.png?width=140&height=63&auto=webp&s=f53b566e7aa9a25215aa77fcf3ed0b16e426e2a1)
The recent Reddit post detailing a comprehensive spreadsheet comparing LLM inference pricing across seven providers—OpenRouter, Together AI, Fireworks, Groq, DeepSeek, and others—highlights a burgeoning complexity in the AI landscape. It’s a welcome effort to bring order to what can feel like a chaotic marketplace, and reinforces the point that cost optimization strategies need to evolve alongside the rapid advancements in model capabilities. As we’ve explored in pieces like DeepSWE: new benchmark looking at how well today's frontier models can actually write code, assessing model performance isn't solely about headline benchmarks; it's about understanding the full cost picture, especially as real-world applications become increasingly sophisticated. The author’s emphasis on caching policies is particularly astute—often overlooked, these policies can dramatically impact overall expenses, particularly for applications utilizing agents, RAG pipelines, or iterative conversational interfaces.
The spreadsheet’s focus on caching, and the surprisingly large variance in cached input pricing, underscores a key shift in how we should evaluate LLM providers. Previously, the focus was almost exclusively on the per-token cost, but this analysis demonstrates that it’s the *total* cost of a workflow, factoring in caching efficiency, that truly matters. The observation that model availability and context windows aren’t always consistent across providers further complicates the selection process. This resonates with broader discussions around data architecture and reactivity, where maintaining consistent data access and transformations—as discussed in Article: Beyond CLEAN and MVP: Architecting an Offline-first Reactive Data Layer in Android—is crucial for operational efficiency. The difficulty in finding this information centralized reinforces the need for tools and resources that aggregate and analyze this data, empowering users to make informed decisions.
This isn't merely a matter of saving a few dollars on inference costs; it’s a signal of the maturing AI ecosystem. Early adopters often focused on the novelty and potential of LLMs, but now, as these models become integrated into production systems, operational efficiency and cost management are paramount. The author's acknowledgement of missing data points – real throughput, cold-start times, quantization details, and network costs – is a clear roadmap for future analysis. The lack of standardized reporting across providers makes direct comparison challenging, highlighting a need for greater transparency and consistency in pricing models. While the conversation around understanding language model behavior, as presented in Presentation: Rules for Understanding Language Models, often centers on model outputs, this spreadsheet shines a light on the critical infrastructural underpinnings required to effectively leverage these models.
Ultimately, this spreadsheet serves as a valuable starting point for a more nuanced understanding of LLM pricing and performance. It's a practical demonstration of how a little data aggregation and analysis can reveal significant insights. As LLMs become increasingly ubiquitous, the ability to optimize inference costs will be a critical differentiator, separating those who can sustainably leverage AI from those who are simply experimenting with it. The question now is: will other data scientists and engineers build upon this foundation, creating more robust and automated tools for evaluating and comparing LLM providers, or will this remain a largely manual, spreadsheet-driven process?
| I've been comparing GPU/LLM providers for a side project and ended up with way too many browser tabs and spreadsheets. So I decided to pull the public pricing data into one sheet and compare it side by side. A quick disclaimer: this is not benchmark data. I didn't run latency tests or throughput measurements. Everything comes from public pricing pages and APIs (OpenRouter, DeepSeek, Together AI, Fireworks, Groq, etc.). The spreadsheet currently tracks:
The thing that surprised me most was caching. For example, when looking at DeepSeek V4 Pro pricing across providers, cached input costs vary dramatically. In some cases a cache hit is tens of times cheaper than a cache miss. That made me realize that if you're running:
...the "headline" token price can be a lot less important than the caching policy. A few other interesting things I noticed:
A few things I haven't figured out how to compare yet:
I'm curious how others evaluate providers. When you're choosing between OpenRouter, Together, Fireworks, Groq, DeepSeek, etc., what metrics actually matter to you beyond token pricing? Am I missing any important data points that should be included in a v2? [link] [comments] |
Read on the original site
Open the publisher's page for the full experience