2 min readfrom Machine Learning

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

Our take

Navigating the landscape of LLM inference pricing can be surprisingly complex. I've compiled a comprehensive spreadsheet analyzing seven providers—OpenRouter, Together AI, Fireworks, Groq, DeepSeek, and more—revealing significant variations in cached input costs, sometimes differing by orders of magnitude. This analysis, while not benchmarked for latency, highlights that caching policy often outweighs headline token pricing, particularly for applications like RAG pipelines. Discover a detailed comparison of input/output tokens, context windows, and supported models—a valuable resource for informed decision-making.
I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

The recent Reddit post detailing a comprehensive spreadsheet comparing LLM inference pricing across seven providers—OpenRouter, Together AI, Fireworks, Groq, DeepSeek, and others—highlights a burgeoning complexity in the AI landscape. It’s a welcome effort to bring order to what can feel like a chaotic marketplace, and reinforces the point that cost optimization strategies need to evolve alongside the rapid advancements in model capabilities. As we’ve explored in pieces like DeepSWE: new benchmark looking at how well today's frontier models can actually write code, assessing model performance isn't solely about headline benchmarks; it's about understanding the full cost picture, especially as real-world applications become increasingly sophisticated. The author’s emphasis on caching policies is particularly astute—often overlooked, these policies can dramatically impact overall expenses, particularly for applications utilizing agents, RAG pipelines, or iterative conversational interfaces.

The spreadsheet’s focus on caching, and the surprisingly large variance in cached input pricing, underscores a key shift in how we should evaluate LLM providers. Previously, the focus was almost exclusively on the per-token cost, but this analysis demonstrates that it’s the *total* cost of a workflow, factoring in caching efficiency, that truly matters. The observation that model availability and context windows aren’t always consistent across providers further complicates the selection process. This resonates with broader discussions around data architecture and reactivity, where maintaining consistent data access and transformations—as discussed in Article: Beyond CLEAN and MVP: Architecting an Offline-first Reactive Data Layer in Android—is crucial for operational efficiency. The difficulty in finding this information centralized reinforces the need for tools and resources that aggregate and analyze this data, empowering users to make informed decisions.

This isn't merely a matter of saving a few dollars on inference costs; it’s a signal of the maturing AI ecosystem. Early adopters often focused on the novelty and potential of LLMs, but now, as these models become integrated into production systems, operational efficiency and cost management are paramount. The author's acknowledgement of missing data points – real throughput, cold-start times, quantization details, and network costs – is a clear roadmap for future analysis. The lack of standardized reporting across providers makes direct comparison challenging, highlighting a need for greater transparency and consistency in pricing models. While the conversation around understanding language model behavior, as presented in Presentation: Rules for Understanding Language Models, often centers on model outputs, this spreadsheet shines a light on the critical infrastructural underpinnings required to effectively leverage these models.

Ultimately, this spreadsheet serves as a valuable starting point for a more nuanced understanding of LLM pricing and performance. It's a practical demonstration of how a little data aggregation and analysis can reveal significant insights. As LLMs become increasingly ubiquitous, the ability to optimize inference costs will be a critical differentiator, separating those who can sustainably leverage AI from those who are simply experimenting with it. The question now is: will other data scientists and engineers build upon this foundation, creating more robust and automated tools for evaluating and comparing LLM providers, or will this remain a largely manual, spreadsheet-driven process?

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

I've been comparing GPU/LLM providers for a side project and ended up with way too many browser tabs and spreadsheets.

So I decided to pull the public pricing data into one sheet and compare it side by side.

A quick disclaimer: this is not benchmark data. I didn't run latency tests or throughput measurements. Everything comes from public pricing pages and APIs (OpenRouter, DeepSeek, Together AI, Fireworks, Groq, etc.).

The spreadsheet currently tracks:

  • Input/output token pricing
  • Context windows
  • Cached input pricing (where available)
  • Supported models
  • Provider-specific pricing differences

The thing that surprised me most was caching.

For example, when looking at DeepSeek V4 Pro pricing across providers, cached input costs vary dramatically. In some cases a cache hit is tens of times cheaper than a cache miss.

That made me realize that if you're running:

  • Agents with large system prompts
  • RAG pipelines with reusable context
  • Multi-turn conversations
  • Repeated prompt templates

...the "headline" token price can be a lot less important than the caching policy.

A few other interesting things I noticed:

  • The same model can vary by multiple times in cost depending on provider.
  • Some providers expose caching clearly, while others barely document it.
  • Model availability and context windows aren't always consistent across providers.
  • It's surprisingly hard to find all of this information in one place.

A few things I haven't figured out how to compare yet:

  • Real throughput (tokens/sec)
  • Cold-start / queue times
  • Whether providers are serving FP16, FP8, quantized variants, etc.
  • Egress/network costs
  • Reliability/uptime

I'm curious how others evaluate providers.

When you're choosing between OpenRouter, Together, Fireworks, Groq, DeepSeek, etc., what metrics actually matter to you beyond token pricing?

https://preview.redd.it/4vj50mvhu79h1.png?width=1615&format=png&auto=webp&s=6c6c084927f83bfdadb5ed8e4378f520a1da6766

Am I missing any important data points that should be included in a v2?

submitted by /u/Technomadlyf
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#big data management in spreadsheets#real-time data collaboration#financial modeling with spreadsheets#natural language processing for spreadsheets#conversational data analysis#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions#modern spreadsheet innovations#machine learning in spreadsheet applications#enterprise-level spreadsheet solutions#digital transformation in spreadsheet software#collaborative spreadsheet tools#AI-driven spreadsheet solutions#cloud-based spreadsheet applications