From Benchmark Theater to Real Performance: A Case for Goodput

Community Article Published February 22, 2026

How many times have we all been dazzled by impressive LLM inference throughput numbers, only to be disappointed once we learn the specific benchmark conditions and uncover the real performance details?

Gemini_Generated_Image_v129m2v129m2v129

Before we make the case for Goodput, it’s important to understand the limitations of the two most common LLM inference metrics throughput and latency

Throughput

Throughput is typically measured in tokens generated per second (sometimes including input tokens processed per second). Throughput matters because it tells you:

  • How cost‑efficient your GPU/TPU cluster is
  • Whether hardware resources are being fully utilized

Throughput is often broken into two components:

1. Prefill Tokens per Second

How fast the model processes input tokens.

  • Dominated by the dense forward pass
  • Highly compute‑intensive

2. Decode Tokens per Second

How fast the model generates output tokens.

  • Dominated by autoregressive decoding over the KV cache
  • Highly memory‑bandwidth (HBM) intensive

Latency

Latency measures the end‑to‑end time for an LLM to return a meaningful portion of its response. It is per‑request and usually measured in milliseconds or seconds. Latency has three phases:

1. Pre‑fill Latency

Time to:

  • Tokenize the prompt
  • Run the transformer forward pass over all input tokens
  • Initialize a KV cache

Pre‑fill latency typically grows quadratically with prompt length.

2. First Token Latency (TTFT)

  • Time from request submission to receiving the first generated token.
  • TTFT = Pre‑fill + Scheduling + Decode setup

This is the latency users feel the most and is critical for chat UX.

3. Steady‑State Decode Latency

Time to generate each subsequent token:

  • Attend over existing KV cache
  • Perform a single forward pass
  • Execute sampling/logit processing

Influenced by model size, model architecture, hardware speed, and batching.

Throughput vs. Latency

A common misconception:

If latency is low, my system must have high throughput—they’re the same.

But they are not:

  • Latency = speed per request
  • Throughput = total capacity per second
  • Throughput increases with batch size
  • Latency also increases with batch size

Enterprises often pursue high throughput to reduce cost per query, but:

Throughput ≠ User Experience

High-throughput optimizations can hurt UX because:

  • Large batches introduce latency spikes
  • First token is delayed
  • Priority scheduling can starve interactive requests
  • Systems may favor bulk workloads over responsiveness

Similarly:

Latency ≠ User Experience

Latency affects responsiveness, but UX also depends on consistency, smoothness, and how the interaction feels.

There’s a need for a throughput metric that’s not just high on paper, but truly useful, meaningful, and aligned with real workload's service-level objectives (SLOs). That’s Goodput — performance that reflects what users actually experience, not just what hardware counters report

Goodput

Optimizing for throughput and delivering a great tokens/second might look great in benchmark theatre, but NOT in real world applications. In Dell Enterprise Hub https://dell.hf.co we optimize for good throughput - Goodput

We define Goodput as the number of served requests/s (throughput) that a system can reach while meeting the required SLOs for a specific use-case or scenario.

Performance optimizations are in context of Scenarios to deliver a meaningful LLM inference performance. The performance that is acceptable in one scanerio might be not be acceptable in another and thus optimizations unique to the scenario, the model and platform that it is deployed on. And optimizations for different scenarios are diferent.

Screenshot 2026-02-21 141841

Scenarios definition

To give users a starting point for their deployments, Dell Enterprise Hub offers optimized configurations for three different Goodput scenarios:

1. Balanced:

This scenario is ideal for applications that require a balance between context length and concurrency, while keeping compute resources at an intermediate level. It is suitable for a wide range of applications and offers a good starting point for further optimization.

2. High concurrency:

This scenario is designed for applications with a high number of concurrent requests, but with a lower context length requirement. It optimizes for served requests per second in exchange for a high resources utilization.

3. Long context:

This scenario is tailored for applications that require a long context length, but with a lower concurrency requirement. It optimizes for context length in exchange for a lower throughput and a higher resource utilization.

Platform-specific SLOs definition

Scenario SLOs are defined based on the capabilities of each hardware platform—for example, an r760xa‑nvidia‑l40s will have different SLOs than a xe8640‑nvidia‑h100 for the same workload. To simplify selection, this guide lists the SLOs for every scenario across all Dell Enterprise Hub platforms, grouped by GPU type, so you can easily compare options and choose the scenario that best fits your application’s requirements.

Here’s a brief explanation of what each SLO represents:

  • Max model context: This SLO defines the maximum context length that the model can handle while meeting the requirements of the scenario. It is measured in number of tokens.
  • Virtual users: This SLO defines the number of concurrent users sending requests that the deployment should handle.
  • Input tokens: The range of input tokens that are used to simulate requests during the benchmarking runs. For each request, we sample a random number of input tokens within this range.
  • Output tokens: The range of output tokens that are used during the benchmarking runs. For each request, we sample a random number of output tokens within this range.

Optimized deployment configurations

Dell Enterprise Hub provides optimized deployment configurations tailored to each Dell platform and aligned with predefined goodput scenarios. After defining SLOs for a given model and hardware setup, we benchmark multiple inference container configurations and select the one that delivers the highest throughput while meeting all requirements.

Get your Goodput Scenario optimized containers from Dell Enterprise Hub https://dell.hf.co and learn more about Goodput Scenarios https://dell.huggingface.co/docs/optimized-deployments/goodput-scenarios

Community

Article author

👏

Sign up or log in to comment