Stop benchmarking inference providers

Community Article Published April 14, 2026

tl;dr:
Benchmarking through inference providers isn’t benchmarking your model.*
Transformers defines the model, it should define your evals too. Here’s how to take advantage of the HF hub and open source libraries to run reliable benchmarks on more than a million models.

Let's say you just created a great benchmark and want to know popular models perform on it. The easiest way? Start spinning up OpenRouter or HF inference providers like crazy. But by doing this, you don't actually benchmark the model; you benchmark the provider. The models behind API can be quantized, prompted differently, it can even be a completely different model!

The source of truth for model definition is transformers, it is only natural that evaluation should be run using this definition.

However, benchmarking is compute intensive, that is the main reason why using inference providers is so handy, however, HF-Jobs blesses us with on demand compute tailored to our needs. It can be used with a single UV script to spin up a server and benchmark our model.
By the end of the post, you will have a one liner to benchmark any model defined in transformers:

hf jobs uv run script.py \
--flavor l4x1  \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"

The script

Dependencies

UV scripts allow us to specify our dependencies using metadata at the top of our file. We will use inspect-ai as our eval harness, OpenAI to send requests to the server, and, of course, transformers and kernels to serve the model.

# /// script
# dependencies = [
#     "inspect-ai",           # AI evaluation framework
#     "huggingface-hub",      # Model repository access
#     "transformers[serving]",# Transformers with serving capabilities
#     "openai>=2.26.0",       # OpenAI API compatibility
#     "kernels",              # Custom kernel implementations
# ]
# ///

Spin up the server

To keep everything in one file will start the server as a subprocess. Here we choose our continuous batching parameters and what attention implementation to use, those should be tailored to the model and hardware you are using.

import subprocess

# Build the command
serve_cmd = ["transformers", "serve"]
serve_cmd.append("--continuous-batching")
serve_cmd.extend(["--cb-block-size", "256"])
serve_cmd.extend(["--attn-implementation", "flash-attn2"])

server_process = subprocess.Popen(serve_cmd)
wait_for_server_up(server_process, timeout=600)

Once the server is up, we can start bombarding it with requests. Transformers serve is OpenAI API-compatible, meaning it's plug-and-play with almost every modern evaluation harness.

Here we use GPQA Diamond as the benchmark, it's a standard knowledge and reasoning benchmark defined as an official benchmark on huggingface.

from inspect_ai import eval
from inspect_evals.gpqa import gpqa_diamond

model = "Qwen/Qwen3-8B"

eval(
   "hf/Idavidrein/gpqa/diamond",
    model=f"openai-api/transformers-serve/{model}",
    log_dir="./logs",
    model_base_url="http://localhost:8000/v1",
    display="plain", # to better read the hf job logs
    limit=100, # limiting to 100 samples
    model_args={"stream": False},
    max_connections=100,
    max_tokens=2048,
)

Publishing your results

Inspect-ai allows us to package our logs and display them nicely in an HF space.

from inspect_ai.log import bundle_log_dir

bundle_space = "{my_user}/bundle"

bundle_log_dir(
    "./logs", 
    output_dir=f"hf/{bundle_space}", 
    overwrite=True
)

Putting it all together

We can now run our script using HF Jobs. We use the script defined in the transformers examples. Specifying the model, the output space, the hardware to run on (flavor).
We also need to specify the `--secrets HF_TOKEN` to be able to publish our results to the hub and the `-e TRANSFORMERS_SERVE_API_KEY="1234"` required by inspect-ai.

hf jobs uv run script.py \
--flavor l4x1  \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"

And, THAT'S IT! The benchmark will run and results will be uploaded to the hub.
Inspect-ai has a large library of benchmarks you can use, it's plug and play, if you want to evaluate HLE, aime, swe-bench, simply change the benchmark in the `eval` function.

If you are satisfied with the results you can open a PR on the repo of the model you benchmarked so that it appears on the community leaderboards. The results will then show up in the GPQA leaderboard on the hub!

- dataset:
    id: Idavidrein/gpqa
    task_id: gpqa_diamond
  value: 0.412
  source: 
    url: https://huggingface.co/spaces/{bundle-space}
    name: Eval traces
    user: user
  notes: "Pass@4"

Deep Dive:

Models mentioned in this article 1

Datasets mentioned in this article 1

Community

halcyonhal

5 days ago

Have you compared models served by an inference provider to evals run direct against the model? Some data to support the crux of your argument would be good.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote