Stop benchmarking inference providers
Benchmarking through inference providers isn’t benchmarking your model.*
Transformers defines the model, it should define your evals too. Here’s how to take advantage of the HF hub and open source libraries to run reliable benchmarks on more than a million models.
Let's say you just created a great benchmark and want to know popular models perform on it. The easiest way? Start spinning up OpenRouter or HF inference providers like crazy. But by doing this, you don't actually benchmark the model; you benchmark the provider. The models behind API can be quantized, prompted differently, it can even be a completely different model!
The source of truth for model definition is transformers, it is only natural that evaluation should be run using this definition.
However, benchmarking is compute intensive, that is the main reason why using inference providers is so handy, however, HF-Jobs blesses us with on demand compute tailored to our needs. It can be used with a single UV script to spin up a server and benchmark our model.
By the end of the post, you will have a one liner to benchmark any model defined in transformers:
hf jobs uv run script.py \
--flavor l4x1 \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"
The script
Dependencies
UV scripts allow us to specify our dependencies using metadata at the top of our file. We will use inspect-ai as our eval harness, OpenAI to send requests to the server, and, of course, transformers and kernels to serve the model.
# /// script
# dependencies = [
# "inspect-ai", # AI evaluation framework
# "huggingface-hub", # Model repository access
# "transformers[serving]",# Transformers with serving capabilities
# "openai>=2.26.0", # OpenAI API compatibility
# "kernels", # Custom kernel implementations
# ]
# ///
Spin up the server
To keep everything in one file will start the server as a subprocess. Here we choose our continuous batching parameters and what attention implementation to use, those should be tailored to the model and hardware you are using.
import subprocess
# Build the command
serve_cmd = ["transformers", "serve"]
serve_cmd.append("--continuous-batching")
serve_cmd.extend(["--cb-block-size", "256"])
serve_cmd.extend(["--attn-implementation", "flash-attn2"])
server_process = subprocess.Popen(serve_cmd)
wait_for_server_up(server_process, timeout=600)
Once the server is up, we can start bombarding it with requests. Transformers serve is OpenAI API-compatible, meaning it's plug-and-play with almost every modern evaluation harness.
Here we use GPQA Diamond as the benchmark, it's a standard knowledge and reasoning benchmark defined as an official benchmark on huggingface.
from inspect_ai import eval
from inspect_evals.gpqa import gpqa_diamond
model = "Qwen/Qwen3-8B"
eval(
"hf/Idavidrein/gpqa/diamond",
model=f"openai-api/transformers-serve/{model}",
log_dir="./logs",
model_base_url="http://localhost:8000/v1",
display="plain", # to better read the hf job logs
limit=100, # limiting to 100 samples
model_args={"stream": False},
max_connections=100,
max_tokens=2048,
)
Publishing your results
Inspect-ai allows us to package our logs and display them nicely in an HF space.
from inspect_ai.log import bundle_log_dir
bundle_space = "{my_user}/bundle"
bundle_log_dir(
"./logs",
output_dir=f"hf/{bundle_space}",
overwrite=True
)
Putting it all together
We can now run our script using HF Jobs. We use the script defined in the transformers examples. Specifying the model, the output space, the hardware to run on (flavor).
We also need to specify the `--secrets HF_TOKEN` to be able to publish our results to the hub and the `-e TRANSFORMERS_SERVE_API_KEY="1234"` required by inspect-ai.
hf jobs uv run script.py \
--flavor l4x1 \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"
And, THAT'S IT! The benchmark will run and results will be uploaded to the hub.
Inspect-ai has a large library of benchmarks you can use, it's plug and play, if you want to evaluate HLE, aime, swe-bench, simply change the benchmark in the `eval` function.
If you are satisfied with the results you can open a PR on the repo of the model you benchmarked so that it appears on the community leaderboards. The results will then show up in the GPQA leaderboard on the hub!
- dataset:
id: Idavidrein/gpqa
task_id: gpqa_diamond
value: 0.412
source:
url: https://huggingface.co/spaces/{bundle-space}
name: Eval traces
user: user
notes: "Pass@4"
Deep Dive: