Qwen3-30B-A3B

Runtime error

App Files Files Community

[performance]Benchmarking Large Models on Hugging Face Free CPU: How Fast Can TPS Go?

by hsuwill000 - opened Aug 5, 2025

Discussion

hsuwill000

Owner Aug 5, 2025

•

edited Aug 6, 2025

This Hugging Face Space is dedicated to exploring the performance of larger models on Hugging Face’s free CPU resources, with a particular focus on achieving the highest possible throughput (TPS).

##Experiment Notes 2##
Model:
gemma-3n-E2B-it-UD-IQ2_M.gguf

Build Configuration: Same as ex1
Run Command: Same as ex1

Result:
prompt eval time = 4492.84 ms / 22 tokens ( 204.22 ms per token, 4.90 tokens per second)
eval time = 41601.69 ms / 167 tokens ( 249.11 ms per token, 4.01 tokens per second)
total time = 46094.53 ms / 189 tokens

Ability: completely useless

##Experiment Notes 1##
Inference Engine:
Llama.cpp

Model:
Qwen3-30B-A3B-Instruct-2507-UD-TQ1_0.gguf

Build Configuration:
cmake -B build -DGGML_NATIVE=ON
-DGGML_AVX512_VNNI=ON
-DGGML_AVX512_VBMI=ON
-DGGML_AVX512=ON
-DGGML_AVX512_BF16=ON
-DGGML_UNROLL=ON
-DGGML_USE_K_QUANTS=ON
-DGGML_LTO=ON
-DCURL_INCLUDE_DIR=/usr/include/x86_64-linux-gnu
-DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/libcurl.so

Run Command:
./llama-server -m model.gguf
--port 8000
--host 0.0.0.0
--threads 2
--ctx-size 4096
--mlock
--jinja

Result:
prompt eval time = 4191.78 ms / 14 tokens ( 299.41 ms per token, 3.34 tokens per second)
eval time = 42453.29 ms / 162 tokens ( 262.06 ms per token, 3.82 tokens per second)
total time = 46645.06 ms / 176 tokens

Ability: Good

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment