Spaces:
Runtime error
[performance]Benchmarking Large Models on Hugging Face Free CPU: How Fast Can TPS Go?
This Hugging Face Space is dedicated to exploring the performance of larger models on Hugging Face’s free CPU resources, with a particular focus on achieving the highest possible throughput (TPS).
##Experiment Notes 2##
Model:
gemma-3n-E2B-it-UD-IQ2_M.gguf
Build Configuration: Same as ex1
Run Command: Same as ex1
Result:
prompt eval time = 4492.84 ms / 22 tokens ( 204.22 ms per token, 4.90 tokens per second)
eval time = 41601.69 ms / 167 tokens ( 249.11 ms per token, 4.01 tokens per second)
total time = 46094.53 ms / 189 tokens
Ability: completely useless
##Experiment Notes 1##
Inference Engine:
Llama.cpp
Model:
Qwen3-30B-A3B-Instruct-2507-UD-TQ1_0.gguf
Build Configuration:
cmake -B build -DGGML_NATIVE=ON
-DGGML_AVX512_VNNI=ON
-DGGML_AVX512_VBMI=ON
-DGGML_AVX512=ON
-DGGML_AVX512_BF16=ON
-DGGML_UNROLL=ON
-DGGML_USE_K_QUANTS=ON
-DGGML_LTO=ON
-DCURL_INCLUDE_DIR=/usr/include/x86_64-linux-gnu
-DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/libcurl.so
Run Command:
./llama-server -m model.gguf
--port 8000
--host 0.0.0.0
--threads 2
--ctx-size 4096
--mlock
--jinja
Result:
prompt eval time = 4191.78 ms / 14 tokens ( 299.41 ms per token, 3.34 tokens per second)
eval time = 42453.29 ms / 162 tokens ( 262.06 ms per token, 3.82 tokens per second)
total time = 46645.06 ms / 176 tokens
Ability: Good