Step-3.5-Flash-REAP-128B-A11B GGUF
GGUF quantizations of lkevincc0/Step-3.5-Flash-REAP-128B-A11B, a REAP-pruned variant of stepfun-ai/Step-3.5-Flash.
Available quantizations
| Quantization | File size | Files |
|---|---|---|
| Q5_K_M | 80 GB | 5 split parts |
| Q4_K_M | 68 GB | 4 split parts |
About the model
- Architecture: Step3p5ForCausalLM (Sparse MoE)
- Original parameters: 196B total, 11B active per token (288 experts, top-8 routing)
- REAP-pruned: 128B total, 11B active per token (173 experts, top-8 routing — 40% expert pruning)
How to run
# Basic inference
llama-cli -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
-c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 \
-p "What's your name?"
# Server mode
llama-server -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
-c 16384 -b 2048 -ub 2048 -fa on -ngl 99
Note: for split GGUFs, point llama.cpp at the first part — it finds the rest automatically.
Quantization details
Converted from the original safetensors to bf16 GGUF using convert_hf_to_gguf.py, then quantized with llama-quantize.
Quantized using llama.cpp at commit 39bf692af.
- Downloads last month
- 38
Hardware compatibility
Log In to add your hardware
4-bit
5-bit
Model tree for nivvis/Step-3.5-Flash-REAP-128B-A11B-GGUF
Base model
stepfun-ai/Step-3.5-Flash Quantized
lkevincc0/Step-3.5-Flash-REAP-128B-A11B