Step-3.5-Flash-REAP-128B-A11B GGUF

GGUF quantizations of lkevincc0/Step-3.5-Flash-REAP-128B-A11B, a REAP-pruned variant of stepfun-ai/Step-3.5-Flash.

Available quantizations

Quantization File size Files
Q5_K_M 80 GB 5 split parts
Q4_K_M 68 GB 4 split parts

About the model

  • Architecture: Step3p5ForCausalLM (Sparse MoE)
  • Original parameters: 196B total, 11B active per token (288 experts, top-8 routing)
  • REAP-pruned: 128B total, 11B active per token (173 experts, top-8 routing — 40% expert pruning)

How to run

# Basic inference
llama-cli -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
  -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 \
  -p "What's your name?"

# Server mode
llama-server -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
  -c 16384 -b 2048 -ub 2048 -fa on -ngl 99

Note: for split GGUFs, point llama.cpp at the first part — it finds the rest automatically.

Quantization details

Converted from the original safetensors to bf16 GGUF using convert_hf_to_gguf.py, then quantized with llama-quantize.

Quantized using llama.cpp at commit 39bf692af.

Downloads last month
38
GGUF
Model size
121B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Step-3.5-Flash-REAP-128B-A11B-GGUF

Quantized
(4)
this model