Step-3.5-Flash-REAP-128B-A11B GGUF

GGUF quantizations of lkevincc0/Step-3.5-Flash-REAP-128B-A11B, a REAP-pruned variant of stepfun-ai/Step-3.5-Flash.

Available quantizations

Quantization	File size	Files
Q5_K_M	80 GB	5 split parts
Q4_K_M	68 GB	4 split parts

About the model

Architecture: Step3p5ForCausalLM (Sparse MoE)
Original parameters: 196B total, 11B active per token (288 experts, top-8 routing)
REAP-pruned: 128B total, 11B active per token (173 experts, top-8 routing — 40% expert pruning)

How to run

# Basic inference
llama-cli -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
  -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 \
  -p "What's your name?"

# Server mode
llama-server -m Step-3.5-Flash-REAP-128B-A11B-Q5_K_M.gguf-00001-of-00005.gguf \
  -c 16384 -b 2048 -ub 2048 -fa on -ngl 99

Note: for split GGUFs, point llama.cpp at the first part — it finds the rest automatically.

Quantization details

Converted from the original safetensors to bf16 GGUF using convert_hf_to_gguf.py, then quantized with llama-quantize.

Quantized using llama.cpp at commit 39bf692af.

Downloads last month: 38

GGUF

Model size

121B params

Architecture

step35

Hardware compatibility

4-bit

5-bit

Model tree for nivvis/Step-3.5-Flash-REAP-128B-A11B-GGUF

Base model

stepfun-ai/Step-3.5-Flash

Quantized

lkevincc0/Step-3.5-Flash-REAP-128B-A11B

Quantized

(4)

this model