gemmacut-spectral
Self-hostable vLLM bundle for the GemmaCut SpectralQuant Phase 2 codebook KV-cache path with Eagle3 speculative decoding.
This is an experimental reproducibility release, not a production-ready model. It has passed short kernel equivalence checks, a graph-mode semantic smoke test, and a short synthetic serving benchmark on an NVIDIA RTX PRO 6000 Blackwell Server Edition. Run your own correctness, load, and KV-cache-footprint validation before serving real users.
What This Includes
artifacts/spectral_sidecar_chat_v2.pt: SpectralQuant calibration sidecar.scripts/setup_repro_from_hf.sh: one-command setup for a new machine.scripts/serve_phase2_eagle.sh: OpenAI-compatible vLLM server launcher.scripts/bench_tokens_sec_phase2_eagle.sh: smoke/benchmark runner.scripts/build_docker_image.sh: builds a no-weights runtime image.docker/: Dockerfile and entrypoint for the no-weights runtime image.scripts/test_triton_codebook_match.py: isolated kernel equivalence harness.scripts/measure_kv_cache_compression.py: live KV-cache measurement helper.results/: selected validation outputs.manifest.json: exact tested versions and checksums.
The actual vLLM implementation lives here:
https://github.com/bluecopa/vllm-spectral.git
branch: spectral-codebook-docker
commit: 008dd7f87fb9de185e536ad30b4d524024ed9b9f
Requirements
- Linux host with Docker and NVIDIA Container Toolkit.
- Large NVIDIA GPU. Tested on a 98 GB Blackwell card and on a single 80 GB H100 for a 4K RULER needle smoke. Start with the default
MAX_MODEL_LEN=512,MAX_NUM_SEQS=2,GPU_MEMORY_UTILIZATION=0.8, then scale after validation. git.- Hugging Face CLI:
pip install -U huggingface_hubor equivalent. - Hugging Face access for the base and drafter models if your account requires it.
If model downloads need authentication:
hf auth login
export HF_TOKEN=...
Do not bake tokens into Docker images or committed files.
No-Weights Docker Image
This is the simplest hosting path if you are willing to build an image. The image bakes in:
vLLM Spectral fork at 008dd7f87fb9de185e536ad30b4d524024ed9b9f
GemmaCut launcher entrypoint
Spectral sidecar artifacts/spectral_sidecar_chat_v2.pt
git/cmake/ninja build tools for inspection and follow-up work
It does not bake in model weights. Intel/gemma-4-31B-it-int4-AutoRound and RedHatAI/gemma-4-31B-it-speculator.eagle3 are downloaded at runtime into the mounted Hugging Face cache.
Build:
hf download satya007/gemmacut-spectral \
.dockerignore \
docker/Dockerfile \
docker/entrypoint.sh \
docker/download_sidecar.py \
scripts/build_docker_image.sh \
--local-dir ./gemmacut-spectral-image
cd ./gemmacut-spectral-image
chmod +x ./scripts/build_docker_image.sh
IMAGE=gemmacut-spectral:008dd7f87 ./scripts/build_docker_image.sh
Smoke:
mkdir -p "$PWD/hf-cache" "$PWD/results"
docker run --rm --gpus all --ipc=host \
-e HF_TOKEN \
-v "$PWD/hf-cache:/root/.cache/huggingface" \
-v "$PWD/results:/workspace/results_bench" \
gemmacut-spectral:008dd7f87 smoke
Serve:
docker run --rm --gpus all --ipc=host \
-p 8000:8000 \
-e HF_TOKEN \
-e MAX_MODEL_LEN=512 \
-e MAX_NUM_BATCHED_TOKENS=512 \
-e MAX_NUM_SEQS=2 \
-e GPU_MEMORY_UTILIZATION=0.8 \
-v "$PWD/hf-cache:/root/.cache/huggingface" \
gemmacut-spectral:008dd7f87 serve
Optional: build without the sidecar and mount it yourself.
IMAGE=gemmacut-spectral:008dd7f87-nosidecar \
./scripts/build_docker_image.sh --build-arg INCLUDE_SIDECAR=0
docker run --rm --gpus all --ipc=host \
-p 8000:8000 \
-e HF_TOKEN \
-e SPECTRAL_SIDECAR=/workspace/spectral_sidecar_chat_v2.pt \
-v "$PWD/hf-cache:/root/.cache/huggingface" \
-v "$PWD/spectral_sidecar_chat_v2.pt:/workspace/spectral_sidecar_chat_v2.pt:ro" \
gemmacut-spectral:008dd7f87-nosidecar serve
One-Command Setup
Pick a host directory. The setup script creates this layout:
$HOST_ROOT/vllm-spectral
$HOST_ROOT/gemmacut
$HOST_ROOT/gemmacut/results_it/spectral_sidecar_chat_v2.pt
$HOST_ROOT/.cache/huggingface
Run:
export HOST_ROOT=$PWD/gemmacut-spectral-host
hf download satya007/gemmacut-spectral \
scripts/setup_repro_from_hf.sh \
--local-dir /tmp/gemmacut-spectral-bootstrap
chmod +x /tmp/gemmacut-spectral-bootstrap/scripts/setup_repro_from_hf.sh
/tmp/gemmacut-spectral-bootstrap/scripts/setup_repro_from_hf.sh
The setup script:
- clones the tested vLLM branch over HTTPS,
- checks out
008dd7f87fb9de185e536ad30b4d524024ed9b9f, - downloads this repo's sidecar and helper scripts,
- verifies the sidecar SHA256,
- writes everything under
$HOST_ROOT.
Smoke Test
This starts Docker, launches vLLM, asks two short prompts, checks the answers, and exits:
cd "$HOST_ROOT/gemmacut"
HOST_ROOT="$HOST_ROOT" \
SPECTRAL_CUDA_GRAPH=1 \
RUN_SMOKE=1 \
SMOKE_ONLY=1 \
NUM_SPEC_TOKENS=3 \
./bench_tokens_sec_phase2_eagle.sh
Expected semantic output:
What is 2+2? Answer with just the number. => 4
Paris is the capital of which country? Answer with one word. => France
SMOKE_PROMPTS_OK
Start The Server
cd "$HOST_ROOT/gemmacut"
HOST_ROOT="$HOST_ROOT" \
HF_TOKEN="${HF_TOKEN:-}" \
PORT=8000 \
HOST_PORT=8000 \
SERVED_MODEL_NAME=gemmacut-spectral \
MAX_MODEL_LEN=512 \
MAX_NUM_BATCHED_TOKENS=512 \
MAX_NUM_SEQS=2 \
GPU_MEMORY_UTILIZATION=0.8 \
NUM_SPEC_TOKENS=3 \
SPECTRAL_CUDA_GRAPH=1 \
./serve_phase2_eagle.sh
The first run may take time while vLLM downloads:
Intel/gemma-4-31B-it-int4-AutoRound
RedHatAI/gemma-4-31B-it-speculator.eagle3
Query it:
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gemmacut-spectral",
"messages": [
{"role": "user", "content": "What is 2+2? Answer with just the number."}
],
"max_tokens": 16,
"temperature": 0
}'
Runtime Defaults
The launch scripts enable the tested path:
SPECTRAL_TRITON_COMPRESS=1
SPECTRAL_TRITON_DEQUANT=1
SPECTRAL_CUDA_GRAPH=1
SPECTRAL_VERIFY=0
ENABLE_SPECTRAL=1
ENABLE_EAGLE=1
NUM_SPEC_TOKENS=3
DISABLE_HYBRID_KV_CACHE_MANAGER=0
kv_cache_dtype=fp8_e4m3
For constrained smoke tests, set ENABLE_EAGLE=0 to skip loading the Eagle3 drafter while keeping the SpectralQuant base path enabled. The normal full package uses ENABLE_EAGLE=1.
For environment isolation in the no-weights Docker image, set ENABLE_SPECTRAL=0 ENABLE_EAGLE=0 to serve the base Intel/gemma-4-31B-it-int4-AutoRound model with fp8 KV cache and no SpectralQuant flags. That mode is not GemmaCut Spectral; it is only a diagnostic for checking whether the base model, Docker image, HF cache, and GPU fit before enabling Spectral. The host setup scripts always enable Spectral; they support ENABLE_EAGLE=0 only for constrained Spectral smokes.
DISABLE_HYBRID_KV_CACHE_MANAGER=0 uses the default vLLM hybrid KV cache manager. Commit 008dd7f87fb9de185e536ad30b4d524024ed9b9f teaches that path to account for Spectral's nonuniform per-layer page sizes with group-local block pools. Set DISABLE_HYBRID_KV_CACHE_MANAGER=1 only as a fallback/bisect mode.
Set HF_HUB_OFFLINE=1 only after the base model and drafter are already cached under $HOST_ROOT/.cache/huggingface.
Validation Snapshot
Graph-mode benchmark:
results/tokens_sec_phase2_eagle_cg1_3spec_128x64_16p_20260413_021712/
completed=16
failed=0
output_throughput=57.94 tok/s
total_token_throughput=173.83 tok/s
Eagle acceptance rate=34.66%
Semantic smoke:
results/phase2_eagle_cg1_semantic_smoke_20260413_022223/
SMOKE_PROMPTS_OK
H100 grouped-KV 4K RULER candidate pilot:
results/candidate_grouped_kv_4k_niah_single_1_500_20260413_073312/
MAX_MODEL_LEN=8192
MAX_NUM_BATCHED_TOKENS=4096
MAX_NUM_SEQS=1
GPU_MEMORY_UTILIZATION=0.9
SPECTRAL_CUDA_GRAPH=1
DISABLE_HYBRID_KV_CACHE_MANAGER=0
GPU KV cache size: 192,192 tokens
Maximum concurrency for 8,192 tokens per request: 23.46x
RULER niah_single_1 4K: 500/500 exact matches, score=100.0, nulls=0/500
mean_latency_s=4.4667
L4 isolation smoke:
NVIDIA L4, 23,034 MiB
Docker image build: passed
HF model prefetch after retry: passed
Base INT4 + fp8 KV, no Spectral, no Eagle: passed two-prompt smoke
Base INT4 + fp8 KV cache capacity: 2,304 tokens at MAX_MODEL_LEN=512, GPU_MEMORY_UTILIZATION=0.9
SpectralQuant, no Eagle, SPECTRAL_CUDA_GRAPH=0: failed to reserve KV cache; available KV cache memory -0.74 GiB
Full SpectralQuant + Eagle3: failed while loading the Eagle3 drafter with CUDA OOM
Known Limitations
- Not production-ready yet.
- Full public RULER and production traffic have not been run yet; the H100 result above is a 4K single-task candidate pilot, not a full eval suite or baseline comparison.
- A single 24 GB L4 is not enough for the current full GemmaCut Spectral package. The base INT4 model with fp8 KV cache can start there, but current Spectral no-Eagle overhead still leaves no KV-cache memory, and full Spectral + Eagle3 OOMs while loading the drafter.
- Current compression is lower than the SpectralQuant paper target because this branch uses the current 6-bit semantic / 4-bit tail byte-and-nibble storage format rather than tighter paper-style bit packing.
- The launcher uses a source-overlay workflow: it copies the checked-out vLLM branch into
/tmpinside the container and links native extensions from the Docker image. - If a Hugging Face sidecar download fails with a transient connection error, rerun
scripts/setup_repro_from_hf.sh; it uses a host-local download cache and retries the large sidecar file.
Checksums
artifacts/spectral_sidecar_chat_v2.pt
sha256: e47a36c13467cbedf720e7f782b976df3dcda2d989c727113a8315008661a3e4
size: 505,231,357 bytes
License
This bundle follows the upstream Gemma 4 license: apache-2.0.
Confirmed upstream metadata:
google/gemma-4-31B-it:apache-2.0Intel/gemma-4-31B-it-int4-AutoRound: quantized fromgoogle/gemma-4-31B-itand says to follow the original model licenseRedHatAI/gemma-4-31B-it-speculator.eagle3:apache-2.0
Use remains subject to the upstream model cards and terms.
Model tree for satya007/gemmacut-spectral
Base model
google/gemma-4-31B-it