gemmacut-spectral

Self-hostable vLLM bundle for the GemmaCut SpectralQuant Phase 2 codebook KV-cache path with Eagle3 speculative decoding.

This is an experimental reproducibility release, not a production-ready model. It has passed short kernel equivalence checks, a graph-mode semantic smoke test, and a short synthetic serving benchmark on an NVIDIA RTX PRO 6000 Blackwell Server Edition. Run your own correctness, load, and KV-cache-footprint validation before serving real users.

What This Includes

  • artifacts/spectral_sidecar_chat_v2.pt: SpectralQuant calibration sidecar.
  • scripts/setup_repro_from_hf.sh: one-command setup for a new machine.
  • scripts/serve_phase2_eagle.sh: OpenAI-compatible vLLM server launcher.
  • scripts/bench_tokens_sec_phase2_eagle.sh: smoke/benchmark runner.
  • scripts/build_docker_image.sh: builds a no-weights runtime image.
  • docker/: Dockerfile and entrypoint for the no-weights runtime image.
  • scripts/test_triton_codebook_match.py: isolated kernel equivalence harness.
  • scripts/measure_kv_cache_compression.py: live KV-cache measurement helper.
  • results/: selected validation outputs.
  • manifest.json: exact tested versions and checksums.

The actual vLLM implementation lives here:

https://github.com/bluecopa/vllm-spectral.git
branch: spectral-codebook-docker
commit: 008dd7f87fb9de185e536ad30b4d524024ed9b9f

Requirements

  • Linux host with Docker and NVIDIA Container Toolkit.
  • Large NVIDIA GPU. Tested on a 98 GB Blackwell card and on a single 80 GB H100 for a 4K RULER needle smoke. Start with the default MAX_MODEL_LEN=512, MAX_NUM_SEQS=2, GPU_MEMORY_UTILIZATION=0.8, then scale after validation.
  • git.
  • Hugging Face CLI: pip install -U huggingface_hub or equivalent.
  • Hugging Face access for the base and drafter models if your account requires it.

If model downloads need authentication:

hf auth login
export HF_TOKEN=...

Do not bake tokens into Docker images or committed files.

No-Weights Docker Image

This is the simplest hosting path if you are willing to build an image. The image bakes in:

vLLM Spectral fork at 008dd7f87fb9de185e536ad30b4d524024ed9b9f
GemmaCut launcher entrypoint
Spectral sidecar artifacts/spectral_sidecar_chat_v2.pt
git/cmake/ninja build tools for inspection and follow-up work

It does not bake in model weights. Intel/gemma-4-31B-it-int4-AutoRound and RedHatAI/gemma-4-31B-it-speculator.eagle3 are downloaded at runtime into the mounted Hugging Face cache.

Build:

hf download satya007/gemmacut-spectral \
  .dockerignore \
  docker/Dockerfile \
  docker/entrypoint.sh \
  docker/download_sidecar.py \
  scripts/build_docker_image.sh \
  --local-dir ./gemmacut-spectral-image

cd ./gemmacut-spectral-image
chmod +x ./scripts/build_docker_image.sh
IMAGE=gemmacut-spectral:008dd7f87 ./scripts/build_docker_image.sh

Smoke:

mkdir -p "$PWD/hf-cache" "$PWD/results"

docker run --rm --gpus all --ipc=host \
  -e HF_TOKEN \
  -v "$PWD/hf-cache:/root/.cache/huggingface" \
  -v "$PWD/results:/workspace/results_bench" \
  gemmacut-spectral:008dd7f87 smoke

Serve:

docker run --rm --gpus all --ipc=host \
  -p 8000:8000 \
  -e HF_TOKEN \
  -e MAX_MODEL_LEN=512 \
  -e MAX_NUM_BATCHED_TOKENS=512 \
  -e MAX_NUM_SEQS=2 \
  -e GPU_MEMORY_UTILIZATION=0.8 \
  -v "$PWD/hf-cache:/root/.cache/huggingface" \
  gemmacut-spectral:008dd7f87 serve

Optional: build without the sidecar and mount it yourself.

IMAGE=gemmacut-spectral:008dd7f87-nosidecar \
  ./scripts/build_docker_image.sh --build-arg INCLUDE_SIDECAR=0

docker run --rm --gpus all --ipc=host \
  -p 8000:8000 \
  -e HF_TOKEN \
  -e SPECTRAL_SIDECAR=/workspace/spectral_sidecar_chat_v2.pt \
  -v "$PWD/hf-cache:/root/.cache/huggingface" \
  -v "$PWD/spectral_sidecar_chat_v2.pt:/workspace/spectral_sidecar_chat_v2.pt:ro" \
  gemmacut-spectral:008dd7f87-nosidecar serve

One-Command Setup

Pick a host directory. The setup script creates this layout:

$HOST_ROOT/vllm-spectral
$HOST_ROOT/gemmacut
$HOST_ROOT/gemmacut/results_it/spectral_sidecar_chat_v2.pt
$HOST_ROOT/.cache/huggingface

Run:

export HOST_ROOT=$PWD/gemmacut-spectral-host

hf download satya007/gemmacut-spectral \
  scripts/setup_repro_from_hf.sh \
  --local-dir /tmp/gemmacut-spectral-bootstrap

chmod +x /tmp/gemmacut-spectral-bootstrap/scripts/setup_repro_from_hf.sh
/tmp/gemmacut-spectral-bootstrap/scripts/setup_repro_from_hf.sh

The setup script:

  • clones the tested vLLM branch over HTTPS,
  • checks out 008dd7f87fb9de185e536ad30b4d524024ed9b9f,
  • downloads this repo's sidecar and helper scripts,
  • verifies the sidecar SHA256,
  • writes everything under $HOST_ROOT.

Smoke Test

This starts Docker, launches vLLM, asks two short prompts, checks the answers, and exits:

cd "$HOST_ROOT/gemmacut"

HOST_ROOT="$HOST_ROOT" \
SPECTRAL_CUDA_GRAPH=1 \
RUN_SMOKE=1 \
SMOKE_ONLY=1 \
NUM_SPEC_TOKENS=3 \
./bench_tokens_sec_phase2_eagle.sh

Expected semantic output:

What is 2+2? Answer with just the number. => 4
Paris is the capital of which country? Answer with one word. => France
SMOKE_PROMPTS_OK

Start The Server

cd "$HOST_ROOT/gemmacut"

HOST_ROOT="$HOST_ROOT" \
HF_TOKEN="${HF_TOKEN:-}" \
PORT=8000 \
HOST_PORT=8000 \
SERVED_MODEL_NAME=gemmacut-spectral \
MAX_MODEL_LEN=512 \
MAX_NUM_BATCHED_TOKENS=512 \
MAX_NUM_SEQS=2 \
GPU_MEMORY_UTILIZATION=0.8 \
NUM_SPEC_TOKENS=3 \
SPECTRAL_CUDA_GRAPH=1 \
./serve_phase2_eagle.sh

The first run may take time while vLLM downloads:

Intel/gemma-4-31B-it-int4-AutoRound
RedHatAI/gemma-4-31B-it-speculator.eagle3

Query it:

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemmacut-spectral",
    "messages": [
      {"role": "user", "content": "What is 2+2? Answer with just the number."}
    ],
    "max_tokens": 16,
    "temperature": 0
  }'

Runtime Defaults

The launch scripts enable the tested path:

SPECTRAL_TRITON_COMPRESS=1
SPECTRAL_TRITON_DEQUANT=1
SPECTRAL_CUDA_GRAPH=1
SPECTRAL_VERIFY=0
ENABLE_SPECTRAL=1
ENABLE_EAGLE=1
NUM_SPEC_TOKENS=3
DISABLE_HYBRID_KV_CACHE_MANAGER=0
kv_cache_dtype=fp8_e4m3

For constrained smoke tests, set ENABLE_EAGLE=0 to skip loading the Eagle3 drafter while keeping the SpectralQuant base path enabled. The normal full package uses ENABLE_EAGLE=1.

For environment isolation in the no-weights Docker image, set ENABLE_SPECTRAL=0 ENABLE_EAGLE=0 to serve the base Intel/gemma-4-31B-it-int4-AutoRound model with fp8 KV cache and no SpectralQuant flags. That mode is not GemmaCut Spectral; it is only a diagnostic for checking whether the base model, Docker image, HF cache, and GPU fit before enabling Spectral. The host setup scripts always enable Spectral; they support ENABLE_EAGLE=0 only for constrained Spectral smokes.

DISABLE_HYBRID_KV_CACHE_MANAGER=0 uses the default vLLM hybrid KV cache manager. Commit 008dd7f87fb9de185e536ad30b4d524024ed9b9f teaches that path to account for Spectral's nonuniform per-layer page sizes with group-local block pools. Set DISABLE_HYBRID_KV_CACHE_MANAGER=1 only as a fallback/bisect mode.

Set HF_HUB_OFFLINE=1 only after the base model and drafter are already cached under $HOST_ROOT/.cache/huggingface.

Validation Snapshot

Graph-mode benchmark:

results/tokens_sec_phase2_eagle_cg1_3spec_128x64_16p_20260413_021712/
completed=16
failed=0
output_throughput=57.94 tok/s
total_token_throughput=173.83 tok/s
Eagle acceptance rate=34.66%

Semantic smoke:

results/phase2_eagle_cg1_semantic_smoke_20260413_022223/
SMOKE_PROMPTS_OK

H100 grouped-KV 4K RULER candidate pilot:

results/candidate_grouped_kv_4k_niah_single_1_500_20260413_073312/
MAX_MODEL_LEN=8192
MAX_NUM_BATCHED_TOKENS=4096
MAX_NUM_SEQS=1
GPU_MEMORY_UTILIZATION=0.9
SPECTRAL_CUDA_GRAPH=1
DISABLE_HYBRID_KV_CACHE_MANAGER=0
GPU KV cache size: 192,192 tokens
Maximum concurrency for 8,192 tokens per request: 23.46x
RULER niah_single_1 4K: 500/500 exact matches, score=100.0, nulls=0/500
mean_latency_s=4.4667

L4 isolation smoke:

NVIDIA L4, 23,034 MiB
Docker image build: passed
HF model prefetch after retry: passed
Base INT4 + fp8 KV, no Spectral, no Eagle: passed two-prompt smoke
Base INT4 + fp8 KV cache capacity: 2,304 tokens at MAX_MODEL_LEN=512, GPU_MEMORY_UTILIZATION=0.9
SpectralQuant, no Eagle, SPECTRAL_CUDA_GRAPH=0: failed to reserve KV cache; available KV cache memory -0.74 GiB
Full SpectralQuant + Eagle3: failed while loading the Eagle3 drafter with CUDA OOM

Known Limitations

  • Not production-ready yet.
  • Full public RULER and production traffic have not been run yet; the H100 result above is a 4K single-task candidate pilot, not a full eval suite or baseline comparison.
  • A single 24 GB L4 is not enough for the current full GemmaCut Spectral package. The base INT4 model with fp8 KV cache can start there, but current Spectral no-Eagle overhead still leaves no KV-cache memory, and full Spectral + Eagle3 OOMs while loading the drafter.
  • Current compression is lower than the SpectralQuant paper target because this branch uses the current 6-bit semantic / 4-bit tail byte-and-nibble storage format rather than tighter paper-style bit packing.
  • The launcher uses a source-overlay workflow: it copies the checked-out vLLM branch into /tmp inside the container and links native extensions from the Docker image.
  • If a Hugging Face sidecar download fails with a transient connection error, rerun scripts/setup_repro_from_hf.sh; it uses a host-local download cache and retries the large sidecar file.

Checksums

artifacts/spectral_sidecar_chat_v2.pt
sha256: e47a36c13467cbedf720e7f782b976df3dcda2d989c727113a8315008661a3e4
size: 505,231,357 bytes

License

This bundle follows the upstream Gemma 4 license: apache-2.0.

Confirmed upstream metadata:

  • google/gemma-4-31B-it: apache-2.0
  • Intel/gemma-4-31B-it-int4-AutoRound: quantized from google/gemma-4-31B-it and says to follow the original model license
  • RedHatAI/gemma-4-31B-it-speculator.eagle3: apache-2.0

Use remains subject to the upstream model cards and terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for satya007/gemmacut-spectral

Finetuned
(1)
this model