Precog-24B-v1 — Quantized (compressed-tensors for vLLM)
This repository provides quantized runtime builds of
TheDrummer/Precog-24B-v1, repackaged for vLLM using the compressed-tensors format.
TL;DR
- Quantized W4A16 (INT4 weights / A16 activations) for vLLM via
--quantization compressed-tensors.- Calibration: 512 chat samples, 2048 max sequence length, from
neuralmagic/LLM_compression_calibration.- Weight-only AWQ (group size 128, symmetric INT4), targeting Linear layers;
lm_headleft high-precision.
Revisions & Branches
The
mainbranch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
- main — placeholder / landing page
- W4A16 — 4-bit weights / 16-bit activations (compressed-tensors)
- W8A16 — 8-bit weights / 16-bit activations (compressed-tensors)
Quick links
- main: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/main
- W4A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W4A16
- W8A16: https://huggingface.co/TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors/tree/W8A16
Repository contents (per revision)
- Sharded quantized weights (
*.safetensors) + index (model.safetensors.index.json) config.jsonwith compressed-tensors metadata (weight_format,quantization,quantization_config, etc.)- Tokenizer artifacts (
tokenizer.json,tokenizer.model, merges/vocab as applicable) - Optional:
chat_template.jinja(inherits the parent finetune’s chat style)
Exact file lists may differ between branches — see Files and versions for each revision.
Quantization & calibration details (from the attached script)
All settings below are extracted from the provided quantization script.
Method & scheme
- Flow:
llmcompressoroneshot pipeline with an AWQModifier. - Targets:
["Linear"](weight-only quantization). - Ignored layers:
["lm_head"]kept in higher precision. - Weights: INT4 (
num_bits=4,type="int",symmetric=True) using group strategy with group_size=128 (Marlin-friendly). - Weights: INT8 (
num_bits=8,type="int",symmetric=True) using group strategy with group_size=128 (Most likely BitBLAS Kernel Activation on Ampre). - Activations: not quantized (A16 at runtime; FP16/BF16).
- Recipe object:
QuantizationScheme+QuantizationArgsembedded in an AWQ modifier. - Save:
save_compressed=Trueso vLLM can load the compressed-tensors layout directly.
Calibration dataset & preprocessing
- Dataset:
neuralmagic/LLM_compression_calibration, split "train". - Sample size: NUM_CALIBRATION_SAMPLES = 512 (random subset, seed=42).
- Sequence length: MAX_SEQUENCE_LENGTH = 2048 (truncate, no padding,
add_special_tokens=False). - Chat rendering: each sample’s
messageslist is rendered withtokenizer.apply_chat_template(..., tokenize=False)to reflect real chat formatting. - Batch processing: preprocessing and tokenization done in batches with multi-proc mapping.
One-shot compression call
oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)on the preprocessed dataset.
These choices aim to preserve long-form dialog behavior by calibrating on chat-templated text at 2048 tokens, with group-wise symmetric INT4 quantization for stable, high-throughput serving.
Context length
- Calibration context: up to 2048 tokens per sample (per the script).
- Model context window: inherited from TheDrummer/Precog-24B-v1. Quantization does not alter rope/positional embeddings; it only changes weight representation.
Quickstart — vLLM (compressed-tensors)
Install vLLM (latest recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors \
--quantization compressed-tensors \
--tensor-parallel-size 4 \
--max-model-len 2048 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Example Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Precog — helpful, precise, and safe."},
{"role":"user","content":"List three strategies to reduce KV-cache memory growth at long context."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensorsis a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision parent model.
Prompting / chat template
This package follows the parent finetune’s chat conventions. If a chat_template.jinja file is present, libraries that support apply_chat_template will automatically format messages.
Guidelines:
- Keep a concise system message (style, constraints, tone).
- Structure user prompts clearly; enumerate steps for multi-part tasks.
Intended use & notes
- General instruction-following, long-form drafting, and summarization
- RAG/agent pipelines (pair with a retriever/tool layer)
Always review the parent/base model license and evaluate on your domain before production use.
Lineage
- Finetuned parent: https://huggingface.co/TheDrummer/Precog-24B-v1
- This repo: Quantized child of the finetune (compressed-tensors for vLLM)
Hardware tips
- 24B models benefit from multi-GPU tensor parallel for throughput.
- Long context is KV-cache heavy — tune
--max-model-lenand batch size. - Prefer BF16 on GPUs with native support; otherwise FP16.
- Enable P2P/NVLink where available; consider CUDA Graphs if stable.
Changelog
- v1 (current) — Initial compressed-tensors W4A16 quantization of
TheDrummer/Precog-24B-v1with 512-sample / 2048-token AWQ calibration; vLLM-ready packaging.
Model tree for TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503