Precog-24B-v1 — Quantized (compressed-tensors for vLLM)

This repository provides quantized runtime builds of
TheDrummer/Precog-24B-v1, repackaged for vLLM using the compressed-tensors format.

TL;DR

Quantized W4A16 (INT4 weights / A16 activations) for vLLM via --quantization compressed-tensors.

Calibration: 512 chat samples, 2048 max sequence length, from neuralmagic/LLM_compression_calibration.

Weight-only AWQ (group size 128, symmetric INT4), targeting Linear layers; lm_head left high-precision.

Revisions & Branches

The main branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

main — placeholder / landing page
W4A16 — 4-bit weights / 16-bit activations (compressed-tensors)
W8A16 — 8-bit weights / 16-bit activations (compressed-tensors)

Quick links

Repository contents (per revision)

Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
config.json with compressed-tensors metadata (weight_format, quantization, quantization_config, etc.)
Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)
Optional: chat_template.jinja (inherits the parent finetune’s chat style)

Exact file lists may differ between branches — see Files and versions for each revision.

Quantization & calibration details (from the attached script)

All settings below are extracted from the provided quantization script.

Method & scheme

Flow: llmcompressor oneshot pipeline with an AWQModifier.
Targets: ["Linear"] (weight-only quantization).
Ignored layers: ["lm_head"] kept in higher precision.
Weights: INT4 (num_bits=4, type="int", symmetric=True) using group strategy with group_size=128 (Marlin-friendly).
Weights: INT8 (num_bits=8, type="int", symmetric=True) using group strategy with group_size=128 (Most likely BitBLAS Kernel Activation on Ampre).
Activations: not quantized (A16 at runtime; FP16/BF16).
Recipe object: QuantizationScheme + QuantizationArgs embedded in an AWQ modifier.
Save: save_compressed=True so vLLM can load the compressed-tensors layout directly.

Calibration dataset & preprocessing

Dataset: neuralmagic/LLM_compression_calibration, split "train".
Sample size: NUM_CALIBRATION_SAMPLES = 512 (random subset, seed=42).
Sequence length: MAX_SEQUENCE_LENGTH = 2048 (truncate, no padding, add_special_tokens=False).
Chat rendering: each sample’s messages list is rendered with tokenizer.apply_chat_template(..., tokenize=False) to reflect real chat formatting.
Batch processing: preprocessing and tokenization done in batches with multi-proc mapping.

One-shot compression call

oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer) on the preprocessed dataset.

These choices aim to preserve long-form dialog behavior by calibrating on chat-templated text at 2048 tokens, with group-wise symmetric INT4 quantization for stable, high-throughput serving.

Context length

Calibration context: up to 2048 tokens per sample (per the script).
Model context window: inherited from TheDrummer/Precog-24B-v1. Quantization does not alter rope/positional embeddings; it only changes weight representation.

Quickstart — vLLM (compressed-tensors)

Install vLLM (latest recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Example Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Precog — helpful, precise, and safe."},
      {"role":"user","content":"List three strategies to reduce KV-cache memory growth at long context."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision parent model.

Prompting / chat template

This package follows the parent finetune’s chat conventions. If a chat_template.jinja file is present, libraries that support apply_chat_template will automatically format messages.

Guidelines:

Keep a concise system message (style, constraints, tone).
Structure user prompts clearly; enumerate steps for multi-part tasks.

Intended use & notes

General instruction-following, long-form drafting, and summarization
RAG/agent pipelines (pair with a retriever/tool layer)

Always review the parent/base model license and evaluate on your domain before production use.

Lineage

Finetuned parent: https://huggingface.co/TheDrummer/Precog-24B-v1
This repo: Quantized child of the finetune (compressed-tensors for vLLM)

Hardware tips

24B models benefit from multi-GPU tensor parallel for throughput.
Long context is KV-cache heavy — tune --max-model-len and batch size.
Prefer BF16 on GPUs with native support; otherwise FP16.
Enable P2P/NVLink where available; consider CUDA Graphs if stable.

Changelog

v1 (current) — Initial compressed-tensors W4A16 quantization of TheDrummer/Precog-24B-v1 with 512-sample / 2048-token AWQ calibration; vLLM-ready packaging.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/Precog-24B-v1_Compressed-Tensors

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Finetuned

mistralai/Magistral-Small-2509

Finetuned

TheDrummer/Precog-24B-v1

Quantized

(14)

this model