Llama_3.x_70b_Hexagon_Purple_V3 — Quantized (compressed-tensors for vLLM, Llama-3.3 finetune)

This repository provides quantized runtime builds of
Nexesenex/Llama_3.x_70b_Hexagon_Purple_V3 (a Llama-3.3-70B finetune), repackaged for vLLM using the compressed-tensors format.

TL;DR

Quantized with W8A16 (INT8 weights / A16 activations) for vLLM via --quantization compressed-tensors.

Three branches (different group sizes): W8A16_GS32, W8A16_GS64, W8A16_GS128.

Same calibration recipe as our recent cards: 512 chat samples, 2048 max sequence length, dataset neuralmagic/LLM_compression_calibration (messages rendered with the model’s chat template).

Weight-only AWQ; lm_head kept high-precision; exported with save_compressed=True.

Revisions & Branches

The main branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

main — placeholder / landing page
W8A16_GS32 — INT8 weights, group size 32 (highest fidelity; more scales/metadata)
W8A16_GS64 — INT8 weights, group size 64 (balanced)
W8A16_GS128 — INT8 weights, group size 128 (lightest scales; fastest/leanest)

Quick links

What’s inside (per revision)

Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
config.json with compressed-tensors metadata (weight_format, quantization, quantization_config, etc.)
Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)
Optional: chat_template.jinja (inherits the finetune’s chat style)

Exact file lists may differ between branches — see Files and versions for each revision.

Quantization & calibration details (same script/recipe family as prior card)

Method / flow

llmcompressor oneshot pipeline with an AWQModifier (weight-only quantization).

Targets / exclusions

Quantize Linear layers; ignore lm_head (kept high-precision).

Weights / grouping

INT8 (num_bits=8, type="int", symmetric=True)
Strategy: "group" with group size ∈ {32, 64, 128} depending on branch
Activations not quantized (runtime A16: BF16/FP16)

Calibration dataset & preprocessing

Dataset: neuralmagic/LLM_compression_calibration, split train
NUM_CALIBRATION_SAMPLES = 512 (random subset with fixed seed)
MAX_SEQUENCE_LENGTH = 2048
Each example’s messages list is rendered via tokenizer.apply_chat_template(..., tokenize=False), then tokenized with:
- max_length=2048, truncation=True, padding=False, add_special_tokens=False

Compression call

oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer) on the preprocessed dataset

Export for vLLM

Saved with save_compressed=True so vLLM loads the compressed-tensors runtime layout directly

Why group size matters in AWQ (W8A16)

What it is: Group size controls how many weights share a set of quantization scales. Smaller groups ⇒ more scale sets; larger groups ⇒ fewer scale sets.
Accuracy vs. speed/VRAM trade-off:
- GS32 (smallest groups): Highest fidelity (more fine-grained scaling, typically best perplexity / task scores), at the cost of larger scale metadata, more bandwidth, and slightly lower throughput.
- GS64 (middle ground): Balanced accuracy and performance; good default if you haven’t profiled yet.
- GS128 (largest groups): Leanest/fastest (fewer scales to fetch), with typically slightly higher quantization error; often preferred when maximizing TPS or minimizing memory.
When to pick which:
- Sensitive reasoning/RP quality → GS32
- General assistant workloads → GS64
- Throughput-critical serving → GS128

Context length

Calibration context: up to 2048 tokens per sample (as above).
Model context window: inherited from the Llama-3.3-70B finetune (Nexesenex/Llama_3.x_70b_Hexagon_Purple_V3). Quantization does not change rope/position encodings—only the numeric representation of the weights.

Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/Llama_3.x_70b_Hexagon_Purple_V3_Compressed-Tensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Example Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/Llama_3.x_70b_Hexagon_Purple_V3_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Hexagon Purple — helpful, precise, and safe."},
      {"role":"user","content":"Draft a character-driven opening scene in under 250 words."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible quant (e.g., GPTQ/AWQ export) or the full-precision finetune.

Prompting / chat template

This package follows the finetuned parent’s chat conventions. If a chat_template.jinja is present, libraries that support apply_chat_template will automatically format messages.

Guidelines:

Keep the system message concise (behavior, tone, safety constraints).
Provide clear user instructions; for multi-step tasks, list steps explicitly.

Intended use & safety

This quantization:

Does not change underlying behavior or content tendencies.
Only changes weight storage for efficient inference.

Apply appropriate content filters / policies for your deployment context.

Lineage

Finetuned parent: https://huggingface.co/Nexesenex/Llama_3.x_70b_Hexagon_Purple_V3
This repo: Quantized child of the finetune (compressed-tensors for vLLM)

Hardware tips

70B-class models benefit from multi-GPU tensor parallel for throughput.
Long contexts are KV-cache heavy — tune --max-model-len and batch size.
Prefer BF16 on GPUs with native support; otherwise FP16.
Enable P2P/NVLink when available; consider CUDA Graphs if stable.

Changelog

V3 (current) — Initial compressed-tensors W8A16 quantization with 512-sample / 2048-token AWQ calibration; branches W8A16_GS32 / W8A16_GS64 / W8A16_GS128 published; vLLM-ready packaging.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/Llama_3.x_70b_Hexagon_Purple_V3_Compressed-Tensors

Base model

Nexesenex/Llama_3.x_70b_Hexagon_Purple_V3

Quantized

(6)

this model