Cydonia-24B-v4.3-W4A16-GPTQ
W4A16 GPTQ quantization of TheDrummer/Cydonia-24B-v4.3, made with llm-compressor.
Quantization Details
- Method: GPTQ (W4A16 — 4-bit weights, 16-bit activations)
- Tool: llm-compressor (vLLM official)
- Group size: 128
- Calibration: 512 samples from ultrachat_200k, 2048 seq length
- Model size: ~14 GB (vs ~48 GB FP16)
- Marlin kernel compatible: Yes (auto-detected by vLLM)
Quality Metrics
| Metric | Value | Threshold |
|---|---|---|
| Perplexity | 3.78 | < 8.0 |
| Repetition rate | 0/50 (0%) | < 10% |
| Unique token ratio | 0.794 | > 0.6 |
Usage
# vLLM (recommended — uses Marlin kernels for fast inference)
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --gpu-memory-utilization 0.95
# With priority scheduling
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --scheduling-policy priority --enable-prefix-caching --enable-chunked-prefill --kv-cache-dtype fp8_e4m3
Original Model
Cydonia v4.3 by TheDrummer — a Mistral Small 3.1 24B fine-tune optimized for roleplay and creative writing.
Hardware Requirements
- Minimum: 16 GB VRAM (with short context)
- Recommended: 24 GB VRAM (RTX 4090, A100) for 64k context support
- Downloads last month
- 163
Model tree for tacodevs/Cydonia-24B-v4.3-W4A16-GPTQ
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503 Finetuned
TheDrummer/Cydonia-24B-v4.3