Cydonia-24B-v4.3-W4A16-GPTQ

W4A16 GPTQ quantization of TheDrummer/Cydonia-24B-v4.3, made with llm-compressor.

Quantization Details

Method: GPTQ (W4A16 — 4-bit weights, 16-bit activations)
Tool: llm-compressor (vLLM official)
Group size: 128
Calibration: 512 samples from ultrachat_200k, 2048 seq length
Model size: ~14 GB (vs ~48 GB FP16)
Marlin kernel compatible: Yes (auto-detected by vLLM)

Quality Metrics

Metric	Value	Threshold
Perplexity	3.78	< 8.0
Repetition rate	0/50 (0%)	< 10%
Unique token ratio	0.794	> 0.6

Usage

# vLLM (recommended — uses Marlin kernels for fast inference)
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --gpu-memory-utilization 0.95

# With priority scheduling
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --scheduling-policy priority --enable-prefix-caching --enable-chunked-prefill --kv-cache-dtype fp8_e4m3

Original Model

Cydonia v4.3 by TheDrummer — a Mistral Small 3.1 24B fine-tune optimized for roleplay and creative writing.

Hardware Requirements

Minimum: 16 GB VRAM (with short context)
Recommended: 24 GB VRAM (RTX 4090, A100) for 64k context support

Downloads last month: 163

Safetensors

Model size

4B params

Tensor type

I64

I32

BF16

Model tree for tacodevs/Cydonia-24B-v4.3-W4A16-GPTQ

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Finetuned

TheDrummer/Cydonia-24B-v4.3

Quantized

(29)

this model