Cydonia-24B-v4.3-W4A16-GPTQ

W4A16 GPTQ quantization of TheDrummer/Cydonia-24B-v4.3, made with llm-compressor.

Quantization Details

  • Method: GPTQ (W4A16 — 4-bit weights, 16-bit activations)
  • Tool: llm-compressor (vLLM official)
  • Group size: 128
  • Calibration: 512 samples from ultrachat_200k, 2048 seq length
  • Model size: ~14 GB (vs ~48 GB FP16)
  • Marlin kernel compatible: Yes (auto-detected by vLLM)

Quality Metrics

Metric Value Threshold
Perplexity 3.78 < 8.0
Repetition rate 0/50 (0%) < 10%
Unique token ratio 0.794 > 0.6

Usage

# vLLM (recommended — uses Marlin kernels for fast inference)
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --gpu-memory-utilization 0.95

# With priority scheduling
vllm serve Irvollo/Cydonia-24B-v4.3-W4A16-GPTQ --dtype auto --max-model-len 65536 --scheduling-policy priority --enable-prefix-caching --enable-chunked-prefill --kv-cache-dtype fp8_e4m3

Original Model

Cydonia v4.3 by TheDrummer — a Mistral Small 3.1 24B fine-tune optimized for roleplay and creative writing.

Hardware Requirements

  • Minimum: 16 GB VRAM (with short context)
  • Recommended: 24 GB VRAM (RTX 4090, A100) for 64k context support
Downloads last month
163
Safetensors
Model size
4B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tacodevs/Cydonia-24B-v4.3-W4A16-GPTQ