Model Card for ealexeev/TheDrummer-Valkyrie-49B-v2.1-NVFP4
This is an NVFP4 quantization of TheDrummer/Valkyrie-49B-v2.1.
Quantization Details
Used https://github.com/ealexeev/llm-quantization script.
Calibration dataset size: 1024 Calibration data:
- HuggingFaceH4/ultrachat_200k
- allenai/c4_en
- mrcedric98/fiction_books_v8
These were shuffled and mixed at a ratio of 3:2:3
Procedure
python ./quantize_nvfp4.py --model TheDrummer/Valkyrie-49B-v2.1 --output ./TheDrummer/Valkyrie-49B-v2.1 --size 1024 --seed 42 --ultra_chat 3 --c4_en 2 --fiction_v8 3
I had read in VLLM docs that NVFP4 quantization needs very few samples. I ran multiple quants of 128, 256, 512, and 1024 samples. This 1024 version hit the sweet spot in these particular evals.
Quantization Evals
| Metric | Base Model (BF16) | NVFP4 (Quantized) | Delta |
|---|---|---|---|
| ARC Challenge (Logic/Reasoning) | 0.596 | 0.582 | -2.300% |
| IFEval (Strict Instruction Following) | 0.724 | 0.717 | -1.000% |
| HellaSwag (Flow/Common Sense) | 0.633 | 0.645 | +1.900% |
| Lambada (Perplexity) | 2.986 | 2.955 | -1.000% |
| WikiText (Perplexity) | 8.704 | 8.2185 | -5.600% |
Bias, Risks, and Limitations
This is already a creative fine-tune. It was quantized with that usecase in mind. Probably not gonna pass any leet-coder challenges with this one.
How To Use
bash
vllm serve ealexeev/TheDrummer-Valkyrie-49B-v2.1-NVFP4 \
--tensor-parallel-size 1 \ # 1 GPU
--gpu-memory-utilization 0.8 \ # Else it will take it all for KV
- Downloads last month
- 85
Model tree for ealexeev/TheDrummer-Valkyrie-49B-v2.1-NVFP4
Base model
nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 Finetuned
TheDrummer/Valkyrie-49B-v2.1