license: apache-2.0
base_model:
- FINAL-Bench/Darwin-35B-A3B-Opus
tags:
- llama-cpp
- gguf
- quantized
- Q8_0
- merge
- evolutionary-merge
- darwin
- darwin-v5
- reasoning
- qwen3.5
- qwen
- moe
- mixture-of-experts
- claude-opus
- distillation
- multilingual
- gpqa
- open-source
- apache-2.0
- layer-wise-merge
- coding-agent
- tool-calling
- long-context
- 262k-context
language:
- en
- zh
- ko
- ja
- de
- fr
- es
- ru
- ar
- multilingual
pipeline_tag: text-generation
library_name: gguf
quantized_by: VIDRAFT
model-index:
- name: Darwin-35B-A3B-Opus-Q8_0-GGUF
results:
- task:
type: text-generation
name: Graduate-Level Reasoning
dataset:
type: Idavidrein/gpqa
name: GPQA Diamond
config: gpqa_diamond
split: train
metrics:
- type: accuracy
value: 90
name: Accuracy
verified: false
- task:
type: text-generation
name: Multilingual Knowledge
dataset:
type: openai/MMMLU
name: MMMLU
metrics:
- type: accuracy
value: 85
name: Accuracy
verified: false
Darwin-35B-A3B-Opus-Q8_0-GGUF
Q8_0 GGUF of Darwin-35B-A3B-Opus | ~37GB (3 shards) | GPQA Diamond 90.0% | Near-lossless quality | MoE 35B (3B active) | 201 Languages | 262K Context | Apache 2.0
About This Quantization
Q8_0 GGUF of FINAL-Bench/Darwin-35B-A3B-Opus.
| Original (BF16) | This Model (Q8_0 GGUF) | |
|---|---|---|
| Format | SafeTensors | GGUF |
| Size | 65.5 GB | ~37 GB (3 shards) |
| Quality | Baseline | Near-lossless (~99.9% of BF16) |
| VRAM Required | 65+ GB | ~37 GB |
| Runs on | H100, A100 80GB | A100 40GB, Mac 64GB, 2x RTX 4090 |
| Framework | Transformers, vLLM, SGLang | llama.cpp, Ollama, LM Studio |
Files
| File | Size | Description |
|---|---|---|
darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf |
~13.6 GB | Shard 1 of 3 |
darwin-35b-a3b-opus-q8_0-00002-of-00003.gguf |
~12.5 GB | Shard 2 of 3 |
darwin-35b-a3b-opus-q8_0-00003-of-00003.gguf |
~10.7 GB | Shard 3 of 3 |
| Total | ~36.8 GB | All 3 shards required |
Download all 3 shard files. llama.cpp and Ollama will automatically load them together.
Hardware Requirements
| Setup | Memory | Status |
|---|---|---|
| NVIDIA A100 40GB | 40 GB VRAM | Fits |
| NVIDIA A100 80GB | 80 GB VRAM | Comfortable |
| NVIDIA H100 93GB | 93 GB VRAM | Comfortable |
| 2x RTX 4090 (24GB each) | 48 GB VRAM | With tensor parallel |
| Mac Studio M2/M3 Ultra 64GB | 64 GB Unified | Fits |
| Mac M3 Max 48GB | 48 GB Unified | Fits |
| Single RTX 4090 24GB | 24 GB VRAM | Insufficient (use Q4_K_M) |
As a MoE model, only 3B parameters are active per token. Inference is fast despite the 37GB model size.
Usage
llama.cpp (CLI)
llama-cli \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-p "The meaning to life and the universe is" \
-n 512 -ngl 99
llama.cpp (Server)
llama-server \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-c 32768 -ngl 99
Ollama
echo 'FROM ./darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf' > Modelfile
ollama create darwin-opus -f Modelfile
ollama run darwin-opus
LM Studio
- Download all 3
.ggufshard files - Place them in the same folder
- Open LM Studio, load the first shard
- LM Studio auto-detects and loads all shards
MoE Expert Offload (Limited VRAM)
llama-cli \
--hf-repo FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF \
--hf-file darwin-35b-a3b-opus-q8_0-00001-of-00003.gguf \
-ot ".ffn_.*_exps.=CPU" \
-ngl 99 -c 32768
Benchmark Results (Original Model)
Q8_0 preserves near-identical performance to BF16.
GPQA Diamond (198 Questions, Graduate-Level Reasoning)
| Model | Accuracy |
|---|---|
| Darwin-35B-A3B-Opus | 90.0% |
| Mother (Jackrong Claude 4.6 Opus Distilled) | 85.0% |
| Father (Qwen3.5-35B-A3B Official) | 84.2% |
MMMLU (Multilingual Knowledge, 29 Languages)
| Model | Accuracy |
|---|---|
| Darwin-35B-A3B-Opus | 85.0% |
| Father (Qwen3.5-35B-A3B Official) | 85.2% |
How Darwin Was Created
Darwin-35B-A3B-Opus was created using Darwin V5, a diagnostic-guided evolutionary merge engine built on mergekit.
Both parent models share the identical Qwen3.5-35B-A3B architecture. The Mother is a LoRA SFT on the same base β not a different architecture.
Darwin V5 adds three phases over standard mergekit evolve:
- Pre-merge parent profiling (40 layers x 256 experts: activation frequency, routing entropy, probe cosine distance)
- Evolution with diagnostic-informed initial population and constrained search space
- Post-merge child validation (layer-by-layer comparison against both parents)
Key diagnostic finding: Mother had 50-65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin compensated by reducing Mother density and using Father's living experts to fill inactive slots.
Merge configuration:
# Method: DARE-TIES via mergekit
L0-L37: t=0.5988 (Mother 60%) β router from Mother
L38: t=0.9000 (Mother 90%) β reasoning core (peak probe cosine distance)
L39: t=0.5336 (Father 47%) β router from Father (output routing)
For full technical details, diagnostics, and health check results, see the original model card.
Other Quantizations
| Quantization | Size | Quality | Use Case |
|---|---|---|---|
| Q8_0 (this) | ~37 GB | Near-lossless | Maximum quality |
| Q4_K_M (coming soon) | ~20 GB | Good | RTX 4090, Mac 32GB |
Model Specifications
| Base Model | FINAL-Bench/Darwin-35B-A3B-Opus |
| Architecture | Qwen3.5 MoE (Gated DeltaNet + MoE) |
| Total Parameters | 35B |
| Active Parameters | 3B per forward pass |
| Experts | 256 (8 routed + 1 shared active) |
| Context Length | 262,144 native |
| Languages | 201 |
| Quantization | Q8_0 (8-bit integer) |
| GGUF Shards | 3 files |
| License | Apache 2.0 |
| Quantized by | VIDRAFT via llama.cpp |
Acknowledgements
- Korean Government β GPU Support Program research grant
- Qwen Team β Qwen3.5-35B-A3B base architecture
- Jackrong β Claude 4.6 Opus Reasoning Distilled model
- mergekit β Merge backend infrastructure
- llama.cpp β GGUF conversion and quantization
Citation
@misc{vidraft_darwin_35b_opus_gguf,
title = {Darwin-35B-A3B-Opus-Q8_0-GGUF},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}}
}