Instructions to use FINAL-Bench/Darwin-36B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FINAL-Bench/Darwin-36B-Opus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-36B-Opus") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus") model = AutoModelForCausalLM.from_pretrained("FINAL-Bench/Darwin-36B-Opus") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FINAL-Bench/Darwin-36B-Opus with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FINAL-Bench/Darwin-36B-Opus" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-36B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FINAL-Bench/Darwin-36B-Opus
- SGLang
How to use FINAL-Bench/Darwin-36B-Opus with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Darwin-36B-Opus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-36B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Darwin-36B-Opus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-36B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use FINAL-Bench/Darwin-36B-Opus with Docker Model Runner:
docker model run hf.co/FINAL-Bench/Darwin-36B-Opus
APEX Quant Request + Real World Performance
Love this model.
@mudler please consider this model for an APEX quant.
It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.
Since they both share the same MOE base model- speeds if apex quantized should be very similar
visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
Hardware / Config
| Spec | Value |
|---|---|
| OS | Linux 7.0.1-1-cachyos-rt-bore-lto |
| GPU | RTX 4080 Max-Q 12GB @ 60w TDP |
| CPU | Intel Ultra 9 185H |
| RAM | 32GB LPDDR5x |
| Backend | ik_llama.cpp (main) |
| Context | 65k max |
| Darwin Quant | bartowski IQ4_XS imatrix |
| 4.7 Quant | APEX-I-Compact |
Performance Eval
(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)
| Model | Quant | Size (on disk) | Prefill (16k) | Generate |
|---|---|---|---|---|
| Darwin-36B-Opus | bartowski IQ4_XS | ~17.5 GB | 293 tps | 51.0 tps |
| Qwen-4.7 Fine-Tune | APEX-I-Compact | 16.1 GB | 313 tps | 46.6 tps |
| Qwen-4.7 Fine-Tune | APEX-I-Nano | 10.8 GB | 1047 tps | 67.2 tps |
Reasoning Samples (Physics Task)
Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.
The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).
Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.
The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.
Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:
1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1
Let me solve each one properly, then write the Python script.
Task 1: WKB for V(x) = αx⁴
The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h
where p(x) = √(2m(E - V(x)))
For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)
The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx
where a = (E/α)^(1/4)
Output Quality (Final Answers)
- Both models produced identical final physics derivation:
Eₙ ∝ (n+1/2)^(4/3) - Both produced identical async code with retry logic + bug fix
- Both produced identical Spanish→Japanese attention analogy
- Both acknowledged quantization-induced numerical instability
Key Observation
Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.
The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.
Model Links
- Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
- Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
- License: Apache 2.0
Love this model.
@mudler please consider this model for an APEX quant.
It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.
Since they both share the same MOE base model- speeds if apex quantized should be very similar
visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):
Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
Hardware / Config
Spec Value OS Linux 7.0.1-1-cachyos-rt-bore-lto GPU RTX 4080 Max-Q 12GB @ 60w TDP CPU Intel Ultra 9 185H RAM 32GB LPDDR5x Backend ik_llama.cpp (main) Context 65k max Darwin Quant bartowski IQ4_XS imatrix 4.7 Quant APEX-I-Compact Performance Eval
(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)
Model Quant Size (on disk) Prefill (16k) Generate Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps Reasoning Samples (Physics Task)
Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise. The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4). Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function. The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task: 1. Physics: WKB approximation for V(x) = αx⁴, derive E_n 2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug 3. Language: Attention analogy in Spanish, translate to Japanese 4. Meta: One genuine limitation and how it affects Task 1 Let me solve each one properly, then write the Python script. Task 1: WKB for V(x) = αx⁴ The WKB quantization condition: ∮ p(x) dx = (n + 1/2) h where p(x) = √(2m(E - V(x))) For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4) The integral: ∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx where a = (E/α)^(1/4)Output Quality (Final Answers)
- Both models produced identical final physics derivation:
Eₙ ∝ (n+1/2)^(4/3)- Both produced identical async code with retry logic + bug fix
- Both produced identical Spanish→Japanese attention analogy
- Both acknowledged quantization-induced numerical instability
Key Observation
Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.
The "thinking density" difference is the real win — Darwin's concise
<think>traces reduce cognitive load more than raw TPS gains from aggressive quantization.Model Links
- Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
- Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
- License: Apache 2.0
Thank you @el4 for the detailed benchmark and the kind words! 🙏
You're right about the denser reasoning — Darwin-36B-Opus inherits Claude
Opus reasoning patterns through our Darwin V7 evolutionary merge, which
tends to produce more compact thinking traces compared to standard
fine-tunes.
@mudler an APEX quant would be wonderful — happy to coordinate if needed.
In the meantime, our team is also working on:
- NVFP4 native quantization (Blackwell-optimized)
- FP8 build for vLLM serving
Stay tuned, and feel free to ping us with any feedback!
This model is friggin fantastic. Would love to try a version with some of those new techniques.
Edit: It's hands down the best ~35b class MoE I've tried so far. Good job.
This model is friggin fantastic. Would love to try a version with some of those new techniques.
Appreciate that, m. V8 is in the oven — NEG-Gate baked into the token generation loop, GPQA Greedy@Q20 jumped 55→70%. Drops as soon as we finish cleaning the eval harness.
Stay tuned.
I've lost this message in the notifications somehow - working on the APEX quant !
I've lost this message in the notifications somehow - working on the APEX quant !
Sick! Thanks!