Instructions to use FINAL-Bench/Darwin-36B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FINAL-Bench/Darwin-36B-Opus with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-36B-Opus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-36B-Opus")
model = AutoModelForCausalLM.from_pretrained("FINAL-Bench/Darwin-36B-Opus")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FINAL-Bench/Darwin-36B-Opus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FINAL-Bench/Darwin-36B-Opus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FINAL-Bench/Darwin-36B-Opus

SGLang

How to use FINAL-Bench/Darwin-36B-Opus with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FINAL-Bench/Darwin-36B-Opus" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FINAL-Bench/Darwin-36B-Opus" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-36B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use FINAL-Bench/Darwin-36B-Opus with Docker Model Runner:
```
docker model run hf.co/FINAL-Bench/Darwin-36B-Opus
```

APEX Quant Request + Real World Performance

by el4 - opened 17 days ago

Discussion

el4

17 days ago

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):

Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec	Value
OS	Linux 7.0.1-1-cachyos-rt-bore-lto
GPU	RTX 4080 Max-Q 12GB @ 60w TDP
CPU	Intel Ultra 9 185H
RAM	32GB LPDDR5x
Backend	ik_llama.cpp (main)
Context	65k max
Darwin Quant	bartowski IQ4_XS imatrix
4.7 Quant	APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model	Quant	Size (on disk)	Prefill (16k)	Generate
Darwin-36B-Opus	bartowski IQ4_XS	~17.5 GB	293 tps	51.0 tps
Qwen-4.7 Fine-Tune	APEX-I-Compact	16.1 GB	313 tps	46.6 tps
Qwen-4.7 Fine-Tune	APEX-I-Nano	10.8 GB	1047 tps	67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):

Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.

Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):

The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)

Output Quality (Final Answers)

Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)
Both produced identical async code with retry logic + bug fix
Both produced identical Spanish→Japanese attention analogy
Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus
Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF
License: Apache 2.0

el4 changed discussion title from APEX Quant Request to APEX Quant Request + Real World Performance 17 days ago

SeaWolf-AI

FINAL_Bench org 17 days ago

Love this model.

@mudler please consider this model for an APEX quant.

It has denser reasoning, feels more fluid, and just performs better than the Opus 4.7 finetune I just tested.

Since they both share the same MOE base model- speeds if apex quantized should be very similar

visual comparison (half of the opus 4.7 finetune internal reasoning is shown in this image):

Darwin-36B-Opus vs. Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Hardware / Config

Spec Value

OS Linux 7.0.1-1-cachyos-rt-bore-lto

GPU RTX 4080 Max-Q 12GB @ 60w TDP

CPU Intel Ultra 9 185H

RAM 32GB LPDDR5x

Backend ik_llama.cpp (main)

Context 65k max

Darwin Quant bartowski IQ4_XS imatrix

4.7 Quant APEX-I-Compact

Performance Eval

(Tested comparable sizes: one that fits VRAM, one closest to IQ4_XS size)

Model Quant Size (on disk) Prefill (16k) Generate

Darwin-36B-Opus bartowski IQ4_XS ~17.5 GB 293 tps 51.0 tps

Qwen-4.7 Fine-Tune APEX-I-Compact 16.1 GB 313 tps 46.6 tps

Qwen-4.7 Fine-Tune APEX-I-Nano 10.8 GB 1047 tps 67.2 tps

Reasoning Samples (Physics Task)

Darwin-36B-Opus thinking trace (~800 tokens):
Physics: WKB approximation for V(x) = αx⁴. The quantization condition is ∮p(x)dx=(n+1/2)2πℏ. For V(x)=αx⁴, the turning points are at x=±(E/α)^(1/4). The integral gives Eₙ = [derivation]... Let me be more precise.

The WKB quantization: ∫₋ₐᵃ √(2m(E−αx⁴)) dx = (n+1/2)πℏ where a=(E/α)^(1/4).

Using substitution x=a·sin^(1/2)θ... Actually, let me use the standard result. The integral ∫₋ₐᵃ √(a²−x⁴) dx relates to the Beta function.

The result: Eₙ = C·ℏ^(2/3)α^(1/3)(n+1/2)^(2/3)/(2m)^(1/3) where C is a numerical constant.
Qwen-4.7 Fine-Tune thinking trace (~1200 tokens):
The user wants me to solve 4 specific tasks and then write a Python script that saves the solutions to a markdown file. Let me work through each task:

1. Physics: WKB approximation for V(x) = αx⁴, derive E_n
2. Code: parallel_map_with_retry (asyncio) + fix fetch_with_timeout bug
3. Language: Attention analogy in Spanish, translate to Japanese
4. Meta: One genuine limitation and how it affects Task 1

Let me solve each one properly, then write the Python script.

Task 1: WKB for V(x) = αx⁴

The WKB quantization condition:
∮ p(x) dx = (n + 1/2) h

where p(x) = √(2m(E - V(x)))

For V(x) = αx⁴, the turning points are at x = ±(E/α)^(1/4)

The integral:
∮ p(x) dx = 2 ∫_{-a}^{a} √(2m(E - αx⁴)) dx = 4 ∫_0^a √(2m(E - αx⁴)) dx

where a = (E/α)^(1/4)
Output Quality (Final Answers)

Both models produced identical final physics derivation: Eₙ ∝ (n+1/2)^(4/3)

Both produced identical async code with retry logic + bug fix

Both produced identical Spanish→Japanese attention analogy

Both acknowledged quantization-induced numerical instability

Key Observation

Same raw TPS, but Darwin uses ~33% fewer tokens to reach the same answer. Result: lower end-to-end latency, less scrolling, denser thinking trace.

The "thinking density" difference is the real win — Darwin's concise <think> traces reduce cognitive load more than raw TPS gains from aggressive quantization.

Model Links

Darwin-36B-Opus: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus

Darwin GGUF (bartowski): https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF

License: Apache 2.0

Thank you @el4 for the detailed benchmark and the kind words! 🙏

You're right about the denser reasoning — Darwin-36B-Opus inherits Claude
Opus reasoning patterns through our Darwin V7 evolutionary merge, which
tends to produce more compact thinking traces compared to standard
fine-tunes.

@mudler an APEX quant would be wonderful — happy to coordinate if needed.

In the meantime, our team is also working on:

NVFP4 native quantization (Blackwell-optimized)
FP8 build for vLLM serving

Stay tuned, and feel free to ping us with any feedback!

apollo-mg

about 23 hours ago

•

edited about 23 hours ago

This model is friggin fantastic. Would love to try a version with some of those new techniques.

Edit: It's hands down the best ~35b class MoE I've tried so far. Good job.

SeaWolf-AI

FINAL_Bench org about 23 hours ago

This model is friggin fantastic. Would love to try a version with some of those new techniques.

Appreciate that, m. V8 is in the oven — NEG-Gate baked into the token generation loop, GPQA Greedy@Q20 jumped 55→70%. Drops as soon as we finish cleaning the eval harness.
Stay tuned.

mudler

about 20 hours ago

I've lost this message in the notifications somehow - working on the APEX quant !

apollo-mg

about 10 hours ago

I've lost this message in the notifications somehow - working on the APEX quant !

Sick! Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment