Instructions to use FINAL-Bench/Darwin-28B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FINAL-Bench/Darwin-28B-Opus with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-28B-Opus")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("FINAL-Bench/Darwin-28B-Opus")
model = AutoModelForImageTextToText.from_pretrained("FINAL-Bench/Darwin-28B-Opus")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FINAL-Bench/Darwin-28B-Opus with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FINAL-Bench/Darwin-28B-Opus"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-28B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FINAL-Bench/Darwin-28B-Opus

SGLang

How to use FINAL-Bench/Darwin-28B-Opus with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FINAL-Bench/Darwin-28B-Opus" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-28B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FINAL-Bench/Darwin-28B-Opus" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FINAL-Bench/Darwin-28B-Opus",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use FINAL-Bench/Darwin-28B-Opus with Docker Model Runner:
```
docker model run hf.co/FINAL-Bench/Darwin-28B-Opus
```

3-stage adaptive evaluation comparison with Qwen3.6-27B?

by SkyMind - opened 21 days ago

Discussion

SkyMind

21 days ago

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

SeaWolf-AI

FINAL_Bench org 20 days ago

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.

SkyMind

13 days ago

•

edited 13 days ago

So I believe they're using the protocol in the Qwen3 paper.

https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

With one caveat, all of their tabulated results are consistent with that. The caveat is that they misidentified Qwen3-235B-A22B-Thinking-2507 results as Qwen3-235B-A22B in the results table on https://huggingface.co/Qwen/Qwen3.5-122B-A10B; if that's correct, Qwen3.5 and Qwen3.6 results can be chained back via comparisons to other Qwen model scores to those in the paper. (The assumption being they evaluate their own models reported in a given table consistently, which looks to hold.)

Qwen3_Technical_Report.evalNotes


https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

4.6 Post-training Evaluation

For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy.

For all Qwen3 models in the thinking mode, we utilize a sampling temperature of 0.6, a top-p value
of 0.95, and a top-k value of 20.


GPQA-Diamond, Thinking
Qwen3-235B-A22B   71.1
Qwen3-30B-A3B        65.8
Qwen3-32B                  68.4
Qwen3-14B                  64.0

Chain of GPQA-Diamond scores:

https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen3-30B-A3B-Thinking-2507     73.4
Qwen3-30B-A3B-Thinking                 65.8



https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen3-235B-A22B                                  71.1
Qwen3-235B-A22B-Thinking-2507   81.1    [IFeval 87.8, MMLU-Pro 84.4]


https://huggingface.co/Qwen/Qwen3.5-122B-A10B
Qwen3-235B-A22B          81.1                    [typo???  Thinking-2507?--also IFeval 87.8, MMLU-Pro 84.4]
Qwen3.5-122B-A10B      86.6
Qwen3.5-27B                     85.5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment