Instructions to use FINAL-Bench/Darwin-28B-Opus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FINAL-Bench/Darwin-28B-Opus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="FINAL-Bench/Darwin-28B-Opus") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("FINAL-Bench/Darwin-28B-Opus") model = AutoModelForImageTextToText.from_pretrained("FINAL-Bench/Darwin-28B-Opus") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FINAL-Bench/Darwin-28B-Opus with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FINAL-Bench/Darwin-28B-Opus" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-28B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FINAL-Bench/Darwin-28B-Opus
- SGLang
How to use FINAL-Bench/Darwin-28B-Opus with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Darwin-28B-Opus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-28B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FINAL-Bench/Darwin-28B-Opus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FINAL-Bench/Darwin-28B-Opus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use FINAL-Bench/Darwin-28B-Opus with Docker Model Runner:
docker model run hf.co/FINAL-Bench/Darwin-28B-Opus
3-stage adaptive evaluation comparison with Qwen3.6-27B?
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.
So I believe they're using the protocol in the Qwen3 paper.
https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf
With one caveat, all of their tabulated results are consistent with that. The caveat is that they misidentified Qwen3-235B-A22B-Thinking-2507 results as Qwen3-235B-A22B in the results table on https://huggingface.co/Qwen/Qwen3.5-122B-A10B; if that's correct, Qwen3.5 and Qwen3.6 results can be chained back via comparisons to other Qwen model scores to those in the paper. (The assumption being they evaluate their own models reported in a given table consistently, which looks to hold.)
Qwen3_Technical_Report.evalNotes
https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf
4.6 Post-training Evaluation
For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy.
For all Qwen3 models in the thinking mode, we utilize a sampling temperature of 0.6, a top-p value
of 0.95, and a top-k value of 20.
GPQA-Diamond, Thinking
Qwen3-235B-A22B 71.1
Qwen3-30B-A3B 65.8
Qwen3-32B 68.4
Qwen3-14B 64.0
Chain of GPQA-Diamond scores:
https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen3-30B-A3B-Thinking-2507 73.4
Qwen3-30B-A3B-Thinking 65.8
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen3-235B-A22B 71.1
Qwen3-235B-A22B-Thinking-2507 81.1 [IFeval 87.8, MMLU-Pro 84.4]
https://huggingface.co/Qwen/Qwen3.5-122B-A10B
Qwen3-235B-A22B 81.1 [typo??? Thinking-2507?--also IFeval 87.8, MMLU-Pro 84.4]
Qwen3.5-122B-A10B 86.6
Qwen3.5-27B 85.5