OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Overview

We introduce OPT-Engine, an extensible benchmark framework for optimization problems with controllable complexity and configurable templates. OPT-Engine spans ten canonical operations research problem classes,systematically scaling in complexity, thus provides a structured testbed for automated problem formulation and solving under different OR complexity. OPT-Engine facilitates rigorous, reproducible studies on how problem complexity impacts model performance, offering a more granular look at LLM formulation and solving capabilities.

Updates

Inference

We recommend using the following T.I.R prompt template which can be found in benchmark_gurobi_prompts.py. Please replace the {question} with any natural language OR question.

Quick start

Below is a simple example for model inference:

from transformers import AutoTokenizer
from  benchmark_prompt_utils import benchmark_gurobi_prompts
from utils import extract_code_block, extract_obj
from vllm import SamplingParams, LLM
from langchain_core.prompts import PromptTemplate
import subprocess

# Load model and parameters
model = LLM("chenyitian-shanshu/Qwen3-SIRL-4B",            
            tensor_parallel_size=1,
            trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("chenyitian-shanshu/Qwen3-SIRL-4B")
sampling_params = SamplingParams(
            n=1,
            temperature=0.5,
            top_p=0.95,
            max_tokens=8192,
            repetition_penalty=1.02
        )

# Load question. Here is just an example. Users can replace this with datasets they want to test
question = "An industrial tire company delivers large tires for equipment to remote engineering sites either by cargo planes or ultrawide trucks. Each cargo plane can transport 10 tires per trip and costs $1000. Each ultrawide truck can transport 6 tires per trip and costs $700. The company needs to transport at least 200 tires and has available $22000. Because most remote sites don't have proper airports, the number of plane trips cannot exceed the number of ultrawide truck trips. How many trips of each should be done to minimize the total number of trips?"

# Load prompt templete
TIR_prompt_user = PromptTemplate.from_template(benchmark_gurobi_prompts['zeroshot_q2mc_en'])
prompt =[ {"role": "user",
          "content": TIR_prompt_user.format(Question=question).strip() }]

# Generate Response
text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
response = model.generate(text,sampling_params)
response_text = response[0].outputs[0].text
code_snippet = extract_code_block(response_text,'gurobi')
result = subprocess.run(['python3', '-c', code_snippet], capture_output=True, text=True, timeout=100)
obj = extract_obj(result.stdout,'gurobi')
print(response_text)
print('optimal value is', obj)

Test Dataset

We evaluate the performance of our trained model on multiple datasets which include MAMO, IndustryOR, OptMATH and the C^3-Bench Dataset(OPT-Engine datasets) from https://github.com/Cardinal-Operations/OPTEngine.

Citation

If you find OPT-Engine (https://github.com/Cardinal-Operations/OPTEngine) useful or relevant to your research, please consider citing our paper:

@article{chen2026opt,
  title={OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling},
  author={Chen, Yitian and Cheng, Cheng and Sun, Yinan and Ling, Zi and Ge, Dongdong},
  journal={arXiv preprint arXiv:2601.19924},
  year={2026}
}
Downloads last month
31
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chenyitian-shanshu/Qwen3-SIRL-4B

Finetuned
(1537)
this model
Quantizations
1 model

Paper for chenyitian-shanshu/Qwen3-SIRL-4B