Feng Luo commited on
Commit
060c6fd
·
1 Parent(s): d0b5a8b

update inference

Browse files
README.md CHANGED
@@ -1,60 +1,82 @@
1
  ---
2
- library_name: transformers
3
- license: other
4
- base_model: feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506
5
- tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
- model-index:
10
- - name: qwen2.5_7b-based_model-AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2
11
- results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # qwen2.5_7b-based_model-AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2
18
 
19
- This model is a fine-tuned version of [feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506](https://huggingface.co/feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506) on the AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2 dataset.
20
 
21
- ## Model description
22
 
23
- More information needed
 
24
 
25
- ## Intended uses & limitations
 
26
 
27
- More information needed
28
 
29
- ## Training and evaluation data
 
 
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ## Training procedure
 
 
 
34
 
35
- ### Training hyperparameters
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 1e-07
39
- - train_batch_size: 1
40
- - eval_batch_size: 1
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 4
44
- - total_train_batch_size: 4
45
- - total_eval_batch_size: 4
46
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_ratio: 0.1
49
- - num_epochs: 1.0
50
 
51
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
 
 
54
 
55
- ### Framework versions
56
 
57
- - Transformers 4.46.1
58
- - Pytorch 2.9.0+cu128
59
- - Datasets 3.1.0
60
- - Tokenizers 0.20.3
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-7B-Instruct
 
 
 
 
 
 
 
5
  ---
6
 
7
+ # AutoL2S-Plus-7B
 
8
 
9
+ This is the official model repository for AutoL2S-Plus-7B, a model fine-tuned for efficient reasoning based on [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
10
 
11
+ ## 💡 Overview
12
 
13
+ **AutoL2S** is a two-stage framework designed to improve reasoning efficiency. It follows a two-phase training pipeline — Supervised Fine-Tuning (SFT) followed by off-policy Reinforcement Learning (RL).
14
 
15
+ - **Stage 1: Long–Short Concatenated Distillation**
16
+ In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
17
 
18
+ - **Stage 2: Off-Policy RL with Length-Aware Objective**
19
+ In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length. The model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
20
 
21
+ This repository contains:
22
 
23
+ - Model weights
24
+ - Configuration files
25
+ - necessary scripts in the `examples/` directory
26
 
27
+ ---
28
+ ## 🧩 Dependencies
29
+ We recommend using the model with [vLLM](https://github.com/vllm-project/vllm).
30
+ The code has been tested with:
31
+
32
+ ```
33
+ vLLM == 0.6.2
34
+ ```
35
+
36
+ ---
37
+ ## 🚀 How to Use
38
+
39
+ Run the inference example:
40
 
41
+ ```bash
42
+ cd examples
43
+ python inference.py
44
+ ```
45
 
46
+ Alternatively, please download examples/prefixLLM.py from this repository and put them in your working dir.
47
 
48
+ ```python
49
+ from vllm import SamplingParams
50
+ from prefixLLM import PrefixLLM
 
 
 
 
 
 
 
 
 
 
51
 
52
+ SYSTEM_PROMPT = "You are a helpful and harmless assistant.You should think step-by-step and put your final answer within \\boxed{{}}."
53
+
54
+ llm = PrefixLLM(model="amandaa/AutoL2S-Plus-7b")
55
+ max_tokens, temp = 32768, 0.7
56
+ sampling_params= SamplingParams(max_tokens=max_tokens, temperature=temp)
57
+
58
+ question = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$"
59
+ messages = [
60
+ {"role": "system", "content": SYSTEM_PROMPT},
61
+ {"role": "user", "content": question}
62
+ ]
63
+ responses = llm.chat(messages=messages, sampling_params=sampling_params, use_tqdm=True)
64
+
65
+ print(responses[0].outputs[0].text)
66
+ ```
67
+
68
+ ---
69
 
70
 
71
+ ## 🔍 Citation
72
 
73
+ If you use this model in your work, please consider citing:
74
 
75
+ ```bibtex
76
+ @article{luo2025autol2s,
77
+ title={AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models},
78
+ author={Luo, Feng and Chuang, Yu-Neng and Wang, Guanchu and Le, Hoang Anh Duy and Zhong, Shaochen and Liu, Hongyi and Yuan, Jiayi and Sui, Yang and Braverman, Vladimir and Chaudhary, Vipin and others},
79
+ journal={arXiv preprint arXiv:2505.22662},
80
+ year={2025}
81
+ }
82
+ ```
examples/__pycache__/prefixLLM.cpython-310.pyc ADDED
Binary file (3.99 kB). View file
 
examples/inference.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from vllm import SamplingParams
2
+ from prefixLLM import PrefixLLM
3
+
4
+ SYSTEM_PROMPT = "You are a helpful and harmless assistant.You should think step-by-step and put your final answer within \\boxed{{}}."
5
+
6
+ if __name__ == "__main__":
7
+ llm = PrefixLLM(model="amandaa/AutoL2S-Plus-7b")
8
+ max_tokens, temp = 32768, 0.7
9
+ sampling_params= SamplingParams(max_tokens=max_tokens, temperature=temp)
10
+
11
+ question = "Melissa works as a pet groomer. This week, she has 8 dogs that need to be bathed, 5 cats that need their nails clipped, 3 birds that need their wings trimmed, and 12 horses that need to be brushed. If she splits the grooming jobs evenly over the days, how many animals will she groom each day of the week?"
12
+ messages = [
13
+ {"role": "system", "content": SYSTEM_PROMPT},
14
+ {"role": "user", "content": question}
15
+ ]
16
+ responses = llm.chat(messages=messages, sampling_params=sampling_params, use_tqdm=True)
17
+
18
+ print(responses[0].outputs[0].text)
examples/prefixLLM.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import Dict, List, Optional, Sequence, Union
3
+
4
+ from vllm import LLM, SamplingParams
5
+ from vllm.entrypoints.chat_utils import (
6
+ ChatCompletionMessageParam,
7
+ apply_hf_chat_template,
8
+ apply_mistral_chat_template,
9
+ parse_chat_messages,
10
+ )
11
+ from vllm.inputs import PromptInputs, TextPrompt
12
+ from vllm.lora.request import LoRARequest
13
+ from vllm.outputs import RequestOutput
14
+ from vllm.transformers_utils.tokenizer import MistralTokenizer
15
+ from vllm.utils import is_list_of
16
+
17
+
18
+ _TAIL_WS_RE = re.compile(r"(?:\r?\n|\s)+$")
19
+
20
+ def needs_newline(text: str) -> bool:
21
+ """Return True when *text* does NOT already end with whitespace/newline."""
22
+ return _TAIL_WS_RE.search(text[-8:]) is None # inspect last few chars
23
+
24
+
25
+ def add_prefix(prompt: str, prefix: str, eos_token: str) -> str:
26
+ """Insert *prefix* before the first generated token.
27
+
28
+ Keeps EOS token at the very end if the template already appended it.
29
+ """
30
+ if prompt.endswith(eos_token):
31
+ return prompt[:-len(eos_token)] + prefix + eos_token
32
+ return prompt + prefix
33
+
34
+
35
+ class PrefixLLM(LLM):
36
+ """vLLM LLM subclass that conditionally prepends *trigger_word*."""
37
+
38
+ def route_chat(
39
+ self,
40
+ messages: Union[
41
+ List[ChatCompletionMessageParam],
42
+ List[List[ChatCompletionMessageParam]],
43
+ ],
44
+ sampling_params_route: Optional[Union[SamplingParams,
45
+ List[SamplingParams]]] = None,
46
+ sampling_params_force_think: Optional[Union[SamplingParams,
47
+ List[SamplingParams]]] = None,
48
+ use_tqdm: bool = True,
49
+ lora_request: Optional[LoRARequest] = None,
50
+ chat_template: Optional[str] = None,
51
+ add_generation_prompt: bool = True,
52
+ tools: Optional[List[Dict[str, any]]] = None,
53
+ *,
54
+ trigger_word: Optional[str] = None,
55
+ ) -> List[RequestOutput]:
56
+ """Drop-in replacement for `LLM.chat` with one extra keyword:
57
+
58
+ Parameters
59
+ ----------
60
+ trigger_word : str | None, default None
61
+ The prefix to inject. If ``None`` → no prefix injection.
62
+ """
63
+
64
+ tokenizer = self.get_tokenizer()
65
+ model_config = self.llm_engine.get_model_config()
66
+ eos_token = tokenizer.eos_token
67
+
68
+ orig_prompts: List[Union[TokensPrompt, TextPrompt]] = []
69
+ pref_prompts: List[Union[TokensPrompt, TextPrompt]] = []
70
+ mm_payloads: List[Optional[Dict[str, Any]]] = []
71
+
72
+ list_of_messages: List[List[ChatCompletionMessageParam]]
73
+
74
+ # Handle multi and single conversations
75
+ if is_list_of(messages, list):
76
+ # messages is List[List[...]]
77
+ list_of_messages = messages
78
+ else:
79
+ # messages is List[...]
80
+ list_of_messages = [messages]
81
+
82
+ prompts: List[Union[TokensPrompt, TextPrompt]] = []
83
+
84
+ for msgs in list_of_messages:
85
+ # ---- render chat template exactly once ----
86
+ if isinstance(tokenizer, MistralTokenizer):
87
+ prompt_data: Union[str, List[int]] = apply_mistral_chat_template(
88
+ tokenizer,
89
+ messages=msgs,
90
+ chat_template=chat_template,
91
+ add_generation_prompt=add_generation_prompt,
92
+ tools=tools,
93
+ )
94
+ mm_data = None # mistral util returns already embedded image tokens
95
+ else:
96
+ conversation, mm_data = parse_chat_messages(msgs, model_config, tokenizer)
97
+ prompt_data = apply_hf_chat_template(
98
+ tokenizer,
99
+ conversation=conversation,
100
+ chat_template=chat_template,
101
+ add_generation_prompt=add_generation_prompt,
102
+ tools=tools,
103
+ )
104
+
105
+ if is_list_of(prompt_data, int):
106
+ raise NotImplementedError
107
+ else:
108
+ orig_prompt = TextPrompt(prompt=prompt_data)
109
+
110
+ if trigger_word is None:
111
+ raise ValueError("trigger_word must be provided when using force_think logic")
112
+
113
+ need_nl = needs_newline(prompt_data)
114
+ prefix = trigger_word + ("\n" if need_nl else "")
115
+ pref_txt = add_prefix(prompt_data, prefix, eos_token)
116
+ pref_prompt = TextPrompt(prompt=pref_txt)
117
+
118
+ if mm_data is not None:
119
+ orig_prompt["multi_modal_data"] = mm_data
120
+ pref_prompt["multi_modal_data"] = copy.deepcopy(mm_data)
121
+
122
+ orig_prompts.append(orig_prompt)
123
+ pref_prompts.append(pref_prompt)
124
+
125
+ results = self.generate(
126
+ orig_prompts,
127
+ sampling_params=sampling_params_route,
128
+ use_tqdm=use_tqdm,
129
+ lora_request=lora_request,
130
+ )
131
+
132
+ need_force = [i for i, out in enumerate(results) if "<specialLong>" in out.outputs[0].text[:100]]
133
+
134
+
135
+ if len(need_force) == 0:
136
+ return results # early exit, nothing to redo
137
+
138
+ prompts_force = [pref_prompts[i] for i in need_force]
139
+
140
+ results_force = self.generate(
141
+ prompts_force,
142
+ sampling_params=sampling_params_force_think,
143
+ use_tqdm=use_tqdm,
144
+ lora_request=lora_request,
145
+ )
146
+
147
+ for idx, new_out in zip(need_force, results_force):
148
+ results[idx] = new_out
149
+
150
+ return results