update inference

Files changed (4) hide show

README.md +63 -41
examples/__pycache__/prefixLLM.cpython-310.pyc +0 -0
examples/inference.py +18 -0
examples/prefixLLM.py +150 -0

README.md CHANGED Viewed

@@ -1,60 +1,82 @@
 ---
-library_name: transformers
-license: other
-base_model: feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506
-tags:
-- llama-factory
-- full
-- generated_from_trainer
-model-index:
-- name: qwen2.5_7b-based_model-AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# qwen2.5_7b-based_model-AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2
-This model is a fine-tuned version of [feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506](https://huggingface.co/feng0929/Bespoke_r1_qww_long_short_trigger_pro_with_rj_4_Instruct-7b_sft_lr1e-05_epoch3_bs32_gpu4_0506) on the AutoL2S_qwen2.5_7b_sft_rj4_pure_long_pure_short-train-K-16-alpha-5-k-2 dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-07
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 4
-- total_train_batch_size: 4
-- total_eval_batch_size: 4
-- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 1.0
-### Training results
-### Framework versions
-- Transformers 4.46.1
-- Pytorch 2.9.0+cu128
-- Datasets 3.1.0
-- Tokenizers 0.20.3

 ---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
 ---
+# AutoL2S-Plus-7B
+This is the official model repository for AutoL2S-Plus-7B, a model fine-tuned for efficient reasoning based on [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
+## 💡 Overview
+**AutoL2S** is a two-stage framework designed to improve reasoning efficiency. It follows a two-phase training pipeline — Supervised Fine-Tuning (SFT) followed by off-policy Reinforcement Learning (RL).
+- **Stage 1: Long–Short Concatenated Distillation**
+  In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
+- **Stage 2: Off-Policy RL with Length-Aware Objective**
+  In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length. The model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
+This repository contains:
+- Model weights
+- Configuration files
+- necessary scripts in the `examples/` directory
+---
+## 🧩 Dependencies
+We recommend using the model with [vLLM](https://github.com/vllm-project/vllm).
+The code has been tested with:
+```
+vLLM == 0.6.2
+```
+---
+## 🚀 How to Use
+Run the inference example:
+```bash
+cd examples
+python inference.py
+```
+Alternatively, please download examples/prefixLLM.py from this repository and put them in your working dir.
+```python
+from vllm import SamplingParams
+from prefixLLM import PrefixLLM
+SYSTEM_PROMPT = "You are a helpful and harmless assistant.You should think step-by-step and put your final answer within \\boxed{{}}."
+llm = PrefixLLM(model="amandaa/AutoL2S-Plus-7b")
+max_tokens, temp = 32768, 0.7
+sampling_params= SamplingParams(max_tokens=max_tokens, temperature=temp)
+question = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$"
+messages = [
+    {"role": "system", "content": SYSTEM_PROMPT},
+    {"role": "user", "content": question}
+]
+responses = llm.chat(messages=messages, sampling_params=sampling_params, use_tqdm=True)
+print(responses[0].outputs[0].text)
+```
+---
+## 🔍 Citation
+If you use this model in your work, please consider citing:
+```bibtex
+@article{luo2025autol2s,
+  title={AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models},
+  author={Luo, Feng and Chuang, Yu-Neng and Wang, Guanchu and Le, Hoang Anh Duy and Zhong, Shaochen and Liu, Hongyi and Yuan, Jiayi and Sui, Yang and Braverman, Vladimir and Chaudhary, Vipin and others},
+  journal={arXiv preprint arXiv:2505.22662},
+  year={2025}
+}
+```

examples/__pycache__/prefixLLM.cpython-310.pyc ADDED Viewed

Binary file (3.99 kB). View file

examples/inference.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from vllm import SamplingParams
+from prefixLLM import PrefixLLM
+SYSTEM_PROMPT = "You are a helpful and harmless assistant.You should think step-by-step and put your final answer within \\boxed{{}}."
+if __name__ == "__main__":
+    llm = PrefixLLM(model="amandaa/AutoL2S-Plus-7b")
+    max_tokens, temp = 32768, 0.7
+    sampling_params= SamplingParams(max_tokens=max_tokens, temperature=temp)
+    question = "Melissa works as a pet groomer. This week, she has 8 dogs that need to be bathed, 5 cats that need their nails clipped, 3 birds that need their wings trimmed, and 12 horses that need to be brushed. If she splits the grooming jobs evenly over the days, how many animals will she groom each day of the week?"
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": question}
+    ]
+    responses = llm.chat(messages=messages, sampling_params=sampling_params, use_tqdm=True)
+    print(responses[0].outputs[0].text)

examples/prefixLLM.py ADDED Viewed

	@@ -0,0 +1,150 @@

+import re
+from typing import Dict, List, Optional, Sequence, Union
+from vllm import LLM, SamplingParams
+from vllm.entrypoints.chat_utils import (
+    ChatCompletionMessageParam,
+    apply_hf_chat_template,
+    apply_mistral_chat_template,
+    parse_chat_messages,
+)
+from vllm.inputs import PromptInputs, TextPrompt
+from vllm.lora.request import LoRARequest
+from vllm.outputs import RequestOutput
+from vllm.transformers_utils.tokenizer import MistralTokenizer
+from vllm.utils import is_list_of
+_TAIL_WS_RE = re.compile(r"(?:\r?\n|\s)+$")
+def needs_newline(text: str) -> bool:
+    """Return True when *text* does NOT already end with whitespace/newline."""
+    return _TAIL_WS_RE.search(text[-8:]) is None  # inspect last few chars
+def add_prefix(prompt: str, prefix: str, eos_token: str) -> str:
+    """Insert *prefix* before the first generated token.
+    Keeps EOS token at the very end if the template already appended it.
+    """
+    if prompt.endswith(eos_token):
+        return prompt[:-len(eos_token)] + prefix + eos_token
+    return prompt + prefix
+class PrefixLLM(LLM):
+    """vLLM LLM subclass that conditionally prepends *trigger_word*."""
+    def route_chat(
+        self,
+        messages: Union[
+            List[ChatCompletionMessageParam],
+            List[List[ChatCompletionMessageParam]],
+        ],
+        sampling_params_route: Optional[Union[SamplingParams,
+                                        List[SamplingParams]]] = None,
+        sampling_params_force_think: Optional[Union[SamplingParams,
+                                        List[SamplingParams]]] = None,
+        use_tqdm: bool = True,
+        lora_request: Optional[LoRARequest] = None,
+        chat_template: Optional[str] = None,
+        add_generation_prompt: bool = True,
+        tools: Optional[List[Dict[str, any]]] = None,
+        *,
+        trigger_word: Optional[str] = None,
+    ) -> List[RequestOutput]:
+        """Drop-in replacement for `LLM.chat` with one extra keyword:
+        Parameters
+        ----------
+        trigger_word : str | None, default None
+            The prefix to inject.  If ``None`` → no prefix injection.
+        """
+        tokenizer = self.get_tokenizer()
+        model_config = self.llm_engine.get_model_config()
+        eos_token   = tokenizer.eos_token
+        orig_prompts: List[Union[TokensPrompt, TextPrompt]] = []
+        pref_prompts: List[Union[TokensPrompt, TextPrompt]] = []
+        mm_payloads: List[Optional[Dict[str, Any]]] = []
+        list_of_messages: List[List[ChatCompletionMessageParam]]
+        # Handle multi and single conversations
+        if is_list_of(messages, list):
+            # messages is List[List[...]]
+            list_of_messages = messages
+        else:
+            # messages is List[...]
+            list_of_messages = [messages]
+        prompts: List[Union[TokensPrompt, TextPrompt]] = []
+        for msgs in list_of_messages:
+            # ---- render chat template exactly once ----
+            if isinstance(tokenizer, MistralTokenizer):
+                prompt_data: Union[str, List[int]] = apply_mistral_chat_template(
+                    tokenizer,
+                    messages=msgs,
+                    chat_template=chat_template,
+                    add_generation_prompt=add_generation_prompt,
+                    tools=tools,
+                )
+                mm_data = None  # mistral util returns already embedded image tokens
+            else:
+                conversation, mm_data = parse_chat_messages(msgs, model_config, tokenizer)
+                prompt_data = apply_hf_chat_template(
+                    tokenizer,
+                    conversation=conversation,
+                    chat_template=chat_template,
+                    add_generation_prompt=add_generation_prompt,
+                    tools=tools,
+                )
+            if is_list_of(prompt_data, int):
+                raise NotImplementedError
+            else:
+                orig_prompt = TextPrompt(prompt=prompt_data)
+                if trigger_word is None:
+                    raise ValueError("trigger_word must be provided when using force_think logic")
+                need_nl = needs_newline(prompt_data)
+                prefix   = trigger_word + ("\n" if need_nl else "")
+                pref_txt = add_prefix(prompt_data, prefix, eos_token)
+                pref_prompt = TextPrompt(prompt=pref_txt)
+            if mm_data is not None:
+                orig_prompt["multi_modal_data"] = mm_data
+                pref_prompt["multi_modal_data"] = copy.deepcopy(mm_data)
+            orig_prompts.append(orig_prompt)
+            pref_prompts.append(pref_prompt)
+        results = self.generate(
+            orig_prompts,
+            sampling_params=sampling_params_route,
+            use_tqdm=use_tqdm,
+            lora_request=lora_request,
+        )
+        need_force = [i for i, out in enumerate(results) if "<specialLong>" in out.outputs[0].text[:100]]
+        if len(need_force) == 0:
+            return results  # early exit, nothing to redo
+        prompts_force = [pref_prompts[i] for i in need_force]
+        results_force = self.generate(
+            prompts_force,
+            sampling_params=sampling_params_force_think,
+            use_tqdm=use_tqdm,
+            lora_request=lora_request,
+        )
+        for idx, new_out in zip(need_force, results_force):
+            results[idx] = new_out
+        return results