SearchLM: RLHF-Trained Search Query Generator

This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct trained using Group Relative Policy Optimization (GRPO) with verifiable rewards for generating better boolean search queries.

Model Description

SearchLM uses Reinforcement Learning with Verifiable Rewards (RLVR) to train language models to generate effective boolean search queries for information retrieval tasks. The model is optimized using real search evaluation metrics (NDCG and MRR) as rewards.

  • Base Model: Qwen/Qwen2.5-3B-Instruct
  • Training Method: Group Relative Policy Optimization (GRPO)
  • Checkpoint: final
  • Reward Function: Weighted combination of NDCG and MRR from actual search results
  • Datasets: NFCorpus and SciFact (MTEB)
  • Task: Boolean search query generation

Training Details

Training Configuration

  • Learning Rate: 1e-06
  • Epochs: 3
  • Batch Size: 1 (colocate) / 8 (server)
  • Gradient Accumulation: 16 (colocate) / 4 (server)
  • Precision: bf16
  • Gradient Checkpointing: True
  • Max New Tokens: 2048
  • Num Generations: 2

Reward Function

  • NDCG Weight: 0.6
  • MRR Weight: 0.4
  • Evaluation K: 100

Evaluation Results

Performance comparison between the base model and RLHF-trained model across SciFact and NFCorpus datasets. Results show mean ± standard deviation across 3 evaluation runs.

SciFact Dataset

Model NDCG@10 NDCG@100 MRR MAP Precision@10 Recall@10
Base (Qwen2.5-3B-Instruct) 0.1644 ± 0.0204 0.1718 ± 0.0156 0.1523 ± 0.0183 0.1502 ± 0.0162 0.0413 ± 0.0069 0.2016 ± 0.0279
RLHF (searchlm-qwen2.5-3b-rlhf) 0.6512 ± 0.0040 0.6696 ± 0.0032 0.6092 ± 0.0045 0.6009 ± 0.0041 0.0870 ± 0.0005 0.7839 ± 0.0025
Improvement +296.0% +289.7% +300.1% +300.1% +110.6% +288.8%

NFCorpus Dataset

Model NDCG@10 NDCG@100 MRR MAP Precision@10 Recall@10
Base (Qwen2.5-3B-Instruct) 0.3345 ± 0.0076 0.3355 ± 0.0080 0.3197 ± 0.0081 0.2834 ± 0.0035 0.2093 ± 0.0102 0.0853 ± 0.0008
RLHF (searchlm-qwen2.5-3b-rlhf) 0.5502 ± 0.0031 0.5448 ± 0.0017 0.5338 ± 0.0050 0.4114 ± 0.0021 0.2982 ± 0.0024 0.1577 ± 0.0003
Improvement +64.5% +62.4% +66.9% +45.2% +42.5% +84.9%

Usage

Prerequisites

uv add transformers torch vllm trl[vllm] datasets omegaconf

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Supreeth/searchlm-qwen2.5-3b-rlhf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# System prompt for query generation
system_prompt = """You are a search expert. Given a user's information need, generate an effective boolean search query using AND, OR, NOT operators and parentheses for grouping. The query should be precise and retrieve relevant documents.

Guidelines:
- Use AND to require multiple terms
- Use OR for synonyms or alternatives  
- Use NOT to exclude irrelevant terms
- Use parentheses for grouping complex logic
- Keep queries focused and not overly complex

Format your response with the query inside <query></query> tags."""

# User query
user_query = "What are the latest treatments for Type 2 diabetes?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]

# Generate search query
inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

Using with vLLM (Recommended)

from vllm import LLM, SamplingParams

model_name = "Supreeth/searchlm-qwen2.5-3b-rlhf"
llm = LLM(model=model_name)

system_prompt = """You are a search expert. Given a user's information need, generate an effective boolean search query..."""

prompts = [
    f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\nWhat are the latest treatments for Type 2 diabetes?<|im_end|>\n<|im_start|>assistant\n"
]

sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Evaluation

The model is evaluated on standard information retrieval datasets (NFCorpus and SciFact) using the following metrics:

  • NDCG@10, NDCG@100: Normalized Discounted Cumulative Gain
  • MRR: Mean Reciprocal Rank
  • Precision@10: Precision at top 10 results
  • Recall@10: Recall at top 10 results
  • MAP: Mean Average Precision

Training Data

The model was trained on:

  • NFCorpus: Medical information retrieval dataset
  • SciFact: Scientific fact-checking dataset

Both datasets are from the MTEB (Massive Text Embedding Benchmark) collection.

Limitations and Bias

  • The model is specifically trained for scientific and medical domains (NFCorpus and SciFact)
  • Performance may vary on other domains
  • Boolean query syntax is optimized for full-text search engines (e.g., Tantivy)
  • Generated queries may need domain-specific tuning for production use

Citation

If you use this model, please cite:

@misc{searchlm2025,
  author = {Supreeth Rao},
  title = {SearchLM: Reinforcement Learning with Verifiable Rewards for Search Query Generation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Supreeth/searchlm-qwen2.5-3b-rlhf}}
}

License

MIT License

Contact

For questions or issues:

Acknowledgments

Downloads last month
1
Safetensors
Model size
3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Supreeth/searchlm-qwen2.5-3b-rlhf

Base model

Qwen/Qwen2.5-3B
Finetuned
(1186)
this model

Datasets used to train Supreeth/searchlm-qwen2.5-3b-rlhf