SearchLM: RLHF-Trained Search Query Generator

This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct trained using Group Relative Policy Optimization (GRPO) with verifiable rewards for generating better boolean search queries.

Model Description

SearchLM uses Reinforcement Learning with Verifiable Rewards (RLVR) to train language models to generate effective boolean search queries for information retrieval tasks. The model is optimized using real search evaluation metrics (NDCG and MRR) as rewards.

Base Model: Qwen/Qwen2.5-3B-Instruct
Training Method: Group Relative Policy Optimization (GRPO)
Checkpoint: final
Reward Function: Weighted combination of NDCG and MRR from actual search results
Datasets: NFCorpus and SciFact (MTEB)
Task: Boolean search query generation

Training Details

Training Configuration

Learning Rate: 1e-06
Epochs: 3
Batch Size: 1 (colocate) / 8 (server)
Gradient Accumulation: 16 (colocate) / 4 (server)
Precision: bf16
Gradient Checkpointing: True
Max New Tokens: 2048
Num Generations: 2

Reward Function

NDCG Weight: 0.6
MRR Weight: 0.4
Evaluation K: 100

Evaluation Results

Performance comparison between the base model and RLHF-trained model across SciFact and NFCorpus datasets. Results show mean ± standard deviation across 3 evaluation runs.

SciFact Dataset

Model	NDCG@10	NDCG@100	MRR	MAP	Precision@10	Recall@10
Base (Qwen2.5-3B-Instruct)	0.1644 ± 0.0204	0.1718 ± 0.0156	0.1523 ± 0.0183	0.1502 ± 0.0162	0.0413 ± 0.0069	0.2016 ± 0.0279
RLHF (searchlm-qwen2.5-3b-rlhf)	0.6512 ± 0.0040	0.6696 ± 0.0032	0.6092 ± 0.0045	0.6009 ± 0.0041	0.0870 ± 0.0005	0.7839 ± 0.0025
Improvement	+296.0%	+289.7%	+300.1%	+300.1%	+110.6%	+288.8%

NFCorpus Dataset

Model	NDCG@10	NDCG@100	MRR	MAP	Precision@10	Recall@10
Base (Qwen2.5-3B-Instruct)	0.3345 ± 0.0076	0.3355 ± 0.0080	0.3197 ± 0.0081	0.2834 ± 0.0035	0.2093 ± 0.0102	0.0853 ± 0.0008
RLHF (searchlm-qwen2.5-3b-rlhf)	0.5502 ± 0.0031	0.5448 ± 0.0017	0.5338 ± 0.0050	0.4114 ± 0.0021	0.2982 ± 0.0024	0.1577 ± 0.0003
Improvement	+64.5%	+62.4%	+66.9%	+45.2%	+42.5%	+84.9%

Usage

Prerequisites

uv add transformers torch vllm trl[vllm] datasets omegaconf

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Supreeth/searchlm-qwen2.5-3b-rlhf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# System prompt for query generation
system_prompt = """You are a search expert. Given a user's information need, generate an effective boolean search query using AND, OR, NOT operators and parentheses for grouping. The query should be precise and retrieve relevant documents.

Guidelines:
- Use AND to require multiple terms
- Use OR for synonyms or alternatives  
- Use NOT to exclude irrelevant terms
- Use parentheses for grouping complex logic
- Keep queries focused and not overly complex

Format your response with the query inside <query></query> tags."""

# User query
user_query = "What are the latest treatments for Type 2 diabetes?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]

# Generate search query
inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

Using with vLLM (Recommended)

from vllm import LLM, SamplingParams

model_name = "Supreeth/searchlm-qwen2.5-3b-rlhf"
llm = LLM(model=model_name)

system_prompt = """You are a search expert. Given a user's information need, generate an effective boolean search query..."""

prompts = [
    f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\nWhat are the latest treatments for Type 2 diabetes?<|im_end|>\n<|im_start|>assistant\n"
]

sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Evaluation

The model is evaluated on standard information retrieval datasets (NFCorpus and SciFact) using the following metrics:

NDCG@10, NDCG@100: Normalized Discounted Cumulative Gain
MRR: Mean Reciprocal Rank
Precision@10: Precision at top 10 results
Recall@10: Recall at top 10 results
MAP: Mean Average Precision

Training Data

The model was trained on:

NFCorpus: Medical information retrieval dataset
SciFact: Scientific fact-checking dataset

Both datasets are from the MTEB (Massive Text Embedding Benchmark) collection.

Limitations and Bias

The model is specifically trained for scientific and medical domains (NFCorpus and SciFact)
Performance may vary on other domains
Boolean query syntax is optimized for full-text search engines (e.g., Tantivy)
Generated queries may need domain-specific tuning for production use

Citation

If you use this model, please cite:

@misc{searchlm2025,
  author = {Supreeth Rao},
  title = {SearchLM: Reinforcement Learning with Verifiable Rewards for Search Query Generation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Supreeth/searchlm-qwen2.5-3b-rlhf}}
}

License

MIT License

Contact

For questions or issues:

GitHub: SearchLM Repository

Acknowledgments

Base model: Qwen/Qwen2.5-3B-Instruct
Training framework: TRL (Transformer Reinforcement Learning)
Inference engine: vLLM
Search engine: Tantivy

Downloads last month: 1

Safetensors

Model size

3B params

Tensor type

F32

Model tree for Supreeth/searchlm-qwen2.5-3b-rlhf

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct