README.md · ml-intern-explorers/queryshield-1.5b at main

queryshield-1.5b / README.md

nickoo004

Copy from nickoo004/queryshield-1.5b - community contribution

e37209f verified 9 days ago

preview code

raw

history blame contribute delete

6.88 kB

	---
	license: mit
	base_model: Qwen/Qwen2.5-1.5B-Instruct
	language:
	- en
	- uz
	- ru
	- kk
	- kaa
	tags:
	- queryshield
	- prompt-optimization
	- multilingual
	- instruction-tuning
	- lora
	- qlora
	- qwen2.5
	- uzbek
	- karakalpak
	- kazakh
	- central-asia
	- fine-tuned
	pipeline_tag: text-generation
	datasets:
	- nickoo004/queryshield-multilingual
	---

	# QueryShield — Multilingual Prompt Optimizer

	QueryShield-1.5B is a fine-tuned version of [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) trained to rewrite raw, messy user queries into detailed, structured instruction prompts for downstream LLMs — across 5 languages and 30 professional domains.

	> Given a raw user question → outputs an expert-level optimized prompt telling a downstream LLM how to answer it.

	---

	## What it does

	Most LLMs perform significantly better when given structured, detailed prompts rather than raw user input. QueryShield sits between the user and the LLM — it takes the raw query and rewrites it into a high-quality instruction prompt automatically.

	```
	User: "menga diabetni boshqarish uchun ovqat rejimi ayting"
	↓ QueryShield
	Optimized: "As a Medical Expert, the user is asking in Uzbek about dietary
	management for diabetes with high blood sugar. Provide a structured
	3-tier response covering: diabetes basics, dietary assessment, and
	an actionable meal plan. Respond entirely in Uzbek. Avoid jargon..."
	↓ Downstream LLM
	Final answer in Uzbek ✅
	```

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen2.5-1.5B-Instruct \|
	\| Training data \| [QueryShield Multilingual Dataset](https://huggingface.co/datasets/nickoo004/queryshield-multilingual) \|
	\| Training rows \| 19,530 \|
	\| Epochs \| 3 \|
	\| Train loss \| 0.88 → 0.47 \|
	\| Eval loss \| 0.967 (best checkpoint) \|
	\| GPU \| NVIDIA RTX 3090 24GB \|
	\| Training time \| ~3.7 hours \|
	\| Parameters \| 1.5B total / 147M trainable (8.7%) \|
	\| Live demo \| [▶ Kaggle Notebook](https://www.kaggle.com/code/nursultankoshekbaev/queryshield-1-5b) \|

	---

	## Languages

	\| Language \| Code \| Support \|
	\|---\|---\|---\|
	\| English \| `en` \| ✅ Full \|
	\| Uzbek \| `uz` \| ✅ Full \|
	\| Russian \| `ru` \| ✅ Full \|
	\| Kazakh \| `kk` \| ✅ Full \|
	\| Karakalpak \| `kaa` \| ✅ Good \|

	Cross-lingual scenarios supported — user can write in one language and request output in another (e.g., Uzbek input → Russian output).

	---

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "nickoo004/queryshield-1.5b"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	SYSTEM = (
	"You are QueryShield, a multilingual prompt optimizer. "
	"Given a raw user question, rewrite it into a detailed instruction "
	"prompt for a downstream LLM expert. "
	"User language: {in_lang}. Response language: {out_lang}. "
	"Expert role: {role}."
	)

	def optimize_prompt(user_question, input_language, output_language, role):
	messages = [
	{"role": "system", "content": SYSTEM.format(
	in_lang=input_language,
	out_lang=output_language,
	role=role,
	)},
	{"role": "user", "content": user_question},
	]
	text = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True,
	repetition_penalty=1.1,
	pad_token_id=tokenizer.eos_token_id,
	)
	new_tokens = output[0][inputs["input_ids"].shape[1]:]
	return tokenizer.decode(new_tokens, skip_special_tokens=True)


	# Example 1 — Uzbek monolingual
	result = optimize_prompt(
	user_question="menga diabetni boshqarish uchun eng yaxshi ovqatlanish rejimini ayting",
	input_language="Uzbek",
	output_language="Uzbek",
	role="Medical Expert",
	)
	print(result)

	# Example 2 — Cross-lingual: Kazakh -> Uzbek
	result = optimize_prompt(
	user_question="менің фермамда топырақ сапасы нашар, не істеуім керек?",
	input_language="Kazakh",
	output_language="Uzbek",
	role="Agricultural Scientist",
	)
	print(result)
	```

	---

	## Live Demo

	[▶ Run on Kaggle](https://www.kaggle.com/code/nursultankoshekbaev/queryshield-1-5b) — no setup needed, free GPU included.

	Tests all 7 cases: English, Uzbek, Russian, Kazakh, Karakalpak + 2 cross-lingual pairs.

	---

	## Supported Domains (30 total)

	\| Domain \| Expert Role \|
	\|---\|---\|
	\| Software Engineering \| Senior Software Engineer \|
	\| Healthcare & Medicine \| Medical Expert \|
	\| Finance & Banking \| Financial Analyst \|
	\| Legal & Law \| Legal Advisor \|
	\| Data Science & AI \| Data Scientist \|
	\| Cybersecurity \| Cybersecurity Specialist \|
	\| Aviation & Aerospace \| Aerospace Engineer \|
	\| Agriculture \| Agricultural Scientist \|
	\| Education & Teaching \| Experienced Educator \|
	\| Automotive \| Automotive Engineer \|
	\| Pharmaceuticals \| Pharmaceutical Researcher \|
	\| Manufacturing \| Manufacturing Expert \|
	\| Civil / Mechanical / Electrical Engineering \| Domain Engineer \|
	\| Business & Marketing \| Business Strategist \|
	\| Creative Writing \| Professional Writer \|
	\| … and 15 more \| … \|

	---

	## Training Details

	### Dataset
	- Source: [nickoo004/queryshield-multilingual](https://huggingface.co/datasets/nickoo004/queryshield-multilingual)
	- 19,530 rows across 5 languages and 30 domains
	- Generated by DeepSeek, Gemini, and Qwen2.5-14B

	### Loss Curve
	```
	Epoch 1.0 -> train: 1.023 \| eval: 0.997
	Epoch 2.5 -> train: 0.731 \| eval: 0.967 <- best checkpoint
	```

	---

	## Limitations

	- Karakalpak support is functional but may be less consistent than other languages due to limited training data for this low-resource language
	- `optimized_prompt` output is always structured as an English instruction — this is by design
	- Best results on domains covered in training data; novel domains may produce generic prompts
	- Not suitable for harmful, illegal, or unethical query optimization

	---

	## Citation

	```bibtex
	@model{queryshield_1_5b_2026,
	author = {nickoo004},
	title = {QueryShield-1.5B: Multilingual Prompt Optimizer},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/nickoo004/queryshield-1.5b}
	}
	```

	---

	## License

	This model is released under the MIT License.
	Base model license: [Qwen License](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)