README.md · ml-intern-explorers/queryshield-1.5b at main

File size: 6,877 Bytes

e37209f

---
license: mit
base_model: Qwen/Qwen2.5-1.5B-Instruct
language:
  - en
  - uz
  - ru
  - kk
  - kaa
tags:
  - queryshield
  - prompt-optimization
  - multilingual
  - instruction-tuning
  - lora
  - qlora
  - qwen2.5
  - uzbek
  - karakalpak
  - kazakh
  - central-asia
  - fine-tuned
pipeline_tag: text-generation
datasets:
- nickoo004/queryshield-multilingual
---

# QueryShield — Multilingual Prompt Optimizer

**QueryShield-1.5B** is a fine-tuned version of [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) trained to rewrite raw, messy user queries into detailed, structured instruction prompts for downstream LLMs — across 5 languages and 30 professional domains.

> Given a raw user question → outputs an expert-level optimized prompt telling a downstream LLM *how* to answer it.

---

## What it does

Most LLMs perform significantly better when given structured, detailed prompts rather than raw user input. QueryShield sits **between the user and the LLM** — it takes the raw query and rewrites it into a high-quality instruction prompt automatically.

```
User: "menga diabetni boshqarish uchun ovqat rejimi ayting"
         ↓  QueryShield
Optimized: "As a Medical Expert, the user is asking in Uzbek about dietary
            management for diabetes with high blood sugar. Provide a structured
            3-tier response covering: diabetes basics, dietary assessment, and
            an actionable meal plan. Respond entirely in Uzbek. Avoid jargon..."
         ↓  Downstream LLM
Final answer in Uzbek ✅
```

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | Qwen/Qwen2.5-1.5B-Instruct |
| **Training data** | [QueryShield Multilingual Dataset](https://huggingface.co/datasets/nickoo004/queryshield-multilingual) |
| **Training rows** | 19,530 |
| **Epochs** | 3 |
| **Train loss** | 0.88 → 0.47 |
| **Eval loss** | 0.967 (best checkpoint) |
| **GPU** | NVIDIA RTX 3090 24GB |
| **Training time** | ~3.7 hours |
| **Parameters** | 1.5B total / 147M trainable (8.7%) |
| **Live demo** | [▶ Kaggle Notebook](https://www.kaggle.com/code/nursultankoshekbaev/queryshield-1-5b) |

---

## Languages

| Language | Code | Support |
|---|---|---|
| English | `en` | ✅ Full |
| Uzbek | `uz` | ✅ Full |
| Russian | `ru` | ✅ Full |
| Kazakh | `kk` | ✅ Full |
| Karakalpak | `kaa` | ✅ Good |

**Cross-lingual** scenarios supported — user can write in one language and request output in another (e.g., Uzbek input → Russian output).

---

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nickoo004/queryshield-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

SYSTEM = (
    "You are QueryShield, a multilingual prompt optimizer. "
    "Given a raw user question, rewrite it into a detailed instruction "
    "prompt for a downstream LLM expert. "
    "User language: {in_lang}. Response language: {out_lang}. "
    "Expert role: {role}."
)

def optimize_prompt(user_question, input_language, output_language, role):
    messages = [
        {"role": "system", "content": SYSTEM.format(
            in_lang=input_language,
            out_lang=output_language,
            role=role,
        )},
        {"role": "user", "content": user_question},
    ]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    new_tokens = output[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)


# Example 1 — Uzbek monolingual
result = optimize_prompt(
    user_question="menga diabetni boshqarish uchun eng yaxshi ovqatlanish rejimini ayting",
    input_language="Uzbek",
    output_language="Uzbek",
    role="Medical Expert",
)
print(result)

# Example 2 — Cross-lingual: Kazakh -> Uzbek
result = optimize_prompt(
    user_question="менің фермамда топырақ сапасы нашар, не істеуім керек?",
    input_language="Kazakh",
    output_language="Uzbek",
    role="Agricultural Scientist",
)
print(result)
```

---

## Live Demo

**[▶ Run on Kaggle](https://www.kaggle.com/code/nursultankoshekbaev/queryshield-1-5b)** — no setup needed, free GPU included.

Tests all 7 cases: English, Uzbek, Russian, Kazakh, Karakalpak + 2 cross-lingual pairs.

---

## Supported Domains (30 total)

| Domain | Expert Role |
|---|---|
| Software Engineering | Senior Software Engineer |
| Healthcare & Medicine | Medical Expert |
| Finance & Banking | Financial Analyst |
| Legal & Law | Legal Advisor |
| Data Science & AI | Data Scientist |
| Cybersecurity | Cybersecurity Specialist |
| Aviation & Aerospace | Aerospace Engineer |
| Agriculture | Agricultural Scientist |
| Education & Teaching | Experienced Educator |
| Automotive | Automotive Engineer |
| Pharmaceuticals | Pharmaceutical Researcher |
| Manufacturing | Manufacturing Expert |
| Civil / Mechanical / Electrical Engineering | Domain Engineer |
| Business & Marketing | Business Strategist |
| Creative Writing | Professional Writer |
| … and 15 more | … |

---

## Training Details

### Dataset
- **Source:** [nickoo004/queryshield-multilingual](https://huggingface.co/datasets/nickoo004/queryshield-multilingual)
- **19,530 rows** across 5 languages and 30 domains
- Generated by DeepSeek, Gemini, and Qwen2.5-14B

### Loss Curve
```
Epoch 1.0  ->  train: 1.023  |  eval: 0.997
Epoch 2.5  ->  train: 0.731  |  eval: 0.967  <- best checkpoint
```

---

## Limitations

- Karakalpak support is functional but may be less consistent than other languages due to limited training data for this low-resource language
- `optimized_prompt` output is always structured as an English instruction — this is by design
- Best results on domains covered in training data; novel domains may produce generic prompts
- Not suitable for harmful, illegal, or unethical query optimization

---

## Citation

```bibtex
@model{queryshield_1_5b_2026,
  author    = {nickoo004},
  title     = {QueryShield-1.5B: Multilingual Prompt Optimizer},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/nickoo004/queryshield-1.5b}
}
```

---

## License

This model is released under the **MIT License**.
Base model license: [Qwen License](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)