Azure Advisor Qwen2.5-0.5B (GRPO)
Fine-tuned Qwen/Qwen2.5-0.5B-Instruct with SFT followed by Group Relative Policy Optimization (GRPO) to generate Azure Advisor-style recommendations.
Model Description
This model builds on the SFT model with additional reward-based training using GRPO (rejection sampling + iterative SFT). It generates structured recommendations across 5 Azure Advisor categories:
- Cost - Cost optimization recommendations
- Security - Security posture improvements
- Performance - Performance optimization suggestions
- OperationalExcellence - Operational best practices
- HighAvailability - Reliability and availability improvements
Training Pipeline
Phase 1: SFT (Supervised Fine-Tuning)
- 200 training steps on 348 examples
- Loss: 1.76 -> 0.029
- Baseline reward: 0.80/10 -> 3.72/10
Phase 2: GRPO (Group Relative Policy Optimization)
- 3 iterations of rejection sampling + iterative SFT
- 25 training prompts per iteration, 4 samples per prompt
- Top-2 high-reward samples kept per prompt
- 30 SFT steps per iteration
| Iteration |
Training Loss |
Generation Avg Reward |
| 1 |
0.069 |
4.19 |
| 2 |
0.048 |
4.30 |
| 3 |
0.040 |
4.44 |
Hill Climbing Verification
Pre-SFT baseline: 0.80/10
Post-SFT: 3.43/10 (+2.63)
GRPO iter losses: 0.069 -> 0.048 -> 0.040 (decreasing)
GRPO gen rewards: 4.19 -> 4.30 -> 4.44 (increasing)
HILL CLIMBING CONFIRMED
Training Configuration
| Parameter |
Value |
| Base Model |
Qwen/Qwen2.5-0.5B-Instruct |
| Method |
SFT + GRPO (rejection sampling) |
| LoRA Rank / Alpha |
16 / 32 |
| Quantization |
4-bit QLoRA (NF4) |
| GRPO Iterations |
3 |
| Samples per Prompt |
4 |
| Top-K Selection |
2 |
| Steps per Iteration |
30 |
| Learning Rate |
5e-5 |
| Hardware |
NVIDIA RTX 3090 (24GB) |
5 Reward Functions (max 10.0 total)
| Function |
Weight |
Description |
| Format Compliance |
1.5 |
Correct XML tags (<ANALYSIS>, <RECOMMENDATIONS>, <SUMMARY>) and valid JSON |
| Category Correctness |
2.0 |
Valid Advisor categories (Cost, Security, Performance, etc.) |
| Grounding Quality |
2.0 |
Claims supported by input evidence, no hallucinated resource IDs |
| Actionability |
2.0 |
Concrete, feasible next steps with specific Azure actions |
| Completeness |
2.5 |
Coverage of all issues with proper recommendation schema fields |
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-advisor-qwen25-0.5b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-advisor-qwen25-0.5b-grpo")
messages = [
{"role": "system", "content": "You are an Azure Advisor assistant. Analyze the workload and provide recommendations."},
{"role": "user", "content": "Analyze this Azure workload and provide recommendations..."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
W&B Training Dashboard
Related Resources