Azure Advisor Qwen2.5-0.5B (GRPO)

Fine-tuned Qwen/Qwen2.5-0.5B-Instruct with SFT followed by Group Relative Policy Optimization (GRPO) to generate Azure Advisor-style recommendations.

Model Description

This model builds on the SFT model with additional reward-based training using GRPO (rejection sampling + iterative SFT). It generates structured recommendations across 5 Azure Advisor categories:

Cost - Cost optimization recommendations
Security - Security posture improvements
Performance - Performance optimization suggestions
OperationalExcellence - Operational best practices
HighAvailability - Reliability and availability improvements

Training Pipeline

Phase 1: SFT (Supervised Fine-Tuning)

200 training steps on 348 examples
Loss: 1.76 -> 0.029
Baseline reward: 0.80/10 -> 3.72/10

Phase 2: GRPO (Group Relative Policy Optimization)

3 iterations of rejection sampling + iterative SFT
25 training prompts per iteration, 4 samples per prompt
Top-2 high-reward samples kept per prompt
30 SFT steps per iteration

Iteration	Training Loss	Generation Avg Reward
1	0.069	4.19
2	0.048	4.30
3	0.040	4.44

Hill Climbing Verification

Pre-SFT baseline:  0.80/10
Post-SFT:          3.43/10  (+2.63)
GRPO iter losses:  0.069 -> 0.048 -> 0.040 (decreasing)
GRPO gen rewards:  4.19 -> 4.30 -> 4.44 (increasing)
HILL CLIMBING CONFIRMED

Training Configuration

Parameter	Value
Base Model	Qwen/Qwen2.5-0.5B-Instruct
Method	SFT + GRPO (rejection sampling)
LoRA Rank / Alpha	16 / 32
Quantization	4-bit QLoRA (NF4)
GRPO Iterations	3
Samples per Prompt	4
Top-K Selection	2
Steps per Iteration	30
Learning Rate	5e-5
Hardware	NVIDIA RTX 3090 (24GB)

5 Reward Functions (max 10.0 total)

Function	Weight	Description
Format Compliance	1.5	Correct XML tags (`<ANALYSIS>`, `<RECOMMENDATIONS>`, `<SUMMARY>`) and valid JSON
Category Correctness	2.0	Valid Advisor categories (Cost, Security, Performance, etc.)
Grounding Quality	2.0	Claims supported by input evidence, no hallucinated resource IDs
Actionability	2.0	Concrete, feasible next steps with specific Azure actions
Completeness	2.5	Coverage of all issues with proper recommendation schema fields

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-advisor-qwen25-0.5b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-advisor-qwen25-0.5b-grpo")

messages = [
    {"role": "system", "content": "You are an Azure Advisor assistant. Analyze the workload and provide recommendations."},
    {"role": "user", "content": "Analyze this Azure workload and provide recommendations..."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

W&B Training Dashboard

SFT Run: wandb.ai/thegovind/azure-advisor-model/runs/quzg7fgs
GRPO Run: wandb.ai/thegovind/azure-advisor-model/runs/v6h8i0hr
Project: wandb.ai/thegovind/azure-advisor-model

Related Resources

SFT Model: thegovind/azure-advisor-qwen25-0.5b - Base SFT model before GRPO
SFT Dataset: thegovind/azure-advisor-sft - 410 training examples
GRPO Benchmark: thegovind/azure-advisor-grpo-benchmark - 106 evaluation examples

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for thegovind/azure-advisor-qwen25-0.5b-grpo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(503)

this model

thegovind
/

azure-advisor-qwen25-0.5b-grpo