Azure Advisor Qwen2.5-0.5B (GRPO)

Fine-tuned Qwen/Qwen2.5-0.5B-Instruct with SFT followed by Group Relative Policy Optimization (GRPO) to generate Azure Advisor-style recommendations.

Model Description

This model builds on the SFT model with additional reward-based training using GRPO (rejection sampling + iterative SFT). It generates structured recommendations across 5 Azure Advisor categories:

  • Cost - Cost optimization recommendations
  • Security - Security posture improvements
  • Performance - Performance optimization suggestions
  • OperationalExcellence - Operational best practices
  • HighAvailability - Reliability and availability improvements

Training Pipeline

Phase 1: SFT (Supervised Fine-Tuning)

  • 200 training steps on 348 examples
  • Loss: 1.76 -> 0.029
  • Baseline reward: 0.80/10 -> 3.72/10

Phase 2: GRPO (Group Relative Policy Optimization)

  • 3 iterations of rejection sampling + iterative SFT
  • 25 training prompts per iteration, 4 samples per prompt
  • Top-2 high-reward samples kept per prompt
  • 30 SFT steps per iteration
Iteration Training Loss Generation Avg Reward
1 0.069 4.19
2 0.048 4.30
3 0.040 4.44

Hill Climbing Verification

Pre-SFT baseline:  0.80/10
Post-SFT:          3.43/10  (+2.63)
GRPO iter losses:  0.069 -> 0.048 -> 0.040 (decreasing)
GRPO gen rewards:  4.19 -> 4.30 -> 4.44 (increasing)
HILL CLIMBING CONFIRMED

Training Configuration

Parameter Value
Base Model Qwen/Qwen2.5-0.5B-Instruct
Method SFT + GRPO (rejection sampling)
LoRA Rank / Alpha 16 / 32
Quantization 4-bit QLoRA (NF4)
GRPO Iterations 3
Samples per Prompt 4
Top-K Selection 2
Steps per Iteration 30
Learning Rate 5e-5
Hardware NVIDIA RTX 3090 (24GB)

5 Reward Functions (max 10.0 total)

Function Weight Description
Format Compliance 1.5 Correct XML tags (<ANALYSIS>, <RECOMMENDATIONS>, <SUMMARY>) and valid JSON
Category Correctness 2.0 Valid Advisor categories (Cost, Security, Performance, etc.)
Grounding Quality 2.0 Claims supported by input evidence, no hallucinated resource IDs
Actionability 2.0 Concrete, feasible next steps with specific Azure actions
Completeness 2.5 Coverage of all issues with proper recommendation schema fields

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-advisor-qwen25-0.5b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-advisor-qwen25-0.5b-grpo")

messages = [
    {"role": "system", "content": "You are an Azure Advisor assistant. Analyze the workload and provide recommendations."},
    {"role": "user", "content": "Analyze this Azure workload and provide recommendations..."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

W&B Training Dashboard

Related Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thegovind/azure-advisor-qwen25-0.5b-grpo

Adapter
(503)
this model

Datasets used to train thegovind/azure-advisor-qwen25-0.5b-grpo