Prompt Quality Analyzer

A LoRA-finetuned model that evaluates prompt quality across 8 criteria.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import json

# Load model (auto-detects base model from adapter config)
model_path = "YOUR_USERNAME/prompt-quality-analyzer"

# The adapter config contains the base model name
# Default: TinyLlama/TinyLlama-1.1B-Chat-v1.0
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    low_cpu_mem_usage=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, model_path)
model.eval()

# Analyze a prompt
prompt = "Classify text into categories"

instruction = """You are a prompt quality analyzer. Analyze the given prompt and extract quality scores for each criterion.

Criteria (score 1-10):
1. clarity_and_specificity: Are instructions clear and specific?
2. context_sufficiency: Is sufficient background/context provided?
3. examples_provided: Are concrete examples included?
4. output_format_specification: Is the expected output format clearly defined?
5. edge_case_handling: Are edge cases and exceptions addressed?
6. tone_and_style_guidance: Is communication style/tone specified?
7. constraint_definition: Are limits and boundaries clearly set?
8. relevance_of_examples: Do examples match the domain/task?

Respond with ONLY a JSON object containing the scores."""

# Format with chat template
formatted = f"""<|system|>
{instruction}<|end|>
<|user|>
Analyze this prompt:

{prompt}<|end|>
<|assistant|>
"""

inputs = tokenizer(formatted, return_tensors="pt", max_length=2048, truncation=True)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=400,
        temperature=0.3,
        do_sample=True,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode only new tokens
input_length = inputs['input_ids'].shape[1]
new_tokens = outputs[0][input_length:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)

# Parse JSON (may need robust parsing)
result = json.loads(response)
print(json.dumps(result, indent=2))

What It Does

Scores prompts on 8 dimensions (1-10 scale):

Criterion Description
Clarity & Specificity Are instructions clear and unambiguous?
Context Sufficiency Is enough background provided?
Examples Provided Are there concrete examples?
Output Format Is the expected format specified?
Edge Case Handling Are exceptions addressed?
Tone & Style Is communication style defined?
Constraint Definition Are limits clearly set?
Example Relevance Do examples match the task?

Example Output

{
  "scores": {
    "clarity_and_specificity": 8,
    "context_sufficiency": 7,
    "examples_provided": 6,
    "output_format_specification": 9,
    "edge_case_handling": 5,
    "tone_and_style_guidance": 6,
    "constraint_definition": 7,
    "relevance_of_examples": 8
  },
  "overall_score": 7.0,
  "quality_tier": "good"
}

Model Details

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B parameters)
  • Method: LoRA fine-tuning
  • LoRA Rank: 8
  • Training Data: 500 synthetic prompt examples
  • Training Time: ~10-15 minutes on CPU
  • Chat Template: Uses <|system|>, <|user|>, <|assistant|> format

Performance

  • Success Rate: 100% with robust JSON parsing
  • MAE: ~1.54 overall
  • Accuracy (±1 point): ~61%
  • Accuracy (±2 points): ~77%
  • Parsing: Multi-strategy approach (direct JSON, fix common issues, regex extraction)
  • Improvement: 75/75 successful vs 29/75 originally (39% → 100%)

Limitations

  • Optimized for English prompts
  • Works best with prompts <500 tokens
  • May require JSON parsing helpers for consistent results
  • Small model - accuracy improves with larger base models

Use Cases

  • Prompt Engineering - Optimize prompts before deployment
  • Quality Assurance - Evaluate prompt libraries
  • Learning - Understand what makes effective prompts
  • Automation - Batch analyze multiple prompts

Training

Trained on synthetic dataset with quality annotations across 8 criteria. Uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.

Citation

@misc{extract-prompt-quality-criteria,
  author = {Gregoire Cattan},
  title = {Extract Prompt Quality Criteria},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/center-of-excellence/extract-prompt-quality-criteria}
}
Downloads last month
153
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for center-of-excellence/extract-prompt-quality-criteria

Adapter
(1448)
this model