Prompt Quality Analyzer

A LoRA-finetuned model that evaluates prompt quality across 8 criteria.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import json

# Load model (auto-detects base model from adapter config)
model_path = "YOUR_USERNAME/prompt-quality-analyzer"

# The adapter config contains the base model name
# Default: TinyLlama/TinyLlama-1.1B-Chat-v1.0
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    low_cpu_mem_usage=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, model_path)
model.eval()

# Analyze a prompt
prompt = "Classify text into categories"

instruction = """You are a prompt quality analyzer. Analyze the given prompt and extract quality scores for each criterion.

Criteria (score 1-10):
1. clarity_and_specificity: Are instructions clear and specific?
2. context_sufficiency: Is sufficient background/context provided?
3. examples_provided: Are concrete examples included?
4. output_format_specification: Is the expected output format clearly defined?
5. edge_case_handling: Are edge cases and exceptions addressed?
6. tone_and_style_guidance: Is communication style/tone specified?
7. constraint_definition: Are limits and boundaries clearly set?
8. relevance_of_examples: Do examples match the domain/task?

Respond with ONLY a JSON object containing the scores."""

# Format with chat template
formatted = f"""<|system|>
{instruction}<|end|>
<|user|>
Analyze this prompt:

{prompt}<|end|>
<|assistant|>
"""

inputs = tokenizer(formatted, return_tensors="pt", max_length=2048, truncation=True)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=400,
        temperature=0.3,
        do_sample=True,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode only new tokens
input_length = inputs['input_ids'].shape[1]
new_tokens = outputs[0][input_length:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)

# Parse JSON (may need robust parsing)
result = json.loads(response)
print(json.dumps(result, indent=2))

What It Does

Scores prompts on 8 dimensions (1-10 scale):

Criterion	Description
Clarity & Specificity	Are instructions clear and unambiguous?
Context Sufficiency	Is enough background provided?
Examples Provided	Are there concrete examples?
Output Format	Is the expected format specified?
Edge Case Handling	Are exceptions addressed?
Tone & Style	Is communication style defined?
Constraint Definition	Are limits clearly set?
Example Relevance	Do examples match the task?

Example Output

{
  "scores": {
    "clarity_and_specificity": 8,
    "context_sufficiency": 7,
    "examples_provided": 6,
    "output_format_specification": 9,
    "edge_case_handling": 5,
    "tone_and_style_guidance": 6,
    "constraint_definition": 7,
    "relevance_of_examples": 8
  },
  "overall_score": 7.0,
  "quality_tier": "good"
}

Model Details

Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B parameters)
Method: LoRA fine-tuning
LoRA Rank: 8
Training Data: 500 synthetic prompt examples
Training Time: ~10-15 minutes on CPU
Chat Template: Uses <|system|>, <|user|>, <|assistant|> format

Performance

Success Rate: 100% with robust JSON parsing
MAE: ~1.54 overall
Accuracy (±1 point): ~61%
Accuracy (±2 points): ~77%
Parsing: Multi-strategy approach (direct JSON, fix common issues, regex extraction)
Improvement: 75/75 successful vs 29/75 originally (39% → 100%)

Limitations

Optimized for English prompts
Works best with prompts <500 tokens
May require JSON parsing helpers for consistent results
Small model - accuracy improves with larger base models

Use Cases

Prompt Engineering - Optimize prompts before deployment
Quality Assurance - Evaluate prompt libraries
Learning - Understand what makes effective prompts
Automation - Batch analyze multiple prompts

Training

Trained on synthetic dataset with quality annotations across 8 criteria. Uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA adapters.

Citation

@misc{extract-prompt-quality-criteria,
  author = {Gregoire Cattan},
  title = {Extract Prompt Quality Criteria},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/center-of-excellence/extract-prompt-quality-criteria}
}

Downloads last month: 153

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for center-of-excellence/extract-prompt-quality-criteria

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Adapter

(1448)

this model