File size: 14,747 Bytes
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
035d186
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8541221
af80aa7
 
 
8541221
 
af80aa7
 
8541221
af80aa7
8541221
 
 
 
 
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8541221
af80aa7
 
 
8541221
af80aa7
 
 
 
 
 
8541221
 
af80aa7
 
8541221
af80aa7
8541221
 
 
af80aa7
 
 
8541221
 
 
af80aa7
 
 
 
 
 
8541221
 
 
af80aa7
 
 
 
8541221
 
 
af80aa7
8541221
 
 
af80aa7
 
 
 
 
8541221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8541221
af80aa7
 
 
 
 
 
 
8541221
 
 
 
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8541221
af80aa7
 
 
 
 
8541221
 
 
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3da9761
 
 
 
 
 
af80aa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3da9761
 
 
af80aa7
 
 
3da9761
 
 
 
 
 
 
 
 
8541221
3da9761
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af80aa7
3da9761
 
 
 
af80aa7
 
 
 
 
 
3da9761
 
af80aa7
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
#!/usr/bin/env env python3
"""
Rubric Generation Script for HF-Agent Benchmark

Generates instance-specific evaluation rubrics following the "Rubrics as Rewards" paper.
Uses LiteLLM to call LLM models for rubric synthesis with expert grounding via reference answers.
"""

import argparse
import json
import os
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any, Dict, List

import litellm
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel

from eval.hf_io import df_to_hub


class Rubric(BaseModel):
    title: str
    description: str
    weight: int


class RubricList(BaseModel):
    rubrics: List[Rubric]


# Load environment variables
load_dotenv()

# Rubric generation prompt template based on RaR paper


PROMPT_TEMPLATE = """You are an expert rubric writer. Your job is to generate a self-contained set of evaluation criteria ("rubrics") for judging how good, helpful and complete an agent's trajectory is to a given user question/request. 

Rubrics can cover aspects of a response such as, but not limited to, factual correctness, helpfulness, completeness, harmlessness, correctness of using Hugging Face best practices (based on HF documentation), depth of
reasoning, contextual relevance and usefulness. Each item must be self-contained – non expert readers should not need to
infer anything or consult external information. Begin each description with its category: "Essential Criteria: . . . ", "Important
Criteria: . . . ", "Optional Criteria: . . . ", or "Pitfall Criteria: Does not mention . . . ".


Inputs:
- question: <<<{question}>>>
- example_solution (NOT ground truth - just an okay attempt): <<<{example_solution}>>>
- example_trace (NOT ground truth - just an okay attempt showing what tool usage might look like): <<<{example_trace}>>>

IMPORTANT: The example_solution and example_trace provided are NOT ground truth or ideal solutions. They represent 
an attempt at solving the task - they give you a general idea of the shape of the problem and what tool usage 
might look like, but they contain mistakes and incomplete solutions, suboptimal approaches, or incomplete answers. Your rubrics MUST be designed to fairly grade a PERFECT solution. The perfect solution is complete in all aspects of solving the task and verifing it's correctness before giving the final answer. It tells the user what was done and why, and provides the final answer clearly answering the user's question.

Total items:
• Choose 7–20 rubric items based on the complexity of the question.

Each rubric item:
• title (2–4 words).
• description: One sentence starting with its category prefix that explicitly states exactly what to look for. For example:
– Essential Criteria: Writes a up-to-date, correct, complete and working training loop using the latest Hugging Face best practices. Launches the training with hf-jobs. 
– Pitfall Criteria: Deprecated launcher usage. Uses python -m torch.distributed.launch instead of torchrun / accelerate.
– Important Criteria: Explains common DDP knobs. Mentions ddp_find_unused_parameters=False for models with conditional branches; optional ddp_timeout; brief note on when they matter and why.
– Optional Criteria: Briefly notes --deepspeed ds_config.json as an alternative scaler when models get big (but stays on DDP for this Q).
• weight: For Essential/Important/Optional, use 1–5 (5 = most important); for Pitfall, use –1 or –2.

Category guidance:
• Essential: Critical actions to answer/complete the user's question/request; if missing, the response is invalid and useless (weight 5).
• Important: Key reasoning, completeness, or clarity; strongly affects quality and usefulness (weight 3–4).
• Optional: Helpfulness in educating the user or providing extra depth; nice to have but not deal-breaking (weight 1–2).
• Pitfall: Common mistakes or omissions specific to this prompt—identify things a respondent often forgets or misstates.
Each Pitfall description must begin with "Pitfall Criteria: Does not mention . . . " or "Pitfall Criteria: Recommends . . . "
and use weight –1 or –2.

To ensure self-contained guidance:
• When referring to answer choices, explicitly say "Identifies (A)", "Identifies (B)", etc., rather than vague phrasing.
• If the format requires an action like calling a tool or launching a training run, include a rubric item such as:
– Essential Criteria: Includes a clear statement "Launches the training with hf-jobs.".
• If reasoning should precede the answer, include a rubric like:
– Important Criteria: Presents the explanation and reasoning before stating the final answer.
• If brevity is valued, include a rubric like:
– Optional Criteria: Remains concise and avoids unnecessary detail.
• If the question context demands mention of specific findings/best practices, include that explicitly (e.g., "Essential Criteria: Mentions
that training data must be in "messages" column for LLM training").

Output: Provide a JSON array of rubric objects. Each object must contain exactly three keys—title, description, and weight.
Do not copy large blocks of the question or example_solution into the text. Each description must begin with its category
prefix, and no extra keys are allowed.

Remember: The example_solution and example_trace are NOT ideal answers - they are just rough attempts to show the 
general approach. Design rubrics that can fairly evaluate any solution, including ones that are better than the example."""


def build_prompt(
    question: str,
    example_solution: str,
    example_trace: List[Dict[str, Any]],
) -> List[Dict[str, str]]:
    """
    Build the messages list for LiteLLM completion.

    Args:
        question: The question/task to evaluate
        difficulty: The difficulty level of the task
        example_solution: An example solution attempt (not ground truth)
        example_trace: The agent's message trace showing tool usage

    Returns:
        List of message dicts for LiteLLM
    """
    # Format the trace for readability - only include key parts
    formatted_trace = format_trace_for_prompt(example_trace)

    prompt = PROMPT_TEMPLATE.format(
        question=question,
        example_solution=example_solution,
        example_trace=formatted_trace,
    )

    return [{"role": "user", "content": prompt}]


def format_trace_for_prompt(messages: List[Dict[str, Any]]) -> str:
    """
    Format the agent message trace for inclusion in the prompt.
    Extracts key information while keeping it readable.
    """
    if not messages:
        return "(No trace available)"

    formatted_parts = []
    for msg in messages:
        role = msg.get("role", "unknown")
        content = msg.get("content", "")

        # Skip system messages
        if role == "system":
            continue

        # Handle tool calls
        if "tool_calls" in msg and msg["tool_calls"]:
            tool_info = []
            for tc in msg["tool_calls"]:
                if isinstance(tc, dict) and "function" in tc:
                    func = tc["function"]
                    tool_name = func.get("name", "unknown_tool")
                    tool_info.append(f"  - Called: {tool_name}")
            if tool_info:
                formatted_parts.append(
                    "[Assistant Tool Calls]\n" + "\n".join(tool_info)
                )

        # Handle regular content
        if content:
            # Truncate very long content
            if len(content) > 500:
                content = content[:500] + "... (truncated)"
            formatted_parts.append(f"[{role.title()}]\n{content}")

    return "\n\n".join(formatted_parts) if formatted_parts else "(Empty trace)"


def validate_rubric(rubric_list: List[Dict[str, Any]]) -> bool:
    """
    Validate that rubric meets basic requirements.

    Args:
        rubric_list: List of rubric items to validate

    Returns:
        True if valid, False otherwise
    """
    # Check count
    if not (7 <= len(rubric_list) <= 20):
        return False

    # Check each item
    category_prefixes = [
        "Essential Criteria:",
        "Important Criteria:",
        "Optional Criteria:",
        "Pitfall Criteria:",
    ]

    for item in rubric_list:
        # Check keys
        if set(item.keys()) != {"title", "description", "weight"}:
            return False

        # Check description starts with category prefix
        if not any(
            item["description"].startswith(prefix) for prefix in category_prefixes
        ):
            return False

    return True


def generate_rubric(row: pd.Series, model: str, timeout: int = 120) -> Dict[str, Any]:
    """
    Generate rubric for a single question using LiteLLM.

    Args:
        row: DataFrame row containing question, difficulty, solution, and messages
        model: Model name for LiteLLM
        timeout: Request timeout in seconds

    Returns:
        Dict with rubric_list and rubric_count, or None on failure
    """

    messages = build_prompt(
        question=row["question"],
        example_solution=row["solution"],
        example_trace=row.get("messages", []),
    )

    try:
        response = litellm.completion(
            model=model,
            messages=messages,
            timeout=timeout,
            response_format=RubricList,
        )

        # Parse structured output
        rubric_list: RubricList = RubricList.model_validate_json(
            response.choices[0].message.content
        )

        return rubric_list.model_dump_json()
    except Exception as e:
        print(f"Error generating rubric: {e}", file=sys.stderr)
        return None


def load_input_data(infile: str) -> pd.DataFrame:
    """
    Load input data from CSV or JSONL file.

    Args:
        infile: Path to input file

    Returns:
        DataFrame with loaded data
    """
    path = Path(infile)

    if not path.exists():
        raise FileNotFoundError(f"Input file not found: {infile}")

    if path.suffix == ".csv":
        # Try to auto-detect delimiter (comma or semicolon)
        df = pd.read_csv(infile, sep=None, engine="python")
    elif path.suffix == ".jsonl":
        df = pd.read_json(infile, lines=True)
    else:
        raise ValueError(f"Unsupported file format: {path.suffix}. Use .csv or .jsonl")

    # Validate required columns
    required_cols = [
        "question",
        "solution",
    ]
    optional_cols = ["difficulty", "messages", "error"]
    missing_cols = [col for col in required_cols if col not in df.columns]

    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")

    # Log available optional columns
    available_optional = [col for col in optional_cols if col in df.columns]
    print(f"Found optional columns: {available_optional}")

    return df


def main():
    parser = argparse.ArgumentParser(
        description="Generate rubrics for HF-agent benchmark evaluation"
    )
    parser.add_argument(
        "--infile", type=str, required=True, help="Input file path (.csv or .jsonl)"
    )
    parser.add_argument(
        "--outfile", type=str, required=True, help="Output JSONL file path"
    )
    parser.add_argument(
        "--model",
        type=str,
        default="anthropic/claude-sonnet-4-5-20250929",
        help="LiteLLM model name (default: from LITELLM_MODEL env or gpt-4o-mini)",
    )
    parser.add_argument(
        "--timeout",
        type=int,
        default=120,
        help="Request timeout in seconds (default: 120)",
    )
    parser.add_argument(
        "--max-concurrent",
        type=int,
        default=30,
        help="Maximum number of concurrent workers (default: 30)",
    )
    parser.add_argument(
        "--push-to-hub",
        type=str,
        default=None,
        help="Push to HuggingFace dataset (e.g., username/dataset@rubrics)",
    )

    args = parser.parse_args()

    # Determine model
    model = args.model or os.getenv("LITELLM_MODEL", "gpt-4o-mini")
    print(f"Using model: {model}")

    # Load input data
    print(f"Loading data from {args.infile}...")
    df = load_input_data(args.infile)
    print(f"Loaded {len(df)} examples")

    # Run rubric generation in parallel using ThreadPoolExecutor
    print(f"Running generation with {args.max_concurrent} parallel workers...")

    with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
        # Submit all tasks
        future_to_idx = {}
        for idx, row in df.iterrows():
            future = executor.submit(
                generate_rubric,
                row=row,
                model=model,
                timeout=args.timeout,
            )
            future_to_idx[future] = idx

        # Collect results in order
        results = [None] * len(df)
        completed = 0
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            results[idx] = future.result()
            completed += 1
            print(f"Completed: {completed}/{len(df)}", end="\r")

    print()  # New line after progress

    # Prepare results DataFrame
    print("Preparing results...")
    output_rows = []
    success_count = 0
    failure_count = 0

    for idx, (_, row) in enumerate(df.iterrows()):
        rubric_result = results[idx]

        if rubric_result is None:
            failure_count += 1
            continue

        # Merge with original data
        output_row = row.to_dict()
        output_row["messages"] = json.dumps(output_row["messages"])
        output_row["rubric"] = rubric_result
        output_rows.append(output_row)
        success_count += 1

    # Create DataFrame with results
    results_df = pd.DataFrame(output_rows)

    # Upload to HuggingFace if specified (before saving JSONL)
    if args.push_to_hub:
        print(f"\nUploading to HuggingFace: {args.push_to_hub}")
        upload_success = df_to_hub(
            df=results_df,
            dataset_spec=args.push_to_hub,
            split="train",
            private=False,
        )
        if not upload_success:
            print("Warning: HuggingFace push failed, but continuing to save JSONL...")

    # Write results to JSONL file
    print(f"\nWriting results to {args.outfile}...")
    with open(args.outfile, "w") as outf:
        for output_row in output_rows:
            outf.write(json.dumps(output_row, default=str) + "\n")

    print("\nComplete!")
    print(f"Success: {success_count}/{len(df)}")
    print(f"Failures: {failure_count}/{len(df)}")
    print(f"Output written to: {args.outfile}")
    if args.push_to_hub and upload_success:
        print(f"Pushed to: {args.push_to_hub}")


if __name__ == "__main__":
    main()