Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

App Files Files Community

akseljoonas HF Staff commited on 14 days ago

Commit

13c9630

1 Parent(s): 0bdb50e

Remove eval/ from tracking and add to gitignore

Browse files

Files changed (18) hide show

.gitignore +3 -0
eval/README.md +0 -100
eval/__init__.py +0 -3
eval/check_completeness.py +0 -164
eval/claude_batch_solve.py +0 -230
eval/create_eval_dataset.py +0 -160
eval/eval_set.ipynb +0 -0
eval/generate_rubrics.py +0 -403
eval/generated_tasks_with_difficulty.json +0 -255
eval/hf_agent_connector.py +0 -89
eval/hf_io.py +0 -215
eval/leaderboard.py +0 -172
eval/models.py +0 -63
eval/rubric_eval.py +0 -142
eval/run_eval_with_leaderboard.py +0 -215
eval/scrape_discussions/discussions_scraper.py +0 -98
eval/solvers.py +0 -165
eval/task.py +0 -121

.gitignore CHANGED Viewed

@@ -52,6 +52,9 @@ frontend/yarn-error.log*
 # Docker
 .docker/
 # Project-specific
 session_logs/
 /logs

 # Docker
 .docker/
+# Eval (stale)
+eval/
 # Project-specific
 session_logs/
 /logs

eval/README.md DELETED Viewed

@@ -1,100 +0,0 @@
-# HF-Agent Eval
-Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
-## Components
-| Component | Purpose | Long Term Goal |
-|-----------|---------|----------------|
-| **`generate_rubrics.py`** | Generates instance-specific evaluation criteria (7-20 weighted rubrics) from QA pairs using LLM, following the RaR paper methodology | Improve rubric quality with few-shot examples, domain-specific templates, and iterative refinement |
-| **`rubric_eval.py`** | Scores responses using RaR-Explicit formula: checks each criterion independently via LLM judge, computes weighted normalized score | Support batch evaluation, caching, and alternative scoring formulas (RaR-Holistic) |
-| **`task.py`** | Defines Inspect AI task `hf-benchmark-with-rubrics` that wires dataset, solver, and rubric scorer into a single evaluation pipeline | Add more task variants for different benchmarks (code generation, tool use, multi-turn) |
-| **`solvers.py`** | Registry of solver implementations (`hf_agent`, `claude_code`, `claude_code+hf_mcp`) that can be swapped via CLI args | Expand solver library to benchmark more agents (OpenAI Codex, Gemini, open-source agents) |
-| **`hf_agent_connector.py`** | Lightweight bridge that spins up the hf-agent stack (tools, MCP, LiteLLM loop) and returns the final assistant response | Enable streaming, intermediate step logging, and cost tracking per evaluation |
-| **`leaderboard.py`** | Utilities to build records and append scores to a HuggingFace dataset for tracking performance over time | Add score breakdowns, visualizations, and automatic regression detection |
-| **`run_eval_with_leaderboard.py`** | CLI wrapper that runs `inspect eval`, parses scores from logs, and pushes results to the leaderboard dataset | Support scheduled CI runs, PR-gated benchmarks, and multi-dataset aggregation |
-| **`hf_io.py`** | Helper utilities for pushing DataFrames to HuggingFace Hub | Extend with dataset versioning and diff tracking |
-| **`models.py`** | Shared Pydantic models for evaluation data structures | Centralize all eval schemas for consistency across components |
-## Pipeline
-```
-QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores
-```
-### 1. Generate Rubrics (if not already generated)
-Creates instance-specific evaluation criteria from question + reference answer.
-```bash
-python eval/generate_rubrics.py \
-    --infile qa_pairs.jsonl \
-    --outfile qa_rubrics.jsonl \
-    --model anthropic/claude-sonnet-4-5-20250929 \
-    --push-to-hub akseljoonas/hf-agent-benchmark@rubrics
-```
-**Input format:**
-```json
-{"question": "...", "solution": "...", "thread": [...]}
-```
-**Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
-### 2. Response evaluation
-Files:
-- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
-  the existing hf-agent stack in `agent/` (tools, MCP, LiteLLM loop) and returns the assistant reply.
-- `eval/solvers.py` keeps the solver implementations (e.g. `hf_agent`,
-  `claude_code`). If additional solvers are needed, register them there and pass
-  `-T solver_name=<name>` to swap them in without touching the task.
-- `eval/task.py` registers `hf-benchmark-with-rubrics`, which wires
-  the dataset, solver, and rubric scorer into a single Inspect task and does the eval.
-### Running the hf-agent (implemented in `agent/`) (args are optional)
-```bash
-uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
-  -T dataset_name=akseljoonas/hf-agent-rubrics \
-  -T dataset_split=train \
-  -T limit=25 \
-  -T solver_name=hf_agent \
-  -T solver_kwargs='{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
-  --log-dir logs/inspect
-```
-Different benchmarks can be used by making/running a new task in `eval/task.py`.
-### Running Claude Code headlessly
-The `claude_code` solver shell-outs to the `claude` CLI (`claude -p ... --output-format json`)
-so you can benchmark Claude Code without any interactive UI. Example:
-Claude Code command example (kwargs are optional):
-```bash
-uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
-  -T solver_name=claude_code \
-  -T solver_kwargs='{"allowed_tools":"Bash,Read","output_format":"json"}'
-```
-### Leaderboard
-Scores can be pushed to a Hugging Face dataset automatically by wrapping the run
-with `eval/run_eval_with_leaderboard.py` (it executes `inspect eval ...` under the hood
-and only appends results when the command succeeds):
-```bash
-uv run python eval/run_eval_with_leaderboard.py \
-  --hf-dataset akseljoonas/hf-agent-leaderboard \
-  --hf-token $HF_TOKEN \
-  --solver-name hf_agent \
-  --solver-kwargs '{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
-  --dataset akseljoonas/hf-agent-rubrics@train \
-  --limit 25
-```
-## Scoring (implemented in `eval/rubric_eval.py`)
-The scoring is implemented in `eval/rubric_eval.py` and is based on the RaR-Explicit formula: `score = Σ(weight × satisfied) / Σ(positive_weights)`.
-The score is normalized to [0, 1] and clipped if pitfalls make it negative.

eval/__init__.py DELETED Viewed

@@ -1,3 +0,0 @@
-from eval.task import hf_benchmark_with_rubrics
-__all__ = ["hf_benchmark_with_rubrics"]

eval/check_completeness.py DELETED Viewed

@@ -1,164 +0,0 @@
-#!/usr/bin/env python3
-"""
-Minimal script to check if tasks in solved_tasks.jsonl were fully completed and verified.
-Uses an LLM to assess completion status and adds the result to each row.
-"""
-import argparse
-import json
-import sys
-from concurrent.futures import ThreadPoolExecutor, as_completed
-import litellm
-from dotenv import load_dotenv
-from pydantic import BaseModel
-load_dotenv()
-class CompletionCheck(BaseModel):
-    reasoning: str
-    completed: bool
-    verified: bool
-PROMPT = """You are evaluating whether an AI agent fully completed a task AND verified its completion.
-Task: {question}
-Agent's final answer: {solution}
-Agent's trace (tool calls and responses):
-{trace}
-Evaluate:
-1. **completed**: Did the agent actually complete the task? (not just explain what could be done, but actually do it)
-2. **verified**: Did the agent verify/confirm that the task was completed correctly? (e.g., checked output, validated results, confirmed success)
-Be strict:
-- If the agent asked for more information or said "please provide...", it's NOT completed.
-- If the agent only explained how to do something but didn't do it, it's NOT completed.
-- If the agent just made a plan of how to complete it but didn't do it, it's NOT completed.
-- If there's an error in the trace and no recovery, it's NOT completed.
-- If the agent didn't check/confirm the code/command completed succesfully or the result is correct somehow, it's NOT verified.
-Return JSON with: completed (bool), verified (bool), reasoning (brief explanation)."""
-def format_trace(messages: list) -> str:
-    """Format messages trace for the prompt."""
-    if not messages:
-        return "(No trace)"
-    parts = []
-    for msg in messages:
-        role = msg.get("role", "unknown")
-        if role == "system":
-            continue
-        content = msg.get("content", "")
-        tool_calls = msg.get("tool_calls", [])
-        if tool_calls:
-            for tc in tool_calls:
-                if isinstance(tc, dict) and "function" in tc:
-                    name = tc["function"].get("name", "?")
-                    parts.append(f"[TOOL CALL] {name}")
-        if content:
-            # Truncate long content
-            if len(content) > 5000:
-                content = content[:4000] + "..." + content[-1000:]
-            parts.append(f"[{role.upper()}] {content}")
-    return "\n".join(parts) if parts else "(Empty trace)"
-def check_row(row: dict, model: str) -> CompletionCheck | None:
-    """Check if a single task was completed and verified."""
-    prompt = PROMPT.format(
-        question=row["question"],
-        solution=row.get("solution", "(No solution)"),
-        trace=format_trace(row.get("messages", [])),
-    )
-    try:
-        response = litellm.completion(
-            model=model,
-            messages=[{"role": "user", "content": prompt}],
-            response_format=CompletionCheck,
-            timeout=60,
-        )
-        return CompletionCheck.model_validate_json(response.choices[0].message.content)
-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        return None
-def main():
-    parser = argparse.ArgumentParser(description="Check task completion status")
-    parser.add_argument("--infile", type=str, default="eval/solved_tasks.jsonl")
-    parser.add_argument(
-        "--outfile", type=str, default="eval/solved_tasks_checked.jsonl"
-    )
-    parser.add_argument(
-        "--model", type=str, default="anthropic/claude-sonnet-4-5-20250929"
-    )
-    parser.add_argument("--max-concurrent", type=int, default=30)
-    args = parser.parse_args()
-    # Load data
-    print(f"Loading {args.infile}...")
-    rows = []
-    with open(args.infile) as f:
-        for line in f:
-            rows.append(json.loads(line))
-    print(f"Loaded {len(rows)} rows")
-    # Process in parallel
-    print(f"Checking completion with {args.model}...")
-    with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
-        futures = {
-            executor.submit(check_row, row, args.model): i for i, row in enumerate(rows)
-        }
-        results = [None] * len(rows)
-        for future in as_completed(futures):
-            idx = futures[future]
-            results[idx] = future.result()
-            print(
-                f"Done: {sum(1 for r in results if r is not None)}/{len(rows)}",
-                end="\r",
-            )
-    print()
-    # Merge results
-    output_rows = []
-    for row, result in zip(rows, results):
-        if result:
-            row["task_completed"] = result.completed
-            row["task_verified"] = result.verified
-            row["completion_reasoning"] = result.reasoning
-        else:
-            row["task_completed"] = None
-            row["task_verified"] = None
-            row["completion_reasoning"] = "Error during check"
-        output_rows.append(row)
-    # Write output
-    print(f"Writing to {args.outfile}...")
-    with open(args.outfile, "w") as f:
-        for row in output_rows:
-            f.write(json.dumps(row, default=str) + "\n")
-    # Summary
-    completed = sum(1 for r in results if r and r.completed)
-    verified = sum(1 for r in results if r and r.verified)
-    print("\nSummary:")
-    print(f"  Completed: {completed}/{len(rows)}")
-    print(f"  Verified: {verified}/{len(rows)}")
-if __name__ == "__main__":
-    main()

eval/claude_batch_solve.py DELETED Viewed

@@ -1,230 +0,0 @@
-import asyncio
-import json
-import os
-import threading
-from pathlib import Path
-from typing import Any
-from claude_agent_sdk import (
-    AssistantMessage,
-    ClaudeAgentOptions,
-    ResultMessage,
-    SystemMessage,
-    TextBlock,
-    ToolResultBlock,
-    ToolUseBlock,
-    UserMessage,
-    query,
-)
-from dotenv import load_dotenv
-load_dotenv()
-# Thread-safe file writing
-file_lock = threading.Lock()
-def convert_message_to_chat_format(message: Any) -> dict | None:
-    """Convert SDK message to standard chat format with role/content/tool_calls."""
-    if isinstance(message, SystemMessage):
-        # Extract tools list from init data for system message
-        if message.subtype == "init":
-            tools = message.data.get("tools", [])
-            tools_desc = "\n".join(f"- {tool}" for tool in tools)
-            return {
-                "role": "system",
-                "content": f"You are a helpful assistant with access to the following tools:\n{tools_desc}",
-            }
-        return None
-    elif isinstance(message, AssistantMessage):
-        text_content = ""
-        tool_calls = []
-        for block in message.content:
-            if isinstance(block, TextBlock):
-                text_content += block.text
-            elif isinstance(block, ToolUseBlock):
-                tool_calls.append(
-                    {
-                        "id": block.id,
-                        "function": {
-                            "name": block.name,
-                            "arguments": block.input,
-                        },
-                    }
-                )
-        result = {"role": "assistant", "content": text_content}
-        if tool_calls:
-            result["tool_calls"] = tool_calls
-        return result
-    elif isinstance(message, UserMessage):
-        # UserMessage can contain tool results or text
-        if isinstance(message.content, str):
-            return {"role": "user", "content": message.content}
-        elif isinstance(message.content, list):
-            # Check for tool results
-            tool_results = []
-            text_content = ""
-            for block in message.content:
-                if isinstance(block, ToolResultBlock):
-                    # Format tool result content
-                    if isinstance(block.content, str):
-                        content = block.content
-                    elif isinstance(block.content, list):
-                        content = json.dumps(block.content)
-                    else:
-                        content = str(block.content) if block.content else ""
-                    tool_results.append(
-                        {
-                            "tool_use_id": block.tool_use_id,
-                            "content": content,
-                            "is_error": block.is_error,
-                        }
-                    )
-                elif isinstance(block, TextBlock):
-                    text_content += block.text
-            if tool_results:
-                return {
-                    "role": "user",
-                    "content": f"<tool_response>\n{json.dumps(tool_results, indent=2)}\n</tool_response>",
-                }
-            else:
-                return {"role": "user", "content": text_content}
-        return None
-    elif isinstance(message, ResultMessage):
-        # ResultMessage is metadata, not a conversation message
-        return None
-    return None
-async def solve_task(
-    question: str,
-    difficulty: str,
-    task_idx: int,
-    total: int,
-    semaphore: asyncio.Semaphore,
-) -> dict:
-    """Solve a single task using Claude Agent SDK."""
-    async with semaphore:
-        print(f"[{task_idx}/{total}] Starting: {question[:60]}...")
-        messages = []
-        solution = None
-        try:
-            async for message in query(
-                prompt=question,
-                options=ClaudeAgentOptions(
-                    cwd=os.getcwd(),
-                    permission_mode="bypassPermissions",
-                    disallowed_tools=["Write", "Edit", "Bash", "Glob", "Grep"],
-                    mcp_servers={
-                        "huggingface": {
-                            "type": "http",
-                            "url": "https://huggingface.co/mcp",
-                            "headers": {
-                                "Authorization": f"Bearer {os.environ['HF_TOKEN']}"
-                            },
-                        }
-                    },
-                ),
-            ):
-                # Convert to chat format and append if valid
-                chat_msg = convert_message_to_chat_format(message)
-                if chat_msg:
-                    messages.append(chat_msg)
-                # Extract text from assistant messages
-                if isinstance(message, AssistantMessage):
-                    for block in message.content:
-                        if isinstance(block, TextBlock):
-                            solution = block.text
-                # Check for result messages
-                elif isinstance(message, ResultMessage):
-                    if message.is_error:
-                        print(f"[{task_idx}/{total}] ✗ Agent error: {message.subtype}")
-                        return {
-                            "question": question,
-                            "difficulty": difficulty,
-                            "solution": None,
-                            "messages": messages,
-                            "error": f"Agent error: {message.subtype}",
-                        }
-                    elif message.result:
-                        solution = message.result
-            print(f"[{task_idx}/{total}] ✓ Done: {question[:60]}...")
-            return {
-                "question": question,
-                "difficulty": difficulty,
-                "solution": solution,
-                "messages": messages,
-                "error": None,
-            }
-        except Exception as e:
-            print(f"[{task_idx}/{total}] ✗ Error: {e}")
-            return {
-                "question": question,
-                "difficulty": difficulty,
-                "solution": None,
-                "messages": messages,
-                "error": str(e),
-            }
-def write_result(output_path: Path, result: dict):
-    """Thread-safe write to output file."""
-    with file_lock:
-        with open(output_path, "a") as f:
-            f.write(json.dumps(result) + "\n")
-async def main():
-    # Load tasks from filled_tasks.jsonl
-    tasks_path = Path(__file__).parent / "filled_tasks.jsonl"
-    tasks = []
-    with open(tasks_path) as f:
-        for line in f:
-            tasks.append(json.loads(line))
-    # Output file - clear it first
-    output_path = Path(__file__).parent / "solved_tasks.jsonl"
-    output_path.write_text("")
-    # Semaphore to limit concurrency
-    max_concurrent = 5
-    semaphore = asyncio.Semaphore(max_concurrent)
-    total = len(tasks)
-    print(f"Processing {total} tasks with {max_concurrent} concurrent agents...")
-    async def process_and_save(task: dict, idx: int):
-        result = await solve_task(
-            task["question"], task["difficulty"], idx, total, semaphore
-        )
-        write_result(output_path, result)
-        return result
-    # Create all tasks
-    coroutines = [process_and_save(task, i + 1) for i, task in enumerate(tasks)]
-    # Run all concurrently (semaphore limits actual parallelism)
-    results = await asyncio.gather(*coroutines, return_exceptions=True)
-    successful = sum(
-        1 for r in results if isinstance(r, dict) and r.get("error") is None
-    )
-    print(f"\nCompleted: {successful}/{total} successful")
-    print(f"Results saved to {output_path}")
-if __name__ == "__main__":
-    asyncio.run(main())

eval/create_eval_dataset.py DELETED Viewed

@@ -1,160 +0,0 @@
-from itertools import product
-from datasets import Dataset
-# Task templates (excluding Very hard difficulty)
-tasks = [
-    {
-        "task": "Evaluate models {M} on benchmarks {B}",
-        "difficulty": "Easy",
-        "category": "Evaluation",
-        "params": ["M", "B"],
-    },
-    {
-        "task": "Train models {M} on datasets {D} evaluating them on benchmarks {B}",
-        "difficulty": "Medium",
-        "category": "Training",
-        "params": ["M", "D", "B"],
-    },
-    {
-        "task": "Run an ablation for hyperparameter {P} for model {M} on dataset {D}",
-        "difficulty": "Hard",
-        "category": "Ablation",
-        "params": ["P", "M", "D"],
-    },
-    {
-        "task": "Generate completions with model {M} on benchmarks {B} using engine {E}",
-        "difficulty": "Medium",
-        "category": "Generation",
-        "params": ["M", "B", "E"],
-    },
-    # {
-    #     "task": "Merge models {M} using linear averaging to find the best result on benchmarks {B}",
-    #     "difficulty": "Hard",
-    #     "category": "Model Merging",
-    #     "params": ["M", "B"],
-    # },
-    {
-        "task": "Decontaminate dataset {D} against benchmarks {B}",
-        "difficulty": "Hard",
-        "category": "Data Processing",
-        "params": ["D", "B"],
-    },
-    {
-        "task": "Format dataset {D} for compatibility with framework {F} on task {T}",
-        "difficulty": "Easy",
-        "category": "Data Formatting",
-        "params": ["D", "F", "T"],
-    },
-]
-# Parameter values
-values = {
-    "M": [
-        "Qwen/Qwen3-4B-Instruct-2507",
-        "openai/gpt-oss-20b",
-        "gpt-4o-mini",
-        "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
-        "anthropic's latest model",
-    ],
-    "B": [
-        "Idavidrein/gpqa",
-        "HuggingFaceH4/MATH-500",
-        "lighteval/SimpleQA",
-        "TIGER-Lab/MMLU-Pro",
-    ],
-    "D": [
-        "HuggingFaceH4/multi_turn_if",
-        "HuggingFaceH4/ultrachat_200k",
-        "HuggingFaceH4/AceReason-1.1-SFT config: math_no_think",
-    ],
-    "E": [
-        "vllm",
-        "sglang",
-    ],
-    "F": [
-        "trl",
-        "axolotl",
-        "verl",
-    ],
-    "P": [
-        "learning_rate",
-        "batch_size",
-        "num_epochs",
-    ],
-    "T": [
-        "SFT",
-        "GRPO",
-    ],
-}
-# Task-specific instance limits
-# For each task, specify which parameter(s) to pivot on and how many instances per pivot combination
-# pivot can be a single parameter string or a list of parameters
-task_limits = [
-    {"pivot": "B", "instances_per_pivot": 1},  # Task 0: 1 instance per
-    {"pivot": ["M", "B"], "instances_per_pivot": 3},  # Task 1: 3 instances per model
-    {"pivot": ["P", "D"], "instances_per_pivot": 3},  # Task 2:
-    {"pivot": "E", "instances_per_pivot": 2},  # Task 3: 2 instances per benchmark
-    # {"pivot": "M", "instances_per_pivot": 2},  # Task 4
-    {"pivot": "D", "instances_per_pivot": 2},  # Task 5: 2 instances per dataset
-    {"pivot": ["D", "F", "T"], "instances_per_pivot": 2},  # Task 6:
-]
-def main():
-    eval_data = []
-    for task_idx, task_dict in enumerate(tasks):
-        template = task_dict["task"]
-        params = task_dict["params"]
-        limit_config = task_limits[task_idx]
-        pivot_params = limit_config["pivot"]
-        instances_per_pivot = limit_config["instances_per_pivot"]
-        # Normalize pivot to list
-        if isinstance(pivot_params, str):
-            pivot_params = [pivot_params]
-        # Get all combinations of pivot values
-        pivot_param_values = [values[p] for p in pivot_params]
-        pivot_combinations = product(*pivot_param_values)
-        # For each pivot combination, generate limited instances
-        for pivot_combo in pivot_combinations:
-            # Get combinations of other (non-pivot) parameters
-            other_params = [p for p in params if p not in pivot_params]
-            other_param_values = [values[p] for p in other_params]
-            other_combinations = list(product(*other_param_values))
-            # Limit to specified number of instances per pivot combination
-            limited_combinations = other_combinations[:instances_per_pivot]
-            # Generate instances
-            for combo in limited_combinations:
-                # Build kwargs with pivot values and other values
-                kwargs = dict(zip(pivot_params, pivot_combo))
-                kwargs.update(dict(zip(other_params, combo)))
-                concrete_task = template.format(**kwargs)
-                eval_data.append(
-                    {
-                        "task": concrete_task,
-                        "difficulty": task_dict["difficulty"],
-                        "category": task_dict["category"],
-                    }
-                )
-    print(f"Generated {len(eval_data)} instances from {len(tasks)} templates")
-    dataset = Dataset.from_list(eval_data)
-    print(f"\nDataset: {len(dataset)} rows")
-    print(f"Sample: {dataset[0]['task']}")
-    dataset.push_to_hub("akseljoonas/qyestions", private=False)
-    print("\n✓ Pushed to akseljoonas/qyestions")
-if __name__ == "__main__":
-    main()

eval/eval_set.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

eval/generate_rubrics.py DELETED Viewed

@@ -1,403 +0,0 @@
-#!/usr/bin/env env python3
-"""
-Rubric Generation Script for HF-Agent Benchmark
-Generates instance-specific evaluation rubrics following the "Rubrics as Rewards" paper.
-Uses LiteLLM to call LLM models for rubric synthesis with expert grounding via reference answers.
-"""
-import argparse
-import json
-import os
-import sys
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from pathlib import Path
-from typing import Any, Dict, List
-import litellm
-import pandas as pd
-from dotenv import load_dotenv
-from pydantic import BaseModel
-from eval.hf_io import df_to_hub
-class Rubric(BaseModel):
-    title: str
-    description: str
-    weight: int
-class RubricList(BaseModel):
-    rubrics: List[Rubric]
-# Load environment variables
-load_dotenv()
-# Rubric generation prompt template based on RaR paper
-PROMPT_TEMPLATE = """You are an expert rubric writer. Your job is to generate a self-contained set of evaluation criteria ("rubrics") for judging how good, helpful and complete an agent's trajectory is to a given user question/request.
-Rubrics can cover aspects of a response such as, but not limited to, factual correctness, helpfulness, completeness, harmlessness, correctness of using Hugging Face best practices (based on HF documentation), depth of
-reasoning, contextual relevance and usefulness. Each item must be self-contained – non expert readers should not need to
-infer anything or consult external information. Begin each description with its category: "Essential Criteria: . . . ", "Important
-Criteria: . . . ", "Optional Criteria: . . . ", or "Pitfall Criteria: Does not mention . . . ".
-Inputs:
-- question: <<<{question}>>>
-- example_solution (NOT ground truth - just an okay attempt): <<<{example_solution}>>>
-- example_trace (NOT ground truth - just an okay attempt showing what tool usage might look like): <<<{example_trace}>>>
-IMPORTANT: The example_solution and example_trace provided are NOT ground truth or ideal solutions. They represent
-an attempt at solving the task - they give you a general idea of the shape of the problem and what tool usage
-might look like, but they contain mistakes and incomplete solutions, suboptimal approaches, or incomplete answers. Your rubrics MUST be designed to fairly grade a PERFECT solution. The perfect solution is complete in all aspects of solving the task and verifing it's correctness before giving the final answer. It tells the user what was done and why, and provides the final answer clearly answering the user's question.
-Total items:
-• Choose 7–20 rubric items based on the complexity of the question.
-Each rubric item:
-• title (2–4 words).
-• description: One sentence starting with its category prefix that explicitly states exactly what to look for. For example:
-– Essential Criteria: Writes a up-to-date, correct, complete and working training loop using the latest Hugging Face best practices. Launches the training with hf-jobs.
-– Pitfall Criteria: Deprecated launcher usage. Uses python -m torch.distributed.launch instead of torchrun / accelerate.
-– Important Criteria: Explains common DDP knobs. Mentions ddp_find_unused_parameters=False for models with conditional branches; optional ddp_timeout; brief note on when they matter and why.
-– Optional Criteria: Briefly notes --deepspeed ds_config.json as an alternative scaler when models get big (but stays on DDP for this Q).
-• weight: For Essential/Important/Optional, use 1–5 (5 = most important); for Pitfall, use –1 or –2.
-Category guidance:
-• Essential: Critical actions to answer/complete the user's question/request; if missing, the response is invalid and useless (weight 5).
-• Important: Key reasoning, completeness, or clarity; strongly affects quality and usefulness (weight 3–4).
-• Optional: Helpfulness in educating the user or providing extra depth; nice to have but not deal-breaking (weight 1–2).
-• Pitfall: Common mistakes or omissions specific to this prompt—identify things a respondent often forgets or misstates.
-Each Pitfall description must begin with "Pitfall Criteria: Does not mention . . . " or "Pitfall Criteria: Recommends . . . "
-and use weight –1 or –2.
-To ensure self-contained guidance:
-• When referring to answer choices, explicitly say "Identifies (A)", "Identifies (B)", etc., rather than vague phrasing.
-• If the format requires an action like calling a tool or launching a training run, include a rubric item such as:
-– Essential Criteria: Includes a clear statement "Launches the training with hf-jobs.".
-• If reasoning should precede the answer, include a rubric like:
-– Important Criteria: Presents the explanation and reasoning before stating the final answer.
-• If brevity is valued, include a rubric like:
-– Optional Criteria: Remains concise and avoids unnecessary detail.
-• If the question context demands mention of specific findings/best practices, include that explicitly (e.g., "Essential Criteria: Mentions
-that training data must be in "messages" column for LLM training").
-Output: Provide a JSON array of rubric objects. Each object must contain exactly three keys—title, description, and weight.
-Do not copy large blocks of the question or example_solution into the text. Each description must begin with its category
-prefix, and no extra keys are allowed.
-Remember: The example_solution and example_trace are NOT ideal answers - they are just rough attempts to show the
-general approach. Design rubrics that can fairly evaluate any solution, including ones that are better than the example."""
-def build_prompt(
-    question: str,
-    example_solution: str,
-    example_trace: List[Dict[str, Any]],
-) -> List[Dict[str, str]]:
-    """
-    Build the messages list for LiteLLM completion.
-    Args:
-        question: The question/task to evaluate
-        difficulty: The difficulty level of the task
-        example_solution: An example solution attempt (not ground truth)
-        example_trace: The agent's message trace showing tool usage
-    Returns:
-        List of message dicts for LiteLLM
-    """
-    # Format the trace for readability - only include key parts
-    formatted_trace = format_trace_for_prompt(example_trace)
-    prompt = PROMPT_TEMPLATE.format(
-        question=question,
-        example_solution=example_solution,
-        example_trace=formatted_trace,
-    )
-    return [{"role": "user", "content": prompt}]
-def format_trace_for_prompt(messages: List[Dict[str, Any]]) -> str:
-    """
-    Format the agent message trace for inclusion in the prompt.
-    Extracts key information while keeping it readable.
-    """
-    if not messages:
-        return "(No trace available)"
-    formatted_parts = []
-    for msg in messages:
-        role = msg.get("role", "unknown")
-        content = msg.get("content", "")
-        # Skip system messages
-        if role == "system":
-            continue
-        # Handle tool calls
-        if "tool_calls" in msg and msg["tool_calls"]:
-            tool_info = []
-            for tc in msg["tool_calls"]:
-                if isinstance(tc, dict) and "function" in tc:
-                    func = tc["function"]
-                    tool_name = func.get("name", "unknown_tool")
-                    tool_info.append(f"  - Called: {tool_name}")
-            if tool_info:
-                formatted_parts.append(
-                    "[Assistant Tool Calls]\n" + "\n".join(tool_info)
-                )
-        # Handle regular content
-        if content:
-            # Truncate very long content
-            if len(content) > 500:
-                content = content[:500] + "... (truncated)"
-            formatted_parts.append(f"[{role.title()}]\n{content}")
-    return "\n\n".join(formatted_parts) if formatted_parts else "(Empty trace)"
-def validate_rubric(rubric_list: List[Dict[str, Any]]) -> bool:
-    """
-    Validate that rubric meets basic requirements.
-    Args:
-        rubric_list: List of rubric items to validate
-    Returns:
-        True if valid, False otherwise
-    """
-    # Check count
-    if not (7 <= len(rubric_list) <= 20):
-        return False
-    # Check each item
-    category_prefixes = [
-        "Essential Criteria:",
-        "Important Criteria:",
-        "Optional Criteria:",
-        "Pitfall Criteria:",
-    ]
-    for item in rubric_list:
-        # Check keys
-        if set(item.keys()) != {"title", "description", "weight"}:
-            return False
-        # Check description starts with category prefix
-        if not any(
-            item["description"].startswith(prefix) for prefix in category_prefixes
-        ):
-            return False
-    return True
-def generate_rubric(row: pd.Series, model: str, timeout: int = 120) -> Dict[str, Any]:
-    """
-    Generate rubric for a single question using LiteLLM.
-    Args:
-        row: DataFrame row containing question, difficulty, solution, and messages
-        model: Model name for LiteLLM
-        timeout: Request timeout in seconds
-    Returns:
-        Dict with rubric_list and rubric_count, or None on failure
-    """
-    messages = build_prompt(
-        question=row["question"],
-        example_solution=row["solution"],
-        example_trace=row.get("messages", []),
-    )
-    try:
-        response = litellm.completion(
-            model=model,
-            messages=messages,
-            timeout=timeout,
-            response_format=RubricList,
-        )
-        # Parse structured output
-        rubric_list: RubricList = RubricList.model_validate_json(
-            response.choices[0].message.content
-        )
-        return rubric_list.model_dump_json()
-    except Exception as e:
-        print(f"Error generating rubric: {e}", file=sys.stderr)
-        return None
-def load_input_data(infile: str) -> pd.DataFrame:
-    """
-    Load input data from CSV or JSONL file.
-    Args:
-        infile: Path to input file
-    Returns:
-        DataFrame with loaded data
-    """
-    path = Path(infile)
-    if not path.exists():
-        raise FileNotFoundError(f"Input file not found: {infile}")
-    if path.suffix == ".csv":
-        # Try to auto-detect delimiter (comma or semicolon)
-        df = pd.read_csv(infile, sep=None, engine="python")
-    elif path.suffix == ".jsonl":
-        df = pd.read_json(infile, lines=True)
-    else:
-        raise ValueError(f"Unsupported file format: {path.suffix}. Use .csv or .jsonl")
-    # Validate required columns
-    required_cols = [
-        "question",
-        "solution",
-    ]
-    optional_cols = ["difficulty", "messages", "error"]
-    missing_cols = [col for col in required_cols if col not in df.columns]
-    if missing_cols:
-        raise ValueError(f"Missing required columns: {missing_cols}")
-    # Log available optional columns
-    available_optional = [col for col in optional_cols if col in df.columns]
-    print(f"Found optional columns: {available_optional}")
-    return df
-def main():
-    parser = argparse.ArgumentParser(
-        description="Generate rubrics for HF-agent benchmark evaluation"
-    )
-    parser.add_argument(
-        "--infile", type=str, required=True, help="Input file path (.csv or .jsonl)"
-    )
-    parser.add_argument(
-        "--outfile", type=str, required=True, help="Output JSONL file path"
-    )
-    parser.add_argument(
-        "--model",
-        type=str,
-        default="anthropic/claude-sonnet-4-5-20250929",
-        help="LiteLLM model name (default: from LITELLM_MODEL env or gpt-4o-mini)",
-    )
-    parser.add_argument(
-        "--timeout",
-        type=int,
-        default=120,
-        help="Request timeout in seconds (default: 120)",
-    )
-    parser.add_argument(
-        "--max-concurrent",
-        type=int,
-        default=30,
-        help="Maximum number of concurrent workers (default: 30)",
-    )
-    parser.add_argument(
-        "--push-to-hub",
-        type=str,
-        default=None,
-        help="Push to HuggingFace dataset (e.g., username/dataset@rubrics)",
-    )
-    args = parser.parse_args()
-    # Determine model
-    model = args.model or os.getenv("LITELLM_MODEL", "gpt-4o-mini")
-    print(f"Using model: {model}")
-    # Load input data
-    print(f"Loading data from {args.infile}...")
-    df = load_input_data(args.infile)
-    print(f"Loaded {len(df)} examples")
-    # Run rubric generation in parallel using ThreadPoolExecutor
-    print(f"Running generation with {args.max_concurrent} parallel workers...")
-    with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
-        # Submit all tasks
-        future_to_idx = {}
-        for idx, row in df.iterrows():
-            future = executor.submit(
-                generate_rubric,
-                row=row,
-                model=model,
-                timeout=args.timeout,
-            )
-            future_to_idx[future] = idx
-        # Collect results in order
-        results = [None] * len(df)
-        completed = 0
-        for future in as_completed(future_to_idx):
-            idx = future_to_idx[future]
-            results[idx] = future.result()
-            completed += 1
-            print(f"Completed: {completed}/{len(df)}", end="\r")
-    print()  # New line after progress
-    # Prepare results DataFrame
-    print("Preparing results...")
-    output_rows = []
-    success_count = 0
-    failure_count = 0
-    for idx, (_, row) in enumerate(df.iterrows()):
-        rubric_result = results[idx]
-        if rubric_result is None:
-            failure_count += 1
-            continue
-        # Merge with original data
-        output_row = row.to_dict()
-        output_row["messages"] = json.dumps(output_row["messages"])
-        output_row["rubric"] = rubric_result
-        output_rows.append(output_row)
-        success_count += 1
-    # Create DataFrame with results
-    results_df = pd.DataFrame(output_rows)
-    # Upload to HuggingFace if specified (before saving JSONL)
-    if args.push_to_hub:
-        print(f"\nUploading to HuggingFace: {args.push_to_hub}")
-        upload_success = df_to_hub(
-            df=results_df,
-            dataset_spec=args.push_to_hub,
-            split="train",
-            private=False,
-        )
-        if not upload_success:
-            print("Warning: HuggingFace push failed, but continuing to save JSONL...")
-    # Write results to JSONL file
-    print(f"\nWriting results to {args.outfile}...")
-    with open(args.outfile, "w") as outf:
-        for output_row in output_rows:
-            outf.write(json.dumps(output_row, default=str) + "\n")
-    print("\nComplete!")
-    print(f"Success: {success_count}/{len(df)}")
-    print(f"Failures: {failure_count}/{len(df)}")
-    print(f"Output written to: {args.outfile}")
-    if args.push_to_hub and upload_success:
-        print(f"Pushed to: {args.push_to_hub}")
-if __name__ == "__main__":
-    main()

eval/generated_tasks_with_difficulty.json DELETED Viewed

@@ -1,255 +0,0 @@
-{
-  "Evaluate models {M_i} on benchmarks {B_i}": "Easy",
-  "Train models {M_i} on datasets {D_i} with benchmarks {B_i}": "Medium",
-  "Run an ablation for hyperparameter P for model M on dataset D": "Hard",
-  "Generate completions with model M on dataset D using engine E": "Medium",
-  "Merge models {M_i} using linear averaging to find the best result on benchmarks {B_i}": "Hard",
-  "Given datasets {D_i}, ablate the best SFT mixture for model M across benchmarks {B_i}": "Very hard",
-  "Decontaminate dataset D against benchmarks {B_i}": "Hard",
-  "Benchmark RL framework F for best throughput on G GPUs": "Very hard",
-  "Implement post-training algorithm A from paper P in framework F. Validate it runs end-to-end": "Very hard",
-  "Implement benchmark B in framework F. Validate it reproduces some published results": "Very hard",
-  "Format dataset D for compatibility with framework F on task T": "Easy",
-  "Remove the background from this image: [image path]": "Easy",
-  "Transcribe all of the audio files in this directory": "Easy",
-  "Transcribe all of the audio files in this directory, choose the model that'll be cheapest and also relatively accurate": "Medium (judgment call or interaction needed to figure out what accuracy levels are acceptable)",
-  "Remove the background music from this audio file": "Medium (needs to find Gradio Space and call its API0",
-  "Change this video track to be from English to Spanish": "Medium (needs to link several models together)",
-  "Translate this flyer from English to Spanish, keeping the layout and images the same": "Medium (needs to link several models together)",
-  "What's the best model for X?": "Easy",
-  "What datasets are available for X? (X={domain x task x modality})": "Easy",
-  "Is there a space to do Y?": "Easy",
-  "I have this script and this error - what's the issue?": "Medium",
-  "This space is broken, how can i fix it?": "Medium",
-  "I built a space but it is super slow. What can I do?": "Medium",
-  "How can I run modal X locally?": "Medium",
-  "I want to build a space with model Y to do X?": "Hard",
-  "How can I serve a model with multiple LoRAs?": "Hard",
-  "What's the best model for sentiment analysis on financial text?": "Easy",
-  "Are there any medical image segmentation datasets on HuggingFace for CT scans?": "Easy",
-  "Which text classification models support 4-bit quantization?": "Medium",
-  "Are there inference endpoints available for Whisper large-v3?": "Easy",
-  "What's the license for the SA-Med2D-20M dataset?": "Easy",
-  "Which vision models fit in 8GB VRAM for image segmentation?": "Medium",
-  "What datasets are available for 3D medical image segmentation?": "Medium",
-  "Is there a space to do text-to-speech with emotion control?": "Medium",
-  "I'm getting \"CUDA out of memory\" when loading Llama-2-7b even though nvidia-smi shows I have 6GB free - what's the issue?": "Medium",
-  "My Gradio space shows \"Connection errored out\" after working fine yesterday, no code changes - how can I fix it?": "Medium",
-  "I built a Gradio space for Stable Diffusion but inference takes 5+ minutes on a 4090 - what can I do?": "Medium",
-  "My Whisper model outputs different transcriptions after quantization to int8 - why?": "Medium",
-  "Getting \"RuntimeError: CUDA error: out of memory. Tried to allocate 70.00 MiB\" but only 2.87 GiB is allocated - what's happening?": "Medium",
-  "My HuggingFace space build fails with \"failed to create containerd task\" - how to fix?": "Medium",
-  "DistilBERT model gives \"you should probably train your model\" warning even though it's a pretrained model from the Hub": "Easy",
-  "Space was working fine but now receiving build errors - receiving this error even with a new space": "Medium",
-  "Inference is correct locally but wrong on deployed space": "Medium",
-  "Getting CUDA OOM despite having enough memory according to nvidia-smi": "Medium",
-  "How can I run Mistral-7B-v0.1 locally with multiple LoRA adapters?": "Hard",
-  "How can I serve Llama-2-7b with vLLM and dynamically load multiple LoRA adapters?": "Hard",
-  "How do I batch inference requests in my Gradio space for better throughput?": "Medium",
-  "Can I run Whisper large-v3 with faster-whisper for 4x speedup?": "Medium",
-  "How to run Llama 2 on CPU after fine-tuning with LoRA?": "Medium",
-  "Best way to handle 50+ concurrent requests in a Gradio space without OOM?": "Hard",
-  "How do I add custom stopping criteria for text generation with Transformers?": "Hard",
-  "Can I merge multiple LoRA adapters before inference to reduce latency?": "Hard",
-  "How can I optimize my LLM inference with one base LLM and multiple LoRA adapters?": "Hard",
-  "Compare tokenizers {T_i} for model M on tasks {classification, QA}; report accuracy and average sequence length per task": "Medium",
-  "Run a LoRA rank sweep (r in {4, 8, 16, 32}) for model M on dataset D; plot validation perplexity vs VRAM usage and select Pareto-optimal settings": "Hard",
-  "Build a streaming dataloader from Parquet on S3 with deterministic shuffling across N workers; validate epoch reproducibility": "Very hard",
-  "Find three open-source TTS models with emotion control and list their sample rates and licenses": "Easy",
-  "Create a retrieval-augmented QA pipeline: index corpus C with FAISS, connect to model M, and benchmark top-1 accuracy and p95 latency": "Hard",
-  "Diagnose a Space where memory grows per request; add no-grad guards, free caches, and demonstrate stable RSS over 10,000 calls": "Hard",
-  "Deduplicate dataset D using MinHash LSH at Jaccard >= 0.9 and publish a cleaned HF dataset with provenance columns": "Medium",
-  "Add special tokens to tokenizer T and resize model M embeddings; resume pretraining for 10k steps without loss spikes": "Hard",
-  "Create a HuggingFace Dataset from CSV file data.csv and push to repo username/my_dataset": "Easy",
-  "Build a real-time Whisper transcription Space with VAD and chunked decoding; keep end-to-end latency under 200 ms": "Hard",
-  "Quantize model M to 4-bit (bnb.int4) with bitsandbytes; compare perplexity and p95 latency to 8-bit on dataset D; select config with <1% perplexity increase": "Medium",
-  "Fuse LoRA adapter A into base model M and export a single safetensors checkpoint; verify logits parity (<1e-5 MSE) vs on-the-fly LoRA": "Hard",
-  "Redact PII from dataset D using a transformer NER pipeline; produce a cleaned HuggingFace Dataset with per-entity removal stats and provenance": "Medium",
-  "Train a SentencePiece tokenizer (vocab=64k, byte fallback) on corpus C; compare tokenization speed, unknown-token rate, and bytes/token vs tokenizer T": "Hard",
-  "Build a sharded FAISS IVF-PQ index for 100M embeddings stored on S3; integrate with HF datasets streaming and report recall@10 and QPS": "Very hard",
-  "Fine-tune model M with QLoRA using TRL PPO on dataset D; log KL, reward, and throughput; validate no divergence on a held-out eval": "Hard",
-  "Resolve HfHubHTTPError 401 when pushing dataset repo R: diagnose token scopes, git-lfs config, and large file thresholds; document the fix": "Medium",
-  "Implement a custom Transformers LogitsProcessor that bans repeated bigrams; add unit tests and benchmark generation quality (BLEU) on dataset D": "Hard",
-  "List and download all Hub models tagged 'text-classification' with Apache-2.0 license and size <500MB; save model ids and downloads to CSV": "Easy",
-  "Enable speculative decoding in vLLM with draft model D for base model M; benchmark tokens/sec speedup at batch sizes {1,4,16} and max_new_tokens {64,256}": "Very hard",
-  "Profile model M under torch.compile modes {reduce-overhead, max-autotune} on GPU G; report tokens/sec, peak VRAM, and compile overhead": "Medium",
-  "Detect and remove near-duplicate images in dataset D using CLIP ViT-L/14 embeddings at cosine >= 0.95; publish a cleaned dataset with duplicate_group ids": "Medium",
-  "Convert a TensorFlow SavedModel of T5-base to Transformers PyTorch format; verify logits parity (MSE < 1e-4) on 1,000 random prompts": "Hard",
-  "Enable FlashAttention-2 in a Transformers training loop for model M; benchmark step time and confirm loss parity over 2,000 steps vs baseline": "Hard",
-  "Deploy vLLM for model M with hot-swappable LoRA adapters {A_i}; provide an API to switch adapters and demonstrate <200 ms switch latency under load": "Very hard",
-  "Implement a custom Trainer callback to log gradient norms, activation histograms, and learning rate; diagnose periodic loss spikes and propose a fix": "Hard",
-  "Build a bilingual RAG pipeline indexing corpora {en, es} with FAISS HNSW; evaluate exact match@1 on dataset D and report p95 latency": "Hard",
-  "Run a mixed-precision sweep (fp16 vs bf16) for model M on A100 and RTX 3090; compare convergence, throughput, and numerical stability issues": "Medium",
-  "Create a Gradio Space that batches Whisper-large-v3 transcription via queue + chunked decoding; maintain real-time factor <= 0.5 on a T4": "Hard",
-  "List five OCR datasets on the Hub with line-level annotations; include licenses and approximate image counts": "Easy",
-  "List models on the Hub tagged 'summarization' that offer safetensors weights and 4-bit quantization; output model ids": "Easy",
-  "Evaluate safety filters of models {M_i} on red-team prompt set R; report jailbreak rate and false positive rate": "Medium",
-  "Run a prompt template ablation for chat model M on dataset D; compare {alpaca, chatml, llama2} formats and report exact match and average output length": "Hard",
-  "Implement tensor parallelism for model M in framework F and show linear scaling across 2\u20138 GPUs with <=10% gap from ideal": "Very hard",
-  "Convert and shard dataset D into WebDataset tar files (~500MB/shard); build a streaming loader with checksum validation": "Medium",
-  "Deploy a Spaces app serving Stable Diffusion XL with ControlNet; add output caching and keep p95 latency <1s for 20 concurrent users": "Hard",
-  "Diagnose and fix 'shape mismatch' when loading LoRA into model M after tokenizer resize; provide minimal repro and patch": "Medium",
-  "Add a detailed model card to repo username/model_M with training data, intended use, limitations, and evaluation results": "Easy",
-  "Enable KV cache quantization (int8) in Transformers for model M; compare tokens/sec and ROUGE-L on dataset D vs fp16 cache": "Hard",
-  "Detect and redact license-incompatible samples in dataset D by matching SPDX identifiers and source domains; publish a compliance report": "Medium",
-  "Profile vLLM serving of model M with paged attention; tune block_size to maximize tokens/sec and report p50/p95 latency and peak VRAM": "Medium",
-  "Filter dataset D for toxic content using classifier C; log per-label removal rates and recreate stratified train/valid/test splits": "Medium",
-  "Train a unigram tokenizer (vocab=80k) on corpora {en, fr}; fine-tune T5-small and compare BLEU vs a BPE baseline; report tokenization speed and OOV rate": "Hard",
-  "Run distributed evaluation of models {M_i} on benchmark B across 4 GPUs with DeepSpeed-Inference; ensure identical metrics across 3 seeds": "Hard",
-  "Find three open-source ASR models that provide word-level timestamps; record licenses and expected WER on LibriSpeech": "Easy",
-  "Diagnose intermittent 'Address already in use' crashes in a FastAPI Space; add graceful shutdown and port probing, verifying stability over 1,000 restart cycles": "Medium",
-  "Export a LoRA-finetuned Llama checkpoint to GGUF for llama.cpp; validate perplexity parity (<=1% drift) on WikiText-2": "Hard",
-  "Construct a streaming RAG pipeline over S3-stored corpus C with Chroma; index ~1B tokens, implement shard rebalancing, and benchmark recall@5 and QPS": "Very hard",
-  "List Hub datasets tagged 'speech-emotion-recognition' with CC-BY or CC-BY-SA licenses and >=10k utterances; write dataset ids and sizes to JSON": "Easy",
-  "Train a summarization reward model via pairwise ranking on dataset D; apply DPO to model M and report ROUGE-L and human win rate": "Hard",
-  "Find four open-source OCR models that output line- or paragraph-level text and provide ONNX or TensorRT exports; list their licenses and maximum input resolutions": "Easy",
-  "Verify tokenizer special tokens for model M are preserved after adding new tokens; write a unit test that asserts CLS/SEP/PAD ids are unchanged before and after resize": "Medium",
-  "Implement a constrained decoder for model M that enforces a JSON schema via a custom Transformers LogitsProcessor; add unit tests and benchmark latency on dataset D": "Hard",
-  "Build a multilingual RAG index for 50M documents using mDPR with sharded storage on S3; support hot index reloads and report recall@10 and p95 latency at 100 QPS": "Very hard",
-  "Quantize T5-base to 8-bit with bitsandbytes (LLM.int8) and compare ROUGE-L and tokens/sec to fp16 on CNN/DailyMail; keep ROUGE-L drop <=1%": "Medium",
-  "Diagnose VRAM growth in a vLLM server at batch size 32; add profiling, fix cache eviction behavior, and demonstrate flat memory over 10,000 requests": "Hard",
-  "Convert a HuggingFace TokenizerFast to a SentencePiece model; verify >=99.9% token-level agreement on 10,000 sentences and measure tokenization speed delta": "Medium",
-  "Train a multi-task adapter stack for {summarization, QA, NLI} on model M; implement routing by prompt prefix and report per-task metrics and cross-task interference": "Very hard",
-  "Assess license compatibility between model M (Apache-2.0) and dataset D (CC-BY-SA); produce a one-paragraph verdict with rationale and reference links": "Easy",
-  "Enable FSDP with activation checkpointing for a 13B model across 2\u00d7A100 GPUs; achieve <=10% throughput loss vs baseline and verify loss parity over 1,000 steps": "Hard",
-  "List three datasets for code summarization with permissive licenses; output their dataset ids and license names": "Easy",
-  "Set up nightly continuous evaluation of model M on benchmarks {B_i}; log metrics to Weights & Biases and alert on >2% regression vs last 7-day rolling mean": "Medium",
-  "Implement streaming text generation in a Gradio Space for model M using server-sent events; cap median token emission delay at <50 ms": "Hard",
-  "Scale out training of a 7B model with FSDP + ZeRO across 8 GPUs; demonstrate checkpoint save/restore and achieve throughput within 15% of ideal linear scaling": "Very hard",
-  "Export a mixture-of-experts PyTorch model to ONNX and run with TensorRT; verify top-1 accuracy within 0.5% of PyTorch on dataset D": "Medium",
-  "Identify whether model M supports FlashAttention-2 from its config or source; provide supporting repo links and a yes/no compatibility flag": "Easy",
-  "Build an audio deduplication pipeline for dataset D using embedding model E with cosine similarity >= 0.98; publish grouped duplicate ids and a cleaned manifest": "Hard",
-  "Diagnose slow tokenization in a Transformers pipeline; profile, switch to a fast tokenizer, and demonstrate 2\u00d7 end-to-end speedup on 1M lines": "Medium",
-  "Implement a contrastive preference learning loss in TRL; train model M on dataset D and compare KL, reward variance, and human win rate vs a PPO baseline": "Hard",
-  "Build an elastic RAG service with Ray that autoscales FAISS shards on S3, supports live corpus updates, and maintains p95 latency <500 ms at 200 QPS": "Very hard",
-  "List five chat-optimized LLMs on the Hub that include a tokenizer chat_template and safetensors weights; output model ids": "Easy",
-  "Find three biomedical NER datasets with Apache-2.0 or MIT licenses; return dataset ids and license names": "Easy",
-  "Create a dataset viewer Space that streams Parquet shards from the Hub using datasets streaming; implement server-side filtering and pagination": "Medium",
-  "Enable gradient checkpointing and optimizer state offloading for model M with Accelerate; report step time and peak VRAM vs baseline on a single A100": "Medium",
-  "Diagnose and fix 'size mismatch for position_embeddings' after increasing max_position_embeddings; provide a minimal repro and a migration script": "Medium",
-  "Implement a regex-constrained Transformers LogitsProcessor that enforces ISO-8601 timestamps; add unit tests and report generation latency overhead on dataset D": "Hard",
-  "Train language-specific LoRA adapters for {en, es, de} on model M; add an automatic language router and report per-language BLEU and cross-language interference": "Hard",
-  "Build a speaker diarization + ASR Gradio Space using pyannote and Whisper-large-v3; achieve DER <= 12% and real-time factor <= 0.75 on a T4": "Hard",
-  "Implement multi-draft speculative decoding with dynamic draft-model selection per prompt; integrate with vLLM and benchmark tokens/sec speedup at batch sizes {1,8,32}": "Very hard",
-  "Convert a TensorFlow DistilBERT SavedModel to ONNX (opset 17) and validate logits parity (MSE < 1e-4) on 1,000 random inputs; measure CPU inference speedup vs TensorFlow": "Medium",
-  "Evaluate alignment drift after SFT: compare model M vs base M0 on prompt set P; report win rate, refusal rate, and average output length": "Medium",
-  "Enable KV cache int4 quantization in vLLM for model M; benchmark tokens/sec and exact match on dataset D vs fp16 cache": "Hard",
-  "Implement variable-length packing in a HF Datasets + Transformers training loop; ensure epoch-level sample coverage matches baseline and no truncation beyond max_length": "Medium",
-  "Build a multi-tenant LoRA router over vLLM: on-demand load adapters from the Hub with LRU eviction; sustain 100 tenants and <300 ms adapter swap latency under load": "Very hard",
-  "Audit generations for PII leakage on prompt set P using detector C; compute precision, recall, and false positive rate; redact before logging and publish a compliance summary": "Medium",
-  "Merge a stack of PEFT adapters {A_i} into base model M to produce a single FP16 checkpoint; validate perplexity drift <=0.5% on dataset D and export safetensors": "Hard",
-  "Find three Spaces that demonstrate constrained JSON generation; return Space ids and URLs": "Easy",
-  "Deploy a cross-lingual vector search service with multilingual-e5-large; shard FAISS across 3 nodes and measure mAP@10 and p95 latency at 500 QPS": "Very hard",
-  "Quantize attention and MLP projections only with bitsandbytes (selective 8-bit); compare peak VRAM, tokens/sec, and ROUGE-L vs full-model 8-bit on dataset D": "Hard",
-  "Fix \"Token indices sequence length is longer than the specified maximum\" after tokenizer resize; add truncation with stride and update generation config; verify no validation metric regression": "Medium",
-  "Identify splits for dataset D and output split names with sample counts": "Easy",
-  "Find five multilingual sentence-embedding models on the Hub with Apache-2.0 license; return model ids": "Easy",
-  "Set up CI to run evaluation suite E for model M nightly; fail the job if any metric drops >1% vs 7-day rolling mean": "Medium",
-  "Add length normalization to beam search for model M; compare vs baseline on dataset D and report ROUGE-L and average output length": "Medium",
-  "Detect per-sample language for dataset D; add a 'lang' column and recreate train/valid/test splits preserving language proportions": "Medium",
-  "Benchmark vLLM KV-cache eviction strategies (e.g., LRU vs TTL) for model M at batch sizes {1,8,32}; report tokens/sec and peak VRAM": "Medium",
-  "Implement a custom DataCollator that packs multiple documents for summarization with separator tokens; add unit tests to prevent cross-sample leakage": "Hard",
-  "Build a PDF-to-dataset pipeline: OCR pages with model Donut, store word-level bboxes, and publish a HuggingFace Dataset with a viewer Space": "Hard",
-  "Train a ColBERT reranker on corpus C + pairs dataset D; integrate into a RAG search service and report recall@10 and p95 latency delta": "Hard",
-  "Deploy vLLM for model M with multi-GPU tensor-parallel inference across 2 nodes using NCCL; demonstrate near-linear throughput scaling and deterministic outputs across 3 seeds": "Very hard",
-  "List four Hub models tagged 'named-entity-recognition' that declare bitsandbytes 8-bit support in their README; output model ids": "Easy",
-  "Find three Spaces that provide real-time TTS streaming demos; return Space ids and reported sample rates": "Easy",
-  "Create a Spaces app that visualizes transformer attention maps for a ViT model using Captum; keep heatmap rendering under 200 ms for 224x224 images": "Medium",
-  "Set up datasets streaming with resumable downloads and exponential backoff for S3-hosted Parquet shards; verify checksum integrity after killing and resuming the job": "Medium",
-  "Build a tokenizer migration tool to convert a SentencePiece model to a HuggingFace tokenizers JSON with byte-fallback; assert >=99.95% token-level agreement on 20k sentences and report speed delta": "Medium",
-  "Implement a custom DataCollator for span masking with variable block sizes for byte-level BPE; add unit tests and demonstrate MLM loss parity over 10k steps on WikiText-103": "Hard",
-  "Add speculative decoding with a small draft model to a Transformers-based text-generation server; expose a per-request flag and benchmark tokens/sec speedup at batch sizes {1,8,32}": "Hard",
-  "Train an online knowledge-distillation SFT: teacher M0 -> student M on dataset D; log KL divergence, token agreement, and throughput; cap metric drop at <=2% vs teacher": "Hard",
-  "Deploy a multi-region vLLM service on Kubernetes with adaptive batching and hot LoRA adapter loading; sustain 200 QPS with p95 latency <300 ms and zero-downtime rollouts": "Very hard",
-  "Build a sharded cross-encoder reranking service with Ray: distribute ColBERT scoring across nodes, integrate with FAISS retrieval, and maintain recall@10 within 1% of single-node baseline at 500 QPS": "Very hard",
-  "List four Spaces that perform multilingual OCR with layout extraction; return Space ids and supported languages": "Easy",
-  "Find five Hub datasets for code generation evaluation with permissive licenses; output dataset ids and license names": "Easy",
-  "Add gradient accumulation and gradient clipping to a Transformers Trainer finetune of model M; report step time, peak VRAM, and validation metric vs baseline": "Medium",
-  "Implement document chunking with sliding windows and overlap in a Datasets map pipeline; add doc_id and span indices and verify no segment exceeds max_length": "Medium",
-  "Export a fine-tuned BERT model to TorchScript and ONNX; verify logits parity (MSE < 1e-4) on 1,000 samples and compare CPU throughput": "Medium",
-  "Diagnose 'pad_token_id is not set' warnings during generation; add a PAD token, resize embeddings, and write a unit test asserting identical logits pre/post fix on 200 prompts": "Medium",
-  "Implement diverse beam search (group_beam_search) for model M; evaluate on dataset D and report ROUGE-L, distinct-n, and average output length vs standard beam search": "Hard",
-  "Build a multi-modal RAG demo that indexes image captions with CLIP and uses LLM M to answer visual questions; report top-1 accuracy and p95 latency": "Hard",
-  "Profile activation and KV-cache memory during generation for model M; log per-layer footprints and reduce peak usage via attention slicing; show tokens/sec and VRAM deltas": "Hard",
-  "Construct a 200M-document FAISS hybrid (IVF-PQ + HNSW) index with memory-mapped shards on S3; support live add/delete and benchmark recall@10 and QPS at 300 QPS": "Very hard",
-  "List five Hub datasets tagged 'topic-modeling' with MIT or Apache-2.0 licenses; output dataset ids": "Easy",
-  "Find three Spaces that offer real-time grammar correction with streaming tokens; return Space ids and URLs": "Easy",
-  "Convert a spaCy en_core_web_trf NER model to ONNX and wrap it in a Transformers TokenClassification pipeline; verify entity text/label/span parity on 1,000 sentences": "Medium",
-  "Set up a GitHub Actions workflow that snapshots tokenizer T weekly and fails if vocab or special token ids drift vs the last snapshot; upload a diff artifact": "Medium",
-  "Profile a Datasets map pipeline on corpus C; refactor to use batched=True, num_proc>1, and caching; achieve >=2\u00d7 speedup while preserving deterministic ordering across runs": "Medium",
-  "Implement a custom Transformers StoppingCriteria that halts when JSON braces are balanced or max nesting depth is reached; add unit tests and benchmark latency overhead on dataset D": "Hard",
-  "Build a visual-and-tabular RAG pipeline: index images with CLIP and CSV tables with TAPAS; answer mixed queries using LLM M; report EM@1 and p95 latency at 50 QPS": "Hard",
-  "Enable KV-cache int4 quantization during generation in Transformers for model M; compare tokens/sec and exact match vs fp16 cache on dataset D; keep metric drop <=1%": "Hard",
-  "Implement a hot-reloadable sharded FAISS IVF-PQ index for multilingual-e5-base with live add/delete and background re-training; sustain 200 QPS with p95 latency <400 ms across 3 nodes": "Very hard",
-  "Deploy a geo-distributed vLLM + LoRA adapter gateway across two regions with consistent hashing and zero-downtime adapter updates; ensure identical outputs across 3 seeds and report cross-region p95 latency": "Very hard",
-  "List five Hub LLM repos that disclose training token counts in their model cards; output model ids and token totals": "Easy",
-  "Find two ready-to-use Spaces for speaker diarization compatible with Whisper; return Space ids and URLs": "Easy",
-  "Create a hashing-based dataset splitter using column 'doc_id' to produce reproducible train/valid/test; verify identical splits across two machines and Python versions": "Medium",
-  "Resolve HTTP 403 when creating an organization dataset via the Hub API; diagnose token scopes and org permissions; provide a minimal repro script and the fix": "Medium",
-  "Export a PEFT LoRA adapter from a fine-tuned Llama checkpoint as standalone safetensors with a correct adapter_config.json; push to the Hub and verify PEFT.from_pretrained loads it": "Medium",
-  "Enable multi-query attention in model M within Transformers; benchmark tokens/sec and peak VRAM vs multi-head attention and verify perplexity parity over 2,000 steps": "Hard",
-  "Audit code dataset D for contamination against {HumanEval, MBPP} using exact substring and 3-gram Jaccard >= 0.9; publish per-source contamination rates and a cleaned dataset": "Hard",
-  "Implement contrastive search decoding for model M with tunable alpha; compare ROUGE-L, distinct-n, and latency vs nucleus sampling on dataset D": "Hard",
-  "Implement pipeline parallelism for model M across 4 GPUs with Accelerate; achieve near-linear scaling (<=15% gap), support checkpoint save/restore, and ensure deterministic outputs across 3 seeds": "Very hard",
-  "Deploy a Spaces app that serves two ASR models with automatic language ID routing; maintain real-time factor <= 0.6 on a single T4 and log per-language latency": "Hard",
-  "Benchmark JSON-constrained decoding across models {M_i}; report JSON validity rate, exact match on dataset D, and p95 latency under streaming": "Hard",
-  "Filter a multilingual dataset D to non-English using fastText language ID; recreate stratified splits and report per-language retention and drop rates": "Medium",
-  "Enable paged attention in a custom Transformers generation loop for model M; verify token-level parity on 500 prompts and measure peak VRAM change": "Hard",
-  "Shard a 1B-token text corpus into deterministic HF Datasets processing across 16 workers; validate byte-for-byte identical outputs across two runs": "Very hard",
-  "Compare LoRA vs QLoRA fine-tunes of Mistral-7B on GSM8K; track loss, exact match, and throughput; select the lowest-VRAM config within 2% EM of best": "Hard",
-  "Deploy a quantized T5 encoder-decoder on Triton Inference Server via a Python backend; add token streaming and achieve >=1.5x throughput vs PyTorch baseline": "Hard",
-  "Find three Spaces that perform audio source separation (vocals/music); return Space ids and reported sample rates": "Easy",
-  "Merge a PEFT IA3 adapter stack into Llama-3-8B base weights; verify perplexity drift <=0.3% on WikiText-103 and export safetensors": "Hard",
-  "Resolve DeepSpeed ZeRO-3 stalls during S3 checkpointing; implement async multipart uploads and show stable 5-minute checkpoint cadence over 2 hours": "Very hard",
-  "Set up CI to run contamination checks on dataset R against {TruthfulQA, SQuAD} using 4-gram overlap; fail if rate >0.5% and attach offending ids as artifacts": "Medium",
-  "List four Hub datasets for sarcasm detection in English; return dataset ids and license tags": "Easy",
-  "Identify whether tokenizer T enables byte_fallback in tokenizer.json; output true/false and the file path": "Easy",
-  "Find three Spaces that showcase streaming chat with token-by-token updates; return Space ids and whether they use SSE or websockets": "Easy",
-  "Create a Datasets loader that parses Praat TextGrid files into word-level timestamps aligned with audio; publish a dataset with an 'audio' column and validate 100 sample alignments": "Medium",
-  "Set up a GitHub Actions workflow that lints model cards for repos {R_i} to require intended use, training data, and limitations; fail PRs and post a summary comment on violations": "Medium",
-  "Containerize a Gradio Space with optional FlashAttention build: detect GPU capability at startup, compile kernels if supported, and fall back gracefully on unsupported GPUs; test on T4 and A100": "Medium",
-  "Evaluate long-context retrieval via needle-in-a-haystack for models {M_i} at context lengths {8k, 32k, 64k}; report retrieval accuracy, tokens/sec, and the max stable context length": "Hard",
-  "Implement a curriculum sampler as a HuggingFace Trainer callback that schedules sample difficulty over epochs; compare convergence and final eval metrics vs random sampling": "Hard",
-  "Add on-the-fly near-duplicate filtering during training using SimHash over token ids; log per-epoch removal rates and verify no convergence regressions vs a deduplicated baseline": "Hard",
-  "Deploy a dual-backend inference router using vLLM and TensorRT-LLM that selects backend per prompt length to minimize latency; maintain deterministic outputs across 3 seeds and sustain 300 QPS with p95 latency SLOs": "Very hard",
-  "Identify max_position_embeddings and whether rope_scaling is enabled for model M from its config; output both values.": "Easy",
-  "List five Vision Transformer models on the Hub that provide safetensors and have a default image size >= 384; output model ids.": "Easy",
-  "Find three Spaces that stream machine-translation outputs token-by-token; return Space ids and whether they use SSE or websockets.": "Easy",
-  "Diagnose bursts of [UNK] after adding special tokens to tokenizer T; enable byte_fallback, retrain embeddings for 2k steps, and show unknown-token rate <= baseline+0.1% on corpus C.": "Medium",
-  "Create a dataset viewer Space for a dataset with a nested JSON column; convert to Arrow struct arrays, implement server-side filtering on nested keys, and verify row counts match the source.": "Medium",
-  "Set up a GitHub Action that hits /health and a no-op inference on Space S after each deploy; fail if cold-start median latency >10s and attach server logs as an artifact.": "Medium",
-  "Implement a SQL grammar-constrained Transformers LogitsProcessor using an LL(1) parser; evaluate on Spider dev and report exact match and p95 latency overhead vs nucleus sampling.": "Hard",
-  "Add CPU-tier KV-cache offloading with pinned memory for model M in a custom generation loop; compare tokens/sec and peak VRAM vs baseline at context lengths {4k, 16k, 32k}.": "Hard",
-  "Deploy a batched cross-encoder reranker microservice using bge-reranker-base; keep recall@10 within 1% of single-request baseline and achieve >=2\u00d7 QPS at 100 concurrent users.": "Hard",
-  "Build a heterogeneous inference gateway that routes requests to vLLM or llama.cpp based on prompt length and GPU load; ensure identical normalized outputs across 3 seeds and sustain 200 QPS with p95 latency <300 ms.": "Very hard",
-  "Determine whether tokenizer T strips accents (strip_accents); output true/false and the file path where the setting is defined.": "Easy",
-  "List four Hub datasets for hate-speech detection in English; return dataset ids and license tags.": "Easy",
-  "Write a Datasets loader for a paginated OAuth2 REST API; cache pages, support streaming, and provide deterministic sharding across 8 workers; verify identical row counts across two runs.": "Medium",
-  "Add request-level caching (ETag/If-None-Match) to a Gradio summarization Space; achieve >=1.8\u00d7 QPS at 50 concurrent users and report cache hit ratio and p95 latency.": "Medium",
-  "Enable HuggingFace tokenizers parallelism and batched encoding for corpus C; benchmark throughput and memory on 10M lines and ensure deterministic outputs across 3 runs.": "Medium",
-  "Set up CI to lint dataset cards in repos {R_i} for required fields {license, citation, dataset_summary}; fail PRs and post a summary comment with missing keys.": "Medium",
-  "Run a parameter-efficient finetuning sweep comparing LoRA, IA3, and prefix-tuning on RoBERTa-base for MNLI; report accuracy, training time, and peak VRAM; select a Pareto-optimal config.": "Hard",
-  "Implement a Transformers LogitsProcessor that enforces balanced parentheses and proper quoted-string escaping; add unit tests and benchmark latency overhead on dataset D.": "Hard",
-  "Export Whisper-medium to ONNX with dynamic axes and int8 weights; verify word-timestamp parity on 500 clips and measure CPU real-time factor improvement >=1.3\u00d7 vs PyTorch.": "Hard",
-  "Deploy a geo-replicated RAG service: shard FAISS HNSW across three regions with conflict-free index metadata sync; sustain 300 QPS with p95 latency <450 ms and recall@10 within 1% of single-region baseline.": "Very hard",
-  "Compare cased vs uncased tokenization for BERT on CoNLL-2003 NER; train both, and report F1, average tokens per sentence, and training time.": "Medium",
-  "Create a HuggingFace Datasets loader for EPUB files: extract chapter text and embedded images into Arrow columns, support streaming and deterministic sharding across 8 workers; verify identical row counts across two runs.": "Medium",
-  "Configure a Hub webhook to trigger CI when a model card (README.md) changes; fail the job if sections {intended use, limitations} are missing and post a checklist comment on the PR.": "Medium",
-  "Add a reranking cache to a RAG service keyed by (query, candidate_ids); achieve >=50% cache hit at 100 QPS and keep recall@10 within 0.5% of baseline.": "Hard",
-  "Fix torch.compile graph breaks in a Transformers training loop; patch non-compilable ops, re-enable compilation, and demonstrate >=1.4\u00d7 step-time speedup with matching loss over 2,000 steps.": "Hard",
-  "Compute 95% bootstrap confidence intervals for ROUGE-L on dataset D over 3 random seeds; flag regressions when the new CI lies entirely below last week's baseline CI.": "Medium",
-  "Build a batch image-captioning Space with ViT-GPT2: accept ZIP uploads, use queue-based batching, and keep p95 latency <2s for 32 images.": "Medium",
-  "Implement hybrid parallelism (tensor + pipeline) for a 13B encoder-decoder using Accelerate; scale across 8 GPUs with <=15% gap from linear, support elastic resize (8->6 GPUs) without losing determinism, and verify checkpoint save/restore.": "Very hard",
-  "Find five Spaces that stream live vision-language captioning (e.g., LLaVA or BLIP); return Space ids and reported FPS.": "Easy",
-  "Identify whether tokenizer T applies Unicode normalization (NFKC/NFC/NFD/NFKD) and where it is configured; output the mode and file path.": "Easy",
-  "Identify whether model repo M stores weights exclusively as safetensors; output true/false and list the .safetensors file paths.": "Easy",
-  "List three multilingual sentence-embedding models on the Hub that provide ONNX exports; return model ids.": "Easy",
-  "Determine if tokenizer T lowercases text (do_lower_case or lowercase flag); output true/false and the file path or JSON key where it is set.": "Easy",
-  "Set up a GitHub Action to run a smoke-test text generation for model M on each push; fail if median time to first token >2s and attach container logs as an artifact.": "Medium",
-  "Create a Datasets preprocessing pipeline that tokenizes to max_length=512 with stride=64 and retains an 'orig_text' column; verify row counts match input and no NaNs after caching.": "Medium",
-  "Resolve 'git-lfs: command not found' when pushing model repo R to the Hub; install and configure Git LFS, set an appropriate large file threshold, and provide a minimal repro plus the verified fix.": "Medium",
-  "Enable KV-cache CPU offloading in a custom Transformers generation loop for model M; benchmark tokens/sec and peak VRAM vs baseline at context lengths {4k, 8k}.": "Hard",
-  "Implement LoRA rank warmup (r: 4\u219232 over the first 1,000 steps) in a custom Trainer; fine-tune model M on dataset D and report validation perplexity and peak VRAM vs fixed r=32.": "Hard",
-  "Export Whisper-small to TensorRT via ONNX (opset 18) with dynamic axes; verify word-timestamp parity (median diff \u22640.05s) on 300 clips and measure \u22651.3\u00d7 GPU speedup vs PyTorch.": "Hard",
-  "Deploy a multi-tenant RAG service that hot-loads per-tenant FAISS indices from S3, shares a reranker, and sustains 200 QPS with p95 latency <350 ms across 1,000 tenants; maintain recall@10 within 1% of a single-tenant baseline.": "Very hard"
-}

eval/hf_agent_connector.py DELETED Viewed

@@ -1,89 +0,0 @@
-from __future__ import annotations
-import asyncio
-import sys
-from pathlib import Path
-from typing import Any
-from agent.config import Config, load_config
-from agent.core.agent_loop import Handlers
-from agent.core.session import Session
-from agent.core.tools import ToolRouter
-PROJECT_ROOT = Path(__file__).resolve().parents[1]
-if str(PROJECT_ROOT) not in sys.path:
-    sys.path.insert(0, str(PROJECT_ROOT))
-def _resolve_project_path(path: str | Path) -> Path:
-    candidate = Path(path)
-    if candidate.is_absolute():
-        return candidate
-    return (PROJECT_ROOT / candidate).resolve()
-class AgentResponseGenerator:
-    """
-    Thin async wrapper that executes the existing agent loop once and
-    returns the assistant's final message.
-    """
-    def __init__(self, config_path: str | Path, max_iterations: int = 300) -> None:
-        self.config_path = _resolve_project_path(config_path)
-        self.config: Config = load_config(str(self.config_path))
-        self.max_iterations = max_iterations
-    @property
-    def model_name(self) -> str:
-        """Expose the agent model name for downstream logging."""
-        return self.config.model_name
-    async def run(self, prompt: str) -> str:
-        """
-        Execute the agent loop for a single prompt and return the assistant reply.
-        """
-        tool_router = ToolRouter(self.config.mcpServers)
-        async with tool_router:
-            session = Session(asyncio.Queue(), config=self.config)
-            session.tool_router = tool_router
-            await Handlers.run_agent(
-                session,
-                prompt,
-                max_iterations=self.max_iterations,
-            )
-            return self._latest_assistant_response(session)
-    def _latest_assistant_response(self, session: Session) -> str:
-        """
-        Extract the final assistant response from the session history.
-        """
-        for message in reversed(session.context_manager.items):
-            if getattr(message, "role", None) == "assistant":
-                return _content_to_text(getattr(message, "content", ""))
-        raise RuntimeError("Agent did not produce an assistant message.")
-def _content_to_text(content: Any) -> str:
-    """
-    Convert LiteLLM content payloads (str or list[dict]) into plain text.
-    """
-    if isinstance(content, str):
-        return content
-    if isinstance(content, list):
-        parts: list[str] = []
-        for block in content:
-            if isinstance(block, dict):
-                text = block.get("text")
-                if text:
-                    parts.append(str(text))
-            else:
-                text = getattr(block, "text", None)
-                if text:
-                    parts.append(str(text))
-        return "\n".join(parts)
-    return str(content)

eval/hf_io.py DELETED Viewed

@@ -1,215 +0,0 @@
-"""
-HuggingFace Dataset I/O Utilities
-Reusable functions for uploading and downloading JSONL data to/from HuggingFace Hub.
-Supports the dataset_name@config_name notation for managing multiple configurations.
-"""
-from typing import List, Optional
-import pandas as pd
-from datasets import Dataset, load_dataset
-def list_dataset_configs(dataset_name: str) -> Optional[List[str]]:
-    """
-    List all available configs for a dataset on HuggingFace Hub.
-    Args:
-        dataset_name: Name of the dataset (e.g., "username/my-dataset")
-    Returns:
-        List of config names, or None if unable to retrieve
-    Example:
-        >>> configs = list_dataset_configs("username/hf-agent-benchmark")
-        >>> print(configs)
-        ['default', 'rubrics', 'evaluations']
-    """
-    try:
-        from datasets import get_dataset_config_names
-        configs = get_dataset_config_names(dataset_name)
-        return configs
-    except Exception as e:
-        print(f"✗ Failed to list configs: {type(e).__name__}: {str(e)}")
-        return None
-def df_to_hub(
-    df: pd.DataFrame,
-    dataset_spec: str,
-    split: str = "train",
-    private: bool = False,
-) -> bool:
-    """
-    Upload a pandas DataFrame directly to HuggingFace Hub as a dataset.
-    This function converts a pandas DataFrame to a HuggingFace Dataset and uploads
-    it to the Hub. This is useful for uploading data directly without creating an
-    intermediate JSONL file.
-    Args:
-        df: pandas DataFrame to upload. All column types should be serializable.
-            Example DataFrame:
-            ```
-            | question | solution | rubric |
-            |----------|----------|--------|
-            | "How..." | "You..." | {...}  |
-            ```
-        dataset_spec: Dataset specification in the format "dataset_name" or
-            "dataset_name@config_name". Examples:
-            - "username/my-dataset" (uses "default" config)
-            - "username/my-dataset@rubrics" (uses "rubrics" config)
-            - "username/my-dataset@evaluations" (uses "evaluations" config)
-        split: The dataset split name. Defaults to "train". Common values:
-            - "train": Training or main data
-            - "validation": Validation data
-            - "test": Test data
-        private: Whether to create a private dataset. Defaults to False (public).
-    Returns:
-        bool: True if upload succeeded, False otherwise
-    Raises:
-        ValueError: If DataFrame is empty
-        Exception: For HuggingFace Hub upload errors
-    Example:
-        >>> import pandas as pd
-        >>> df = pd.DataFrame({
-        ...     "question": ["How to train?", "What is fine-tuning?"],
-        ...     "solution": ["Use trainer...", "Fine-tuning is..."],
-        ...     "rubric": ['[{"title": "...", ...}]', '[{"title": "...", ...}]']
-        ... })
-        >>> upload_dataframe_to_hf(df, "username/dataset@rubrics")
-    Notes:
-        - Requires authentication via `huggingface-cli login` or HF_TOKEN env var
-        - DataFrame columns with complex objects should be serialized first (e.g., to JSON strings)
-        - If the dataset doesn't exist, it will be created automatically
-        - Empty DataFrames will raise ValueError to prevent uploading invalid data
-    """
-    # Validate DataFrame
-    if df.empty:
-        raise ValueError("DataFrame is empty")
-    # Parse dataset specification
-    if "@" in dataset_spec:
-        dataset_name, config_name = dataset_spec.split("@", 1)
-    else:
-        dataset_name = dataset_spec
-        config_name = "default"
-    try:
-        print("\nUploading DataFrame to HuggingFace Hub...")
-        print(f"  Dataset: {dataset_name}")
-        print(f"  Config: {config_name}")
-        print(f"  Split: {split}")
-        print(f"  Rows: {len(df)}")
-        print(f"  Columns: {list(df.columns)}")
-        # Convert DataFrame to HuggingFace Dataset
-        dataset = Dataset.from_pandas(df)
-        # Upload to HuggingFace Hub
-        dataset.push_to_hub(
-            dataset_name,
-            config_name=config_name,
-            split=split,
-            private=private,
-        )
-        print(
-            f"✓ Successfully uploaded to {dataset_name}@{config_name} (split: {split})"
-        )
-        return True
-    except Exception as e:
-        print(f"✗ Failed to upload to HuggingFace: {type(e).__name__}: {str(e)}")
-        return False
-def hub_to_df(
-    dataset_spec: str,
-    split: str = "train",
-) -> Optional[pd.DataFrame]:
-    """
-    Download a dataset from HuggingFace Hub as a pandas DataFrame.
-    This function downloads a dataset from the HuggingFace Hub and returns it as a
-    pandas DataFrame for immediate use in Python.
-    Args:
-        dataset_spec: Dataset specification in the format "dataset_name" or
-            "dataset_name@config_name". Examples:
-            - "username/my-dataset" (uses "default" config)
-            - "username/my-dataset@rubrics" (uses "rubrics" config)
-            - "username/my-dataset@evaluations" (uses "evaluations" config)
-        split: The dataset split to download. Defaults to "train". Common values:
-            - "train": Training or main data
-            - "validation": Validation data
-            - "test": Test data
-    Returns:
-        pd.DataFrame: Downloaded data as pandas DataFrame, or None if failed
-    Raises:
-        ValueError: If the dataset/config/split doesn't exist
-        Exception: For HuggingFace Hub download errors
-    Example:
-        >>> # Download rubrics from specific config
-        >>> df = hub_to_df("username/hf-agent-benchmark@rubrics")
-        >>> print(df.head())
-        >>> print(f"Shape: {df.shape}")
-        >>> # Download evaluation results
-        >>> results_df = download_hf_to_dataframe(
-        ...     "username/hf-agent-benchmark@evaluations",
-        ...     split="test"
-        ... )
-    Notes:
-        - Requires authentication for private datasets via `huggingface-cli login`
-        - Downloaded data will be in the same format as uploaded (preserves structure)
-        - Large datasets may take time to download and consume significant memory
-        - For very large datasets, consider using streaming or download_hf_to_jsonl
-    """
-    # Parse dataset specification
-    if "@" in dataset_spec:
-        dataset_name, config_name = dataset_spec.split("@", 1)
-    else:
-        dataset_name = dataset_spec
-        config_name = "default"
-    try:
-        print("\nDownloading from HuggingFace Hub...")
-        print(f"  Dataset: {dataset_name}")
-        print(f"  Config: {config_name}")
-        print(f"  Split: {split}")
-        # Download dataset from HuggingFace Hub
-        dataset = load_dataset(
-            dataset_name,
-            name=config_name,
-            split=split,
-        )
-        print(f"  Downloaded {len(dataset)} records")
-        # Convert to pandas DataFrame
-        df = dataset.to_pandas()
-        print("✓ Successfully loaded as DataFrame")
-        print(f"  Shape: {df.shape}")
-        print(f"  Columns: {list(df.columns)}")
-        return df
-    except Exception as e:
-        print(f"✗ Failed to download from HuggingFace: {type(e).__name__}: {str(e)}")
-        return None

eval/leaderboard.py DELETED Viewed

@@ -1,172 +0,0 @@
-"""
-Utilities for logging solver scores to a Hugging Face dataset.
-"""
-from __future__ import annotations
-import json
-import re
-import shutil
-import subprocess
-import tempfile
-from dataclasses import dataclass
-from datetime import datetime, timezone
-from pathlib import Path
-from typing import Any
-from huggingface_hub import HfApi, hf_hub_download
-AVERAGE_RE = re.compile(r"Average normalized score:\s*([0-9.]+)")
-DEFAULT_FILENAME = "records.jsonl"
-def _hydra_join(*parts: str | None) -> str:
-    tokens = [str(part).strip().replace(" ", "_") for part in parts if part]
-    return "/".join(tokens) if tokens else "default"
-def detect_agent_version(config_path: str = "agent/config_mcp_example.json") -> str:
-    """
-    Returns a short string identifying the current agent version:
-    <git short sha>-<config hash>.
-    """
-    try:
-        commit = (
-            subprocess.check_output(["git", "rev-parse", "--short", "HEAD"])
-            .decode()
-            .strip()
-        )
-    except Exception:
-        commit = "unknown"
-    config_file = Path(config_path)
-    config_stem = config_file.stem or "config"
-    parent_name = config_file.parent.name if config_file.parent.name else None
-    return _hydra_join(parent_name, config_stem, commit)
-def parse_average_score(text: str) -> float | None:
-    """Extracts the 'Average normalized score' value from Inspect logs."""
-    match = AVERAGE_RE.search(text)
-    if match:
-        try:
-            return float(match.group(1))
-        except ValueError:
-            return None
-    return None
-def latest_log_file(
-    log_dir: Path, extensions: tuple[str, ...] = (".eval", ".json")
-) -> Path | None:
-    """Returns the most recent log file in log_dir matching the provided extensions."""
-    if not log_dir.exists():
-        return None
-    files: list[Path] = []
-    for ext in extensions:
-        files.extend(log_dir.glob(f"*{ext}"))
-    if not files:
-        return None
-    files.sort(key=lambda path: path.stat().st_mtime)
-    return files[-1]
-@dataclass
-class LeaderboardClient:
-    """Simple helper to append JSONL rows to a HF dataset."""
-    repo_id: str
-    token: str
-    filename: str = DEFAULT_FILENAME
-    def append_record(self, record: dict[str, Any]) -> None:
-        tmp_dir = Path(tempfile.mkdtemp(prefix="leaderboard_"))
-        local_file = tmp_dir / self.filename
-        self._download_existing(local_file)
-        if not local_file.exists():
-            local_file.write_text("", encoding="utf-8")
-        with local_file.open("a", encoding="utf-8") as fh:
-            fh.write(json.dumps(record) + "\n")
-        HfApi(token=self.token).upload_file(
-            path_or_fileobj=str(local_file),
-            path_in_repo=self.filename,
-            repo_id=self.repo_id,
-            repo_type="dataset",
-        )
-        try:
-            local_file.unlink()
-            tmp_dir.rmdir()
-        except OSError:
-            pass
-    def _download_existing(self, destination: Path) -> None:
-        destination.parent.mkdir(parents=True, exist_ok=True)
-        try:
-            downloaded = hf_hub_download(
-                repo_id=self.repo_id,
-                filename=self.filename,
-                repo_type="dataset",
-                token=self.token,
-            )
-            shutil.copy(Path(downloaded), destination)
-        except Exception:
-            destination.write_text("", encoding="utf-8")
-def build_record(
-    solver_name: str,
-    solver_kwargs: dict[str, Any],
-    dataset_name: str,
-    dataset_split: str,
-    limit: int | None,
-    score: float,
-    command: list[str],
-    log_path: Path | None,
-    criterion_checks: list[dict[str, Any]] | None = None,
-) -> dict[str, Any]:
-    """Assembles a JSON-serialisable record for the leaderboard dataset."""
-    record = {
-        "timestamp": datetime.now(timezone.utc).isoformat(),
-        "solver": solver_name,
-        "solver_kwargs": solver_kwargs,
-        "dataset_name": dataset_name,
-        "dataset_split": dataset_split,
-        "limit": limit,
-        "score": score,
-        "command": command,
-    }
-    if solver_name == "hf_agent":
-        record["solver_version"] = detect_agent_version(
-            solver_kwargs.get("config_path", "agent/config_mcp_example.json")
-        )
-    else:
-        version_spec = solver_kwargs.get("version")
-        if isinstance(version_spec, (list, tuple)):
-            record["solver_version"] = _hydra_join(*version_spec)
-        elif isinstance(version_spec, dict):
-            record["solver_version"] = _hydra_join(
-                *[f"{k}={v}" for k, v in version_spec.items()]
-            )
-        elif isinstance(version_spec, str):
-            record["solver_version"] = version_spec
-        else:
-            record["solver_version"] = _hydra_join(solver_name, "default")
-    if log_path:
-        record["log_artifact"] = str(log_path)
-    record["criterion_checks"] = criterion_checks or []
-    return record

eval/models.py DELETED Viewed

@@ -1,63 +0,0 @@
-"""Shared data models for the HF agent project"""
-from datetime import datetime
-from enum import Enum
-from pydantic import BaseModel, Field
-class Discussion(BaseModel):
-    """Model for a discussion thread"""
-    title: str
-    url: str
-    topic_id: int
-    category: int
-    created_at: datetime
-class QuestionAndSolution(BaseModel):
-    """Model for a QA pair from a discussion"""
-    discussion_title: str
-    discussion_url: str
-    discussion_topic_id: int
-    discussion_category: int
-    discussion_created_at: datetime
-    thread: list[dict]
-    question: str
-    solution: str
-class Correctness(str, Enum):
-    yes = "yes"
-    no = "no"
-class JudgementResult(BaseModel):
-    """Structured output for LLM judge evaluation"""
-    extracted_final_answer: str = Field(
-        description="The final exact/snippet answer extracted from the response"
-    )
-    reasoning: str = Field(
-        description="Explanation of why the answer is correct or incorrect"
-    )
-    correct: Correctness = Field(description="'yes' if correct, 'no' if incorrect")
-    confidence: int = Field(
-        description="Confidence score between 0 and 100", ge=0, le=100
-    )
-class EvaluationResult(BaseModel):
-    """Model for evaluation results including metadata"""
-    success: bool
-    judgement: JudgementResult | None = None
-    error: str | None = None
-class EvaluatedQuestionAndSolution(QuestionAndSolution):
-    """Model for a QA pair with its evaluation result"""
-    evaluation: JudgementResult

eval/rubric_eval.py DELETED Viewed

@@ -1,142 +0,0 @@
-"""
-Rubric-based evaluation following the "Rubrics as Rewards" paper.
-Implements RaR-Explicit: Weighted sum of individual criterion scores (Equation 1)
-"""
-from typing import List, Optional
-import litellm
-from pydantic import BaseModel
-class CriterionCheck(BaseModel):
-    """Result of checking a single rubric criterion."""
-    title: str
-    description: str
-    weight: int
-    satisfied: bool
-    reasoning: Optional[str] = None
-class RubricEvaluation(BaseModel):
-    """Complete rubric-based evaluation result."""
-    criterion_checks: List[CriterionCheck]
-    raw_score: float  # Unnormalized score
-    normalized_score: float  # Score normalized to [0, 1]
-CRITERION_PROMPT = """You are evaluating whether a response satisfies a specific evaluation criterion.
-Question: {question}
-Response to evaluate: {response}
-Evaluation Criterion:
-{criterion_description}
-Your task: Determine if the response satisfies this criterion.
-Output a JSON object with:
-- "satisfied": true or false
-- "reasoning": Brief explanation (1-2 sentences) of why it does or doesn't satisfy the criterion
-Be strict but fair. The criterion must be clearly satisfied for you to answer true."""
-class RubricData(BaseModel):
-    """Rubric data loaded from file."""
-    title: str
-    description: str
-    weight: int
-def check_criterion(
-    question: str, response: str, criterion: RubricData, model: str = "gpt-4o-mini"
-) -> CriterionCheck:
-    """
-    Check if response satisfies a single criterion.
-    Args:
-        question: The question being answered
-        response: The response to evaluate
-        criterion: The rubric criterion to check
-        model: LLM model for judging
-    Returns:
-        CriterionCheck with satisfaction result
-    """
-    prompt = CRITERION_PROMPT.format(
-        question=question,
-        response=response,
-        criterion_description=criterion.description,
-    )
-    llm_response = litellm.completion(
-        model=model,
-        messages=[
-            {
-                "role": "system",
-                "content": "You are an expert evaluator for rubric-based assessment.",
-            },
-            {"role": "user", "content": prompt},
-        ],
-        temperature=0.0,
-        response_format=CriterionCheck,
-    )
-    result = CriterionCheck.model_validate_json(llm_response.choices[0].message.content)
-    return result
-def evaluate_with_rubrics(
-    question: str,
-    response: str,
-    rubrics: List[RubricData],
-    model: str = "gpt-5-nano",
-) -> RubricEvaluation:
-    """
-    Evaluate response using RaR-Explicit method (weighted sum).
-    Implements Equation 1 from paper:
-    r(x, ŷ) = Σ(w_j * c_j(x, ŷ)) / Σ(w_j)
-    Args:
-        question: The question
-        response: Response to evaluate
-        reference_answer: Reference answer (not directly used, but available)
-        rubrics: List of rubric criteria
-        model: LLM model for judging
-    Returns:
-        RubricEvaluation with normalized score
-    """
-    # Check each criterion independently
-    checks = []
-    for rubric in rubrics:
-        check = check_criterion(question, response, rubric, model)
-        checks.append(check)
-    # Calculate weighted score (Equation 1)
-    # Only positive weights contribute to denominator
-    positive_weights = sum(abs(r.weight) for r in rubrics if r.weight > 0)
-    raw_score = 0.0
-    for check in checks:
-        if check.satisfied:
-            raw_score += check.weight
-    # Normalize to [0, 1]
-    normalized_score = raw_score / positive_weights if positive_weights > 0 else 0.0
-    # Clip to [0, 1] in case pitfalls make it negative
-    normalized_score = max(0.0, min(1.0, normalized_score))
-    return RubricEvaluation(
-        raw_score=raw_score,
-        normalized_score=normalized_score,
-        criterion_checks=checks,
-    )

eval/run_eval_with_leaderboard.py DELETED Viewed

@@ -1,215 +0,0 @@
-from __future__ import annotations
-import argparse
-import json
-import os
-import re
-import subprocess
-import sys
-from pathlib import Path
-from typing import Any
-from dotenv import load_dotenv
-from leaderboard import LeaderboardClient, build_record, latest_log_file
-load_dotenv()
-def run_command(cmd: list[str]) -> subprocess.CompletedProcess[str]:
-    print(f"[leaderboard] running: {' '.join(cmd)}")
-    return subprocess.run(cmd, capture_output=True, text=True)
-def build_inspect_command(args: argparse.Namespace) -> list[str]:
-    cmd = []
-    cmd.extend(args.inspect_launch)
-    cmd.append(args.inspect_task)
-    def add_task_arg(key: str, value: Any) -> None:
-        if value is None:
-            return
-        cmd.extend(["-T", f"{key}={value}"])
-    add_task_arg("solver_name", args.solver_name)
-    add_task_arg("solver_kwargs", json.dumps(args.solver_kwargs))
-    add_task_arg("dataset_name", args.dataset)
-    if args.limit is not None:
-        add_task_arg("limit", args.limit)
-    cmd.extend(["--log-dir", args.log_dir])
-    if args.log_format:
-        cmd.extend(["--log-format", args.log_format])
-    if args.extra_inspect_args:
-        cmd.extend(args.extra_inspect_args)
-    return cmd
-def parse_score_from_outputs(log_dir: Path) -> tuple[float, Path, list[dict[str, Any]]]:
-    log_path = latest_log_file(log_dir)
-    if not log_path:
-        raise RuntimeError("Inspect log file not found.")
-    # Sanitization
-    content = log_path.read_text(encoding="utf-8")
-    # Regex to match hf_ followed by 34 alphanumeric chars
-    sanitized_content = re.sub(r"hf_[a-zA-Z0-9]{34}", "<REDACTED_TOKEN>", content)
-    if content != sanitized_content:
-        log_path.write_text(sanitized_content, encoding="utf-8")
-        print(f"[leaderboard] Redacted HF tokens in {log_path}")
-        content = sanitized_content
-    data = json.loads(content)
-    results = data.get("results", {})
-    scores = results.get("scores", [])
-    score_value = None
-    criterion_checks: list[dict[str, Any]] = []
-    for score_entry in scores:
-        metrics = score_entry.get("metrics", {})
-        for metric in metrics.values():
-            value = metric.get("value")
-            if isinstance(value, (int, float)):
-                score_value = float(value)
-                break
-        if score_value is not None:
-            break
-    if score_value is None:
-        raise RuntimeError("Could not find a numeric metric value in the Inspect log.")
-    for sample in data.get("samples", []):
-        # Grab the question from metadata (fallback to input)
-        question = "Unknown Question"
-        if "metadata" in sample and "question" in sample["metadata"]:
-            question = sample["metadata"]["question"]
-        elif "input" in sample:
-            question = sample["input"]
-        # Check if any scorer produced criterion_checks
-        for scorer in sample.get("scores", {}).values():
-            metadata = scorer.get("metadata") or {}
-            checks = metadata.get("criterion_checks")
-            if isinstance(checks, list) and checks:
-                # Create a grouped entry for this question/sample
-                grouped_entry = {"question": question, "checks": []}
-                for check in checks:
-                    if isinstance(check, dict):
-                        grouped_entry["checks"].append(check)
-                if grouped_entry["checks"]:
-                    criterion_checks.append(grouped_entry)
-    return score_value, log_path, criterion_checks
-def main() -> None:
-    parser = argparse.ArgumentParser(
-        description="Run Inspect eval and append the resulting score to a HF dataset."
-    )
-    parser.add_argument(
-        "--hf-dataset",
-        default="akseljoonas/hf-agent-leaderboard",
-        help="HF dataset repo id for the leaderboard (e.g. user/leaderboard).",
-    )
-    parser.add_argument(
-        "--solver-name",
-        required=True,
-        help="Solver name used in the Inspect task (e.g. hf_agent).",
-    )
-    parser.add_argument(
-        "--solver-kwargs",
-        type=json.loads,
-        default="{}",
-        help="JSON string with solver kwargs passed to the Inspect task.",
-    )
-    parser.add_argument(
-        "--dataset",
-        default="akseljoonas/hf-agent-rubrics@train",
-        help="Dataset spec in the form author/dataset@split.",
-    )
-    parser.add_argument(
-        "--limit",
-        type=int,
-        default=None,
-        help="Optional sample limit passed to Inspect.",
-    )
-    parser.add_argument(
-        "--inspect-task",
-        default="eval/task.py@hf-benchmark-with-rubrics",
-        help="Inspect task reference.",
-    )
-    parser.add_argument(
-        "--inspect-launch",
-        nargs="+",
-        default=["uv", "run", "inspect", "eval"],
-        help="Command used to invoke Inspect (default: uv run inspect eval).",
-    )
-    parser.add_argument(
-        "--log-dir",
-        default="logs/leaderboard",
-        help="Directory where Inspect outputs .eval logs.",
-    )
-    parser.add_argument(
-        "--extra-inspect-args",
-        nargs="*",
-        help="Additional args forwarded to Inspect after the standard task arguments.",
-    )
-    parser.add_argument(
-        "--log-format",
-        default="json",
-        help="Log format passed to Inspect (default: json).",
-    )
-    args = parser.parse_args()
-    if isinstance(args.solver_kwargs, str):
-        args.solver_kwargs = json.loads(args.solver_kwargs or "{}")
-    hf_token = os.getenv("HF_TOKEN")
-    if not hf_token:
-        print("ERROR: set HF_TOKEN in your environment.", file=sys.stderr)
-        sys.exit(1)
-    if "@" not in args.dataset:
-        raise ValueError("Dataset must be in the format 'author/dataset@split'.")
-    dataset_name, dataset_split = args.dataset.split("@", 1)
-    log_dir = Path(args.log_dir)
-    log_dir.mkdir(parents=True, exist_ok=True)
-    inspect_cmd = build_inspect_command(args)
-    result = run_command(inspect_cmd)
-    if result.returncode != 0:
-        print(result.stdout)
-        print(result.stderr, file=sys.stderr)
-        raise SystemExit(result.returncode)
-    score, log_path, criterion_checks = parse_score_from_outputs(log_dir)
-    client = LeaderboardClient(repo_id=args.hf_dataset, token=hf_token)
-    record = build_record(
-        solver_name=args.solver_name,
-        solver_kwargs=args.solver_kwargs,
-        dataset_name=dataset_name,
-        dataset_split=dataset_split,
-        limit=args.limit,
-        score=score,
-        command=inspect_cmd,
-        log_path=log_path,
-        criterion_checks=criterion_checks,
-    )
-    client.append_record(record)
-    print(
-        f"[leaderboard] recorded score {score:.3f} for solver '{args.solver_name}' to {args.hf_dataset}"
-    )
-if __name__ == "__main__":
-    main()

eval/scrape_discussions/discussions_scraper.py DELETED Viewed

@@ -1,98 +0,0 @@
-import sys
-import time
-from pathlib import Path
-import requests
-from tenacity import (
-    retry,
-    retry_if_exception_type,
-    stop_after_attempt,
-    wait_exponential,
-)
-# Add parent directory to path to import models
-sys.path.insert(0, str(Path(__file__).parent.parent))
-from models import Discussion, QuestionAndSolution
-BASE_URL = "https://discuss.huggingface.co"
-# configure retry decorator for your requests
-@retry(
-    stop=stop_after_attempt(5),
-    wait=wait_exponential(multiplier=1, min=1, max=60),
-    retry=retry_if_exception_type(requests.HTTPError),
-)
-def safe_get(url, **kwargs):
-    resp = requests.get(url, **kwargs)
-    if resp.status_code == 422:
-        # read retry‐after header if present
-        retry_after = resp.headers.get("Retry-After")
-        if retry_after:
-            delay = float(retry_after)
-        else:
-            # fallback to guess
-            delay = 30
-        print(f"429 hit — waiting {delay} seconds...")
-        time.sleep(delay)
-        resp.raise_for_status()
-    else:
-        resp.raise_for_status()
-    return resp
-def get_solved_discussions(n_posts: int = 50):
-    page = 1
-    discussions = []
-    while len(discussions) < n_posts:
-        url = f"{BASE_URL}/search.json?q=status:solved+order:latest&page={page}"
-        resp = safe_get(url)
-        topics = resp.json()["topics"]
-        if not topics:
-            break
-        for post in topics:
-            discussions.append(
-                Discussion(
-                    title=post["fancy_title"],
-                    url=f"{BASE_URL}/t/{post['slug']}/{post['id']}",
-                    topic_id=post["id"],
-                    category=post["category_id"],
-                    created_at=post["created_at"],
-                )
-            )
-            if len(discussions) >= n_posts:
-                break
-        page += 1
-        time.sleep(0.5)  # simple pacing to avoid bursts
-    return discussions
-def get_qa_pair(discussions, start_idx: int = 0):
-    for discussion in discussions[start_idx:]:
-        resp = safe_get(discussion.url + ".json")
-        data = resp.json()
-        posts = data["post_stream"]["posts"]
-        accepted_nr = min(
-            max(data["accepted_answer"]["post_number"] - 1, 0), len(posts) - 1
-        )
-        question = posts[0]["cooked"]
-        solution = posts[accepted_nr]["cooked"]
-        yield QuestionAndSolution(
-            discussion_title=discussion.title,
-            discussion_url=discussion.url,
-            discussion_topic_id=discussion.topic_id,
-            discussion_category=discussion.category,
-            discussion_created_at=discussion.created_at,
-            question=question,
-            solution=solution,
-            thread=posts,
-        )
-        time.sleep(0.5)
-if __name__ == "__main__":
-    discussions = get_solved_discussions(n_posts=300)
-    print(f"Fetched {len(discussions)} discussions")
-    with open("qa_pairs.jsonl", "a") as f:
-        for qa_pair in get_qa_pair(discussions):
-            f.write(qa_pair.model_dump_json() + "\n")

eval/solvers.py DELETED Viewed

@@ -1,165 +0,0 @@
-"""
-Collection of Inspect AI solvers used by the rubric task.
-"""
-from __future__ import annotations
-import asyncio
-import json
-import os
-import tempfile
-from typing import Callable, Dict, List, Sequence
-import litellm
-from inspect_ai.model import ChatMessageAssistant, ModelOutput
-from inspect_ai.solver import Solver, solver
-from inspect_ai.solver._task_state import TaskState
-from eval.hf_agent_connector import AgentResponseGenerator
-async def _run_subprocess(command: Sequence[str]) -> str:
-    process = await asyncio.create_subprocess_exec(
-        *command,
-        stdout=asyncio.subprocess.PIPE,
-        stderr=asyncio.subprocess.PIPE,
-    )
-    stdout, stderr = await process.communicate()
-    if process.returncode != 0:
-        raise RuntimeError(
-            f"Command {' '.join(command)} failed with code {process.returncode}:\n"
-            f"{stderr.decode().strip()}"
-        )
-    return stdout.decode().strip()
-@solver(name="hf_agent")
-def hf_agent(
-    config_path: str = "agent/config_mcp_example.json",
-    max_iterations: int = 10,
-) -> Solver:
-    runner = AgentResponseGenerator(
-        config_path=config_path,
-        max_iterations=max_iterations,
-    )
-    async def solve(state: TaskState, generate) -> TaskState:
-        response = await runner.run(state.input_text)
-        assistant_message = ChatMessageAssistant(
-            content=response,
-            model=runner.model_name,
-            source="generate",
-        )
-        state.messages.append(assistant_message)
-        state.output = ModelOutput.from_message(assistant_message)
-        state.completed = True
-        return state
-    return solve
-@solver(name="claude_code")
-def claude_code(
-    output_format: str = "json",
-    mcp_config: str | None = None,
-) -> Solver:
-    if output_format not in {"text", "json", "stream-json"}:
-        raise ValueError("output_format must be one of: text, json, stream-json")
-    async def solve(state: TaskState, generate) -> TaskState:
-        prompt = state.input_text
-        cmd: List[str] = ["claude", "-p", prompt, "--output-format", output_format]
-        if mcp_config:
-            cmd += ["--mcp-config", mcp_config]
-        stdout = await _run_subprocess(cmd)
-        response_text = stdout
-        session_id = None
-        if output_format in {"json", "stream-json"}:
-            # stream-json may emit multiple JSON objects; take the last complete line
-            candidate_line = stdout.strip().splitlines()[-1]
-            try:
-                payload = json.loads(candidate_line)
-                response_text = (
-                    payload.get("result") or payload.get("message", "") or stdout
-                )
-                session_id = payload.get("session_id")
-            except (json.JSONDecodeError, AttributeError):
-                response_text = stdout
-        assistant_message = ChatMessageAssistant(
-            content=response_text,
-            model="claude-code",
-            source="generate",
-            metadata={"session_id": session_id} if session_id else None,
-        )
-        state.messages.append(assistant_message)
-        state.output = ModelOutput.from_message(assistant_message)
-        state.completed = True
-        return state
-    return solve
-@solver(name="claude_code+hf_mcp")
-def claude_code_hf_mcp(
-    output_format: str = "json",
-    hf_token: str | None = None,
-) -> Solver:
-    """
-    A solver that uses Claude Code with the Hugging Face MCP server.
-    Requires HF_TOKEN in environment variables or passed as argument.
-    """
-    token = hf_token or os.environ.get("HF_TOKEN")
-    if not token:
-        raise ValueError(
-            "HF_TOKEN not found. Please set HF_TOKEN env var or pass it to the solver."
-        )
-    # Construct the MCP configuration for Hugging Face
-    mcp_config = {
-        "mcpServers": {
-            "huggingface": {
-                "type": "http",
-                "url": "https://huggingface.co/mcp",
-                "headers": {"Authorization": f"Bearer {token}"},
-            }
-        }
-    }
-    async def solve(state: TaskState, generate) -> TaskState:
-        # Write config to a temporary file
-        with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as tmp:
-            json.dump(mcp_config, tmp, indent=2)
-            tmp_path = tmp.name
-        try:
-            # Delegate to the base claude_code solver
-            delegate = claude_code(output_format=output_format, mcp_config=tmp_path)
-            return await delegate(state, generate)
-        finally:
-            # Clean up the temporary file
-            if os.path.exists(tmp_path):
-                os.remove(tmp_path)
-    return solve
-SOLVER_REGISTRY: Dict[str, Callable[..., Solver]] = {
-    "hf_agent": hf_agent,
-    "claude_code": claude_code,
-    "claude_code+hf_mcp": claude_code_hf_mcp,
-}
-def get_solver(name: str, **kwargs) -> Solver:
-    try:
-        factory = SOLVER_REGISTRY[name]
-    except KeyError as exc:
-        available = ", ".join(sorted(SOLVER_REGISTRY))
-        raise ValueError(f"Unknown solver '{name}'. Available: {available}") from exc
-    return factory(**kwargs)

eval/task.py DELETED Viewed

@@ -1,121 +0,0 @@
-"""
-Inspect AI task definition that runs the existing agent and reuses the rubric scorer.
-"""
-from __future__ import annotations
-import asyncio
-import json
-import sys
-from pathlib import Path
-from typing import Any, Sequence
-from inspect_ai import Task, task
-from inspect_ai.dataset import Sample, hf_dataset
-from inspect_ai.scorer import Score, Target, mean, scorer
-from inspect_ai.solver._task_state import TaskState
-import litellm
-PROJECT_ROOT = Path(__file__).resolve().parents[1]
-if str(PROJECT_ROOT) not in sys.path:
-    sys.path.insert(0, str(PROJECT_ROOT))
-from eval.rubric_eval import RubricData, evaluate_with_rubrics  # noqa: E402
-from eval.solvers import get_solver  # noqa: E402
-def _record_to_sample(record: dict[str, Any]) -> Sample:
-    rubric_payload = json.loads(record["rubric"])
-    rubrics = rubric_payload.get("rubrics", [])
-    metadata = {
-        "question": record["question"],
-        "discussion_title": record.get("discussion_title"),
-        "discussion_url": record.get("discussion_url"),
-        "rubric_title": rubric_payload.get("title"),
-        "rubric_description": rubric_payload.get("description"),
-        "rubrics": rubrics,
-    }
-    return Sample(
-        input=record["question"],
-        target=record["solution"],
-        id=record.get("discussion_topic_id"),
-        metadata=metadata,
-    )
-def _load_dataset(dataset_name: str, split: str, limit: int | None) -> Sequence[Sample]:
-    return hf_dataset(
-        dataset_name, sample_fields=_record_to_sample, split=split, limit=limit
-    )
-def _metadata_to_rubrics(metadata: dict[str, Any]) -> list[RubricData]:
-    raw_rubrics = metadata.get("rubrics", [])
-    return [RubricData(**rubric) for rubric in raw_rubrics]
-@scorer(metrics=[mean()], name="rubric_scorer")
-def rubric_scorer(judge_model: str = "gpt-5-mini"):
-    async def score(state: TaskState, target: Target) -> Score:
-        response_text = state.output.completion or state.output.message.text
-        question = state.metadata.get("question", state.input_text)
-        rubrics = _metadata_to_rubrics(state.metadata)
-        evaluation = await asyncio.to_thread(
-            evaluate_with_rubrics,
-            question,
-            response_text,
-            rubrics,
-            judge_model,
-        )
-        score_metadata = {
-            "raw_score": evaluation.raw_score,
-            "criterion_checks": [
-                check.model_dump() for check in evaluation.criterion_checks
-            ],
-            "discussion_title": state.metadata.get("discussion_title"),
-            "discussion_url": state.metadata.get("discussion_url"),
-            "reference_answer": target.text,
-        }
-        return Score(
-            value=evaluation.normalized_score,
-            answer=response_text,
-            explanation=f"Normalized score {evaluation.normalized_score:.3f}",
-            metadata=score_metadata,
-        )
-    return score
-@task(name="hf-benchmark-with-rubrics")
-def hf_benchmark_with_rubrics(
-    solver_name: str = "hf_agent",
-    solver_kwargs: dict[str, Any] = {
-        "max_iterations": 10,
-        "config_path": "agent/config_mcp_example.json",
-    },
-    dataset_name: str = "akseljoonas/hf-agent-rubrics@train",
-    limit: int | None = None,
-    judge_model: str = "gpt-5-mini",
-) -> Task:
-    litellm.drop_params = True
-    if "@" not in dataset_name:
-        raise ValueError("Dataset name must be in the format 'author/dataset@split'")
-    dataset_name, dataset_split = dataset_name.split("@")
-    dataset = _load_dataset(dataset_name, dataset_split, limit=limit)
-    return Task(
-        dataset=dataset,
-        solver=get_solver(solver_name, **solver_kwargs),
-        scorer=rubric_scorer(judge_model=judge_model),
-        metadata={
-            "dataset_name": dataset_name,
-            "dataset_split": dataset_split,
-            "solver_name": solver_name,
-            "judge_model": judge_model,
-        },
-    )