Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

App Files Files Community

akseljoonas HF Staff commited on Nov 25, 2025

Commit

235ace7

1 Parent(s): de136d0

updated eval

Browse files

Files changed (8) hide show

eval/README.md +33 -36
eval/__init__.py +3 -0
eval/generate_rubrics.py +2 -1
eval/hf_agent_connector.py +88 -0
eval/{hf_dataset_io.py → hf_io.py} +1 -303
eval/{evaluate.py → rubric_eval.py} +1 -216
eval/solvers.py +116 -0
eval/task.py +119 -0

eval/README.md CHANGED Viewed

@@ -5,10 +5,10 @@ Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv
 ## Pipeline
 ```
-QA pairs → generate_rubrics.py → evaluate.py → scores
 ```
-### 1. Generate Rubrics
 Creates instance-specific evaluation criteria from question + reference answer.
@@ -27,50 +27,47 @@ python eval/generate_rubrics.py \
 **Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
-### 2. Evaluate Responses
-Scores responses using generated rubrics via LLM-as-judge.
-```python
-from evaluate import evaluate_dataset_with_rubrics
-evaluate_dataset_with_rubrics(
-    input_file="responses.jsonl",
-    rubric_file="qa_rubrics.jsonl",
-    ground_truth_file="qa_pairs.jsonl",
-    output_file="results.jsonl",
-    model="gpt-4o-mini",
-    push_to_hub="akseljoonas/hf-agent-benchmark@evaluations"
-)
 ```
-**Output:** Normalized score [0, 1] + per-criterion satisfaction + reasoning
-## HuggingFace Integration
-Both scripts upload DataFrames before saving JSONL:
-```python
-from hf_dataset_io import df_to_hub, hub_to_df
-# Upload
-df_to_hub(df, "username/dataset@config", split="train")
-# Download
-df = hub_to_df("username/dataset@config", split="train")
 ```
-Use `@config` notation to organize: `@rubrics`, `@evaluations`, `@ground-truth`
-## Key Parameters
-- **--max-concurrent**: Parallel workers (default: 30 for rubrics, 10 for eval)
-- **--push-to-hub**: Auto-upload to HF Hub (e.g., `user/dataset@rubrics`)
-- **--model**: LiteLLM model string
-- **split**: `train` for rubrics, `test` for evaluations
-## Scoring
-RaR-Explicit: `score = Σ(weight × satisfied) / Σ(positive_weights)`
-Normalized to [0, 1], clipped if pitfalls make it negative.

 ## Pipeline
 ```
+QA pairs → generate_rubrics.py → `eval/task.py@hf-benchmark-with-rubrics` → scores
 ```
+### 1. Generate Rubrics (if not already generated)
 Creates instance-specific evaluation criteria from question + reference answer.
 **Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
+### 2. Evaluate Responses (Inspect)
+Load your rubric dataset, run a solver, and score with `rubric_scorer` using `inspect-ai`.
+Files:
+- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
+  the existing hf-agent stack in `agent/` (tools, MCP, LiteLLM loop) and returns the assistant reply.
+- `eval/solvers.py` keeps the solver implementations (e.g. `hf_agent_solver`,
+  `claude_code`). If additional solvers are needed, register them there and pass
+  `-T solver_name=<name>` to swap them in without touching the task.
+- `eval/task.py` registers `hf-benchmark-with-rubrics`, which wires
+  the dataset, solver, and rubric scorer into a single Inspect task and does the eval.
+### Running the hf-agent (implemented in `agent/`) (args are optional)
+```bash
+uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
+  -T dataset_name=akseljoonas/hf-agent-rubrics \
+  -T dataset_split=train \
+  -T limit=25 \
+  -T solver_name=hf_agent_solver \
+  -T solver_kwargs='{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
+  --log-dir logs/inspect
 ```
+Different benchmarks can be used by making/running a new task in `eval/task.py`.
+### Running Claude Code headlessly
+The `claude_code` solver shell-outs to the `claude` CLI (`claude -p ... --output-format json`)
+so you can benchmark Claude Code without any interactive UI. Example:
+Claude Code command example (kwargs are optional):
+```bash
+uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
+  -T solver_name=claude_code \
+  -T solver_kwargs='{"allowed_tools":"Bash,Read","output_format":"json"}'
 ```
+## Scoring (implemented in `eval/rubric_eval.py`)
+The scoring is implemented in `eval/rubric_eval.py` and is based on the RaR-Explicit formula: `score = Σ(weight × satisfied) / Σ(positive_weights)`.
+The score is normalized to [0, 1] and clipped if pitfalls make it negative.

eval/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from eval.task import hf_benchmark_with_rubrics
2	+
3	+ __all__ = ["hf_benchmark_with_rubrics"]

eval/generate_rubrics.py CHANGED Viewed

@@ -17,9 +17,10 @@ from typing import Any, Dict, List
 import litellm
 import pandas as pd
 from dotenv import load_dotenv
-from hf_dataset_io import df_to_hub
 from pydantic import BaseModel
 class Rubric(BaseModel):
     title: str

 import litellm
 import pandas as pd
 from dotenv import load_dotenv
 from pydantic import BaseModel
+from eval.hf_io import df_to_hub
 class Rubric(BaseModel):
     title: str

eval/hf_agent_connector.py ADDED Viewed

	@@ -0,0 +1,88 @@

+from __future__ import annotations
+import asyncio
+import sys
+from pathlib import Path
+from typing import Any
+from agent.config import Config, load_config
+from agent.core.agent_loop import Handlers
+from agent.core.session import Session
+from agent.core.tools import ToolRouter
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+def _resolve_project_path(path: str | Path) -> Path:
+    candidate = Path(path)
+    if candidate.is_absolute():
+        return candidate
+    return (PROJECT_ROOT / candidate).resolve()
+class AgentResponseGenerator:
+    """
+    Thin async wrapper that executes the existing agent loop once and
+    returns the assistant's final message.
+    """
+    def __init__(self, config_path: str | Path, max_iterations: int = 10) -> None:
+        self.config_path = _resolve_project_path(config_path)
+        self.config: Config = load_config(str(self.config_path))
+        self.max_iterations = max_iterations
+    @property
+    def model_name(self) -> str:
+        """Expose the agent model name for downstream logging."""
+        return self.config.model_name
+    async def run(self, prompt: str) -> str:
+        """
+        Execute the agent loop for a single prompt and return the assistant reply.
+        """
+        tool_router = ToolRouter(self.config.mcpServers)
+        async with tool_router:
+            session = Session(asyncio.Queue(), config=self.config)
+            session.tool_router = tool_router
+            await Handlers.run_agent(
+                session,
+                prompt,
+                max_iterations=self.max_iterations,
+            )
+            return self._latest_assistant_response(session)
+    def _latest_assistant_response(self, session: Session) -> str:
+        """
+        Extract the final assistant response from the session history.
+        """
+        for message in reversed(session.context_manager.items):
+            if getattr(message, "role", None) == "assistant":
+                return _content_to_text(getattr(message, "content", ""))
+        raise RuntimeError("Agent did not produce an assistant message.")
+def _content_to_text(content: Any) -> str:
+    """
+    Convert LiteLLM content payloads (str or list[dict]) into plain text.
+    """
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts: list[str] = []
+        for block in content:
+            if isinstance(block, dict):
+                text = block.get("text")
+                if text:
+                    parts.append(str(text))
+            else:
+                text = getattr(block, "text", None)
+                if text:
+                    parts.append(str(text))
+        return "\n".join(parts)
+    return str(content)

eval/{hf_dataset_io.py → hf_io.py} RENAMED Viewed

@@ -5,245 +5,12 @@ Reusable functions for uploading and downloading JSONL data to/from HuggingFace
 Supports the dataset_name@config_name notation for managing multiple configurations.
 """
-import json
-from pathlib import Path
-from typing import Dict, List, Optional, Union
 import pandas as pd
 from datasets import Dataset, load_dataset
-def upload_jsonl_to_hf(
-    jsonl_file: Union[str, Path],
-    dataset_spec: str,
-    split: str = "train",
-    private: bool = False,
-) -> bool:
-    """
-    Upload a JSONL file to HuggingFace Hub as a dataset.
-    This function reads a JSONL file where each line is a complete JSON object,
-    converts it to a HuggingFace Dataset, and uploads it to the Hub.
-    Args:
-        jsonl_file: Path to the JSONL file to upload. Each line should be a valid
-            JSON object. Example format:
-            ```
-            {"question": "How to...", "solution": "...", "rubric": "[...]"}
-            {"question": "What is...", "solution": "...", "rubric": "[...]"}
-            ```
-        dataset_spec: Dataset specification in the format "dataset_name" or
-            "dataset_name@config_name". Examples:
-            - "username/my-dataset" (uses "default" config)
-            - "username/my-dataset@rubrics" (uses "rubrics" config)
-            - "username/my-dataset@evaluations" (uses "evaluations" config)
-            Multiple configs allow you to store different data types in the same
-            dataset repository (e.g., raw data, rubrics, evaluation results).
-        split: The dataset split name. Defaults to "train". Common values:
-            - "train": Training or main data
-            - "validation": Validation data
-            - "test": Test data
-        private: Whether to create a private dataset. Defaults to False (public).
-    Returns:
-        bool: True if upload succeeded, False otherwise
-    Raises:
-        FileNotFoundError: If the JSONL file doesn't exist
-        ValueError: If the JSONL file is empty or contains invalid JSON
-        Exception: For HuggingFace Hub upload errors
-    Example:
-        >>> # Upload rubrics with custom config
-        >>> upload_jsonl_to_hf(
-        ...     "qa_rubrics.jsonl",
-        ...     "username/hf-agent-benchmark@rubrics",
-        ...     split="train"
-        ... )
-        >>> # Upload evaluation results with different config
-        >>> upload_jsonl_to_hf(
-        ...     "evaluation_results.jsonl",
-        ...     "username/hf-agent-benchmark@evaluations",
-        ...     split="test"
-        ... )
-    Notes:
-        - Requires authentication via `huggingface-cli login` or HF_TOKEN env var
-        - If the dataset doesn't exist, it will be created automatically
-        - If it exists, the specified config/split will be updated
-        - Empty files will raise ValueError to prevent uploading invalid data
-    """
-    jsonl_path = Path(jsonl_file)
-    # Validate file exists
-    if not jsonl_path.exists():
-        raise FileNotFoundError(f"JSONL file not found: {jsonl_file}")
-    # Parse dataset specification
-    if "@" in dataset_spec:
-        dataset_name, config_name = dataset_spec.split("@", 1)
-    else:
-        dataset_name = dataset_spec
-        config_name = "default"
-    try:
-        print(f"\nUploading {jsonl_path.name} to HuggingFace Hub...")
-        print(f"  Dataset: {dataset_name}")
-        print(f"  Config: {config_name}")
-        print(f"  Split: {split}")
-        # Load JSONL file
-        records = []
-        with open(jsonl_path, "r") as f:
-            for line_num, line in enumerate(f, start=1):
-                line = line.strip()
-                if line:  # Skip empty lines
-                    try:
-                        records.append(json.loads(line))
-                    except json.JSONDecodeError as e:
-                        raise ValueError(f"Invalid JSON on line {line_num}: {e}") from e
-        if not records:
-            raise ValueError("JSONL file is empty or contains no valid records")
-        print(f"  Loaded {len(records)} records from JSONL")
-        # Create HuggingFace Dataset
-        dataset = Dataset.from_list(records)
-        # Upload to HuggingFace Hub
-        dataset.push_to_hub(
-            dataset_name,
-            config_name=config_name,
-            split=split,
-            private=private,
-        )
-        print(
-            f"✓ Successfully uploaded to {dataset_name}@{config_name} (split: {split})"
-        )
-        return True
-    except Exception as e:
-        print(f"✗ Failed to upload to HuggingFace: {type(e).__name__}: {str(e)}")
-        print(f"  JSONL file preserved at: {jsonl_path}")
-        return False
-def download_hf_to_jsonl(
-    dataset_spec: str,
-    output_file: Union[str, Path],
-    split: str = "train",
-    overwrite: bool = False,
-) -> bool:
-    """
-    Download a dataset from HuggingFace Hub and save as JSONL.
-    This function downloads a dataset from the HuggingFace Hub and saves it as a
-    JSONL file where each line is a complete JSON object.
-    Args:
-        dataset_spec: Dataset specification in the format "dataset_name" or
-            "dataset_name@config_name". Examples:
-            - "username/my-dataset" (uses "default" config)
-            - "username/my-dataset@rubrics" (uses "rubrics" config)
-            - "username/my-dataset@evaluations" (uses "evaluations" config)
-        output_file: Path where the JSONL file will be saved. Will create parent
-            directories if they don't exist. Example: "data/downloaded_rubrics.jsonl"
-        split: The dataset split to download. Defaults to "train". Common values:
-            - "train": Training or main data
-            - "validation": Validation data
-            - "test": Test data
-            - "all": Download all splits (creates one JSONL with all data)
-        overwrite: Whether to overwrite existing file. Defaults to False.
-    Returns:
-        bool: True if download succeeded, False otherwise
-    Raises:
-        FileExistsError: If output file exists and overwrite=False
-        ValueError: If the dataset/config/split doesn't exist
-        Exception: For HuggingFace Hub download errors
-    Example:
-        >>> # Download rubrics from specific config
-        >>> download_hf_to_jsonl(
-        ...     "username/hf-agent-benchmark@rubrics",
-        ...     "local_rubrics.jsonl",
-        ...     split="train"
-        ... )
-        >>> # Download evaluation results
-        >>> download_hf_to_jsonl(
-        ...     "username/hf-agent-benchmark@evaluations",
-        ...     "local_evaluations.jsonl",
-        ...     split="test",
-        ...     overwrite=True
-        ... )
-    Notes:
-        - Requires authentication for private datasets via `huggingface-cli login`
-        - Downloaded data will be in the same format as uploaded (preserves structure)
-        - Each line in the output JSONL is a complete, valid JSON object
-        - Large datasets may take time to download
-    """
-    output_path = Path(output_file)
-    # Check if file exists
-    if output_path.exists() and not overwrite:
-        raise FileExistsError(
-            f"Output file already exists: {output_file}. "
-            "Use overwrite=True to replace it."
-        )
-    # Parse dataset specification
-    if "@" in dataset_spec:
-        dataset_name, config_name = dataset_spec.split("@", 1)
-    else:
-        dataset_name = dataset_spec
-        config_name = "default"
-    try:
-        print("\nDownloading from HuggingFace Hub...")
-        print(f"  Dataset: {dataset_name}")
-        print(f"  Config: {config_name}")
-        print(f"  Split: {split}")
-        # Download dataset from HuggingFace Hub
-        dataset = load_dataset(
-            dataset_name,
-            name=config_name,
-            split=split,
-        )
-        print(f"  Downloaded {len(dataset)} records")
-        # Create parent directories if needed
-        output_path.parent.mkdir(parents=True, exist_ok=True)
-        # Write to JSONL
-        with open(output_path, "w") as f:
-            for record in dataset:
-                # Convert record to JSON and write as line
-                f.write(json.dumps(record) + "\n")
-        print(f"✓ Successfully saved to {output_path}")
-        print(f"  Total records: {len(dataset)}")
-        return True
-    except Exception as e:
-        print(f"✗ Failed to download from HuggingFace: {type(e).__name__}: {str(e)}")
-        return False
 def list_dataset_configs(dataset_name: str) -> Optional[List[str]]:
     """
     List all available configs for a dataset on HuggingFace Hub.
@@ -269,60 +36,6 @@ def list_dataset_configs(dataset_name: str) -> Optional[List[str]]:
         return None
-def get_dataset_info(dataset_spec: str, split: str = "train") -> Optional[Dict]:
-    """
-    Get information about a dataset on HuggingFace Hub.
-    Args:
-        dataset_spec: Dataset specification ("dataset_name" or "dataset_name@config")
-        split: The split to get info for (default: "train")
-    Returns:
-        Dictionary with dataset info, or None if unable to retrieve
-    Example:
-        >>> info = get_dataset_info("username/hf-agent-benchmark@rubrics")
-        >>> print(f"Records: {info['num_rows']}")
-        >>> print(f"Columns: {info['column_names']}")
-    """
-    # Parse dataset specification
-    if "@" in dataset_spec:
-        dataset_name, config_name = dataset_spec.split("@", 1)
-    else:
-        dataset_name = dataset_spec
-        config_name = "default"
-    try:
-        # Load just to get info (streaming mode for efficiency)
-        dataset = load_dataset(
-            dataset_name,
-            name=config_name,
-            split=split,
-            streaming=True,
-        )
-        # Get basic info
-        info = {
-            "dataset_name": dataset_name,
-            "config_name": config_name,
-            "split": split,
-            "features": str(dataset.features),
-            "column_names": dataset.column_names
-            if hasattr(dataset, "column_names")
-            else None,
-        }
-        # Try to get row count (only works for non-streaming)
-        dataset_full = load_dataset(dataset_name, name=config_name, split=split)
-        info["num_rows"] = len(dataset_full)
-        return info
-    except Exception as e:
-        print(f"✗ Failed to get dataset info: {type(e).__name__}: {str(e)}")
-        return None
 def df_to_hub(
     df: pd.DataFrame,
     dataset_spec: str,
@@ -500,18 +213,3 @@ def hub_to_df(
     except Exception as e:
         print(f"✗ Failed to download from HuggingFace: {type(e).__name__}: {str(e)}")
         return None
-if __name__ == "__main__":
-    # Example usage
-    print("HuggingFace Dataset I/O Utilities")
-    print("=" * 60)
-    print("\nExample: Upload rubrics")
-    print('  upload_jsonl_to_hf("qa_rubrics.jsonl", "username/dataset@rubrics")')
-    print("\nExample: Download evaluations")
-    print('  download_hf_to_jsonl("username/dataset@evaluations", "local.jsonl")')
-    print("\nExample: List configs")
-    print('  list_dataset_configs("username/dataset")')
-    print("\nExample: Get dataset info")
-    print('  get_dataset_info("username/dataset@rubrics")')
-    print("=" * 60)

 Supports the dataset_name@config_name notation for managing multiple configurations.
 """
+from typing import List, Optional
 import pandas as pd
 from datasets import Dataset, load_dataset
 def list_dataset_configs(dataset_name: str) -> Optional[List[str]]:
     """
     List all available configs for a dataset on HuggingFace Hub.
         return None
 def df_to_hub(
     df: pd.DataFrame,
     dataset_spec: str,
     except Exception as e:
         print(f"✗ Failed to download from HuggingFace: {type(e).__name__}: {str(e)}")
         return None

eval/{evaluate.py → rubric_eval.py} RENAMED Viewed

@@ -4,13 +4,9 @@ Rubric-based evaluation following the "Rubrics as Rewards" paper.
 Implements RaR-Explicit: Weighted sum of individual criterion scores (Equation 1)
 """
-import json
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from typing import Dict, List, Optional
 import litellm
-import pandas as pd
-from hf_dataset_io import df_to_hub
 from pydantic import BaseModel
@@ -32,17 +28,6 @@ class RubricEvaluation(BaseModel):
     normalized_score: float  # Score normalized to [0, 1]
-class EvaluatedResponse(BaseModel):
-    """Complete evaluated response with rubric scores."""
-    discussion_title: str
-    discussion_url: str
-    question: str
-    response: str
-    reference_answer: str
-    evaluation: RubricEvaluation
 CRITERION_PROMPT = """You are evaluating whether a response satisfies a specific evaluation criterion.
 Question: {question}
@@ -69,32 +54,6 @@ class RubricData(BaseModel):
     weight: int
-def load_rubrics_from_file(rubric_file: str) -> Dict[str, List[RubricData]]:
-    """
-    Load rubrics from JSONL file and index by question.
-    Args:
-        rubric_file: Path to rubric JSONL file
-    Returns:
-        Dictionary mapping questions to their rubrics
-    """
-    rubrics_by_question = {}
-    with open(rubric_file, "r") as f:
-        for line in f:
-            entry = json.loads(line)
-            question = entry["question"]
-            # Parse rubric JSON string
-            rubric_data = json.loads(entry["rubric"])
-            rubrics = [RubricData(**r) for r in rubric_data["rubrics"]]
-            rubrics_by_question[question] = rubrics
-    return rubrics_by_question
 def check_criterion(
     question: str, response: str, criterion: RubricData, model: str = "gpt-4o-mini"
 ) -> CriterionCheck:
@@ -137,7 +96,6 @@ def check_criterion(
 def evaluate_with_rubrics(
     question: str,
     response: str,
-    reference_answer: str,
     rubrics: List[RubricData],
     model: str = "gpt-4o-mini",
 ) -> RubricEvaluation:
@@ -182,176 +140,3 @@ def evaluate_with_rubrics(
         normalized_score=normalized_score,
         criterion_checks=checks,
     )
-def evaluate_dataset_with_rubrics(
-    input_file: str,
-    rubric_file: str,
-    ground_truth_file: str,
-    output_file: str = "rubric_evaluation_results.jsonl",
-    model: str = "gpt-4o-mini",
-    max_concurrent: int = 10,
-    limit: Optional[int] = None,
-    push_to_hub: Optional[str] = None,
-) -> None:
-    """
-    Evaluate all responses using rubric-based assessment.
-    Args:
-        input_file: Path to JSONL with responses to evaluate
-        rubric_file: Path to JSONL with rubrics (output from generate_rubrics.py)
-        ground_truth_file: Path to JSONL with ground truth answers
-        output_file: Path to output JSONL file
-        model: LLM model for judging
-        max_concurrent: Maximum concurrent evaluations
-        limit: Optional limit on number of examples
-        push_to_hub: Optional HuggingFace dataset spec (e.g., username/dataset@evaluations)
-    """
-    # Load data
-    print(f"Loading responses from {input_file}...")
-    with open(input_file, "r") as f:
-        responses = [json.loads(line) for line in f]
-    print(f"Loading rubrics from {rubric_file}...")
-    rubrics_by_question = load_rubrics_from_file(rubric_file)
-    print(f"Loading ground truth from {ground_truth_file}...")
-    with open(ground_truth_file, "r") as f:
-        ground_truths = [json.loads(line) for line in f]
-    if limit:
-        responses = responses[:limit]
-        ground_truths = ground_truths[:limit]
-    print(f"Loaded {len(responses)} responses to evaluate")
-    print(f"Judge model: {model}")
-    # Match responses with rubrics and ground truth
-    evaluation_tasks = []
-    for response_data, gt_data in zip(responses, ground_truths):
-        question = gt_data["question"]
-        # Find rubrics for this question
-        rubrics = rubrics_by_question.get(question)
-        if not rubrics:
-            print(f"Warning: No rubrics found for question: {question[:50]}...")
-            continue
-        evaluation_tasks.append(
-            {
-                "question": question,
-                "response": response_data["solution"],
-                "reference_answer": gt_data["solution"],
-                "rubrics": rubrics,
-                "metadata": {
-                    "discussion_title": response_data.get("discussion_title", ""),
-                    "discussion_url": response_data.get("discussion_url", ""),
-                },
-            }
-        )
-    print(
-        f"Running {len(evaluation_tasks)} evaluations with {max_concurrent} parallel workers..."
-    )
-    # Run evaluations in parallel
-    results = []
-    with ThreadPoolExecutor(max_workers=max_concurrent) as executor:
-        # Submit all tasks
-        future_to_idx = {}
-        for idx, task in enumerate(evaluation_tasks):
-            future = executor.submit(
-                evaluate_with_rubrics,
-                question=task["question"],
-                response=task["response"],
-                reference_answer=task["reference_answer"],
-                rubrics=task["rubrics"],
-                model=model,
-            )
-            future_to_idx[future] = idx
-        # Collect results in order
-        results = [None] * len(evaluation_tasks)
-        completed = 0
-        for future in as_completed(future_to_idx):
-            idx = future_to_idx[future]
-            results[idx] = future.result()
-            completed += 1
-            print(f"Completed: {completed}/{len(evaluation_tasks)}", end="\r")
-    print()  # New line after progress
-    # Combine results with metadata
-    output_data = []
-    total_score = 0.0
-    for task, evaluation in zip(evaluation_tasks, results):
-        evaluated_response = EvaluatedResponse(
-            discussion_title=task["metadata"]["discussion_title"],
-            discussion_url=task["metadata"]["discussion_url"],
-            question=task["question"],
-            response=task["response"],
-            reference_answer=task["reference_answer"],
-            evaluation=evaluation,
-        )
-        output_data.append(evaluated_response)
-        total_score += evaluation.normalized_score
-    # Convert to DataFrame for HuggingFace upload
-    results_df = pd.DataFrame([entry.model_dump() for entry in output_data])
-    # Upload to HuggingFace if specified (before saving JSONL)
-    if push_to_hub:
-        print(f"\nUploading to HuggingFace: {push_to_hub}")
-        upload_success = df_to_hub(
-            df=results_df,
-            dataset_spec=push_to_hub,
-            split="test",
-            private=False,
-        )
-        if not upload_success:
-            print("Warning: HuggingFace upload failed, but continuing to save JSONL...")
-    # Write results to JSONL file
-    print(f"\nWriting results to {output_file}...")
-    with open(output_file, "w") as f:
-        for entry in output_data:
-            f.write(entry.model_dump_json() + "\n")
-    # Print summary
-    avg_score = total_score / len(output_data) if output_data else 0.0
-    print("\n" + "=" * 60)
-    print("RUBRIC-BASED EVALUATION SUMMARY")
-    print("=" * 60)
-    print(f"Total examples: {len(output_data)}")
-    print(f"Judge model: {model}")
-    print(f"Average normalized score: {avg_score:.3f}")
-    print(f"Average percentage: {avg_score * 100:.1f}%")
-    # Per-criterion statistics
-    total_satisfied = sum(
-        sum(1 for check in eval.evaluation.criterion_checks if check.satisfied)
-        for eval in output_data
-    )
-    total_criteria = sum(len(eval.evaluation.criterion_checks) for eval in output_data)
-    satisfaction_rate = total_satisfied / total_criteria if total_criteria > 0 else 0.0
-    print(f"Criteria satisfaction rate: {satisfaction_rate * 100:.1f}%")
-    if push_to_hub and upload_success:
-        print(f"Pushed to: {push_to_hub}")
-    print("=" * 60)
-if __name__ == "__main__":
-    evaluate_dataset_with_rubrics(
-        input_file="eval/qa_pairs_accepted.jsonl",
-        rubric_file="eval/qa_rubrics.jsonl",
-        ground_truth_file="eval/qa_pairs_accepted.jsonl",
-        output_file="rubric_evaluation.jsonl",
-        model="gpt-4o-mini",
-        max_concurrent=10,
-        limit=30,  # Set to None to evaluate all
-        push_to_hub="akseljoonas/hf-agent-benchmark@ground-truth",  # Set to "username/dataset@evaluations" to upload
-    )

 Implements RaR-Explicit: Weighted sum of individual criterion scores (Equation 1)
 """
+from typing import List, Optional
 import litellm
 from pydantic import BaseModel
     normalized_score: float  # Score normalized to [0, 1]
 CRITERION_PROMPT = """You are evaluating whether a response satisfies a specific evaluation criterion.
 Question: {question}
     weight: int
 def check_criterion(
     question: str, response: str, criterion: RubricData, model: str = "gpt-4o-mini"
 ) -> CriterionCheck:
 def evaluate_with_rubrics(
     question: str,
     response: str,
     rubrics: List[RubricData],
     model: str = "gpt-4o-mini",
 ) -> RubricEvaluation:
         normalized_score=normalized_score,
         criterion_checks=checks,
     )

eval/solvers.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+Collection of Inspect AI solvers used by the rubric task.
+"""
+from __future__ import annotations
+import asyncio
+import json
+from typing import Callable, Dict, List, Sequence
+from inspect_ai.model import ChatMessageAssistant, ModelOutput
+from inspect_ai.solver import Solver, solver
+from inspect_ai.solver._task_state import TaskState
+from eval.hf_agent_connector import AgentResponseGenerator
+async def _run_subprocess(command: Sequence[str]) -> str:
+    process = await asyncio.create_subprocess_exec(
+        *command,
+        stdout=asyncio.subprocess.PIPE,
+        stderr=asyncio.subprocess.PIPE,
+    )
+    stdout, stderr = await process.communicate()
+    if process.returncode != 0:
+        raise RuntimeError(
+            f"Command {' '.join(command)} failed with code {process.returncode}:\n"
+            f"{stderr.decode().strip()}"
+        )
+    return stdout.decode().strip()
+@solver(name="hf_agent_solver")
+def hf_agent_solver(
+    config_path: str = "agent/config_mcp_example.json",
+    max_iterations: int = 10,
+) -> Solver:
+    runner = AgentResponseGenerator(
+        config_path=config_path,
+        max_iterations=max_iterations,
+    )
+    async def solve(state: TaskState, generate) -> TaskState:
+        response = await runner.run(state.input_text)
+        assistant_message = ChatMessageAssistant(
+            content=response,
+            model=runner.model_name,
+            source="generate",
+        )
+        state.messages.append(assistant_message)
+        state.output = ModelOutput.from_message(assistant_message)
+        state.completed = True
+        return state
+    return solve
+@solver(name="claude_code")
+def claude_code(
+    output_format: str = "json",
+    mcp_config: str | None = None,
+) -> Solver:
+    if output_format not in {"text", "json", "stream-json"}:
+        raise ValueError("output_format must be one of: text, json, stream-json")
+    async def solve(state: TaskState, generate) -> TaskState:
+        prompt = state.input_text
+        cmd: List[str] = ["claude", "-p", prompt, "--output-format", output_format]
+        if mcp_config:
+            cmd += ["--mcp-config", mcp_config]
+        stdout = await _run_subprocess(cmd)
+        response_text = stdout
+        session_id = None
+        if output_format in {"json", "stream-json"}:
+            # stream-json may emit multiple JSON objects; take the last complete line
+            candidate_line = stdout.strip().splitlines()[-1]
+            try:
+                payload = json.loads(candidate_line)
+                response_text = (
+                    payload.get("result") or payload.get("message", "") or stdout
+                )
+                session_id = payload.get("session_id")
+            except (json.JSONDecodeError, AttributeError):
+                response_text = stdout
+        assistant_message = ChatMessageAssistant(
+            content=response_text,
+            model="claude-code",
+            source="generate",
+            metadata={"session_id": session_id} if session_id else None,
+        )
+        state.messages.append(assistant_message)
+        state.output = ModelOutput.from_message(assistant_message)
+        state.completed = True
+        return state
+    return solve
+SOLVER_REGISTRY: Dict[str, Callable[..., Solver]] = {
+    "hf_agent_solver": hf_agent_solver,
+    "claude_code": claude_code,
+}
+def get_solver(name: str, **kwargs) -> Solver:
+    try:
+        factory = SOLVER_REGISTRY[name]
+    except KeyError as exc:
+        available = ", ".join(sorted(SOLVER_REGISTRY))
+        raise ValueError(f"Unknown solver '{name}'. Available: {available}") from exc
+    return factory(**kwargs)

eval/task.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Inspect AI task definition that runs the existing agent and reuses the rubric scorer.
+"""
+from __future__ import annotations
+import asyncio
+import json
+import sys
+from pathlib import Path
+from typing import Any, Sequence
+from inspect_ai import Task, task
+from inspect_ai.dataset import Sample, hf_dataset
+from inspect_ai.scorer import Score, Target, mean, scorer
+from inspect_ai.solver._task_state import TaskState
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from eval.rubric_eval import RubricData, evaluate_with_rubrics  # noqa: E402
+from eval.solvers import get_solver  # noqa: E402
+def _record_to_sample(record: dict[str, Any]) -> Sample:
+    rubric_payload = json.loads(record["rubric"])
+    rubrics = rubric_payload.get("rubrics", [])
+    metadata = {
+        "question": record["question"],
+        "discussion_title": record.get("discussion_title"),
+        "discussion_url": record.get("discussion_url"),
+        "rubric_title": rubric_payload.get("title"),
+        "rubric_description": rubric_payload.get("description"),
+        "rubrics": rubrics,
+    }
+    return Sample(
+        input=record["question"],
+        target=record["solution"],
+        id=record.get("discussion_topic_id"),
+        metadata=metadata,
+    )
+def _load_dataset(dataset_name: str, split: str, limit: int | None) -> Sequence[Sample]:
+    return hf_dataset(
+        dataset_name, sample_fields=_record_to_sample, split=split, limit=limit
+    )
+def _metadata_to_rubrics(metadata: dict[str, Any]) -> list[RubricData]:
+    raw_rubrics = metadata.get("rubrics", [])
+    return [RubricData(**rubric) for rubric in raw_rubrics]
+@scorer(metrics=[mean()], name="rubric_scorer")
+def rubric_scorer(judge_model: str = "gpt-4o-mini"):
+    async def score(state: TaskState, target: Target) -> Score:
+        response_text = state.output.completion or state.output.message.text
+        question = state.metadata.get("question", state.input_text)
+        rubrics = _metadata_to_rubrics(state.metadata)
+        evaluation = await asyncio.to_thread(
+            evaluate_with_rubrics,
+            question,
+            response_text,
+            rubrics,
+            judge_model,
+        )
+        score_metadata = {
+            "raw_score": evaluation.raw_score,
+            "criterion_checks": [
+                check.model_dump() for check in evaluation.criterion_checks
+            ],
+            "discussion_title": state.metadata.get("discussion_title"),
+            "discussion_url": state.metadata.get("discussion_url"),
+            "reference_answer": target.text,
+        }
+        return Score(
+            value=evaluation.normalized_score,
+            answer=response_text,
+            explanation=f"Normalized score {evaluation.normalized_score:.3f}",
+            metadata=score_metadata,
+        )
+    return score
+@task(name="hf-benchmark-with-rubrics")
+def hf_benchmark_with_rubrics(
+    solver_name: str = "hf_agent_solver",
+    solver_kwargs: dict[str, Any] = {
+        "max_iterations": 10,
+        "config_path": "agent/config_mcp_example.json",
+    },
+    dataset_name: str = "akseljoonas/hf-agent-rubrics@train",
+    limit: int | None = None,
+    judge_model: str = "gpt-4o-mini",
+) -> Task:
+    if "@" not in dataset_name:
+        raise ValueError("Dataset name must be in the format 'author/dataset@split'")
+    dataset_name, dataset_split = dataset_name.split("@")
+    dataset = _load_dataset(dataset_name, dataset_split, limit=limit)
+    return Task(
+        dataset=dataset,
+        solver=get_solver(solver_name, **solver_kwargs),
+        scorer=rubric_scorer(judge_model=judge_model),
+        metadata={
+            "dataset_name": dataset_name,
+            "dataset_split": dataset_split,
+            "solver_name": solver_name,
+            "judge_model": judge_model,
+        },
+    )