Spaces:

QAQ123
/

test

Running

App Files Files Community

QAQ123 commited on 1 day ago

Commit

e9e6671

verified ·

1 Parent(s): 017b3a7

Upload RPC-Bench Space

Browse files

Files changed (8) hide show

.gitignore +4 -0
README.md +57 -7
app.py +274 -0
build_seed_leaderboard.py +156 -0
constants.py +62 -0
eval.py +264 -0
leaderboard_seed.csv +29 -0
requirements.txt +7 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+index.html
+__pycache__/
+.cache/
+*.pyc

README.md CHANGED Viewed

@@ -1,13 +1,63 @@
 ---
-title: Test
-emoji: 🐨
-colorFrom: yellow
-colorTo: pink
 sdk: gradio
-sdk_version: 6.14.0
-python_version: '3.13'
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: RPC-Bench Leaderboard
+emoji: 📊
+colorFrom: indigo
+colorTo: purple
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
 pinned: false
+license: mit
 ---
+<p align="center">
+    🌐 <a href="https://rpc-bench.github.io/" target="_blank">Project Page</a> •
+    💻 <a href="https://github.com/RPC-Bench/PRC-Bench" target="_blank">GitHub</a> •
+    📖 <a href="https://arxiv.org/abs/2601.14289" target="_blank">Paper</a> •
+    🤗 <a href="https://huggingface.co" target="_blank">Hugging Face</a> •
+    🧭 <a href="https://community.modelscope.cn/" target="_blank">ModelScope</a>
+</p>
+# RPC-Bench Leaderboard
+RPC-Bench is a benchmark for research paper comprehension. This Space provides two functions:
+- a public leaderboard for published submissions
+- a submission entry for uploading new evaluation files
+## Expected repository layout
+The Space is designed to work with a separate submission dataset repository.
+```text
+space/
+├── app.py
+├── constants.py
+├── eval.py
+├── requirements.txt
+└── benchmark/
+    ├── dev.json
+    └── test.json
+```
+If `benchmark/dev.json` and `benchmark/test.json` are not bundled in the Space repo, set `RPC_BENCH_GOLD_DIR` or `RPC_BENCH_GOLD_PATH` through Space secrets / variables.
+The static leaderboard seed is stored in `leaderboard_seed.csv`. `index.html` is only used locally to generate that CSV and should not be uploaded to the Space repository.
+## Submission format
+Uploaded files should be JSONL with one answer per line:
+```json
+{"id":"...", "part_idx":1, "question":"...", "gen_answer":"...", "category":"..."}
+```
+## Required environment variables
+- `HF_TOKEN`: token for cloning and pushing the submission repository
+- `SUBMISSION_REPO_ID`: dataset repo used to store leaderboard results
+- `RPC_BENCH_GOLD_DIR`: optional directory containing `dev.json` and `test.json`
+- `OPENAI_API_KEY`: optional, required if you want the Space to run LLM-based judging inline
+- `OPENAI_BASE_URL`: optional, for OpenAI-compatible endpoints
+The Space can still accept uploads when the judge variables are missing, but evaluation will be marked as pending.

app.py ADDED Viewed

	@@ -0,0 +1,274 @@

+from __future__ import annotations
+import json
+import os
+import shutil
+import tempfile
+import traceback
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import List
+import gradio as gr
+import pandas as pd
+from huggingface_hub import Repository
+from constants import (
+    ALL_COLUMNS,
+    CITATION,
+    EXTERNAL_LINKS,
+    GOLD_PATHS,
+    HF_TOKEN,
+    INTRODUCTION,
+    MODEL_COLUMNS,
+    SCORE_COLUMNS,
+    SEED_LEADERBOARD_PATH,
+    SPACE_SUBTITLE,
+    SPACE_TITLE,
+    SUBMISSION_CSV_PATH,
+    SUBMISSION_REPO_ID,
+    SUBMISSION_REPO_TYPE,
+    SUBMIT_GUIDANCE,
+)
+from eval import evaluate_submission
+def _empty_leaderboard() -> pd.DataFrame:
+    return pd.DataFrame(columns=ALL_COLUMNS)
+def _normalize_leaderboard_df(df: pd.DataFrame) -> pd.DataFrame:
+    for col in SCORE_COLUMNS:
+        if col in df.columns:
+            df[col] = pd.to_numeric(df[col], errors="coerce")
+    return df
+def _seed_leaderboard() -> pd.DataFrame:
+    if not SEED_LEADERBOARD_PATH.exists():
+        return _empty_leaderboard()
+    df = pd.read_csv(SEED_LEADERBOARD_PATH)
+    for col in ALL_COLUMNS:
+        if col not in df.columns:
+            df[col] = ""
+    return _normalize_leaderboard_df(df[ALL_COLUMNS])
+def _clone_submission_repo() -> tuple[Repository | None, Path]:
+    if not SUBMISSION_REPO_ID:
+        return None, Path(".")
+    local_dir = Path(tempfile.mkdtemp(prefix="rpc_bench_submission_"))
+    repo = Repository(
+        local_dir=str(local_dir),
+        clone_from=SUBMISSION_REPO_ID,
+        repo_type=SUBMISSION_REPO_TYPE,
+        use_auth_token=HF_TOKEN,
+    )
+    repo.git_pull()
+    return repo, local_dir
+def _load_leaderboard() -> pd.DataFrame:
+    try:
+        seed_df = _seed_leaderboard()
+        repo, local_dir = _clone_submission_repo()
+        if repo is None:
+            return seed_df.sort_values(by=["Info"], ascending=False, na_position="last")
+        csv_path = local_dir / SUBMISSION_CSV_PATH
+        if not csv_path.exists():
+            return seed_df.sort_values(by=["Info"], ascending=False, na_position="last")
+        df = pd.read_csv(csv_path)
+        for col in ALL_COLUMNS:
+            if col not in df.columns:
+                df[col] = ""
+        merged = pd.concat([seed_df, _normalize_leaderboard_df(df[ALL_COLUMNS])], ignore_index=True)
+        return merged.sort_values(by=["Info"], ascending=False, na_position="last")
+    except Exception:
+        print(traceback.format_exc())
+        return _seed_leaderboard().sort_values(by=["Info"], ascending=False, na_position="last")
+def _validate_submission_file(file_path: str) -> tuple[bool, str, List[dict]]:
+    path = Path(file_path)
+    if not path.exists():
+        return False, "Uploaded file does not exist.", []
+    if path.suffix.lower() not in {".jsonl", ".json"}:
+        return False, "Submission file must be JSONL or JSON.", []
+    rows: List[dict] = []
+    try:
+        if path.suffix.lower() == ".json":
+            loaded = json.loads(path.read_text(encoding="utf-8"))
+            if not isinstance(loaded, list):
+                return False, "JSON submissions must be a list of records.", []
+            rows = loaded
+        else:
+            with path.open("r", encoding="utf-8") as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    rows.append(json.loads(line))
+    except Exception as exc:
+        return False, f"Failed to parse submission file: {exc}", []
+    required = {"id", "part_idx", "question", "gen_answer", "category"}
+    for idx, row in enumerate(rows, start=1):
+        missing = required - set(row.keys())
+        if missing:
+            return False, f"Row {idx} is missing fields: {sorted(missing)}", []
+    return True, "Submission format is valid.", rows
+def _append_submission_record(local_dir: Path, leaderboard: pd.DataFrame, row: dict) -> pd.DataFrame:
+    csv_path = local_dir / SUBMISSION_CSV_PATH
+    merged = pd.concat([leaderboard, pd.DataFrame([row])], ignore_index=True)
+    merged = merged.reindex(columns=ALL_COLUMNS)
+    merged.to_csv(csv_path, index=False)
+    return merged
+def submit_prediction(
+    input_file,
+    model_name: str,
+    organization: str,
+    revision: str,
+    model_link: str,
+    input_config: str,
+    split: str,
+):
+    if input_file is None:
+        return "Error: please upload a prediction file.", gr.update(value=_load_leaderboard())
+    path = input_file if isinstance(input_file, str) else getattr(input_file, "name", None)
+    if not path:
+        return "Error: could not access the uploaded file.", gr.update(value=_load_leaderboard())
+    ok, message, _ = _validate_submission_file(path)
+    if not ok:
+        return f"Error: {message}", gr.update(value=_load_leaderboard())
+    try:
+        repo, local_dir = _clone_submission_repo()
+        leaderboard = _load_leaderboard()
+        now = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
+        display_name = revision.strip() or model_name.strip()
+        if model_link.strip() and "](" not in display_name:
+            display_name = f"[{display_name}]({model_link.strip()})"
+        status = "pending"
+        score_row = {k: "" for k in SCORE_COLUMNS}
+        split_path = GOLD_PATHS.get(split.lower())
+        if os.environ.get("OPENAI_API_KEY") and split_path and split_path.exists():
+            eval_dir = local_dir / ".eval" if repo is not None else Path(tempfile.mkdtemp(prefix="rpc_bench_eval_"))
+            try:
+                score_row = evaluate_submission(split_path, path, eval_dir)
+                status = "scored"
+            except Exception:
+                print(traceback.format_exc())
+                status = "uploaded, evaluation failed"
+        else:
+            status = "uploaded, evaluation pending"
+        record = {
+            "Model": display_name,
+            "Organization": organization.strip(),
+            "Input Config": input_config.strip().upper(),
+            "Date": now,
+            "Status": status,
+            **{k: score_row.get(k, "") for k in SCORE_COLUMNS},
+        }
+        if repo is None:
+            return (
+                "Submission accepted, but no submission repository is configured. "
+                "Set `SUBMISSION_REPO_ID` to enable persistent leaderboard updates.",
+                gr.update(value=_load_leaderboard()),
+            )
+        submissions_dir = local_dir / "submissions"
+        submissions_dir.mkdir(parents=True, exist_ok=True)
+        stored_name = f"{datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')}_{Path(path).name}"
+        shutil.copy2(path, submissions_dir / stored_name)
+        updated_leaderboard = _append_submission_record(local_dir, leaderboard, record)
+        repo.push_to_hub()
+        return f"OK: {message}. Status: {status}", gr.update(value=updated_leaderboard)
+    except Exception as exc:
+        print(traceback.format_exc())
+        return f"Error: {exc}", gr.update(value=_load_leaderboard())
+def refresh_leaderboard():
+    return gr.update(value=_load_leaderboard())
+with gr.Blocks(title=SPACE_TITLE) as demo:
+    gr.Markdown(EXTERNAL_LINKS)
+    gr.Markdown(f"# {SPACE_TITLE}")
+    gr.Markdown(SPACE_SUBTITLE)
+    gr.Markdown(INTRODUCTION)
+    with gr.Tabs():
+        with gr.TabItem("🏅 Leaderboard"):
+            with gr.Row():
+                refresh_btn = gr.Button("Refresh")
+            leaderboard = gr.Dataframe(
+                value=_load_leaderboard(),
+                headers=ALL_COLUMNS,
+                datatype=["markdown", "str", "str", "str", "str", "number", "number", "number", "number", "number"],
+                interactive=False,
+                wrap=True,
+            )
+            refresh_btn.click(fn=refresh_leaderboard, inputs=None, outputs=leaderboard)
+        with gr.TabItem("📝 Submit"):
+            gr.Markdown(SUBMIT_GUIDANCE)
+            with gr.Row():
+                with gr.Column():
+                    model_name = gr.Textbox(label="Model name", placeholder="Your model name")
+                    organization = gr.Textbox(label="Organization", placeholder="Your lab, company, or team name")
+                    revision = gr.Textbox(label="Revision name", placeholder="Optional revision label")
+                with gr.Column():
+                    model_link = gr.Textbox(label="Model link", placeholder="https://huggingface.co/...")
+                    input_config = gr.Dropdown(
+                        choices=["TEXT", "VISUAL"],
+                        value="TEXT",
+                        label="Input config",
+                        interactive=True,
+                    )
+                    split = gr.Dropdown(
+                        choices=["test", "dev"],
+                        value="test",
+                        label="Evaluation split",
+                        interactive=True,
+                    )
+            input_file = gr.File(label="Upload prediction file", file_count="single", type="filepath")
+            submit_btn = gr.Button("Submit and evaluate")
+            submit_result = gr.Markdown()
+            submit_btn.click(
+                fn=submit_prediction,
+                inputs=[input_file, model_name, organization, revision, model_link, input_config, split],
+                outputs=[submit_result, leaderboard],
+            )
+        with gr.TabItem("ℹ️ About"):
+            gr.Markdown("## Citation")
+            gr.Code(CITATION, language="bibtex")
+    gr.Markdown(
+        "If you want inline evaluation, configure `OPENAI_API_KEY` and `OPENAI_BASE_URL` in the Space secrets."
+    )
+if __name__ == "__main__":
+    demo.launch()

build_seed_leaderboard.py ADDED Viewed

	@@ -0,0 +1,156 @@

+from __future__ import annotations
+import csv
+import re
+from html.parser import HTMLParser
+from pathlib import Path
+SPACE_DIR = Path(__file__).resolve().parent
+INDEX_HTML = SPACE_DIR / "index.html"
+OUTPUT_CSV = SPACE_DIR / "leaderboard_seed.csv"
+class ResultsTableParser(HTMLParser):
+    def __init__(self) -> None:
+        super().__init__()
+        self.in_results_table = False
+        self.in_tbody = False
+        self.in_tr = False
+        self.in_td = False
+        self.in_a = False
+        self.in_p = False
+        self.current_href = ""
+        self.current_cell_parts: list[str] = []
+        self.current_row: list[dict] = []
+        self.rows: list[list[dict]] = []
+    def handle_starttag(self, tag, attrs):
+        attrs_dict = dict(attrs)
+        if tag == "table" and attrs_dict.get("id") == "results":
+            self.in_results_table = True
+        elif self.in_results_table and tag == "tbody":
+            self.in_tbody = True
+        elif self.in_tbody and tag == "tr":
+            self.in_tr = True
+            self.current_row = []
+        elif self.in_tr and tag == "td":
+            self.in_td = True
+            self.current_cell_parts = []
+            self.current_href = ""
+        elif self.in_td and tag == "a":
+            self.in_a = True
+            self.current_href = attrs_dict.get("href", "").strip()
+        elif self.in_td and tag == "p":
+            self.in_p = True
+        elif self.in_td and tag == "br":
+            self.current_cell_parts.append(" ")
+    def handle_endtag(self, tag):
+        if tag == "table" and self.in_results_table:
+            self.in_results_table = False
+        elif tag == "tbody" and self.in_tbody:
+            self.in_tbody = False
+        elif tag == "tr" and self.in_tr:
+            self.in_tr = False
+            if self.current_row:
+                self.rows.append(self.current_row)
+        elif tag == "td" and self.in_td:
+            text = re.sub(r"\s+", " ", "".join(self.current_cell_parts)).strip()
+            self.current_row.append({"text": text, "href": self.current_href})
+            self.in_td = False
+            self.in_a = False
+            self.in_p = False
+            self.current_cell_parts = []
+            self.current_href = ""
+        elif tag == "a":
+            self.in_a = False
+        elif tag == "p":
+            self.in_p = False
+    def handle_data(self, data):
+        if self.in_td:
+            self.current_cell_parts.append(data)
+def parse_rows() -> list[dict]:
+    parser = ResultsTableParser()
+    parser.feed(INDEX_HTML.read_text(encoding="utf-8"))
+    records = []
+    for row in parser.rows:
+        if len(row) < 8:
+            continue
+        model_cell = row[1]
+        model_text = model_cell["text"]
+        parts = [part.strip() for part in model_text.split(" ") if part.strip()]
+        organization = ""
+        # The parser preserves the model and organization in a single cell text.
+        # Organization appears after the model title because of the nested <p>.
+        # We recover it by subtracting the anchor text prefix from the cell text.
+        model_name = model_text
+        if model_cell["href"]:
+            anchor_name = model_text.split(" ")[0]
+            if model_text.startswith(anchor_name):
+                model_name = anchor_name
+                organization = model_text[len(anchor_name):].strip()
+        # For names with spaces, use the full text before the trailing organization line.
+        if not organization and len(parts) >= 2:
+            organization = parts[-1]
+        if organization and model_name.endswith(organization):
+            model_name = model_name[: -len(organization)].strip()
+        if not model_name:
+            model_name = model_text
+        if model_cell["href"]:
+            model_md = f"[{model_name}]({model_cell['href']})"
+        else:
+            model_md = model_name
+        record = {
+            "Model": model_md,
+            "Organization": organization,
+            "Input Config": row[2]["text"].upper(),
+            "Date": row[3]["text"],
+            "Status": "published",
+            "Conciseness": row[4]["text"],
+            "Correctness": row[5]["text"],
+            "Completeness": row[6]["text"],
+            "F1-like": row[7]["text"],
+            "Info": row[8]["text"],
+        }
+        records.append(record)
+    return records
+def main() -> None:
+    rows = parse_rows()
+    with OUTPUT_CSV.open("w", encoding="utf-8", newline="") as f:
+        writer = csv.DictWriter(
+            f,
+            fieldnames=[
+                "Model",
+                "Organization",
+                "Input Config",
+                "Date",
+                "Status",
+                "Conciseness",
+                "Correctness",
+                "Completeness",
+                "F1-like",
+                "Info",
+            ],
+        )
+        writer.writeheader()
+        writer.writerows(rows)
+    print(f"Wrote {len(rows)} rows to {OUTPUT_CSV}")
+if __name__ == "__main__":
+    main()

constants.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from __future__ import annotations
+import os
+from pathlib import Path
+SPACE_ROOT = Path(__file__).resolve().parent
+DEFAULT_OUTPUT_DIR = SPACE_ROOT / ".cache"
+SEED_LEADERBOARD_PATH = SPACE_ROOT / "leaderboard_seed.csv"
+SPACE_TITLE = "RPC-Bench Leaderboard"
+SPACE_SUBTITLE = "Leaderboard and submission entry for RPC-Bench."
+SUBMISSION_REPO_ID = os.environ.get("SUBMISSION_REPO_ID", "").strip()
+SUBMISSION_REPO_TYPE = "dataset"
+SUBMISSION_CSV_PATH = os.environ.get("SUBMISSION_CSV_PATH", "leaderboard.csv").strip()
+HF_TOKEN = os.environ.get("HF_TOKEN", "").strip() or None
+GOLD_DIR = Path(os.environ.get("RPC_BENCH_GOLD_DIR", SPACE_ROOT / "benchmark"))
+GOLD_PATHS = {
+    "dev": Path(os.environ.get("RPC_BENCH_GOLD_DEV", GOLD_DIR / "dev.json")),
+    "test": Path(os.environ.get("RPC_BENCH_GOLD_TEST", GOLD_DIR / "test.json")),
+}
+MODEL_COLUMNS = ["Model", "Organization", "Input Config", "Date", "Status"]
+SCORE_COLUMNS = [
+    "Conciseness",
+    "Correctness",
+    "Completeness",
+    "F1-like",
+    "Info",
+]
+ALL_COLUMNS = MODEL_COLUMNS + SCORE_COLUMNS
+EXTERNAL_LINKS = """
+<p align="center">
+    🌐 <a href="https://rpc-bench.github.io/" target="_blank">Project Page</a> •
+    💻 <a href="https://github.com/RPC-Bench/PRC-Bench" target="_blank">GitHub</a> •
+    📖 <a href="https://arxiv.org/abs/2601.14289" target="_blank">Paper</a> •
+    🤗 <a href="https://huggingface.co" target="_blank">Hugging Face</a> •
+    🧭 <a href="https://community.modelscope.cn/" target="_blank">ModelScope</a>
+</p>
+"""
+INTRODUCTION = (
+    "RPC-Bench Leaderboard provides a compact interface for browsing published results "
+    "and uploading new submissions for evaluation."
+)
+SUBMIT_GUIDANCE = (
+    "Upload a JSONL prediction file with fields `id`, `part_idx`, `question`, "
+    "`gen_answer`, and `category`. The Space will validate the format, optionally "
+    "run the judge, and then write the result into the submission repository."
+)
+CITATION = r"""@article{chen2026rpc,
+  title={RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension},
+  author={Chen, Yelin and Zhang, Fanjin and Sun, Suping and Pang, Yunhe and Wang, Yuanchun and Song, Jian and Li, Xiaoyan and Hou, Lei and Zhao, Shu and Tang, Jie and others},
+  journal={arXiv preprint arXiv:2601.14289},
+  year={2026}
+}"""

eval.py ADDED Viewed

	@@ -0,0 +1,264 @@

+from __future__ import annotations
+import json
+import os
+import re
+import time
+from pathlib import Path
+from typing import Dict, Iterable, List, Tuple
+from openai import OpenAI
+DEFAULT_GPT_MODEL = os.environ.get("RPC_BENCH_GPT_MODEL", "gpt-5-2025-08-07")
+DEFAULT_GEMINI_MODEL = os.environ.get("RPC_BENCH_GEMINI_MODEL", "gemini-2.5-pro")
+OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
+OPENAI_BASE_URL = os.environ.get("OPENAI_BASE_URL", "")
+def _client() -> OpenAI:
+    return OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL or None)
+def _extract_json(text: str) -> str:
+    text = text.strip()
+    if "```json" in text:
+        match = re.search(r"```json(.*?)```", text, re.DOTALL)
+        if match:
+            return match.group(1).strip()
+    if "```" in text:
+        match = re.search(r"```(.*?)```", text, re.DOTALL)
+        if match:
+            return match.group(1).strip()
+    return text
+def _load_jsonl(path: str | Path) -> List[Dict]:
+    rows: List[Dict] = []
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            rows.append(json.loads(line))
+    return rows
+def _judge(messages: List[Dict], model: str) -> str:
+    client = _client()
+    response = client.chat.completions.create(
+        model=model,
+        messages=messages,
+        stream=False,
+    )
+    return response.choices[0].message.content or ""
+def _score_prompt(title: str, abstract: str, question: str, reference_answer: str, predicted_answer: str) -> List[Dict]:
+    system_prompt = (
+        "You are a strict paper-answer judge. Return JSON only. "
+        "Score the prediction on three dimensions: Conciseness, Correctness, Completeness. "
+        "Each dimension must contain a numeric rating in [1, 5] and a short reason."
+    )
+    user_prompt = (
+        f"Title: {title}\n"
+        f"Abstract: {abstract}\n"
+        f"Question: {question}\n"
+        f"Reference answer: {reference_answer}\n"
+        f"Predicted answer: {predicted_answer}\n"
+        "Return JSON only with keys Conciseness, Correctness, Completeness."
+    )
+    return [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}]
+def _normalize_rating_block(content: Dict) -> Dict:
+    result = {}
+    for key in ("Conciseness", "Correctness", "Completeness"):
+        value = content.get(key, {})
+        if isinstance(value, dict):
+            rating = float(value.get("rating", 0.0))
+            reason = value.get("reason", "")
+        else:
+            rating = float(value)
+            reason = ""
+        result[key] = {"rating": rating, "reason": reason}
+    return result
+def paper_qa_score(
+    file_path: str | Path,
+    eval_path: str | Path,
+    out_path: str | Path,
+    judge_model: str = "gpt",
+) -> None:
+    gold_items = _load_jsonl(file_path)
+    pred_items = _load_jsonl(eval_path)
+    paper_dict: Dict[str, Dict[str, str]] = {}
+    qa_items: List[Dict] = []
+    for paper in gold_items:
+        paper_dict[paper["id"]] = {
+            "title": paper.get("title", ""),
+            "abstract": paper.get("abstract", ""),
+        }
+        for idx, qa in enumerate(paper.get("qa_pairs", []), start=1):
+            qa_items.append(
+                {
+                    "id": paper["id"],
+                    "part_idx": idx,
+                    "question": qa["question"],
+                    "answer": qa["answer"],
+                    "category": qa["category"],
+                }
+            )
+    os.makedirs(Path(out_path).parent, exist_ok=True)
+    if len(qa_items) != len(pred_items):
+        raise ValueError(f"Prediction count mismatch: expected {len(qa_items)}, got {len(pred_items)}")
+    model_name = DEFAULT_GPT_MODEL if judge_model == "gpt" else DEFAULT_GEMINI_MODEL
+    for gold, pred in zip(qa_items, pred_items):
+        if gold["id"] != pred["id"] or gold["part_idx"] != pred["part_idx"]:
+            raise ValueError(f"Submission order mismatch at {gold['id']} / {gold['part_idx']}")
+        if gold["category"] == "Claim_Verification":
+            score_block = []
+        else:
+            messages = _score_prompt(
+                paper_dict[gold["id"]]["title"],
+                paper_dict[gold["id"]]["abstract"],
+                gold["question"],
+                gold["answer"],
+                pred["gen_answer"],
+            )
+            raw = _judge(messages, model_name)
+            score_block = _normalize_rating_block(json.loads(_extract_json(raw)))
+            time.sleep(float(os.environ.get("RPC_BENCH_JUDGE_SLEEP", "0")))
+        with open(out_path, "a", encoding="utf-8") as fw:
+            fw.write(
+                json.dumps(
+                    {
+                        "id": gold["id"],
+                        "part_idx": gold["part_idx"],
+                        "question": gold["question"],
+                        "reference_answer": gold["answer"],
+                        "predicted_answer": pred["gen_answer"],
+                        "category": gold["category"],
+                        "score": score_block,
+                    },
+                    ensure_ascii=False,
+                )
+                + "\n"
+            )
+def get_llm_score(eval_path: str | Path) -> Tuple[Dict[str, float], Dict[str, Tuple[float, float, float]]]:
+    category_dict: Dict[str, Dict[str, float]] = {}
+    sum_c1 = sum_c2 = sum_c3 = 0.0
+    count = 0
+    with open(eval_path, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            item = json.loads(line)
+            category = item["category"]
+            if category == "Claim_Verification":
+                continue
+            if category not in category_dict:
+                category_dict[category] = {"Conciseness": 0.0, "Correctness": 0.0, "Completeness": 0.0, "count": 0.0}
+            content = item.get("score", {})
+            c1 = float(content.get("Conciseness", {}).get("rating", 0.0))
+            c2 = float(content.get("Correctness", {}).get("rating", 0.0))
+            c3 = float(content.get("Completeness", {}).get("rating", 0.0))
+            category_dict[category]["Conciseness"] += c1
+            category_dict[category]["Correctness"] += c2
+            category_dict[category]["Completeness"] += c3
+            category_dict[category]["count"] += 1
+            sum_c1 += c1
+            sum_c2 += c2
+            sum_c3 += c3
+            count += 1
+    result: Dict[str, Tuple[float, float, float]] = {}
+    for category, values in category_dict.items():
+        denom = max(values["count"], 1.0)
+        result[category] = (
+            values["Conciseness"] / denom,
+            values["Correctness"] / denom,
+            values["Completeness"] / denom,
+        )
+    total_scores = {
+        "Conciseness": sum_c1 / max(count, 1),
+        "Correctness": sum_c2 / max(count, 1),
+        "Completeness": sum_c3 / max(count, 1),
+    }
+    return total_scores, result
+def calculate_acc(pred: List[str], gold: List[str]) -> float:
+    if not pred:
+        return 0.0
+    return sum(1 for p, g in zip(pred, gold) if p == g) / len(pred)
+def get_verification_score(gold_path: str | Path, eval_path: str | Path) -> float:
+    gold_answers: List[str] = []
+    pred_answers: List[str] = []
+    for paper in _load_jsonl(gold_path):
+        for qa in paper.get("qa_pairs", []):
+            if qa.get("category") == "Claim_Verification":
+                gold_answers.append(str(qa.get("answer", "")).strip())
+    for item in _load_jsonl(eval_path):
+        if item.get("category") == "Claim_Verification":
+            pred_answers.append(str(item.get("gen_answer", "")).strip())
+    if len(gold_answers) != len(pred_answers):
+        raise ValueError(
+            f"Claim verification count mismatch: expected {len(gold_answers)}, got {len(pred_answers)}"
+        )
+    normalized_pred: List[str] = []
+    for gold, pred in zip(gold_answers, pred_answers):
+        if pred not in {"True", "False"}:
+            normalized_pred.append("False" if gold == "True" else "True")
+        else:
+            normalized_pred.append(pred)
+    return calculate_acc(normalized_pred, gold_answers[: len(normalized_pred)])
+def evaluate_submission(gold_path: str | Path, pred_path: str | Path, out_dir: str | Path, judge_model: str = "gpt") -> Dict[str, float]:
+    out_dir = Path(out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    judged_path = out_dir / f"{Path(pred_path).stem}_{judge_model}_judge.jsonl"
+    if judged_path.exists():
+        judged_path.unlink()
+    paper_qa_score(gold_path, pred_path, judged_path, judge_model=judge_model)
+    llm_total, _ = get_llm_score(judged_path)
+    claim_acc = get_verification_score(gold_path, pred_path)
+    f1_like = (
+        2 * llm_total["Correctness"] * llm_total["Completeness"]
+        / (llm_total["Correctness"] + llm_total["Completeness"] + 1e-8)
+    )
+    info = llm_total["Conciseness"] * f1_like * 4
+    return {
+        "Conciseness": round(llm_total["Conciseness"] * 20, 4),
+        "Correctness": round(llm_total["Correctness"] * 20, 4),
+        "Completeness": round(llm_total["Completeness"] * 20, 4),
+        "F1-like": round(f1_like * 20, 4),
+        "Info": round(info * 20, 4),
+        "Claim Accuracy": round(claim_acc * 100, 4),
+    }

leaderboard_seed.csv ADDED Viewed

	@@ -0,0 +1,29 @@

+Model,Organization,Input Config,Date,Status,Conciseness,Correctness,Completeness,F1-like,Info
+[GPT-5](https://openai.com/index/introducing-gpt-5/),OpenAI,TEXT,2025-8-7,published,54.93,69.10,67.33,68.20,37.46
+[GPT-5.2](https://openai.com/index/introducing-gpt-5-2/),OpenAI,TEXT,2025-12-11,published,53.81,66.84,64.03,65.40,35.19
+[GPT-5](https://openai.com/index/introducing-gpt-5/),OpenAI,VISUAL,2025-8-7,published,61.47,58.90,55.34,57.07,35.08
+[Gemini-2.5-Pro](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/),Google,TEXT,2025-3-25,published,54.87,62.65,59.03,60.79,33.35
+[Gemini-3-Pro](https://blog.google/products-and-platforms/products/gemini/gemini-3/),Google,TEXT,2025-11-18,published,52.81,62.69,60.28,61.46,32.46
+[DeepSeek-V3.2](https://api-docs.deepseek.com/news/news251201),DeepSeek-AI,TEXT,2025-12-1,published,56.31,58.73,55.19,56.91,32.04
+[GPT-5.2](https://openai.com/index/introducing-gpt-5-2/),OpenAI,VISUAL,2025-12-11,published,56.43,56.75,52.82,54.72,30.88
+[DeepSeek-V3.1](https://api-docs.deepseek.com/news/news250821),DeepSeek-AI,TEXT,2025-8-21,published,54.76,57.85,54.85,56.31,30.84
+[GLM-4.6V](https://github.com/zai-org/GLM-V),Z.ai,VISUAL,2025-12-8,published,64.55,47.32,43.43,45.29,29.23
+[GLM-4.7](https://z.ai/blog/glm-4.7),Z.ai,TEXT,2025-12-22,published,54.34,54.36,51.75,53.02,28.81
+[GLM-4.5V](https://github.com/zai-org/GLM-V),Z.ai,VISUAL,2025-8-11,published,59.44,48.79,43.62,46.06,27.38
+[gemini-3-pro](https://blog.google/products-and-platforms/products/gemini/gemini-3/),Google,VISUAL,2025-11-18,published,50.22,56.06,52.69,54.32,27.28
+[GLM-4.5](https://z.ai/blog/glm-4.5),Z.ai,TEXT,2025-7-28,published,43.41,58.95,59.54,59.24,25.72
+[gemini-2.5-pro](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/),Google,VISUAL,2025-3-25,published,51.71,48.39,45.59,46.95,24.28
+[Claude-Sonnet-4](https://www.anthropic.com/news/claude-4),Anthropic,TEXT,2025-5-23,published,41.37,58.53,58.44,58.48,24.19
+[Qwen3](https://github.com/QwenLM/Qwen3),Alibaba,TEXT,2025-7-21,published,41.44,55.88,56.64,56.26,23.31
+[Claude-Sonnet-4.5](https://www.anthropic.com/news/claude-sonnet-4-5),Anthropic,TEXT,2025-9-30,published,31.02,64.31,64.97,64.64,20.05
+[Claude-Sonnet-4.5](https://www.anthropic.com/news/claude-sonnet-4-5),Anthropic,VISUAL,2025-9-30,published,31.95,55.35,54.45,54.89,17.54
+[Claude-Sonnet-4](https://www.anthropic.com/news/claude-4),Anthropic,VISUAL,2025-5-23,published,31.63,54.16,53.32,53.74,16.99
+[HippoRAG2](https://github.com/ianliuwd/HippoRAG2),The Ohio State University,TEXT,2025-6-19,published,45.77,33.13,27.88,30.28,13.86
+[MemoRAG](https://github.com/qhjqhj00/MemoRAG),Peking University & Hong Kong Polytechnic University,TEXT,2025-4-9,published,51.31,24.19,19.10,21.35,10.96
+[VdocRAG](https://vdocrag.github.io/),NTT Corporation & Tohoku University,VISUAL,2025-4-14,published,61.54,21.17,13.88,16.77,10.32
+[VisRAG](https://github.com/OpenBMB/VisRAG),Tsinghua University & ModelBest Inc.,VISUAL,2025-3-2,published,39.90,26.24,23.63,24.87,9.92
+[Raptor](https://github.com/parthsarthi03/raptor),Stanford University,TEXT,2024-1-31,published,36.47,25.28,20.82,22.84,8.33
+[Monkey](https://github.com/Yuliang-Liu/Monkey),Huazhong University of Science and Technology,VISUAL,2024-8-26,published,54.61,17.08,11.27,13.58,7.41
+[Docopilot](https://github.com/OpenGVLab/Docopilot),Shanghai AI Laboratory,VISUAL,2025-7-19,published,39.31,18.31,17.12,17.69,6.96
+[Qwen3](https://github.com/QwenLM/Qwen3),Alibaba,VISUAL,2025-7-21,published,22.64,20.17,20.14,20.16,4.56
+[DocOwl2](https://github.com/X-PLUG/mPLUG-DocOwl),Alibaba,VISUAL,2024-9-9,published,50.19,11.75,6.66,8.50,4.27

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio>=4.44.1
+huggingface-hub>=0.23.0
+pandas>=2.0.0
+numpy>=1.24.0
+openai>=1.40.0
+tqdm>=4.66.0
+python-dateutil>=2.8.2