Spaces:
Running
Running
Ziyang Wang commited on
Commit ·
1bf5b23
0
Parent(s):
initial Space
Browse files- .gitignore +18 -0
- README.md +76 -0
- SUBMISSION_FORMAT.md +77 -0
- app.py +331 -0
- auth.py +28 -0
- evaluator.py +121 -0
- ledger.py +180 -0
- requirements.txt +3 -0
- tests/fixtures/all_a_submission.json +1 -0
- tests/fixtures/oracle_submission.json +1 -0
- tests/test_evaluator.py +52 -0
.gitignore
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Private answer key — pulled at boot from the private dataset, never committed
|
| 2 |
+
annotations_private.json
|
| 3 |
+
|
| 4 |
+
# Paper-baseline seed records — destined for the public DATASET repo, not the Space
|
| 5 |
+
seeds/
|
| 6 |
+
|
| 7 |
+
# Local dev
|
| 8 |
+
.venv/
|
| 9 |
+
__pycache__/
|
| 10 |
+
*.pyc
|
| 11 |
+
|
| 12 |
+
# Local snapshot caches
|
| 13 |
+
.cache/
|
| 14 |
+
|
| 15 |
+
# Editor / OS
|
| 16 |
+
.DS_Store
|
| 17 |
+
.idea/
|
| 18 |
+
.vscode/
|
README.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: EgoMemReason Leaderboard
|
| 3 |
+
emoji: 🧠
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: cc-by-nc-4.0
|
| 11 |
+
hf_oauth: true
|
| 12 |
+
hf_oauth_scopes:
|
| 13 |
+
- openid
|
| 14 |
+
- profile
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# EgoMemReason — Leaderboard Space
|
| 18 |
+
|
| 19 |
+
Live leaderboard for the **EgoMemReason** benchmark: 500 multiple-choice questions over week-long egocentric video, evaluating entity / event / behavior memory.
|
| 20 |
+
|
| 21 |
+
- 📄 Paper: *coming soon*
|
| 22 |
+
- 💻 Reference eval scripts: <https://github.com/Ted412/EgoMemReason>
|
| 23 |
+
- 📦 Public questions: <https://huggingface.co/datasets/Ted412/EgoMemReason>
|
| 24 |
+
- 🎬 Source frames: <https://egolife-ai.github.io/>
|
| 25 |
+
|
| 26 |
+
## Operator notes
|
| 27 |
+
|
| 28 |
+
This Space lives at `Ted412/EgoMemReason` and writes one JSON record per submission to the public dataset `Ted412/EgoMemReason-Leaderboard`. The held-out answer key lives in a separate **private** dataset `Ted412/EgoMemReason-Private` and is pulled at boot via `snapshot_download(token=HF_TOKEN)`.
|
| 29 |
+
|
| 30 |
+
### Required Space secret
|
| 31 |
+
|
| 32 |
+
| Name | Value | Scope |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| `HF_TOKEN` | Fine-grained HF token | Write on `Ted412/EgoMemReason-Leaderboard` + Read on `Ted412/EgoMemReason-Private` |
|
| 35 |
+
|
| 36 |
+
Create at <https://huggingface.co/settings/tokens> → fine-grained → grant only those two repos.
|
| 37 |
+
|
| 38 |
+
### Local development
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
python -m venv .venv && source .venv/bin/activate
|
| 42 |
+
pip install -r requirements.txt
|
| 43 |
+
|
| 44 |
+
# Copy the private answer key into cwd (skips the snapshot_download path).
|
| 45 |
+
cp ../EgoMemReason-EvalAI.archived/annotations/annotations_private.json .
|
| 46 |
+
|
| 47 |
+
# Run, optionally faking a user.
|
| 48 |
+
DEBUG_USER=alice python app.py
|
| 49 |
+
# → http://127.0.0.1:7860
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Tests:
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
python -m pytest tests/ -q
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Architecture
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
EgoMemReason-Space (this Space, public)
|
| 62 |
+
├── app.py Gradio UI (Leaderboard / Submit / Manage / About)
|
| 63 |
+
├── evaluator.py pure scoring — port of the old EvalAI main.py
|
| 64 |
+
├── ledger.py HF I/O: pulls private annotations at boot; writes
|
| 65 |
+
│ one JSON record per submission to the public dataset
|
| 66 |
+
├── auth.py resolves the HF username from gr.OAuthProfile
|
| 67 |
+
└── annotations_private.json pulled at boot from the private dataset
|
| 68 |
+
|
| 69 |
+
Ted412/EgoMemReason-Private (HF dataset, private)
|
| 70 |
+
└── annotations_private.json 500 Qs WITH correct_answer
|
| 71 |
+
|
| 72 |
+
Ted412/EgoMemReason-Leaderboard (HF dataset, public)
|
| 73 |
+
└── submissions/
|
| 74 |
+
└── <uuid>.json one immutable record per submission
|
| 75 |
+
(only is_selected flips on a re-write)
|
| 76 |
+
```
|
SUBMISSION_FORMAT.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Submission Format
|
| 2 |
+
|
| 3 |
+
A submission is a single JSON file (`.json`) containing a top-level array of 500 prediction objects — one per question.
|
| 4 |
+
|
| 5 |
+
## Schema
|
| 6 |
+
|
| 7 |
+
```json
|
| 8 |
+
[
|
| 9 |
+
{"example_id": 1, "predicted_answer": "A"},
|
| 10 |
+
{"example_id": 2, "predicted_answer": "C"},
|
| 11 |
+
{"example_id": 500, "predicted_answer": "B"}
|
| 12 |
+
]
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
**Required keys (per object):**
|
| 16 |
+
- `example_id` — integer in `[1, 500]`, matching `example_id` in `annotations_public.json`.
|
| 17 |
+
- `predicted_answer` — single uppercase letter that appears in that question's `options` dict.
|
| 18 |
+
|
| 19 |
+
**Important:** questions have **between 4 and 10 options**. The valid answer letters for any given question are exactly the keys of its `options` dict. Most are A-F; Event Ordering questions can extend to A-J. A letter outside the question's option set is rejected.
|
| 20 |
+
|
| 21 |
+
**Optional keys (ignored, but useful for your own debugging):** `raw_response`, `confidence`, `tokens`, etc.
|
| 22 |
+
|
| 23 |
+
## Rules
|
| 24 |
+
|
| 25 |
+
1. Top-level must be a JSON array (not an object).
|
| 26 |
+
2. The submission must cover **exactly 500 unique `example_id`s**, one per question.
|
| 27 |
+
3. Duplicate `example_id`s are rejected.
|
| 28 |
+
4. Letters must be uppercase (whitespace is trimmed).
|
| 29 |
+
5. File extension must be `.json`.
|
| 30 |
+
|
| 31 |
+
## Converting from existing eval-script output
|
| 32 |
+
|
| 33 |
+
The reference inference scripts in the [EgoMemReason GitHub repo](https://github.com/Ted412/EgoMemReason) write a list of records with a `pred` field. One-liner to convert:
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
import json
|
| 37 |
+
src = json.load(open("results_my_model.json"))
|
| 38 |
+
sub = [{"example_id": r["example_id"], "predicted_answer": r["pred"]} for r in src]
|
| 39 |
+
json.dump(sub, open("submission.json", "w"))
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## How submissions are scored
|
| 43 |
+
|
| 44 |
+
Accuracy (%) for each of the six `query_type` splits:
|
| 45 |
+
|
| 46 |
+
- Cumulative State Tracking (100 Qs)
|
| 47 |
+
- Temporal Counting (100 Qs)
|
| 48 |
+
- Event Ordering (100 Qs)
|
| 49 |
+
- Event Linking (100 Qs)
|
| 50 |
+
- Spatial Preference (50 Qs)
|
| 51 |
+
- Activity Pattern (50 Qs)
|
| 52 |
+
|
| 53 |
+
plus **Overall** accuracy on all 500. All seven values appear on the leaderboard; ranking is by Overall descending.
|
| 54 |
+
|
| 55 |
+
## Submission limits
|
| 56 |
+
|
| 57 |
+
- **5 submissions per HF user per 24-hour window.**
|
| 58 |
+
- The 24-hour window is rolling, not midnight-aligned.
|
| 59 |
+
|
| 60 |
+
## Selected submission
|
| 61 |
+
|
| 62 |
+
Submit as many times as you like under the cap. In the **Manage my submissions** tab you can mark **one** of your past submissions as your *selected* entry. The default leaderboard view shows only each team's selected entry; the "Show all submissions" toggle reveals all.
|
| 63 |
+
|
| 64 |
+
## Required metadata fields
|
| 65 |
+
|
| 66 |
+
When you submit you must fill in:
|
| 67 |
+
|
| 68 |
+
| Field | Required | Notes |
|
| 69 |
+
|---|---|---|
|
| 70 |
+
| `team_name` | yes | Team or affiliation |
|
| 71 |
+
| `method_name` | yes | Short title displayed on the leaderboard |
|
| 72 |
+
| `uses_external_data` | yes (yes/no) | Did you train / finetune on anything beyond EgoLife? |
|
| 73 |
+
| `uses_video_frames` | yes | one of `frames-only` · `video-only` · `frames+audio` · `captions-only` · `other` |
|
| 74 |
+
| `model_size` | no | e.g. `8B`, `32B`, `API` |
|
| 75 |
+
| `method_description` | no | Free-form description |
|
| 76 |
+
| `project_url` | no | Project page |
|
| 77 |
+
| `publication_url` | no | arXiv / OpenReview link |
|
app.py
ADDED
|
@@ -0,0 +1,331 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""EgoMemReason leaderboard — Gradio Space app.
|
| 2 |
+
|
| 3 |
+
Tabs:
|
| 4 |
+
- Leaderboard public, auto-refresh, toggle selected-only / show-all
|
| 5 |
+
- Submit HF login required; JSON upload + metadata form
|
| 6 |
+
- Manage toggle is_selected on your own past submissions
|
| 7 |
+
- About paper description + citation
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import os
|
| 11 |
+
|
| 12 |
+
import gradio as gr
|
| 13 |
+
import pandas as pd
|
| 14 |
+
|
| 15 |
+
import auth
|
| 16 |
+
import evaluator
|
| 17 |
+
import ledger
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
# Boot: pull annotations_private.json from the private dataset repo.
|
| 22 |
+
# ---------------------------------------------------------------------------
|
| 23 |
+
|
| 24 |
+
try:
|
| 25 |
+
ledger.ensure_private_annotations()
|
| 26 |
+
except RuntimeError as e:
|
| 27 |
+
# In local dev without HF_TOKEN, allow the app to come up with a clear banner.
|
| 28 |
+
BOOT_ERROR = str(e)
|
| 29 |
+
else:
|
| 30 |
+
BOOT_ERROR = None
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
LEADERBOARD_COLUMNS = [
|
| 34 |
+
"Rank",
|
| 35 |
+
"Method",
|
| 36 |
+
"Team",
|
| 37 |
+
"Overall",
|
| 38 |
+
"Cumul. State",
|
| 39 |
+
"Temp. Count",
|
| 40 |
+
"Event Order",
|
| 41 |
+
"Event Link",
|
| 42 |
+
"Spatial Pref.",
|
| 43 |
+
"Activity Pat.",
|
| 44 |
+
"Model size",
|
| 45 |
+
"Ext. data",
|
| 46 |
+
"Modality",
|
| 47 |
+
"Date (UTC)",
|
| 48 |
+
"Links",
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _row_from_submission(sub, rank):
|
| 53 |
+
m = sub["metrics"]
|
| 54 |
+
links = []
|
| 55 |
+
if sub.get("project_url"):
|
| 56 |
+
links.append(f"[project]({sub['project_url']})")
|
| 57 |
+
if sub.get("publication_url"):
|
| 58 |
+
links.append(f"[paper]({sub['publication_url']})")
|
| 59 |
+
return [
|
| 60 |
+
rank,
|
| 61 |
+
sub["method_name"],
|
| 62 |
+
sub["team_name"],
|
| 63 |
+
m["Overall"],
|
| 64 |
+
m["Cumulative State Tracking"],
|
| 65 |
+
m["Temporal Counting"],
|
| 66 |
+
m["Event Ordering"],
|
| 67 |
+
m["Event Linking"],
|
| 68 |
+
m["Spatial Preference"],
|
| 69 |
+
m["Activity Pattern"],
|
| 70 |
+
sub.get("model_size") or "—",
|
| 71 |
+
"yes" if sub.get("uses_external_data") else "no",
|
| 72 |
+
sub.get("uses_video_frames") or "—",
|
| 73 |
+
sub.get("submitted_at_utc", "")[:10],
|
| 74 |
+
" · ".join(links) or "—",
|
| 75 |
+
]
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def load_leaderboard(show_all):
|
| 79 |
+
subs = ledger.list_submissions()
|
| 80 |
+
if not show_all:
|
| 81 |
+
subs = [s for s in subs if s.get("is_selected")]
|
| 82 |
+
subs = sorted(subs, key=lambda s: s["metrics"]["Overall"], reverse=True)
|
| 83 |
+
rows = [_row_from_submission(s, i + 1) for i, s in enumerate(subs)]
|
| 84 |
+
return pd.DataFrame(rows, columns=LEADERBOARD_COLUMNS)
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
# ---------------------------------------------------------------------------
|
| 88 |
+
# Submit
|
| 89 |
+
# ---------------------------------------------------------------------------
|
| 90 |
+
|
| 91 |
+
def handle_submission(file, team_name, method_name, model_size, uses_external,
|
| 92 |
+
uses_frames, method_description, project_url,
|
| 93 |
+
publication_url, profile: gr.OAuthProfile | None):
|
| 94 |
+
user = auth.resolve_user(profile)
|
| 95 |
+
if user is None:
|
| 96 |
+
return "**Error:** sign in with Hugging Face first (button at the top of the page)."
|
| 97 |
+
if not team_name or not method_name:
|
| 98 |
+
return "**Error:** `team_name` and `method_name` are required."
|
| 99 |
+
if uses_external not in ("yes", "no"):
|
| 100 |
+
return "**Error:** answer `Uses external data?` (yes/no)."
|
| 101 |
+
if not uses_frames:
|
| 102 |
+
return "**Error:** pick a video input modality."
|
| 103 |
+
if file is None:
|
| 104 |
+
return "**Error:** upload a `.json` submission file."
|
| 105 |
+
|
| 106 |
+
recent = ledger.count_recent(user, hours=24)
|
| 107 |
+
if recent >= 5:
|
| 108 |
+
return (f"**Rate limit:** you have **{recent}** submissions in the last 24 h "
|
| 109 |
+
"(max 5). Try again later.")
|
| 110 |
+
|
| 111 |
+
try:
|
| 112 |
+
metrics = evaluator.score_submission(file.name)
|
| 113 |
+
except ValueError as e:
|
| 114 |
+
return f"**Validation error:**\n```\n{e}\n```"
|
| 115 |
+
except Exception as e:
|
| 116 |
+
return f"**Internal error scoring submission:** `{type(e).__name__}: {e}`"
|
| 117 |
+
|
| 118 |
+
try:
|
| 119 |
+
sid = ledger.append_submission(
|
| 120 |
+
hf_user_id=user,
|
| 121 |
+
team_name=team_name,
|
| 122 |
+
method_name=method_name,
|
| 123 |
+
model_size=model_size,
|
| 124 |
+
uses_external_data=(uses_external == "yes"),
|
| 125 |
+
uses_video_frames=uses_frames,
|
| 126 |
+
method_description=method_description,
|
| 127 |
+
project_url=project_url,
|
| 128 |
+
publication_url=publication_url,
|
| 129 |
+
metrics=metrics,
|
| 130 |
+
)
|
| 131 |
+
except Exception as e:
|
| 132 |
+
return (f"**Scored, but failed to persist to ledger:** `{type(e).__name__}: {e}`\n\n"
|
| 133 |
+
f"Your metrics were:\n```\n{metrics}\n```")
|
| 134 |
+
|
| 135 |
+
rows = "\n".join(f"| {k} | **{v:.2f}** |" for k, v in metrics.items())
|
| 136 |
+
return (
|
| 137 |
+
f"✅ **Submission logged.** `submission_id = {sid}`\n\n"
|
| 138 |
+
f"| Metric | Score (%) |\n|---|---|\n{rows}\n\n"
|
| 139 |
+
"Go to **Manage my submissions** to mark this as your official entry."
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
# ---------------------------------------------------------------------------
|
| 144 |
+
# Manage
|
| 145 |
+
# ---------------------------------------------------------------------------
|
| 146 |
+
|
| 147 |
+
MANAGE_COLUMNS = ["submission_id", "method_name", "Overall", "is_selected", "submitted_at_utc"]
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def load_my_submissions(profile: gr.OAuthProfile | None):
|
| 151 |
+
user = auth.resolve_user(profile)
|
| 152 |
+
if user is None:
|
| 153 |
+
return pd.DataFrame(columns=MANAGE_COLUMNS)
|
| 154 |
+
rows = []
|
| 155 |
+
for sub in ledger.list_submissions():
|
| 156 |
+
if sub.get("hf_user_id") != user:
|
| 157 |
+
continue
|
| 158 |
+
rows.append([
|
| 159 |
+
sub["submission_id"],
|
| 160 |
+
sub["method_name"],
|
| 161 |
+
sub["metrics"]["Overall"],
|
| 162 |
+
sub.get("is_selected", False),
|
| 163 |
+
sub.get("submitted_at_utc", ""),
|
| 164 |
+
])
|
| 165 |
+
rows.sort(key=lambda r: r[4], reverse=True)
|
| 166 |
+
return pd.DataFrame(rows, columns=MANAGE_COLUMNS)
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def set_my_selected(submission_id, profile: gr.OAuthProfile | None):
|
| 170 |
+
user = auth.resolve_user(profile)
|
| 171 |
+
if user is None:
|
| 172 |
+
return "**Error:** sign in first.", load_my_submissions(profile)
|
| 173 |
+
if not submission_id or not submission_id.strip():
|
| 174 |
+
return "**Error:** paste a submission_id.", load_my_submissions(profile)
|
| 175 |
+
try:
|
| 176 |
+
ledger.set_selected(submission_id.strip(), user)
|
| 177 |
+
except (ValueError, PermissionError) as e:
|
| 178 |
+
return f"**Error:** {e}", load_my_submissions(profile)
|
| 179 |
+
return f"✅ `{submission_id.strip()}` is now your selected entry.", load_my_submissions(profile)
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
# ---------------------------------------------------------------------------
|
| 183 |
+
# About
|
| 184 |
+
# ---------------------------------------------------------------------------
|
| 185 |
+
|
| 186 |
+
ABOUT_MD = """\
|
| 187 |
+
## EgoMemReason
|
| 188 |
+
|
| 189 |
+
**A Memory-driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding.**
|
| 190 |
+
|
| 191 |
+
EgoMemReason is a 500-question multiple-choice benchmark over week-long egocentric
|
| 192 |
+
videos (built on [EgoLife](https://egolife-ai.github.io/)). Models must answer
|
| 193 |
+
questions whose evidence is sparsely distributed across hours or days, exercising
|
| 194 |
+
three memory types:
|
| 195 |
+
|
| 196 |
+
- **Entity memory** — Cumulative State Tracking, Temporal Counting
|
| 197 |
+
- **Event memory** — Event Ordering, Event Linking
|
| 198 |
+
- **Behavior memory** — Spatial Preference Inference, Activity Pattern Inference
|
| 199 |
+
|
| 200 |
+
500 Qs · avg. 5.1 evidence segments / Q · avg. 25.9 h memory backtracking. The
|
| 201 |
+
strongest model in the paper reaches **39.6% Overall**.
|
| 202 |
+
|
| 203 |
+
### Resources
|
| 204 |
+
|
| 205 |
+
- 📄 Paper: *coming soon*
|
| 206 |
+
- 💻 Code & reference eval scripts: <https://github.com/Ted412/EgoMemReason>
|
| 207 |
+
- 📦 Public questions (no answers): <https://huggingface.co/datasets/Ted412/EgoMemReason>
|
| 208 |
+
- 🎬 EgoLife video frames: <https://egolife-ai.github.io/>
|
| 209 |
+
|
| 210 |
+
### Submission
|
| 211 |
+
|
| 212 |
+
Upload a JSON file with 500 entries:
|
| 213 |
+
|
| 214 |
+
```json
|
| 215 |
+
[
|
| 216 |
+
{"example_id": 1, "predicted_answer": "A"},
|
| 217 |
+
...
|
| 218 |
+
]
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
Questions have 4-10 options (letters A-J) — `predicted_answer` must be a letter
|
| 222 |
+
that appears in that question's `options` dict. See
|
| 223 |
+
[SUBMISSION_FORMAT.md](https://github.com/Ted412/EgoMemReason/blob/main/SUBMISSION_FORMAT.md)
|
| 224 |
+
for the full spec.
|
| 225 |
+
|
| 226 |
+
### License
|
| 227 |
+
|
| 228 |
+
- **Annotations** (this Space + the public dataset): CC BY-NC 4.0.
|
| 229 |
+
- **Video frames**: governed by the [EgoLife data license](https://egolife-ai.github.io/) — you must accept their terms separately.
|
| 230 |
+
|
| 231 |
+
### Citation
|
| 232 |
+
|
| 233 |
+
```bibtex
|
| 234 |
+
@article{wang2026egomemreason,
|
| 235 |
+
title = {EgoMemReason: A Memory-driven Reasoning Benchmark for
|
| 236 |
+
Long-Horizon Egocentric Video Understanding},
|
| 237 |
+
author = {Wang, Ziyang and Zhang, Yue and Yu, Shoubin and Zhang, Ce and
|
| 238 |
+
Zhao, Zengqi and Yoon, Jaehong and Lee, Hyunji and
|
| 239 |
+
Bertasius, Gedas and Bansal, Mohit},
|
| 240 |
+
year = {2026},
|
| 241 |
+
journal = {arXiv preprint}
|
| 242 |
+
}
|
| 243 |
+
```
|
| 244 |
+
"""
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# ---------------------------------------------------------------------------
|
| 248 |
+
# UI
|
| 249 |
+
# ---------------------------------------------------------------------------
|
| 250 |
+
|
| 251 |
+
with gr.Blocks(title="EgoMemReason Leaderboard", theme=gr.themes.Soft()) as demo:
|
| 252 |
+
gr.Markdown("# 🧠 EgoMemReason — Leaderboard")
|
| 253 |
+
gr.Markdown(
|
| 254 |
+
"*Memory-driven reasoning over week-long egocentric video. 500 MCQs · "
|
| 255 |
+
"Entity / Event / Behavior memory.*"
|
| 256 |
+
)
|
| 257 |
+
if BOOT_ERROR:
|
| 258 |
+
gr.Markdown(f"⚠️ **Boot warning:** {BOOT_ERROR}\n\nSubmissions are disabled.")
|
| 259 |
+
|
| 260 |
+
login_btn = gr.LoginButton()
|
| 261 |
+
|
| 262 |
+
with gr.Tab("Leaderboard"):
|
| 263 |
+
with gr.Row():
|
| 264 |
+
show_all = gr.Checkbox(
|
| 265 |
+
value=False,
|
| 266 |
+
label="Show all submissions (not just each team's selected entry)",
|
| 267 |
+
)
|
| 268 |
+
refresh_btn = gr.Button("Refresh", size="sm")
|
| 269 |
+
leaderboard_df = gr.Dataframe(
|
| 270 |
+
value=load_leaderboard(False),
|
| 271 |
+
headers=LEADERBOARD_COLUMNS,
|
| 272 |
+
interactive=False,
|
| 273 |
+
wrap=True,
|
| 274 |
+
)
|
| 275 |
+
show_all.change(load_leaderboard, inputs=[show_all], outputs=[leaderboard_df])
|
| 276 |
+
refresh_btn.click(load_leaderboard, inputs=[show_all], outputs=[leaderboard_df])
|
| 277 |
+
|
| 278 |
+
with gr.Tab("Submit"):
|
| 279 |
+
gr.Markdown("**Sign in with Hugging Face (button above) before submitting.** "
|
| 280 |
+
"Limit: 5 submissions per HF user per 24 h.")
|
| 281 |
+
with gr.Row():
|
| 282 |
+
team_name = gr.Textbox(label="Team name *", max_lines=1)
|
| 283 |
+
method_name = gr.Textbox(label="Method name *", max_lines=1)
|
| 284 |
+
with gr.Row():
|
| 285 |
+
model_size = gr.Textbox(label="Model size (e.g. 8B, 32B, API)", max_lines=1)
|
| 286 |
+
uses_external = gr.Radio(
|
| 287 |
+
["yes", "no"], label="Uses training data beyond EgoLife? *",
|
| 288 |
+
)
|
| 289 |
+
uses_frames = gr.Radio(
|
| 290 |
+
["frames-only", "video-only", "frames+audio", "captions-only", "other"],
|
| 291 |
+
label="Video input modality *",
|
| 292 |
+
)
|
| 293 |
+
method_description = gr.Textbox(label="Method description", lines=3)
|
| 294 |
+
with gr.Row():
|
| 295 |
+
project_url = gr.Textbox(label="Project URL", max_lines=1)
|
| 296 |
+
publication_url = gr.Textbox(label="Publication URL (arXiv/OpenReview)", max_lines=1)
|
| 297 |
+
submission_file = gr.File(label="submission.json", file_types=[".json"])
|
| 298 |
+
submit_btn = gr.Button("Score & log", variant="primary")
|
| 299 |
+
result_md = gr.Markdown()
|
| 300 |
+
submit_btn.click(
|
| 301 |
+
handle_submission,
|
| 302 |
+
inputs=[submission_file, team_name, method_name, model_size,
|
| 303 |
+
uses_external, uses_frames, method_description,
|
| 304 |
+
project_url, publication_url],
|
| 305 |
+
outputs=[result_md],
|
| 306 |
+
)
|
| 307 |
+
|
| 308 |
+
with gr.Tab("Manage my submissions"):
|
| 309 |
+
gr.Markdown(
|
| 310 |
+
"Toggle which of your past submissions is the official **selected** entry. "
|
| 311 |
+
"Only your own submissions appear here. "
|
| 312 |
+
"Only one entry per HF user can be selected at a time."
|
| 313 |
+
)
|
| 314 |
+
my_subs = gr.Dataframe(
|
| 315 |
+
value=pd.DataFrame(columns=MANAGE_COLUMNS),
|
| 316 |
+
headers=MANAGE_COLUMNS,
|
| 317 |
+
interactive=False,
|
| 318 |
+
wrap=True,
|
| 319 |
+
)
|
| 320 |
+
selected_id = gr.Textbox(label="submission_id to mark as selected", max_lines=1)
|
| 321 |
+
select_btn = gr.Button("Mark as my selected entry")
|
| 322 |
+
manage_msg = gr.Markdown()
|
| 323 |
+
demo.load(load_my_submissions, outputs=[my_subs])
|
| 324 |
+
select_btn.click(set_my_selected, inputs=[selected_id], outputs=[manage_msg, my_subs])
|
| 325 |
+
|
| 326 |
+
with gr.Tab("About"):
|
| 327 |
+
gr.Markdown(ABOUT_MD)
|
| 328 |
+
|
| 329 |
+
|
| 330 |
+
if __name__ == "__main__":
|
| 331 |
+
demo.queue().launch()
|
auth.py
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Resolve the current HF user from Gradio's OAuthProfile.
|
| 2 |
+
|
| 3 |
+
`gr.LoginButton()` populates `gr.OAuthProfile` for every callback that declares
|
| 4 |
+
it as a parameter. We add a `DEBUG_USER` escape hatch for local development,
|
| 5 |
+
gated on the SPACE_ID env var so it can never fire in production.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def is_production():
|
| 12 |
+
"""True when running inside the HF Space sandbox (vs local dev)."""
|
| 13 |
+
return os.environ.get("SPACE_ID") is not None
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def resolve_user(profile):
|
| 17 |
+
"""Returns the HF username of the requesting user, or None if not logged in.
|
| 18 |
+
|
| 19 |
+
`profile` is the `gr.OAuthProfile | None` Gradio passes to callbacks that
|
| 20 |
+
declare it. In local dev, set DEBUG_USER=alice to pretend to be `alice`.
|
| 21 |
+
"""
|
| 22 |
+
if not is_production():
|
| 23 |
+
debug = os.environ.get("DEBUG_USER")
|
| 24 |
+
if debug:
|
| 25 |
+
return debug
|
| 26 |
+
if profile is None:
|
| 27 |
+
return None
|
| 28 |
+
return profile.username
|
evaluator.py
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Scoring logic for EgoMemReason.
|
| 2 |
+
|
| 3 |
+
Pure stdlib — no Gradio, no HF imports. Returns a flat metrics dict.
|
| 4 |
+
Raises ValueError with per-example messages on validation failure.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
from collections import defaultdict
|
| 9 |
+
|
| 10 |
+
# Order matches the leaderboard column order.
|
| 11 |
+
QUERY_TYPES = [
|
| 12 |
+
"Cumulative State Tracking",
|
| 13 |
+
"Temporal Counting",
|
| 14 |
+
"Event Ordering",
|
| 15 |
+
"Event Linking",
|
| 16 |
+
"Spatial Preference",
|
| 17 |
+
"Activity Pattern",
|
| 18 |
+
]
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
def _load(path):
|
| 22 |
+
with open(path) as f:
|
| 23 |
+
return json.load(f)
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def _build_gt(ann):
|
| 27 |
+
"""Returns {example_id: (correct_letter, query_type, valid_option_letters)}.
|
| 28 |
+
|
| 29 |
+
Questions have 4-10 options (letters up to J), so the valid answer set
|
| 30 |
+
is per-question, not a fixed A-D.
|
| 31 |
+
"""
|
| 32 |
+
samples = ann["samples"] if isinstance(ann, dict) else ann
|
| 33 |
+
gt = {}
|
| 34 |
+
for s in samples:
|
| 35 |
+
eid = s["example_id"]
|
| 36 |
+
opts = {str(k).strip().upper() for k in s["options"].keys()}
|
| 37 |
+
gt[eid] = (s["correct_answer"].strip().upper(), s["query_type"], opts)
|
| 38 |
+
return gt
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _validate(preds, gt):
|
| 42 |
+
if not isinstance(preds, list):
|
| 43 |
+
raise ValueError("Submission must be a JSON list of objects.")
|
| 44 |
+
|
| 45 |
+
errors = []
|
| 46 |
+
seen = set()
|
| 47 |
+
for i, item in enumerate(preds):
|
| 48 |
+
if not isinstance(item, dict):
|
| 49 |
+
errors.append(f"item {i}: not a JSON object")
|
| 50 |
+
continue
|
| 51 |
+
eid = item.get("example_id")
|
| 52 |
+
ans = item.get("predicted_answer")
|
| 53 |
+
if not isinstance(eid, int):
|
| 54 |
+
errors.append(f"item {i}: 'example_id' must be an int, got {type(eid).__name__}")
|
| 55 |
+
continue
|
| 56 |
+
if eid in seen:
|
| 57 |
+
errors.append(f"duplicate example_id: {eid}")
|
| 58 |
+
seen.add(eid)
|
| 59 |
+
if eid not in gt:
|
| 60 |
+
errors.append(f"unknown example_id: {eid}")
|
| 61 |
+
continue
|
| 62 |
+
valid = gt[eid][2]
|
| 63 |
+
if not isinstance(ans, str) or ans.strip().upper() not in valid:
|
| 64 |
+
errors.append(
|
| 65 |
+
f"example_id {eid}: 'predicted_answer' must be one of "
|
| 66 |
+
f"{'/'.join(sorted(valid))}, got {ans!r}"
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
missing = set(gt) - seen
|
| 70 |
+
if missing:
|
| 71 |
+
errors.append(
|
| 72 |
+
f"missing {len(missing)} example_ids (e.g. {sorted(missing)[:5]}); "
|
| 73 |
+
f"submission must cover all {len(gt)} questions"
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
if errors:
|
| 77 |
+
msg = "Submission validation failed:\n - " + "\n - ".join(errors[:20])
|
| 78 |
+
if len(errors) > 20:
|
| 79 |
+
msg += f"\n - ... and {len(errors) - 20} more error(s)"
|
| 80 |
+
raise ValueError(msg)
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def _score(preds, gt):
|
| 84 |
+
correct_total = 0
|
| 85 |
+
count_by_qt = defaultdict(int)
|
| 86 |
+
correct_by_qt = defaultdict(int)
|
| 87 |
+
|
| 88 |
+
for _eid, (_gt_ans, qt, _opts) in gt.items():
|
| 89 |
+
count_by_qt[qt] += 1
|
| 90 |
+
|
| 91 |
+
for item in preds:
|
| 92 |
+
eid = item["example_id"]
|
| 93 |
+
ans = item["predicted_answer"].strip().upper()
|
| 94 |
+
gt_ans, qt, _opts = gt[eid]
|
| 95 |
+
if ans == gt_ans:
|
| 96 |
+
correct_total += 1
|
| 97 |
+
correct_by_qt[qt] += 1
|
| 98 |
+
|
| 99 |
+
metrics = {}
|
| 100 |
+
for qt in QUERY_TYPES:
|
| 101 |
+
n = count_by_qt.get(qt, 0)
|
| 102 |
+
metrics[qt] = round(100.0 * correct_by_qt[qt] / n, 2) if n else 0.0
|
| 103 |
+
metrics["Overall"] = round(100.0 * correct_total / len(gt), 2)
|
| 104 |
+
return metrics
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def score_submission(submission_path, annotation_path="annotations_private.json"):
|
| 108 |
+
"""Returns {"Cumulative State Tracking": ..., ..., "Overall": ...} as percentages."""
|
| 109 |
+
gt = _build_gt(_load(annotation_path))
|
| 110 |
+
preds = _load(submission_path)
|
| 111 |
+
_validate(preds, gt)
|
| 112 |
+
return _score(preds, gt)
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
if __name__ == "__main__":
|
| 116 |
+
import argparse, pprint
|
| 117 |
+
p = argparse.ArgumentParser()
|
| 118 |
+
p.add_argument("--annotation", default="annotations_private.json")
|
| 119 |
+
p.add_argument("--submission", required=True)
|
| 120 |
+
args = p.parse_args()
|
| 121 |
+
pprint.pp(score_submission(args.submission, args.annotation))
|
ledger.py
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""HF I/O for the EgoMemReason leaderboard.
|
| 2 |
+
|
| 3 |
+
Two repos:
|
| 4 |
+
- PUBLIC_DATASET Ted412/EgoMemReason-Leaderboard (one JSON per submission)
|
| 5 |
+
- PRIVATE_DATASET Ted412/EgoMemReason-Private (annotations_private.json)
|
| 6 |
+
|
| 7 |
+
Boot path: ensure_private_annotations() downloads the private annotations file
|
| 8 |
+
on app start so evaluator.score_submission() can read it from cwd.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import functools
|
| 12 |
+
import io
|
| 13 |
+
import json
|
| 14 |
+
import os
|
| 15 |
+
import time
|
| 16 |
+
import uuid
|
| 17 |
+
from datetime import datetime, timedelta, timezone
|
| 18 |
+
|
| 19 |
+
from huggingface_hub import HfApi, snapshot_download
|
| 20 |
+
|
| 21 |
+
# Hard-coded for this challenge. Override via env vars in dev.
|
| 22 |
+
PUBLIC_DATASET = os.environ.get("EGOMEM_PUBLIC_DATASET", "Ted412/EgoMemReason-Leaderboard")
|
| 23 |
+
PRIVATE_DATASET = os.environ.get("EGOMEM_PRIVATE_DATASET", "Ted412/EgoMemReason-Private")
|
| 24 |
+
ANNOTATIONS_FILENAME = "annotations_private.json"
|
| 25 |
+
|
| 26 |
+
HF_TOKEN = os.environ.get("HF_TOKEN") # write scope on PUBLIC_DATASET; read scope on PRIVATE_DATASET
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def ensure_private_annotations(dest_path=ANNOTATIONS_FILENAME):
|
| 30 |
+
"""Download annotations_private.json from the private dataset on app boot.
|
| 31 |
+
|
| 32 |
+
Only called once per Space restart. If the file is already present (local
|
| 33 |
+
dev case where you've copied it manually), do nothing.
|
| 34 |
+
"""
|
| 35 |
+
if os.path.exists(dest_path):
|
| 36 |
+
return dest_path
|
| 37 |
+
if not HF_TOKEN:
|
| 38 |
+
raise RuntimeError(
|
| 39 |
+
"HF_TOKEN env var not set; cannot pull private annotations from "
|
| 40 |
+
f"{PRIVATE_DATASET}. Either set HF_TOKEN or place {dest_path} in cwd."
|
| 41 |
+
)
|
| 42 |
+
local_dir = snapshot_download(
|
| 43 |
+
repo_id=PRIVATE_DATASET,
|
| 44 |
+
repo_type="dataset",
|
| 45 |
+
token=HF_TOKEN,
|
| 46 |
+
allow_patterns=[ANNOTATIONS_FILENAME],
|
| 47 |
+
)
|
| 48 |
+
src = os.path.join(local_dir, ANNOTATIONS_FILENAME)
|
| 49 |
+
if not os.path.exists(src):
|
| 50 |
+
raise RuntimeError(
|
| 51 |
+
f"{ANNOTATIONS_FILENAME} not found in {PRIVATE_DATASET}. "
|
| 52 |
+
"Upload it via the HF Files UI of the private dataset repo."
|
| 53 |
+
)
|
| 54 |
+
# Symlink rather than copy — snapshot_download already cached it.
|
| 55 |
+
if not os.path.exists(dest_path):
|
| 56 |
+
os.symlink(src, dest_path)
|
| 57 |
+
return dest_path
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def _now_iso():
|
| 61 |
+
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@functools.lru_cache(maxsize=1)
|
| 65 |
+
def _cached_submissions(cache_bucket):
|
| 66 |
+
"""Pulls all submission JSON files. Bucket is int(time/60) so cache rolls every minute."""
|
| 67 |
+
del cache_bucket # only here to invalidate the cache
|
| 68 |
+
try:
|
| 69 |
+
local_dir = snapshot_download(
|
| 70 |
+
repo_id=PUBLIC_DATASET,
|
| 71 |
+
repo_type="dataset",
|
| 72 |
+
token=HF_TOKEN, # not strictly required for public read but avoids rate-limiting
|
| 73 |
+
allow_patterns=["submissions/*.json"],
|
| 74 |
+
)
|
| 75 |
+
except Exception:
|
| 76 |
+
return []
|
| 77 |
+
folder = os.path.join(local_dir, "submissions")
|
| 78 |
+
if not os.path.isdir(folder):
|
| 79 |
+
return []
|
| 80 |
+
out = []
|
| 81 |
+
for fn in os.listdir(folder):
|
| 82 |
+
if not fn.endswith(".json"):
|
| 83 |
+
continue
|
| 84 |
+
try:
|
| 85 |
+
with open(os.path.join(folder, fn)) as f:
|
| 86 |
+
out.append(json.load(f))
|
| 87 |
+
except Exception:
|
| 88 |
+
continue
|
| 89 |
+
return out
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def list_submissions():
|
| 93 |
+
return _cached_submissions(int(time.time() / 60))
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def _invalidate_cache():
|
| 97 |
+
_cached_submissions.cache_clear()
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def count_recent(hf_user_id, hours=24):
|
| 101 |
+
cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
|
| 102 |
+
n = 0
|
| 103 |
+
for sub in list_submissions():
|
| 104 |
+
if sub.get("hf_user_id") != hf_user_id:
|
| 105 |
+
continue
|
| 106 |
+
ts = sub.get("submitted_at_utc", "")
|
| 107 |
+
try:
|
| 108 |
+
t = datetime.fromisoformat(ts.rstrip("Z")).replace(tzinfo=timezone.utc)
|
| 109 |
+
except ValueError:
|
| 110 |
+
continue
|
| 111 |
+
if t >= cutoff:
|
| 112 |
+
n += 1
|
| 113 |
+
return n
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def _upload_record(record):
|
| 117 |
+
payload = json.dumps(record, indent=2).encode("utf-8")
|
| 118 |
+
HfApi().upload_file(
|
| 119 |
+
path_or_fileobj=io.BytesIO(payload),
|
| 120 |
+
path_in_repo=f"submissions/{record['submission_id']}.json",
|
| 121 |
+
repo_id=PUBLIC_DATASET,
|
| 122 |
+
repo_type="dataset",
|
| 123 |
+
token=HF_TOKEN,
|
| 124 |
+
commit_message=f"submission {record['submission_id'][:8]} from {record['hf_user_id']}",
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def append_submission(*, hf_user_id, team_name, method_name, model_size,
|
| 129 |
+
uses_external_data, uses_video_frames, method_description,
|
| 130 |
+
project_url, publication_url, metrics):
|
| 131 |
+
if not HF_TOKEN:
|
| 132 |
+
raise RuntimeError("HF_TOKEN not set; cannot persist submission.")
|
| 133 |
+
sid = str(uuid.uuid4())
|
| 134 |
+
record = {
|
| 135 |
+
"submission_id": sid,
|
| 136 |
+
"submitted_at_utc": _now_iso(),
|
| 137 |
+
"hf_user_id": hf_user_id,
|
| 138 |
+
"team_name": team_name,
|
| 139 |
+
"method_name": method_name,
|
| 140 |
+
"model_size": model_size or "",
|
| 141 |
+
"uses_external_data": bool(uses_external_data),
|
| 142 |
+
"uses_video_frames": uses_video_frames,
|
| 143 |
+
"method_description": method_description or "",
|
| 144 |
+
"project_url": project_url or "",
|
| 145 |
+
"publication_url": publication_url or "",
|
| 146 |
+
"is_selected": False,
|
| 147 |
+
"metrics": metrics,
|
| 148 |
+
}
|
| 149 |
+
_upload_record(record)
|
| 150 |
+
_invalidate_cache()
|
| 151 |
+
return sid
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def set_selected(submission_id, requesting_user):
|
| 155 |
+
"""Mark `submission_id` as the requesting_user's selected entry.
|
| 156 |
+
|
| 157 |
+
Enforces one-selected-per-user. Raises PermissionError if the submission
|
| 158 |
+
does not belong to requesting_user.
|
| 159 |
+
"""
|
| 160 |
+
target = None
|
| 161 |
+
for sub in list_submissions():
|
| 162 |
+
if sub["submission_id"] == submission_id:
|
| 163 |
+
target = sub
|
| 164 |
+
break
|
| 165 |
+
if target is None:
|
| 166 |
+
raise ValueError(f"submission_id not found: {submission_id}")
|
| 167 |
+
if target["hf_user_id"] != requesting_user:
|
| 168 |
+
raise PermissionError("You can only modify your own submissions.")
|
| 169 |
+
|
| 170 |
+
# Un-select any other submission this user previously selected.
|
| 171 |
+
for sub in list_submissions():
|
| 172 |
+
if (sub["hf_user_id"] == requesting_user
|
| 173 |
+
and sub["is_selected"]
|
| 174 |
+
and sub["submission_id"] != submission_id):
|
| 175 |
+
sub["is_selected"] = False
|
| 176 |
+
_upload_record(sub)
|
| 177 |
+
|
| 178 |
+
target["is_selected"] = True
|
| 179 |
+
_upload_record(target)
|
| 180 |
+
_invalidate_cache()
|
requirements.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==4.44.0
|
| 2 |
+
huggingface_hub==0.25.0
|
| 3 |
+
pandas==2.2.3
|
tests/fixtures/all_a_submission.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
[{"example_id": 1, "predicted_answer": "A"}, {"example_id": 2, "predicted_answer": "A"}, {"example_id": 3, "predicted_answer": "A"}, {"example_id": 4, "predicted_answer": "A"}, {"example_id": 5, "predicted_answer": "A"}, {"example_id": 6, "predicted_answer": "A"}, {"example_id": 7, "predicted_answer": "A"}, {"example_id": 8, "predicted_answer": "A"}, {"example_id": 9, "predicted_answer": "A"}, {"example_id": 10, "predicted_answer": "A"}, {"example_id": 11, "predicted_answer": "A"}, {"example_id": 12, "predicted_answer": "A"}, {"example_id": 13, "predicted_answer": "A"}, {"example_id": 14, "predicted_answer": "A"}, {"example_id": 15, "predicted_answer": "A"}, {"example_id": 16, "predicted_answer": "A"}, {"example_id": 17, "predicted_answer": "A"}, {"example_id": 18, "predicted_answer": "A"}, {"example_id": 19, "predicted_answer": "A"}, {"example_id": 20, "predicted_answer": "A"}, {"example_id": 21, "predicted_answer": "A"}, {"example_id": 22, "predicted_answer": "A"}, {"example_id": 23, "predicted_answer": "A"}, {"example_id": 24, "predicted_answer": "A"}, {"example_id": 25, "predicted_answer": "A"}, {"example_id": 26, "predicted_answer": "A"}, {"example_id": 27, "predicted_answer": "A"}, {"example_id": 28, "predicted_answer": "A"}, {"example_id": 29, "predicted_answer": "A"}, {"example_id": 30, "predicted_answer": "A"}, {"example_id": 31, "predicted_answer": "A"}, {"example_id": 32, "predicted_answer": "A"}, {"example_id": 33, "predicted_answer": "A"}, {"example_id": 34, "predicted_answer": "A"}, {"example_id": 35, "predicted_answer": "A"}, {"example_id": 36, "predicted_answer": "A"}, {"example_id": 37, "predicted_answer": "A"}, {"example_id": 38, "predicted_answer": "A"}, {"example_id": 39, "predicted_answer": "A"}, {"example_id": 40, "predicted_answer": "A"}, {"example_id": 41, "predicted_answer": "A"}, {"example_id": 42, "predicted_answer": "A"}, {"example_id": 43, "predicted_answer": "A"}, {"example_id": 44, "predicted_answer": "A"}, {"example_id": 45, "predicted_answer": "A"}, {"example_id": 46, "predicted_answer": "A"}, {"example_id": 47, "predicted_answer": "A"}, {"example_id": 48, "predicted_answer": "A"}, {"example_id": 49, "predicted_answer": "A"}, {"example_id": 50, "predicted_answer": "A"}, {"example_id": 51, "predicted_answer": "A"}, {"example_id": 52, "predicted_answer": "A"}, {"example_id": 53, "predicted_answer": "A"}, {"example_id": 54, "predicted_answer": "A"}, {"example_id": 55, "predicted_answer": "A"}, {"example_id": 56, "predicted_answer": "A"}, {"example_id": 57, "predicted_answer": "A"}, {"example_id": 58, "predicted_answer": "A"}, {"example_id": 59, "predicted_answer": "A"}, {"example_id": 60, "predicted_answer": "A"}, {"example_id": 61, "predicted_answer": "A"}, {"example_id": 62, "predicted_answer": "A"}, {"example_id": 63, "predicted_answer": "A"}, {"example_id": 64, "predicted_answer": "A"}, {"example_id": 65, "predicted_answer": "A"}, {"example_id": 66, "predicted_answer": "A"}, {"example_id": 67, "predicted_answer": "A"}, {"example_id": 68, "predicted_answer": "A"}, {"example_id": 69, "predicted_answer": "A"}, {"example_id": 70, "predicted_answer": "A"}, {"example_id": 71, "predicted_answer": "A"}, {"example_id": 72, "predicted_answer": "A"}, {"example_id": 73, "predicted_answer": "A"}, {"example_id": 74, "predicted_answer": "A"}, {"example_id": 75, "predicted_answer": "A"}, {"example_id": 76, "predicted_answer": "A"}, {"example_id": 77, "predicted_answer": "A"}, {"example_id": 78, "predicted_answer": "A"}, {"example_id": 79, "predicted_answer": "A"}, {"example_id": 80, "predicted_answer": "A"}, {"example_id": 81, "predicted_answer": "A"}, {"example_id": 82, "predicted_answer": "A"}, {"example_id": 83, "predicted_answer": "A"}, {"example_id": 84, "predicted_answer": "A"}, {"example_id": 85, "predicted_answer": "A"}, {"example_id": 86, "predicted_answer": "A"}, {"example_id": 87, "predicted_answer": "A"}, {"example_id": 88, "predicted_answer": "A"}, {"example_id": 89, "predicted_answer": "A"}, {"example_id": 90, "predicted_answer": "A"}, {"example_id": 91, "predicted_answer": "A"}, {"example_id": 92, "predicted_answer": "A"}, {"example_id": 93, "predicted_answer": "A"}, {"example_id": 94, "predicted_answer": "A"}, {"example_id": 95, "predicted_answer": "A"}, {"example_id": 96, "predicted_answer": "A"}, {"example_id": 97, "predicted_answer": "A"}, {"example_id": 98, "predicted_answer": "A"}, {"example_id": 99, "predicted_answer": "A"}, {"example_id": 100, "predicted_answer": "A"}, {"example_id": 101, "predicted_answer": "A"}, {"example_id": 102, "predicted_answer": "A"}, {"example_id": 103, "predicted_answer": "A"}, {"example_id": 104, "predicted_answer": "A"}, {"example_id": 105, "predicted_answer": "A"}, {"example_id": 106, "predicted_answer": "A"}, {"example_id": 107, "predicted_answer": "A"}, {"example_id": 108, "predicted_answer": "A"}, {"example_id": 109, "predicted_answer": "A"}, {"example_id": 110, "predicted_answer": "A"}, {"example_id": 111, "predicted_answer": "A"}, {"example_id": 112, "predicted_answer": "A"}, {"example_id": 113, "predicted_answer": "A"}, {"example_id": 114, "predicted_answer": "A"}, {"example_id": 115, "predicted_answer": "A"}, {"example_id": 116, "predicted_answer": "A"}, {"example_id": 117, "predicted_answer": "A"}, {"example_id": 118, "predicted_answer": "A"}, {"example_id": 119, "predicted_answer": "A"}, {"example_id": 120, "predicted_answer": "A"}, {"example_id": 121, "predicted_answer": "A"}, {"example_id": 122, "predicted_answer": "A"}, {"example_id": 123, "predicted_answer": "A"}, {"example_id": 124, "predicted_answer": "A"}, {"example_id": 125, "predicted_answer": "A"}, {"example_id": 126, "predicted_answer": "A"}, {"example_id": 127, "predicted_answer": "A"}, {"example_id": 128, "predicted_answer": "A"}, {"example_id": 129, "predicted_answer": "A"}, {"example_id": 130, "predicted_answer": "A"}, {"example_id": 131, "predicted_answer": "A"}, {"example_id": 132, "predicted_answer": "A"}, {"example_id": 133, "predicted_answer": "A"}, {"example_id": 134, "predicted_answer": "A"}, {"example_id": 135, "predicted_answer": "A"}, {"example_id": 136, "predicted_answer": "A"}, {"example_id": 137, "predicted_answer": "A"}, {"example_id": 138, "predicted_answer": "A"}, {"example_id": 139, "predicted_answer": "A"}, {"example_id": 140, "predicted_answer": "A"}, {"example_id": 141, "predicted_answer": "A"}, {"example_id": 142, "predicted_answer": "A"}, {"example_id": 143, "predicted_answer": "A"}, {"example_id": 144, "predicted_answer": "A"}, {"example_id": 145, "predicted_answer": "A"}, {"example_id": 146, "predicted_answer": "A"}, {"example_id": 147, "predicted_answer": "A"}, {"example_id": 148, "predicted_answer": "A"}, {"example_id": 149, "predicted_answer": "A"}, {"example_id": 150, "predicted_answer": "A"}, {"example_id": 151, "predicted_answer": "A"}, {"example_id": 152, "predicted_answer": "A"}, {"example_id": 153, "predicted_answer": "A"}, {"example_id": 154, "predicted_answer": "A"}, {"example_id": 155, "predicted_answer": "A"}, {"example_id": 156, "predicted_answer": "A"}, {"example_id": 157, "predicted_answer": "A"}, {"example_id": 158, "predicted_answer": "A"}, {"example_id": 159, "predicted_answer": "A"}, {"example_id": 160, "predicted_answer": "A"}, {"example_id": 161, "predicted_answer": "A"}, {"example_id": 162, "predicted_answer": "A"}, {"example_id": 163, "predicted_answer": "A"}, {"example_id": 164, "predicted_answer": "A"}, {"example_id": 165, "predicted_answer": "A"}, {"example_id": 166, "predicted_answer": "A"}, {"example_id": 167, "predicted_answer": "A"}, {"example_id": 168, "predicted_answer": "A"}, {"example_id": 169, "predicted_answer": "A"}, {"example_id": 170, "predicted_answer": "A"}, {"example_id": 171, "predicted_answer": "A"}, {"example_id": 172, "predicted_answer": "A"}, {"example_id": 173, "predicted_answer": "A"}, {"example_id": 174, "predicted_answer": "A"}, {"example_id": 175, "predicted_answer": "A"}, {"example_id": 176, "predicted_answer": "A"}, {"example_id": 177, "predicted_answer": "A"}, {"example_id": 178, "predicted_answer": "A"}, {"example_id": 179, "predicted_answer": "A"}, {"example_id": 180, "predicted_answer": "A"}, {"example_id": 181, "predicted_answer": "A"}, {"example_id": 182, "predicted_answer": "A"}, {"example_id": 183, "predicted_answer": "A"}, {"example_id": 184, "predicted_answer": "A"}, {"example_id": 185, "predicted_answer": "A"}, {"example_id": 186, "predicted_answer": "A"}, {"example_id": 187, "predicted_answer": "A"}, {"example_id": 188, "predicted_answer": "A"}, {"example_id": 189, "predicted_answer": "A"}, {"example_id": 190, "predicted_answer": "A"}, {"example_id": 191, "predicted_answer": "A"}, {"example_id": 192, "predicted_answer": "A"}, {"example_id": 193, "predicted_answer": "A"}, {"example_id": 194, "predicted_answer": "A"}, {"example_id": 195, "predicted_answer": "A"}, {"example_id": 196, "predicted_answer": "A"}, {"example_id": 197, "predicted_answer": "A"}, {"example_id": 198, "predicted_answer": "A"}, {"example_id": 199, "predicted_answer": "A"}, {"example_id": 200, "predicted_answer": "A"}, {"example_id": 201, "predicted_answer": "A"}, {"example_id": 202, "predicted_answer": "A"}, {"example_id": 203, "predicted_answer": "A"}, {"example_id": 204, "predicted_answer": "A"}, {"example_id": 205, "predicted_answer": "A"}, {"example_id": 206, "predicted_answer": "A"}, {"example_id": 207, "predicted_answer": "A"}, {"example_id": 208, "predicted_answer": "A"}, {"example_id": 209, "predicted_answer": "A"}, {"example_id": 210, "predicted_answer": "A"}, {"example_id": 211, "predicted_answer": "A"}, {"example_id": 212, "predicted_answer": "A"}, {"example_id": 213, "predicted_answer": "A"}, {"example_id": 214, "predicted_answer": "A"}, {"example_id": 215, "predicted_answer": "A"}, {"example_id": 216, "predicted_answer": "A"}, {"example_id": 217, "predicted_answer": "A"}, {"example_id": 218, "predicted_answer": "A"}, {"example_id": 219, "predicted_answer": "A"}, {"example_id": 220, "predicted_answer": "A"}, {"example_id": 221, "predicted_answer": "A"}, {"example_id": 222, "predicted_answer": "A"}, {"example_id": 223, "predicted_answer": "A"}, {"example_id": 224, "predicted_answer": "A"}, {"example_id": 225, "predicted_answer": "A"}, {"example_id": 226, "predicted_answer": "A"}, {"example_id": 227, "predicted_answer": "A"}, {"example_id": 228, "predicted_answer": "A"}, {"example_id": 229, "predicted_answer": "A"}, {"example_id": 230, "predicted_answer": "A"}, {"example_id": 231, "predicted_answer": "A"}, {"example_id": 232, "predicted_answer": "A"}, {"example_id": 233, "predicted_answer": "A"}, {"example_id": 234, "predicted_answer": "A"}, {"example_id": 235, "predicted_answer": "A"}, {"example_id": 236, "predicted_answer": "A"}, {"example_id": 237, "predicted_answer": "A"}, {"example_id": 238, "predicted_answer": "A"}, {"example_id": 239, "predicted_answer": "A"}, {"example_id": 240, "predicted_answer": "A"}, {"example_id": 241, "predicted_answer": "A"}, {"example_id": 242, "predicted_answer": "A"}, {"example_id": 243, "predicted_answer": "A"}, {"example_id": 244, "predicted_answer": "A"}, {"example_id": 245, "predicted_answer": "A"}, {"example_id": 246, "predicted_answer": "A"}, {"example_id": 247, "predicted_answer": "A"}, {"example_id": 248, "predicted_answer": "A"}, {"example_id": 249, "predicted_answer": "A"}, {"example_id": 250, "predicted_answer": "A"}, {"example_id": 251, "predicted_answer": "A"}, {"example_id": 252, "predicted_answer": "A"}, {"example_id": 253, "predicted_answer": "A"}, {"example_id": 254, "predicted_answer": "A"}, {"example_id": 255, "predicted_answer": "A"}, {"example_id": 256, "predicted_answer": "A"}, {"example_id": 257, "predicted_answer": "A"}, {"example_id": 258, "predicted_answer": "A"}, {"example_id": 259, "predicted_answer": "A"}, {"example_id": 260, "predicted_answer": "A"}, {"example_id": 261, "predicted_answer": "A"}, {"example_id": 262, "predicted_answer": "A"}, {"example_id": 263, "predicted_answer": "A"}, {"example_id": 264, "predicted_answer": "A"}, {"example_id": 265, "predicted_answer": "A"}, {"example_id": 266, "predicted_answer": "A"}, {"example_id": 267, "predicted_answer": "A"}, {"example_id": 268, "predicted_answer": "A"}, {"example_id": 269, "predicted_answer": "A"}, {"example_id": 270, "predicted_answer": "A"}, {"example_id": 271, "predicted_answer": "A"}, {"example_id": 272, "predicted_answer": "A"}, {"example_id": 273, "predicted_answer": "A"}, {"example_id": 274, "predicted_answer": "A"}, {"example_id": 275, "predicted_answer": "A"}, {"example_id": 276, "predicted_answer": "A"}, {"example_id": 277, "predicted_answer": "A"}, {"example_id": 278, "predicted_answer": "A"}, {"example_id": 279, "predicted_answer": "A"}, {"example_id": 280, "predicted_answer": "A"}, {"example_id": 281, "predicted_answer": "A"}, {"example_id": 282, "predicted_answer": "A"}, {"example_id": 283, "predicted_answer": "A"}, {"example_id": 284, "predicted_answer": "A"}, {"example_id": 285, "predicted_answer": "A"}, {"example_id": 286, "predicted_answer": "A"}, {"example_id": 287, "predicted_answer": "A"}, {"example_id": 288, "predicted_answer": "A"}, {"example_id": 289, "predicted_answer": "A"}, {"example_id": 290, "predicted_answer": "A"}, {"example_id": 291, "predicted_answer": "A"}, {"example_id": 292, "predicted_answer": "A"}, {"example_id": 293, "predicted_answer": "A"}, {"example_id": 294, "predicted_answer": "A"}, {"example_id": 295, "predicted_answer": "A"}, {"example_id": 296, "predicted_answer": "A"}, {"example_id": 297, "predicted_answer": "A"}, {"example_id": 298, "predicted_answer": "A"}, {"example_id": 299, "predicted_answer": "A"}, {"example_id": 300, "predicted_answer": "A"}, {"example_id": 301, "predicted_answer": "A"}, {"example_id": 302, "predicted_answer": "A"}, {"example_id": 303, "predicted_answer": "A"}, {"example_id": 304, "predicted_answer": "A"}, {"example_id": 305, "predicted_answer": "A"}, {"example_id": 306, "predicted_answer": "A"}, {"example_id": 307, "predicted_answer": "A"}, {"example_id": 308, "predicted_answer": "A"}, {"example_id": 309, "predicted_answer": "A"}, {"example_id": 310, "predicted_answer": "A"}, {"example_id": 311, "predicted_answer": "A"}, {"example_id": 312, "predicted_answer": "A"}, {"example_id": 313, "predicted_answer": "A"}, {"example_id": 314, "predicted_answer": "A"}, {"example_id": 315, "predicted_answer": "A"}, {"example_id": 316, "predicted_answer": "A"}, {"example_id": 317, "predicted_answer": "A"}, {"example_id": 318, "predicted_answer": "A"}, {"example_id": 319, "predicted_answer": "A"}, {"example_id": 320, "predicted_answer": "A"}, {"example_id": 321, "predicted_answer": "A"}, {"example_id": 322, "predicted_answer": "A"}, {"example_id": 323, "predicted_answer": "A"}, {"example_id": 324, "predicted_answer": "A"}, {"example_id": 325, "predicted_answer": "A"}, {"example_id": 326, "predicted_answer": "A"}, {"example_id": 327, "predicted_answer": "A"}, {"example_id": 328, "predicted_answer": "A"}, {"example_id": 329, "predicted_answer": "A"}, {"example_id": 330, "predicted_answer": "A"}, {"example_id": 331, "predicted_answer": "A"}, {"example_id": 332, "predicted_answer": "A"}, {"example_id": 333, "predicted_answer": "A"}, {"example_id": 334, "predicted_answer": "A"}, {"example_id": 335, "predicted_answer": "A"}, {"example_id": 336, "predicted_answer": "A"}, {"example_id": 337, "predicted_answer": "A"}, {"example_id": 338, "predicted_answer": "A"}, {"example_id": 339, "predicted_answer": "A"}, {"example_id": 340, "predicted_answer": "A"}, {"example_id": 341, "predicted_answer": "A"}, {"example_id": 342, "predicted_answer": "A"}, {"example_id": 343, "predicted_answer": "A"}, {"example_id": 344, "predicted_answer": "A"}, {"example_id": 345, "predicted_answer": "A"}, {"example_id": 346, "predicted_answer": "A"}, {"example_id": 347, "predicted_answer": "A"}, {"example_id": 348, "predicted_answer": "A"}, {"example_id": 349, "predicted_answer": "A"}, {"example_id": 350, "predicted_answer": "A"}, {"example_id": 351, "predicted_answer": "A"}, {"example_id": 352, "predicted_answer": "A"}, {"example_id": 353, "predicted_answer": "A"}, {"example_id": 354, "predicted_answer": "A"}, {"example_id": 355, "predicted_answer": "A"}, {"example_id": 356, "predicted_answer": "A"}, {"example_id": 357, "predicted_answer": "A"}, {"example_id": 358, "predicted_answer": "A"}, {"example_id": 359, "predicted_answer": "A"}, {"example_id": 360, "predicted_answer": "A"}, {"example_id": 361, "predicted_answer": "A"}, {"example_id": 362, "predicted_answer": "A"}, {"example_id": 363, "predicted_answer": "A"}, {"example_id": 364, "predicted_answer": "A"}, {"example_id": 365, "predicted_answer": "A"}, {"example_id": 366, "predicted_answer": "A"}, {"example_id": 367, "predicted_answer": "A"}, {"example_id": 368, "predicted_answer": "A"}, {"example_id": 369, "predicted_answer": "A"}, {"example_id": 370, "predicted_answer": "A"}, {"example_id": 371, "predicted_answer": "A"}, {"example_id": 372, "predicted_answer": "A"}, {"example_id": 373, "predicted_answer": "A"}, {"example_id": 374, "predicted_answer": "A"}, {"example_id": 375, "predicted_answer": "A"}, {"example_id": 376, "predicted_answer": "A"}, {"example_id": 377, "predicted_answer": "A"}, {"example_id": 378, "predicted_answer": "A"}, {"example_id": 379, "predicted_answer": "A"}, {"example_id": 380, "predicted_answer": "A"}, {"example_id": 381, "predicted_answer": "A"}, {"example_id": 382, "predicted_answer": "A"}, {"example_id": 383, "predicted_answer": "A"}, {"example_id": 384, "predicted_answer": "A"}, {"example_id": 385, "predicted_answer": "A"}, {"example_id": 386, "predicted_answer": "A"}, {"example_id": 387, "predicted_answer": "A"}, {"example_id": 388, "predicted_answer": "A"}, {"example_id": 389, "predicted_answer": "A"}, {"example_id": 390, "predicted_answer": "A"}, {"example_id": 391, "predicted_answer": "A"}, {"example_id": 392, "predicted_answer": "A"}, {"example_id": 393, "predicted_answer": "A"}, {"example_id": 394, "predicted_answer": "A"}, {"example_id": 395, "predicted_answer": "A"}, {"example_id": 396, "predicted_answer": "A"}, {"example_id": 397, "predicted_answer": "A"}, {"example_id": 398, "predicted_answer": "A"}, {"example_id": 399, "predicted_answer": "A"}, {"example_id": 400, "predicted_answer": "A"}, {"example_id": 401, "predicted_answer": "A"}, {"example_id": 402, "predicted_answer": "A"}, {"example_id": 403, "predicted_answer": "A"}, {"example_id": 404, "predicted_answer": "A"}, {"example_id": 405, "predicted_answer": "A"}, {"example_id": 406, "predicted_answer": "A"}, {"example_id": 407, "predicted_answer": "A"}, {"example_id": 408, "predicted_answer": "A"}, {"example_id": 409, "predicted_answer": "A"}, {"example_id": 410, "predicted_answer": "A"}, {"example_id": 411, "predicted_answer": "A"}, {"example_id": 412, "predicted_answer": "A"}, {"example_id": 413, "predicted_answer": "A"}, {"example_id": 414, "predicted_answer": "A"}, {"example_id": 415, "predicted_answer": "A"}, {"example_id": 416, "predicted_answer": "A"}, {"example_id": 417, "predicted_answer": "A"}, {"example_id": 418, "predicted_answer": "A"}, {"example_id": 419, "predicted_answer": "A"}, {"example_id": 420, "predicted_answer": "A"}, {"example_id": 421, "predicted_answer": "A"}, {"example_id": 422, "predicted_answer": "A"}, {"example_id": 423, "predicted_answer": "A"}, {"example_id": 424, "predicted_answer": "A"}, {"example_id": 425, "predicted_answer": "A"}, {"example_id": 426, "predicted_answer": "A"}, {"example_id": 427, "predicted_answer": "A"}, {"example_id": 428, "predicted_answer": "A"}, {"example_id": 429, "predicted_answer": "A"}, {"example_id": 430, "predicted_answer": "A"}, {"example_id": 431, "predicted_answer": "A"}, {"example_id": 432, "predicted_answer": "A"}, {"example_id": 433, "predicted_answer": "A"}, {"example_id": 434, "predicted_answer": "A"}, {"example_id": 435, "predicted_answer": "A"}, {"example_id": 436, "predicted_answer": "A"}, {"example_id": 437, "predicted_answer": "A"}, {"example_id": 438, "predicted_answer": "A"}, {"example_id": 439, "predicted_answer": "A"}, {"example_id": 440, "predicted_answer": "A"}, {"example_id": 441, "predicted_answer": "A"}, {"example_id": 442, "predicted_answer": "A"}, {"example_id": 443, "predicted_answer": "A"}, {"example_id": 444, "predicted_answer": "A"}, {"example_id": 445, "predicted_answer": "A"}, {"example_id": 446, "predicted_answer": "A"}, {"example_id": 447, "predicted_answer": "A"}, {"example_id": 448, "predicted_answer": "A"}, {"example_id": 449, "predicted_answer": "A"}, {"example_id": 450, "predicted_answer": "A"}, {"example_id": 451, "predicted_answer": "A"}, {"example_id": 452, "predicted_answer": "A"}, {"example_id": 453, "predicted_answer": "A"}, {"example_id": 454, "predicted_answer": "A"}, {"example_id": 455, "predicted_answer": "A"}, {"example_id": 456, "predicted_answer": "A"}, {"example_id": 457, "predicted_answer": "A"}, {"example_id": 458, "predicted_answer": "A"}, {"example_id": 459, "predicted_answer": "A"}, {"example_id": 460, "predicted_answer": "A"}, {"example_id": 461, "predicted_answer": "A"}, {"example_id": 462, "predicted_answer": "A"}, {"example_id": 463, "predicted_answer": "A"}, {"example_id": 464, "predicted_answer": "A"}, {"example_id": 465, "predicted_answer": "A"}, {"example_id": 466, "predicted_answer": "A"}, {"example_id": 467, "predicted_answer": "A"}, {"example_id": 468, "predicted_answer": "A"}, {"example_id": 469, "predicted_answer": "A"}, {"example_id": 470, "predicted_answer": "A"}, {"example_id": 471, "predicted_answer": "A"}, {"example_id": 472, "predicted_answer": "A"}, {"example_id": 473, "predicted_answer": "A"}, {"example_id": 474, "predicted_answer": "A"}, {"example_id": 475, "predicted_answer": "A"}, {"example_id": 476, "predicted_answer": "A"}, {"example_id": 477, "predicted_answer": "A"}, {"example_id": 478, "predicted_answer": "A"}, {"example_id": 479, "predicted_answer": "A"}, {"example_id": 480, "predicted_answer": "A"}, {"example_id": 481, "predicted_answer": "A"}, {"example_id": 482, "predicted_answer": "A"}, {"example_id": 483, "predicted_answer": "A"}, {"example_id": 484, "predicted_answer": "A"}, {"example_id": 485, "predicted_answer": "A"}, {"example_id": 486, "predicted_answer": "A"}, {"example_id": 487, "predicted_answer": "A"}, {"example_id": 488, "predicted_answer": "A"}, {"example_id": 489, "predicted_answer": "A"}, {"example_id": 490, "predicted_answer": "A"}, {"example_id": 491, "predicted_answer": "A"}, {"example_id": 492, "predicted_answer": "A"}, {"example_id": 493, "predicted_answer": "A"}, {"example_id": 494, "predicted_answer": "A"}, {"example_id": 495, "predicted_answer": "A"}, {"example_id": 496, "predicted_answer": "A"}, {"example_id": 497, "predicted_answer": "A"}, {"example_id": 498, "predicted_answer": "A"}, {"example_id": 499, "predicted_answer": "A"}, {"example_id": 500, "predicted_answer": "A"}]
|
tests/fixtures/oracle_submission.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
[{"example_id": 1, "predicted_answer": "A"}, {"example_id": 2, "predicted_answer": "B"}, {"example_id": 3, "predicted_answer": "C"}, {"example_id": 4, "predicted_answer": "D"}, {"example_id": 5, "predicted_answer": "A"}, {"example_id": 6, "predicted_answer": "B"}, {"example_id": 7, "predicted_answer": "C"}, {"example_id": 8, "predicted_answer": "D"}, {"example_id": 9, "predicted_answer": "E"}, {"example_id": 10, "predicted_answer": "A"}, {"example_id": 11, "predicted_answer": "D"}, {"example_id": 12, "predicted_answer": "E"}, {"example_id": 13, "predicted_answer": "B"}, {"example_id": 14, "predicted_answer": "B"}, {"example_id": 15, "predicted_answer": "C"}, {"example_id": 16, "predicted_answer": "C"}, {"example_id": 17, "predicted_answer": "F"}, {"example_id": 18, "predicted_answer": "E"}, {"example_id": 19, "predicted_answer": "A"}, {"example_id": 20, "predicted_answer": "A"}, {"example_id": 21, "predicted_answer": "B"}, {"example_id": 22, "predicted_answer": "B"}, {"example_id": 23, "predicted_answer": "D"}, {"example_id": 24, "predicted_answer": "C"}, {"example_id": 25, "predicted_answer": "A"}, {"example_id": 26, "predicted_answer": "C"}, {"example_id": 27, "predicted_answer": "E"}, {"example_id": 28, "predicted_answer": "A"}, {"example_id": 29, "predicted_answer": "B"}, {"example_id": 30, "predicted_answer": "C"}, {"example_id": 31, "predicted_answer": "F"}, {"example_id": 32, "predicted_answer": "A"}, {"example_id": 33, "predicted_answer": "D"}, {"example_id": 34, "predicted_answer": "E"}, {"example_id": 35, "predicted_answer": "C"}, {"example_id": 36, "predicted_answer": "C"}, {"example_id": 37, "predicted_answer": "C"}, {"example_id": 38, "predicted_answer": "D"}, {"example_id": 39, "predicted_answer": "D"}, {"example_id": 40, "predicted_answer": "C"}, {"example_id": 41, "predicted_answer": "B"}, {"example_id": 42, "predicted_answer": "B"}, {"example_id": 43, "predicted_answer": "B"}, {"example_id": 44, "predicted_answer": "B"}, {"example_id": 45, "predicted_answer": "C"}, {"example_id": 46, "predicted_answer": "B"}, {"example_id": 47, "predicted_answer": "B"}, {"example_id": 48, "predicted_answer": "A"}, {"example_id": 49, "predicted_answer": "B"}, {"example_id": 50, "predicted_answer": "E"}, {"example_id": 51, "predicted_answer": "F"}, {"example_id": 52, "predicted_answer": "F"}, {"example_id": 53, "predicted_answer": "F"}, {"example_id": 54, "predicted_answer": "F"}, {"example_id": 55, "predicted_answer": "F"}, {"example_id": 56, "predicted_answer": "F"}, {"example_id": 57, "predicted_answer": "F"}, {"example_id": 58, "predicted_answer": "F"}, {"example_id": 59, "predicted_answer": "F"}, {"example_id": 60, "predicted_answer": "F"}, {"example_id": 61, "predicted_answer": "F"}, {"example_id": 62, "predicted_answer": "F"}, {"example_id": 63, "predicted_answer": "F"}, {"example_id": 64, "predicted_answer": "F"}, {"example_id": 65, "predicted_answer": "F"}, {"example_id": 66, "predicted_answer": "F"}, {"example_id": 67, "predicted_answer": "F"}, {"example_id": 68, "predicted_answer": "F"}, {"example_id": 69, "predicted_answer": "B"}, {"example_id": 70, "predicted_answer": "C"}, {"example_id": 71, "predicted_answer": "D"}, {"example_id": 72, "predicted_answer": "E"}, {"example_id": 73, "predicted_answer": "F"}, {"example_id": 74, "predicted_answer": "A"}, {"example_id": 75, "predicted_answer": "B"}, {"example_id": 76, "predicted_answer": "C"}, {"example_id": 77, "predicted_answer": "D"}, {"example_id": 78, "predicted_answer": "E"}, {"example_id": 79, "predicted_answer": "F"}, {"example_id": 80, "predicted_answer": "E"}, {"example_id": 81, "predicted_answer": "E"}, {"example_id": 82, "predicted_answer": "E"}, {"example_id": 83, "predicted_answer": "A"}, {"example_id": 84, "predicted_answer": "B"}, {"example_id": 85, "predicted_answer": "C"}, {"example_id": 86, "predicted_answer": "D"}, {"example_id": 87, "predicted_answer": "E"}, {"example_id": 88, "predicted_answer": "A"}, {"example_id": 89, "predicted_answer": "B"}, {"example_id": 90, "predicted_answer": "D"}, {"example_id": 91, "predicted_answer": "E"}, {"example_id": 92, "predicted_answer": "A"}, {"example_id": 93, "predicted_answer": "C"}, {"example_id": 94, "predicted_answer": "C"}, {"example_id": 95, "predicted_answer": "D"}, {"example_id": 96, "predicted_answer": "A"}, {"example_id": 97, "predicted_answer": "B"}, {"example_id": 98, "predicted_answer": "C"}, {"example_id": 99, "predicted_answer": "D"}, {"example_id": 100, "predicted_answer": "A"}, {"example_id": 101, "predicted_answer": "B"}, {"example_id": 102, "predicted_answer": "C"}, {"example_id": 103, "predicted_answer": "D"}, {"example_id": 104, "predicted_answer": "A"}, {"example_id": 105, "predicted_answer": "B"}, {"example_id": 106, "predicted_answer": "D"}, {"example_id": 107, "predicted_answer": "A"}, {"example_id": 108, "predicted_answer": "C"}, {"example_id": 109, "predicted_answer": "B"}, {"example_id": 110, "predicted_answer": "C"}, {"example_id": 111, "predicted_answer": "D"}, {"example_id": 112, "predicted_answer": "D"}, {"example_id": 113, "predicted_answer": "E"}, {"example_id": 114, "predicted_answer": "A"}, {"example_id": 115, "predicted_answer": "B"}, {"example_id": 116, "predicted_answer": "A"}, {"example_id": 117, "predicted_answer": "C"}, {"example_id": 118, "predicted_answer": "C"}, {"example_id": 119, "predicted_answer": "D"}, {"example_id": 120, "predicted_answer": "E"}, {"example_id": 121, "predicted_answer": "F"}, {"example_id": 122, "predicted_answer": "B"}, {"example_id": 123, "predicted_answer": "D"}, {"example_id": 124, "predicted_answer": "D"}, {"example_id": 125, "predicted_answer": "D"}, {"example_id": 126, "predicted_answer": "B"}, {"example_id": 127, "predicted_answer": "D"}, {"example_id": 128, "predicted_answer": "D"}, {"example_id": 129, "predicted_answer": "B"}, {"example_id": 130, "predicted_answer": "C"}, {"example_id": 131, "predicted_answer": "D"}, {"example_id": 132, "predicted_answer": "E"}, {"example_id": 133, "predicted_answer": "F"}, {"example_id": 134, "predicted_answer": "A"}, {"example_id": 135, "predicted_answer": "D"}, {"example_id": 136, "predicted_answer": "E"}, {"example_id": 137, "predicted_answer": "B"}, {"example_id": 138, "predicted_answer": "C"}, {"example_id": 139, "predicted_answer": "B"}, {"example_id": 140, "predicted_answer": "A"}, {"example_id": 141, "predicted_answer": "A"}, {"example_id": 142, "predicted_answer": "C"}, {"example_id": 143, "predicted_answer": "D"}, {"example_id": 144, "predicted_answer": "E"}, {"example_id": 145, "predicted_answer": "E"}, {"example_id": 146, "predicted_answer": "G"}, {"example_id": 147, "predicted_answer": "F"}, {"example_id": 148, "predicted_answer": "D"}, {"example_id": 149, "predicted_answer": "D"}, {"example_id": 150, "predicted_answer": "A"}, {"example_id": 151, "predicted_answer": "C"}, {"example_id": 152, "predicted_answer": "A"}, {"example_id": 153, "predicted_answer": "E"}, {"example_id": 154, "predicted_answer": "A"}, {"example_id": 155, "predicted_answer": "A"}, {"example_id": 156, "predicted_answer": "A"}, {"example_id": 157, "predicted_answer": "A"}, {"example_id": 158, "predicted_answer": "A"}, {"example_id": 159, "predicted_answer": "B"}, {"example_id": 160, "predicted_answer": "A"}, {"example_id": 161, "predicted_answer": "E"}, {"example_id": 162, "predicted_answer": "G"}, {"example_id": 163, "predicted_answer": "A"}, {"example_id": 164, "predicted_answer": "G"}, {"example_id": 165, "predicted_answer": "A"}, {"example_id": 166, "predicted_answer": "B"}, {"example_id": 167, "predicted_answer": "B"}, {"example_id": 168, "predicted_answer": "F"}, {"example_id": 169, "predicted_answer": "F"}, {"example_id": 170, "predicted_answer": "G"}, {"example_id": 171, "predicted_answer": "G"}, {"example_id": 172, "predicted_answer": "E"}, {"example_id": 173, "predicted_answer": "F"}, {"example_id": 174, "predicted_answer": "A"}, {"example_id": 175, "predicted_answer": "B"}, {"example_id": 176, "predicted_answer": "B"}, {"example_id": 177, "predicted_answer": "C"}, {"example_id": 178, "predicted_answer": "B"}, {"example_id": 179, "predicted_answer": "F"}, {"example_id": 180, "predicted_answer": "F"}, {"example_id": 181, "predicted_answer": "G"}, {"example_id": 182, "predicted_answer": "E"}, {"example_id": 183, "predicted_answer": "G"}, {"example_id": 184, "predicted_answer": "E"}, {"example_id": 185, "predicted_answer": "B"}, {"example_id": 186, "predicted_answer": "B"}, {"example_id": 187, "predicted_answer": "C"}, {"example_id": 188, "predicted_answer": "F"}, {"example_id": 189, "predicted_answer": "E"}, {"example_id": 190, "predicted_answer": "D"}, {"example_id": 191, "predicted_answer": "D"}, {"example_id": 192, "predicted_answer": "D"}, {"example_id": 193, "predicted_answer": "D"}, {"example_id": 194, "predicted_answer": "C"}, {"example_id": 195, "predicted_answer": "E"}, {"example_id": 196, "predicted_answer": "F"}, {"example_id": 197, "predicted_answer": "B"}, {"example_id": 198, "predicted_answer": "F"}, {"example_id": 199, "predicted_answer": "C"}, {"example_id": 200, "predicted_answer": "C"}, {"example_id": 201, "predicted_answer": "C"}, {"example_id": 202, "predicted_answer": "D"}, {"example_id": 203, "predicted_answer": "E"}, {"example_id": 204, "predicted_answer": "C"}, {"example_id": 205, "predicted_answer": "F"}, {"example_id": 206, "predicted_answer": "F"}, {"example_id": 207, "predicted_answer": "E"}, {"example_id": 208, "predicted_answer": "D"}, {"example_id": 209, "predicted_answer": "E"}, {"example_id": 210, "predicted_answer": "E"}, {"example_id": 211, "predicted_answer": "D"}, {"example_id": 212, "predicted_answer": "D"}, {"example_id": 213, "predicted_answer": "B"}, {"example_id": 214, "predicted_answer": "D"}, {"example_id": 215, "predicted_answer": "A"}, {"example_id": 216, "predicted_answer": "B"}, {"example_id": 217, "predicted_answer": "C"}, {"example_id": 218, "predicted_answer": "C"}, {"example_id": 219, "predicted_answer": "C"}, {"example_id": 220, "predicted_answer": "F"}, {"example_id": 221, "predicted_answer": "E"}, {"example_id": 222, "predicted_answer": "E"}, {"example_id": 223, "predicted_answer": "D"}, {"example_id": 224, "predicted_answer": "D"}, {"example_id": 225, "predicted_answer": "B"}, {"example_id": 226, "predicted_answer": "D"}, {"example_id": 227, "predicted_answer": "C"}, {"example_id": 228, "predicted_answer": "C"}, {"example_id": 229, "predicted_answer": "A"}, {"example_id": 230, "predicted_answer": "B"}, {"example_id": 231, "predicted_answer": "E"}, {"example_id": 232, "predicted_answer": "E"}, {"example_id": 233, "predicted_answer": "C"}, {"example_id": 234, "predicted_answer": "F"}, {"example_id": 235, "predicted_answer": "D"}, {"example_id": 236, "predicted_answer": "F"}, {"example_id": 237, "predicted_answer": "C"}, {"example_id": 238, "predicted_answer": "B"}, {"example_id": 239, "predicted_answer": "A"}, {"example_id": 240, "predicted_answer": "B"}, {"example_id": 241, "predicted_answer": "C"}, {"example_id": 242, "predicted_answer": "B"}, {"example_id": 243, "predicted_answer": "C"}, {"example_id": 244, "predicted_answer": "C"}, {"example_id": 245, "predicted_answer": "D"}, {"example_id": 246, "predicted_answer": "D"}, {"example_id": 247, "predicted_answer": "C"}, {"example_id": 248, "predicted_answer": "C"}, {"example_id": 249, "predicted_answer": "D"}, {"example_id": 250, "predicted_answer": "C"}, {"example_id": 251, "predicted_answer": "C"}, {"example_id": 252, "predicted_answer": "C"}, {"example_id": 253, "predicted_answer": "C"}, {"example_id": 254, "predicted_answer": "C"}, {"example_id": 255, "predicted_answer": "C"}, {"example_id": 256, "predicted_answer": "C"}, {"example_id": 257, "predicted_answer": "E"}, {"example_id": 258, "predicted_answer": "C"}, {"example_id": 259, "predicted_answer": "B"}, {"example_id": 260, "predicted_answer": "B"}, {"example_id": 261, "predicted_answer": "B"}, {"example_id": 262, "predicted_answer": "C"}, {"example_id": 263, "predicted_answer": "C"}, {"example_id": 264, "predicted_answer": "B"}, {"example_id": 265, "predicted_answer": "B"}, {"example_id": 266, "predicted_answer": "B"}, {"example_id": 267, "predicted_answer": "C"}, {"example_id": 268, "predicted_answer": "F"}, {"example_id": 269, "predicted_answer": "F"}, {"example_id": 270, "predicted_answer": "A"}, {"example_id": 271, "predicted_answer": "A"}, {"example_id": 272, "predicted_answer": "E"}, {"example_id": 273, "predicted_answer": "F"}, {"example_id": 274, "predicted_answer": "F"}, {"example_id": 275, "predicted_answer": "D"}, {"example_id": 276, "predicted_answer": "B"}, {"example_id": 277, "predicted_answer": "C"}, {"example_id": 278, "predicted_answer": "F"}, {"example_id": 279, "predicted_answer": "F"}, {"example_id": 280, "predicted_answer": "E"}, {"example_id": 281, "predicted_answer": "C"}, {"example_id": 282, "predicted_answer": "D"}, {"example_id": 283, "predicted_answer": "C"}, {"example_id": 284, "predicted_answer": "E"}, {"example_id": 285, "predicted_answer": "C"}, {"example_id": 286, "predicted_answer": "B"}, {"example_id": 287, "predicted_answer": "D"}, {"example_id": 288, "predicted_answer": "E"}, {"example_id": 289, "predicted_answer": "A"}, {"example_id": 290, "predicted_answer": "D"}, {"example_id": 291, "predicted_answer": "B"}, {"example_id": 292, "predicted_answer": "D"}, {"example_id": 293, "predicted_answer": "E"}, {"example_id": 294, "predicted_answer": "F"}, {"example_id": 295, "predicted_answer": "C"}, {"example_id": 296, "predicted_answer": "A"}, {"example_id": 297, "predicted_answer": "C"}, {"example_id": 298, "predicted_answer": "D"}, {"example_id": 299, "predicted_answer": "B"}, {"example_id": 300, "predicted_answer": "D"}, {"example_id": 301, "predicted_answer": "A"}, {"example_id": 302, "predicted_answer": "C"}, {"example_id": 303, "predicted_answer": "A"}, {"example_id": 304, "predicted_answer": "A"}, {"example_id": 305, "predicted_answer": "E"}, {"example_id": 306, "predicted_answer": "F"}, {"example_id": 307, "predicted_answer": "E"}, {"example_id": 308, "predicted_answer": "D"}, {"example_id": 309, "predicted_answer": "D"}, {"example_id": 310, "predicted_answer": "F"}, {"example_id": 311, "predicted_answer": "D"}, {"example_id": 312, "predicted_answer": "B"}, {"example_id": 313, "predicted_answer": "A"}, {"example_id": 314, "predicted_answer": "B"}, {"example_id": 315, "predicted_answer": "C"}, {"example_id": 316, "predicted_answer": "D"}, {"example_id": 317, "predicted_answer": "B"}, {"example_id": 318, "predicted_answer": "A"}, {"example_id": 319, "predicted_answer": "E"}, {"example_id": 320, "predicted_answer": "C"}, {"example_id": 321, "predicted_answer": "A"}, {"example_id": 322, "predicted_answer": "B"}, {"example_id": 323, "predicted_answer": "D"}, {"example_id": 324, "predicted_answer": "D"}, {"example_id": 325, "predicted_answer": "D"}, {"example_id": 326, "predicted_answer": "F"}, {"example_id": 327, "predicted_answer": "D"}, {"example_id": 328, "predicted_answer": "C"}, {"example_id": 329, "predicted_answer": "C"}, {"example_id": 330, "predicted_answer": "B"}, {"example_id": 331, "predicted_answer": "D"}, {"example_id": 332, "predicted_answer": "F"}, {"example_id": 333, "predicted_answer": "E"}, {"example_id": 334, "predicted_answer": "F"}, {"example_id": 335, "predicted_answer": "E"}, {"example_id": 336, "predicted_answer": "C"}, {"example_id": 337, "predicted_answer": "E"}, {"example_id": 338, "predicted_answer": "E"}, {"example_id": 339, "predicted_answer": "E"}, {"example_id": 340, "predicted_answer": "F"}, {"example_id": 341, "predicted_answer": "F"}, {"example_id": 342, "predicted_answer": "D"}, {"example_id": 343, "predicted_answer": "A"}, {"example_id": 344, "predicted_answer": "C"}, {"example_id": 345, "predicted_answer": "E"}, {"example_id": 346, "predicted_answer": "B"}, {"example_id": 347, "predicted_answer": "B"}, {"example_id": 348, "predicted_answer": "B"}, {"example_id": 349, "predicted_answer": "E"}, {"example_id": 350, "predicted_answer": "F"}, {"example_id": 351, "predicted_answer": "C"}, {"example_id": 352, "predicted_answer": "A"}, {"example_id": 353, "predicted_answer": "B"}, {"example_id": 354, "predicted_answer": "B"}, {"example_id": 355, "predicted_answer": "F"}, {"example_id": 356, "predicted_answer": "C"}, {"example_id": 357, "predicted_answer": "F"}, {"example_id": 358, "predicted_answer": "C"}, {"example_id": 359, "predicted_answer": "C"}, {"example_id": 360, "predicted_answer": "B"}, {"example_id": 361, "predicted_answer": "C"}, {"example_id": 362, "predicted_answer": "D"}, {"example_id": 363, "predicted_answer": "A"}, {"example_id": 364, "predicted_answer": "B"}, {"example_id": 365, "predicted_answer": "B"}, {"example_id": 366, "predicted_answer": "A"}, {"example_id": 367, "predicted_answer": "A"}, {"example_id": 368, "predicted_answer": "B"}, {"example_id": 369, "predicted_answer": "E"}, {"example_id": 370, "predicted_answer": "F"}, {"example_id": 371, "predicted_answer": "E"}, {"example_id": 372, "predicted_answer": "A"}, {"example_id": 373, "predicted_answer": "D"}, {"example_id": 374, "predicted_answer": "B"}, {"example_id": 375, "predicted_answer": "C"}, {"example_id": 376, "predicted_answer": "D"}, {"example_id": 377, "predicted_answer": "A"}, {"example_id": 378, "predicted_answer": "F"}, {"example_id": 379, "predicted_answer": "E"}, {"example_id": 380, "predicted_answer": "A"}, {"example_id": 381, "predicted_answer": "C"}, {"example_id": 382, "predicted_answer": "A"}, {"example_id": 383, "predicted_answer": "F"}, {"example_id": 384, "predicted_answer": "A"}, {"example_id": 385, "predicted_answer": "B"}, {"example_id": 386, "predicted_answer": "F"}, {"example_id": 387, "predicted_answer": "E"}, {"example_id": 388, "predicted_answer": "C"}, {"example_id": 389, "predicted_answer": "B"}, {"example_id": 390, "predicted_answer": "C"}, {"example_id": 391, "predicted_answer": "D"}, {"example_id": 392, "predicted_answer": "A"}, {"example_id": 393, "predicted_answer": "C"}, {"example_id": 394, "predicted_answer": "F"}, {"example_id": 395, "predicted_answer": "E"}, {"example_id": 396, "predicted_answer": "F"}, {"example_id": 397, "predicted_answer": "E"}, {"example_id": 398, "predicted_answer": "D"}, {"example_id": 399, "predicted_answer": "D"}, {"example_id": 400, "predicted_answer": "D"}, {"example_id": 401, "predicted_answer": "B"}, {"example_id": 402, "predicted_answer": "E"}, {"example_id": 403, "predicted_answer": "G"}, {"example_id": 404, "predicted_answer": "G"}, {"example_id": 405, "predicted_answer": "G"}, {"example_id": 406, "predicted_answer": "C"}, {"example_id": 407, "predicted_answer": "G"}, {"example_id": 408, "predicted_answer": "A"}, {"example_id": 409, "predicted_answer": "F"}, {"example_id": 410, "predicted_answer": "B"}, {"example_id": 411, "predicted_answer": "H"}, {"example_id": 412, "predicted_answer": "E"}, {"example_id": 413, "predicted_answer": "F"}, {"example_id": 414, "predicted_answer": "C"}, {"example_id": 415, "predicted_answer": "F"}, {"example_id": 416, "predicted_answer": "C"}, {"example_id": 417, "predicted_answer": "B"}, {"example_id": 418, "predicted_answer": "C"}, {"example_id": 419, "predicted_answer": "G"}, {"example_id": 420, "predicted_answer": "A"}, {"example_id": 421, "predicted_answer": "B"}, {"example_id": 422, "predicted_answer": "H"}, {"example_id": 423, "predicted_answer": "E"}, {"example_id": 424, "predicted_answer": "G"}, {"example_id": 425, "predicted_answer": "G"}, {"example_id": 426, "predicted_answer": "G"}, {"example_id": 427, "predicted_answer": "B"}, {"example_id": 428, "predicted_answer": "E"}, {"example_id": 429, "predicted_answer": "C"}, {"example_id": 430, "predicted_answer": "C"}, {"example_id": 431, "predicted_answer": "E"}, {"example_id": 432, "predicted_answer": "F"}, {"example_id": 433, "predicted_answer": "A"}, {"example_id": 434, "predicted_answer": "D"}, {"example_id": 435, "predicted_answer": "D"}, {"example_id": 436, "predicted_answer": "C"}, {"example_id": 437, "predicted_answer": "H"}, {"example_id": 438, "predicted_answer": "H"}, {"example_id": 439, "predicted_answer": "A"}, {"example_id": 440, "predicted_answer": "F"}, {"example_id": 441, "predicted_answer": "C"}, {"example_id": 442, "predicted_answer": "F"}, {"example_id": 443, "predicted_answer": "G"}, {"example_id": 444, "predicted_answer": "G"}, {"example_id": 445, "predicted_answer": "G"}, {"example_id": 446, "predicted_answer": "D"}, {"example_id": 447, "predicted_answer": "B"}, {"example_id": 448, "predicted_answer": "A"}, {"example_id": 449, "predicted_answer": "A"}, {"example_id": 450, "predicted_answer": "I"}, {"example_id": 451, "predicted_answer": "B"}, {"example_id": 452, "predicted_answer": "I"}, {"example_id": 453, "predicted_answer": "F"}, {"example_id": 454, "predicted_answer": "F"}, {"example_id": 455, "predicted_answer": "J"}, {"example_id": 456, "predicted_answer": "H"}, {"example_id": 457, "predicted_answer": "C"}, {"example_id": 458, "predicted_answer": "A"}, {"example_id": 459, "predicted_answer": "B"}, {"example_id": 460, "predicted_answer": "C"}, {"example_id": 461, "predicted_answer": "I"}, {"example_id": 462, "predicted_answer": "E"}, {"example_id": 463, "predicted_answer": "A"}, {"example_id": 464, "predicted_answer": "E"}, {"example_id": 465, "predicted_answer": "F"}, {"example_id": 466, "predicted_answer": "C"}, {"example_id": 467, "predicted_answer": "E"}, {"example_id": 468, "predicted_answer": "D"}, {"example_id": 469, "predicted_answer": "J"}, {"example_id": 470, "predicted_answer": "A"}, {"example_id": 471, "predicted_answer": "C"}, {"example_id": 472, "predicted_answer": "D"}, {"example_id": 473, "predicted_answer": "J"}, {"example_id": 474, "predicted_answer": "H"}, {"example_id": 475, "predicted_answer": "F"}, {"example_id": 476, "predicted_answer": "E"}, {"example_id": 477, "predicted_answer": "J"}, {"example_id": 478, "predicted_answer": "A"}, {"example_id": 479, "predicted_answer": "D"}, {"example_id": 480, "predicted_answer": "F"}, {"example_id": 481, "predicted_answer": "G"}, {"example_id": 482, "predicted_answer": "A"}, {"example_id": 483, "predicted_answer": "E"}, {"example_id": 484, "predicted_answer": "J"}, {"example_id": 485, "predicted_answer": "C"}, {"example_id": 486, "predicted_answer": "F"}, {"example_id": 487, "predicted_answer": "F"}, {"example_id": 488, "predicted_answer": "B"}, {"example_id": 489, "predicted_answer": "F"}, {"example_id": 490, "predicted_answer": "C"}, {"example_id": 491, "predicted_answer": "D"}, {"example_id": 492, "predicted_answer": "C"}, {"example_id": 493, "predicted_answer": "A"}, {"example_id": 494, "predicted_answer": "H"}, {"example_id": 495, "predicted_answer": "H"}, {"example_id": 496, "predicted_answer": "A"}, {"example_id": 497, "predicted_answer": "H"}, {"example_id": 498, "predicted_answer": "F"}, {"example_id": 499, "predicted_answer": "A"}, {"example_id": 500, "predicted_answer": "B"}]
|
tests/test_evaluator.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for evaluator.score_submission().
|
| 2 |
+
|
| 3 |
+
Run from the EgoMemReason-Space/ directory:
|
| 4 |
+
python -m pytest tests/ -q
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
import os
|
| 9 |
+
import pathlib
|
| 10 |
+
import sys
|
| 11 |
+
|
| 12 |
+
import pytest
|
| 13 |
+
|
| 14 |
+
ROOT = pathlib.Path(__file__).resolve().parents[1]
|
| 15 |
+
sys.path.insert(0, str(ROOT))
|
| 16 |
+
|
| 17 |
+
import evaluator
|
| 18 |
+
|
| 19 |
+
ANN = ROOT / "annotations_private.json"
|
| 20 |
+
ORACLE = ROOT / "tests" / "fixtures" / "oracle_submission.json"
|
| 21 |
+
ALL_A = ROOT / "tests" / "fixtures" / "all_a_submission.json"
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
pytestmark = pytest.mark.skipif(
|
| 25 |
+
not ANN.exists(),
|
| 26 |
+
reason=f"{ANN} not present (copy from ../EgoMemReason-EvalAI.archived/)",
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def test_oracle_scores_100():
|
| 31 |
+
metrics = evaluator.score_submission(str(ORACLE), str(ANN))
|
| 32 |
+
for k, v in metrics.items():
|
| 33 |
+
assert v == 100.0, f"{k} should be 100.0, got {v}"
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def test_all_a_scores_around_14():
|
| 37 |
+
# All-A's exact score depends on the A-letter frequency in the dataset
|
| 38 |
+
# — we measured 14.2% during the EvalAI port. Allow a wide band.
|
| 39 |
+
metrics = evaluator.score_submission(str(ALL_A), str(ANN))
|
| 40 |
+
assert 10.0 <= metrics["Overall"] <= 20.0, metrics
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def test_broken_submission_raises(tmp_path):
|
| 44 |
+
broken = tmp_path / "broken.json"
|
| 45 |
+
json.dump(
|
| 46 |
+
[{"example_id": 1, "predicted_answer": "ZZ"}], # bogus letter + only 1 row
|
| 47 |
+
broken.open("w"),
|
| 48 |
+
)
|
| 49 |
+
with pytest.raises(ValueError) as exc:
|
| 50 |
+
evaluator.score_submission(str(broken), str(ANN))
|
| 51 |
+
assert "must be one of" in str(exc.value)
|
| 52 |
+
assert "missing" in str(exc.value)
|