Ziyang Wang commited on
Commit
1bf5b23
·
0 Parent(s):

initial Space

Browse files
.gitignore ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Private answer key — pulled at boot from the private dataset, never committed
2
+ annotations_private.json
3
+
4
+ # Paper-baseline seed records — destined for the public DATASET repo, not the Space
5
+ seeds/
6
+
7
+ # Local dev
8
+ .venv/
9
+ __pycache__/
10
+ *.pyc
11
+
12
+ # Local snapshot caches
13
+ .cache/
14
+
15
+ # Editor / OS
16
+ .DS_Store
17
+ .idea/
18
+ .vscode/
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: EgoMemReason Leaderboard
3
+ emoji: 🧠
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: cc-by-nc-4.0
11
+ hf_oauth: true
12
+ hf_oauth_scopes:
13
+ - openid
14
+ - profile
15
+ ---
16
+
17
+ # EgoMemReason — Leaderboard Space
18
+
19
+ Live leaderboard for the **EgoMemReason** benchmark: 500 multiple-choice questions over week-long egocentric video, evaluating entity / event / behavior memory.
20
+
21
+ - 📄 Paper: *coming soon*
22
+ - 💻 Reference eval scripts: <https://github.com/Ted412/EgoMemReason>
23
+ - 📦 Public questions: <https://huggingface.co/datasets/Ted412/EgoMemReason>
24
+ - 🎬 Source frames: <https://egolife-ai.github.io/>
25
+
26
+ ## Operator notes
27
+
28
+ This Space lives at `Ted412/EgoMemReason` and writes one JSON record per submission to the public dataset `Ted412/EgoMemReason-Leaderboard`. The held-out answer key lives in a separate **private** dataset `Ted412/EgoMemReason-Private` and is pulled at boot via `snapshot_download(token=HF_TOKEN)`.
29
+
30
+ ### Required Space secret
31
+
32
+ | Name | Value | Scope |
33
+ |---|---|---|
34
+ | `HF_TOKEN` | Fine-grained HF token | Write on `Ted412/EgoMemReason-Leaderboard` + Read on `Ted412/EgoMemReason-Private` |
35
+
36
+ Create at <https://huggingface.co/settings/tokens> → fine-grained → grant only those two repos.
37
+
38
+ ### Local development
39
+
40
+ ```bash
41
+ python -m venv .venv && source .venv/bin/activate
42
+ pip install -r requirements.txt
43
+
44
+ # Copy the private answer key into cwd (skips the snapshot_download path).
45
+ cp ../EgoMemReason-EvalAI.archived/annotations/annotations_private.json .
46
+
47
+ # Run, optionally faking a user.
48
+ DEBUG_USER=alice python app.py
49
+ # → http://127.0.0.1:7860
50
+ ```
51
+
52
+ Tests:
53
+
54
+ ```bash
55
+ python -m pytest tests/ -q
56
+ ```
57
+
58
+ ### Architecture
59
+
60
+ ```
61
+ EgoMemReason-Space (this Space, public)
62
+ ├── app.py Gradio UI (Leaderboard / Submit / Manage / About)
63
+ ├── evaluator.py pure scoring — port of the old EvalAI main.py
64
+ ├── ledger.py HF I/O: pulls private annotations at boot; writes
65
+ │ one JSON record per submission to the public dataset
66
+ ├── auth.py resolves the HF username from gr.OAuthProfile
67
+ └── annotations_private.json pulled at boot from the private dataset
68
+
69
+ Ted412/EgoMemReason-Private (HF dataset, private)
70
+ └── annotations_private.json 500 Qs WITH correct_answer
71
+
72
+ Ted412/EgoMemReason-Leaderboard (HF dataset, public)
73
+ └── submissions/
74
+ └── <uuid>.json one immutable record per submission
75
+ (only is_selected flips on a re-write)
76
+ ```
SUBMISSION_FORMAT.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Submission Format
2
+
3
+ A submission is a single JSON file (`.json`) containing a top-level array of 500 prediction objects — one per question.
4
+
5
+ ## Schema
6
+
7
+ ```json
8
+ [
9
+ {"example_id": 1, "predicted_answer": "A"},
10
+ {"example_id": 2, "predicted_answer": "C"},
11
+ {"example_id": 500, "predicted_answer": "B"}
12
+ ]
13
+ ```
14
+
15
+ **Required keys (per object):**
16
+ - `example_id` — integer in `[1, 500]`, matching `example_id` in `annotations_public.json`.
17
+ - `predicted_answer` — single uppercase letter that appears in that question's `options` dict.
18
+
19
+ **Important:** questions have **between 4 and 10 options**. The valid answer letters for any given question are exactly the keys of its `options` dict. Most are A-F; Event Ordering questions can extend to A-J. A letter outside the question's option set is rejected.
20
+
21
+ **Optional keys (ignored, but useful for your own debugging):** `raw_response`, `confidence`, `tokens`, etc.
22
+
23
+ ## Rules
24
+
25
+ 1. Top-level must be a JSON array (not an object).
26
+ 2. The submission must cover **exactly 500 unique `example_id`s**, one per question.
27
+ 3. Duplicate `example_id`s are rejected.
28
+ 4. Letters must be uppercase (whitespace is trimmed).
29
+ 5. File extension must be `.json`.
30
+
31
+ ## Converting from existing eval-script output
32
+
33
+ The reference inference scripts in the [EgoMemReason GitHub repo](https://github.com/Ted412/EgoMemReason) write a list of records with a `pred` field. One-liner to convert:
34
+
35
+ ```python
36
+ import json
37
+ src = json.load(open("results_my_model.json"))
38
+ sub = [{"example_id": r["example_id"], "predicted_answer": r["pred"]} for r in src]
39
+ json.dump(sub, open("submission.json", "w"))
40
+ ```
41
+
42
+ ## How submissions are scored
43
+
44
+ Accuracy (%) for each of the six `query_type` splits:
45
+
46
+ - Cumulative State Tracking (100 Qs)
47
+ - Temporal Counting (100 Qs)
48
+ - Event Ordering (100 Qs)
49
+ - Event Linking (100 Qs)
50
+ - Spatial Preference (50 Qs)
51
+ - Activity Pattern (50 Qs)
52
+
53
+ plus **Overall** accuracy on all 500. All seven values appear on the leaderboard; ranking is by Overall descending.
54
+
55
+ ## Submission limits
56
+
57
+ - **5 submissions per HF user per 24-hour window.**
58
+ - The 24-hour window is rolling, not midnight-aligned.
59
+
60
+ ## Selected submission
61
+
62
+ Submit as many times as you like under the cap. In the **Manage my submissions** tab you can mark **one** of your past submissions as your *selected* entry. The default leaderboard view shows only each team's selected entry; the "Show all submissions" toggle reveals all.
63
+
64
+ ## Required metadata fields
65
+
66
+ When you submit you must fill in:
67
+
68
+ | Field | Required | Notes |
69
+ |---|---|---|
70
+ | `team_name` | yes | Team or affiliation |
71
+ | `method_name` | yes | Short title displayed on the leaderboard |
72
+ | `uses_external_data` | yes (yes/no) | Did you train / finetune on anything beyond EgoLife? |
73
+ | `uses_video_frames` | yes | one of `frames-only` · `video-only` · `frames+audio` · `captions-only` · `other` |
74
+ | `model_size` | no | e.g. `8B`, `32B`, `API` |
75
+ | `method_description` | no | Free-form description |
76
+ | `project_url` | no | Project page |
77
+ | `publication_url` | no | arXiv / OpenReview link |
app.py ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """EgoMemReason leaderboard — Gradio Space app.
2
+
3
+ Tabs:
4
+ - Leaderboard public, auto-refresh, toggle selected-only / show-all
5
+ - Submit HF login required; JSON upload + metadata form
6
+ - Manage toggle is_selected on your own past submissions
7
+ - About paper description + citation
8
+ """
9
+
10
+ import os
11
+
12
+ import gradio as gr
13
+ import pandas as pd
14
+
15
+ import auth
16
+ import evaluator
17
+ import ledger
18
+
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Boot: pull annotations_private.json from the private dataset repo.
22
+ # ---------------------------------------------------------------------------
23
+
24
+ try:
25
+ ledger.ensure_private_annotations()
26
+ except RuntimeError as e:
27
+ # In local dev without HF_TOKEN, allow the app to come up with a clear banner.
28
+ BOOT_ERROR = str(e)
29
+ else:
30
+ BOOT_ERROR = None
31
+
32
+
33
+ LEADERBOARD_COLUMNS = [
34
+ "Rank",
35
+ "Method",
36
+ "Team",
37
+ "Overall",
38
+ "Cumul. State",
39
+ "Temp. Count",
40
+ "Event Order",
41
+ "Event Link",
42
+ "Spatial Pref.",
43
+ "Activity Pat.",
44
+ "Model size",
45
+ "Ext. data",
46
+ "Modality",
47
+ "Date (UTC)",
48
+ "Links",
49
+ ]
50
+
51
+
52
+ def _row_from_submission(sub, rank):
53
+ m = sub["metrics"]
54
+ links = []
55
+ if sub.get("project_url"):
56
+ links.append(f"[project]({sub['project_url']})")
57
+ if sub.get("publication_url"):
58
+ links.append(f"[paper]({sub['publication_url']})")
59
+ return [
60
+ rank,
61
+ sub["method_name"],
62
+ sub["team_name"],
63
+ m["Overall"],
64
+ m["Cumulative State Tracking"],
65
+ m["Temporal Counting"],
66
+ m["Event Ordering"],
67
+ m["Event Linking"],
68
+ m["Spatial Preference"],
69
+ m["Activity Pattern"],
70
+ sub.get("model_size") or "—",
71
+ "yes" if sub.get("uses_external_data") else "no",
72
+ sub.get("uses_video_frames") or "—",
73
+ sub.get("submitted_at_utc", "")[:10],
74
+ " · ".join(links) or "—",
75
+ ]
76
+
77
+
78
+ def load_leaderboard(show_all):
79
+ subs = ledger.list_submissions()
80
+ if not show_all:
81
+ subs = [s for s in subs if s.get("is_selected")]
82
+ subs = sorted(subs, key=lambda s: s["metrics"]["Overall"], reverse=True)
83
+ rows = [_row_from_submission(s, i + 1) for i, s in enumerate(subs)]
84
+ return pd.DataFrame(rows, columns=LEADERBOARD_COLUMNS)
85
+
86
+
87
+ # ---------------------------------------------------------------------------
88
+ # Submit
89
+ # ---------------------------------------------------------------------------
90
+
91
+ def handle_submission(file, team_name, method_name, model_size, uses_external,
92
+ uses_frames, method_description, project_url,
93
+ publication_url, profile: gr.OAuthProfile | None):
94
+ user = auth.resolve_user(profile)
95
+ if user is None:
96
+ return "**Error:** sign in with Hugging Face first (button at the top of the page)."
97
+ if not team_name or not method_name:
98
+ return "**Error:** `team_name` and `method_name` are required."
99
+ if uses_external not in ("yes", "no"):
100
+ return "**Error:** answer `Uses external data?` (yes/no)."
101
+ if not uses_frames:
102
+ return "**Error:** pick a video input modality."
103
+ if file is None:
104
+ return "**Error:** upload a `.json` submission file."
105
+
106
+ recent = ledger.count_recent(user, hours=24)
107
+ if recent >= 5:
108
+ return (f"**Rate limit:** you have **{recent}** submissions in the last 24 h "
109
+ "(max 5). Try again later.")
110
+
111
+ try:
112
+ metrics = evaluator.score_submission(file.name)
113
+ except ValueError as e:
114
+ return f"**Validation error:**\n```\n{e}\n```"
115
+ except Exception as e:
116
+ return f"**Internal error scoring submission:** `{type(e).__name__}: {e}`"
117
+
118
+ try:
119
+ sid = ledger.append_submission(
120
+ hf_user_id=user,
121
+ team_name=team_name,
122
+ method_name=method_name,
123
+ model_size=model_size,
124
+ uses_external_data=(uses_external == "yes"),
125
+ uses_video_frames=uses_frames,
126
+ method_description=method_description,
127
+ project_url=project_url,
128
+ publication_url=publication_url,
129
+ metrics=metrics,
130
+ )
131
+ except Exception as e:
132
+ return (f"**Scored, but failed to persist to ledger:** `{type(e).__name__}: {e}`\n\n"
133
+ f"Your metrics were:\n```\n{metrics}\n```")
134
+
135
+ rows = "\n".join(f"| {k} | **{v:.2f}** |" for k, v in metrics.items())
136
+ return (
137
+ f"✅ **Submission logged.** `submission_id = {sid}`\n\n"
138
+ f"| Metric | Score (%) |\n|---|---|\n{rows}\n\n"
139
+ "Go to **Manage my submissions** to mark this as your official entry."
140
+ )
141
+
142
+
143
+ # ---------------------------------------------------------------------------
144
+ # Manage
145
+ # ---------------------------------------------------------------------------
146
+
147
+ MANAGE_COLUMNS = ["submission_id", "method_name", "Overall", "is_selected", "submitted_at_utc"]
148
+
149
+
150
+ def load_my_submissions(profile: gr.OAuthProfile | None):
151
+ user = auth.resolve_user(profile)
152
+ if user is None:
153
+ return pd.DataFrame(columns=MANAGE_COLUMNS)
154
+ rows = []
155
+ for sub in ledger.list_submissions():
156
+ if sub.get("hf_user_id") != user:
157
+ continue
158
+ rows.append([
159
+ sub["submission_id"],
160
+ sub["method_name"],
161
+ sub["metrics"]["Overall"],
162
+ sub.get("is_selected", False),
163
+ sub.get("submitted_at_utc", ""),
164
+ ])
165
+ rows.sort(key=lambda r: r[4], reverse=True)
166
+ return pd.DataFrame(rows, columns=MANAGE_COLUMNS)
167
+
168
+
169
+ def set_my_selected(submission_id, profile: gr.OAuthProfile | None):
170
+ user = auth.resolve_user(profile)
171
+ if user is None:
172
+ return "**Error:** sign in first.", load_my_submissions(profile)
173
+ if not submission_id or not submission_id.strip():
174
+ return "**Error:** paste a submission_id.", load_my_submissions(profile)
175
+ try:
176
+ ledger.set_selected(submission_id.strip(), user)
177
+ except (ValueError, PermissionError) as e:
178
+ return f"**Error:** {e}", load_my_submissions(profile)
179
+ return f"✅ `{submission_id.strip()}` is now your selected entry.", load_my_submissions(profile)
180
+
181
+
182
+ # ---------------------------------------------------------------------------
183
+ # About
184
+ # ---------------------------------------------------------------------------
185
+
186
+ ABOUT_MD = """\
187
+ ## EgoMemReason
188
+
189
+ **A Memory-driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding.**
190
+
191
+ EgoMemReason is a 500-question multiple-choice benchmark over week-long egocentric
192
+ videos (built on [EgoLife](https://egolife-ai.github.io/)). Models must answer
193
+ questions whose evidence is sparsely distributed across hours or days, exercising
194
+ three memory types:
195
+
196
+ - **Entity memory** — Cumulative State Tracking, Temporal Counting
197
+ - **Event memory** — Event Ordering, Event Linking
198
+ - **Behavior memory** — Spatial Preference Inference, Activity Pattern Inference
199
+
200
+ 500 Qs · avg. 5.1 evidence segments / Q · avg. 25.9 h memory backtracking. The
201
+ strongest model in the paper reaches **39.6% Overall**.
202
+
203
+ ### Resources
204
+
205
+ - 📄 Paper: *coming soon*
206
+ - 💻 Code & reference eval scripts: <https://github.com/Ted412/EgoMemReason>
207
+ - 📦 Public questions (no answers): <https://huggingface.co/datasets/Ted412/EgoMemReason>
208
+ - 🎬 EgoLife video frames: <https://egolife-ai.github.io/>
209
+
210
+ ### Submission
211
+
212
+ Upload a JSON file with 500 entries:
213
+
214
+ ```json
215
+ [
216
+ {"example_id": 1, "predicted_answer": "A"},
217
+ ...
218
+ ]
219
+ ```
220
+
221
+ Questions have 4-10 options (letters A-J) — `predicted_answer` must be a letter
222
+ that appears in that question's `options` dict. See
223
+ [SUBMISSION_FORMAT.md](https://github.com/Ted412/EgoMemReason/blob/main/SUBMISSION_FORMAT.md)
224
+ for the full spec.
225
+
226
+ ### License
227
+
228
+ - **Annotations** (this Space + the public dataset): CC BY-NC 4.0.
229
+ - **Video frames**: governed by the [EgoLife data license](https://egolife-ai.github.io/) — you must accept their terms separately.
230
+
231
+ ### Citation
232
+
233
+ ```bibtex
234
+ @article{wang2026egomemreason,
235
+ title = {EgoMemReason: A Memory-driven Reasoning Benchmark for
236
+ Long-Horizon Egocentric Video Understanding},
237
+ author = {Wang, Ziyang and Zhang, Yue and Yu, Shoubin and Zhang, Ce and
238
+ Zhao, Zengqi and Yoon, Jaehong and Lee, Hyunji and
239
+ Bertasius, Gedas and Bansal, Mohit},
240
+ year = {2026},
241
+ journal = {arXiv preprint}
242
+ }
243
+ ```
244
+ """
245
+
246
+
247
+ # ---------------------------------------------------------------------------
248
+ # UI
249
+ # ---------------------------------------------------------------------------
250
+
251
+ with gr.Blocks(title="EgoMemReason Leaderboard", theme=gr.themes.Soft()) as demo:
252
+ gr.Markdown("# 🧠 EgoMemReason — Leaderboard")
253
+ gr.Markdown(
254
+ "*Memory-driven reasoning over week-long egocentric video. 500 MCQs · "
255
+ "Entity / Event / Behavior memory.*"
256
+ )
257
+ if BOOT_ERROR:
258
+ gr.Markdown(f"⚠️ **Boot warning:** {BOOT_ERROR}\n\nSubmissions are disabled.")
259
+
260
+ login_btn = gr.LoginButton()
261
+
262
+ with gr.Tab("Leaderboard"):
263
+ with gr.Row():
264
+ show_all = gr.Checkbox(
265
+ value=False,
266
+ label="Show all submissions (not just each team's selected entry)",
267
+ )
268
+ refresh_btn = gr.Button("Refresh", size="sm")
269
+ leaderboard_df = gr.Dataframe(
270
+ value=load_leaderboard(False),
271
+ headers=LEADERBOARD_COLUMNS,
272
+ interactive=False,
273
+ wrap=True,
274
+ )
275
+ show_all.change(load_leaderboard, inputs=[show_all], outputs=[leaderboard_df])
276
+ refresh_btn.click(load_leaderboard, inputs=[show_all], outputs=[leaderboard_df])
277
+
278
+ with gr.Tab("Submit"):
279
+ gr.Markdown("**Sign in with Hugging Face (button above) before submitting.** "
280
+ "Limit: 5 submissions per HF user per 24 h.")
281
+ with gr.Row():
282
+ team_name = gr.Textbox(label="Team name *", max_lines=1)
283
+ method_name = gr.Textbox(label="Method name *", max_lines=1)
284
+ with gr.Row():
285
+ model_size = gr.Textbox(label="Model size (e.g. 8B, 32B, API)", max_lines=1)
286
+ uses_external = gr.Radio(
287
+ ["yes", "no"], label="Uses training data beyond EgoLife? *",
288
+ )
289
+ uses_frames = gr.Radio(
290
+ ["frames-only", "video-only", "frames+audio", "captions-only", "other"],
291
+ label="Video input modality *",
292
+ )
293
+ method_description = gr.Textbox(label="Method description", lines=3)
294
+ with gr.Row():
295
+ project_url = gr.Textbox(label="Project URL", max_lines=1)
296
+ publication_url = gr.Textbox(label="Publication URL (arXiv/OpenReview)", max_lines=1)
297
+ submission_file = gr.File(label="submission.json", file_types=[".json"])
298
+ submit_btn = gr.Button("Score & log", variant="primary")
299
+ result_md = gr.Markdown()
300
+ submit_btn.click(
301
+ handle_submission,
302
+ inputs=[submission_file, team_name, method_name, model_size,
303
+ uses_external, uses_frames, method_description,
304
+ project_url, publication_url],
305
+ outputs=[result_md],
306
+ )
307
+
308
+ with gr.Tab("Manage my submissions"):
309
+ gr.Markdown(
310
+ "Toggle which of your past submissions is the official **selected** entry. "
311
+ "Only your own submissions appear here. "
312
+ "Only one entry per HF user can be selected at a time."
313
+ )
314
+ my_subs = gr.Dataframe(
315
+ value=pd.DataFrame(columns=MANAGE_COLUMNS),
316
+ headers=MANAGE_COLUMNS,
317
+ interactive=False,
318
+ wrap=True,
319
+ )
320
+ selected_id = gr.Textbox(label="submission_id to mark as selected", max_lines=1)
321
+ select_btn = gr.Button("Mark as my selected entry")
322
+ manage_msg = gr.Markdown()
323
+ demo.load(load_my_submissions, outputs=[my_subs])
324
+ select_btn.click(set_my_selected, inputs=[selected_id], outputs=[manage_msg, my_subs])
325
+
326
+ with gr.Tab("About"):
327
+ gr.Markdown(ABOUT_MD)
328
+
329
+
330
+ if __name__ == "__main__":
331
+ demo.queue().launch()
auth.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Resolve the current HF user from Gradio's OAuthProfile.
2
+
3
+ `gr.LoginButton()` populates `gr.OAuthProfile` for every callback that declares
4
+ it as a parameter. We add a `DEBUG_USER` escape hatch for local development,
5
+ gated on the SPACE_ID env var so it can never fire in production.
6
+ """
7
+
8
+ import os
9
+
10
+
11
+ def is_production():
12
+ """True when running inside the HF Space sandbox (vs local dev)."""
13
+ return os.environ.get("SPACE_ID") is not None
14
+
15
+
16
+ def resolve_user(profile):
17
+ """Returns the HF username of the requesting user, or None if not logged in.
18
+
19
+ `profile` is the `gr.OAuthProfile | None` Gradio passes to callbacks that
20
+ declare it. In local dev, set DEBUG_USER=alice to pretend to be `alice`.
21
+ """
22
+ if not is_production():
23
+ debug = os.environ.get("DEBUG_USER")
24
+ if debug:
25
+ return debug
26
+ if profile is None:
27
+ return None
28
+ return profile.username
evaluator.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scoring logic for EgoMemReason.
2
+
3
+ Pure stdlib — no Gradio, no HF imports. Returns a flat metrics dict.
4
+ Raises ValueError with per-example messages on validation failure.
5
+ """
6
+
7
+ import json
8
+ from collections import defaultdict
9
+
10
+ # Order matches the leaderboard column order.
11
+ QUERY_TYPES = [
12
+ "Cumulative State Tracking",
13
+ "Temporal Counting",
14
+ "Event Ordering",
15
+ "Event Linking",
16
+ "Spatial Preference",
17
+ "Activity Pattern",
18
+ ]
19
+
20
+
21
+ def _load(path):
22
+ with open(path) as f:
23
+ return json.load(f)
24
+
25
+
26
+ def _build_gt(ann):
27
+ """Returns {example_id: (correct_letter, query_type, valid_option_letters)}.
28
+
29
+ Questions have 4-10 options (letters up to J), so the valid answer set
30
+ is per-question, not a fixed A-D.
31
+ """
32
+ samples = ann["samples"] if isinstance(ann, dict) else ann
33
+ gt = {}
34
+ for s in samples:
35
+ eid = s["example_id"]
36
+ opts = {str(k).strip().upper() for k in s["options"].keys()}
37
+ gt[eid] = (s["correct_answer"].strip().upper(), s["query_type"], opts)
38
+ return gt
39
+
40
+
41
+ def _validate(preds, gt):
42
+ if not isinstance(preds, list):
43
+ raise ValueError("Submission must be a JSON list of objects.")
44
+
45
+ errors = []
46
+ seen = set()
47
+ for i, item in enumerate(preds):
48
+ if not isinstance(item, dict):
49
+ errors.append(f"item {i}: not a JSON object")
50
+ continue
51
+ eid = item.get("example_id")
52
+ ans = item.get("predicted_answer")
53
+ if not isinstance(eid, int):
54
+ errors.append(f"item {i}: 'example_id' must be an int, got {type(eid).__name__}")
55
+ continue
56
+ if eid in seen:
57
+ errors.append(f"duplicate example_id: {eid}")
58
+ seen.add(eid)
59
+ if eid not in gt:
60
+ errors.append(f"unknown example_id: {eid}")
61
+ continue
62
+ valid = gt[eid][2]
63
+ if not isinstance(ans, str) or ans.strip().upper() not in valid:
64
+ errors.append(
65
+ f"example_id {eid}: 'predicted_answer' must be one of "
66
+ f"{'/'.join(sorted(valid))}, got {ans!r}"
67
+ )
68
+
69
+ missing = set(gt) - seen
70
+ if missing:
71
+ errors.append(
72
+ f"missing {len(missing)} example_ids (e.g. {sorted(missing)[:5]}); "
73
+ f"submission must cover all {len(gt)} questions"
74
+ )
75
+
76
+ if errors:
77
+ msg = "Submission validation failed:\n - " + "\n - ".join(errors[:20])
78
+ if len(errors) > 20:
79
+ msg += f"\n - ... and {len(errors) - 20} more error(s)"
80
+ raise ValueError(msg)
81
+
82
+
83
+ def _score(preds, gt):
84
+ correct_total = 0
85
+ count_by_qt = defaultdict(int)
86
+ correct_by_qt = defaultdict(int)
87
+
88
+ for _eid, (_gt_ans, qt, _opts) in gt.items():
89
+ count_by_qt[qt] += 1
90
+
91
+ for item in preds:
92
+ eid = item["example_id"]
93
+ ans = item["predicted_answer"].strip().upper()
94
+ gt_ans, qt, _opts = gt[eid]
95
+ if ans == gt_ans:
96
+ correct_total += 1
97
+ correct_by_qt[qt] += 1
98
+
99
+ metrics = {}
100
+ for qt in QUERY_TYPES:
101
+ n = count_by_qt.get(qt, 0)
102
+ metrics[qt] = round(100.0 * correct_by_qt[qt] / n, 2) if n else 0.0
103
+ metrics["Overall"] = round(100.0 * correct_total / len(gt), 2)
104
+ return metrics
105
+
106
+
107
+ def score_submission(submission_path, annotation_path="annotations_private.json"):
108
+ """Returns {"Cumulative State Tracking": ..., ..., "Overall": ...} as percentages."""
109
+ gt = _build_gt(_load(annotation_path))
110
+ preds = _load(submission_path)
111
+ _validate(preds, gt)
112
+ return _score(preds, gt)
113
+
114
+
115
+ if __name__ == "__main__":
116
+ import argparse, pprint
117
+ p = argparse.ArgumentParser()
118
+ p.add_argument("--annotation", default="annotations_private.json")
119
+ p.add_argument("--submission", required=True)
120
+ args = p.parse_args()
121
+ pprint.pp(score_submission(args.submission, args.annotation))
ledger.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HF I/O for the EgoMemReason leaderboard.
2
+
3
+ Two repos:
4
+ - PUBLIC_DATASET Ted412/EgoMemReason-Leaderboard (one JSON per submission)
5
+ - PRIVATE_DATASET Ted412/EgoMemReason-Private (annotations_private.json)
6
+
7
+ Boot path: ensure_private_annotations() downloads the private annotations file
8
+ on app start so evaluator.score_submission() can read it from cwd.
9
+ """
10
+
11
+ import functools
12
+ import io
13
+ import json
14
+ import os
15
+ import time
16
+ import uuid
17
+ from datetime import datetime, timedelta, timezone
18
+
19
+ from huggingface_hub import HfApi, snapshot_download
20
+
21
+ # Hard-coded for this challenge. Override via env vars in dev.
22
+ PUBLIC_DATASET = os.environ.get("EGOMEM_PUBLIC_DATASET", "Ted412/EgoMemReason-Leaderboard")
23
+ PRIVATE_DATASET = os.environ.get("EGOMEM_PRIVATE_DATASET", "Ted412/EgoMemReason-Private")
24
+ ANNOTATIONS_FILENAME = "annotations_private.json"
25
+
26
+ HF_TOKEN = os.environ.get("HF_TOKEN") # write scope on PUBLIC_DATASET; read scope on PRIVATE_DATASET
27
+
28
+
29
+ def ensure_private_annotations(dest_path=ANNOTATIONS_FILENAME):
30
+ """Download annotations_private.json from the private dataset on app boot.
31
+
32
+ Only called once per Space restart. If the file is already present (local
33
+ dev case where you've copied it manually), do nothing.
34
+ """
35
+ if os.path.exists(dest_path):
36
+ return dest_path
37
+ if not HF_TOKEN:
38
+ raise RuntimeError(
39
+ "HF_TOKEN env var not set; cannot pull private annotations from "
40
+ f"{PRIVATE_DATASET}. Either set HF_TOKEN or place {dest_path} in cwd."
41
+ )
42
+ local_dir = snapshot_download(
43
+ repo_id=PRIVATE_DATASET,
44
+ repo_type="dataset",
45
+ token=HF_TOKEN,
46
+ allow_patterns=[ANNOTATIONS_FILENAME],
47
+ )
48
+ src = os.path.join(local_dir, ANNOTATIONS_FILENAME)
49
+ if not os.path.exists(src):
50
+ raise RuntimeError(
51
+ f"{ANNOTATIONS_FILENAME} not found in {PRIVATE_DATASET}. "
52
+ "Upload it via the HF Files UI of the private dataset repo."
53
+ )
54
+ # Symlink rather than copy — snapshot_download already cached it.
55
+ if not os.path.exists(dest_path):
56
+ os.symlink(src, dest_path)
57
+ return dest_path
58
+
59
+
60
+ def _now_iso():
61
+ return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
62
+
63
+
64
+ @functools.lru_cache(maxsize=1)
65
+ def _cached_submissions(cache_bucket):
66
+ """Pulls all submission JSON files. Bucket is int(time/60) so cache rolls every minute."""
67
+ del cache_bucket # only here to invalidate the cache
68
+ try:
69
+ local_dir = snapshot_download(
70
+ repo_id=PUBLIC_DATASET,
71
+ repo_type="dataset",
72
+ token=HF_TOKEN, # not strictly required for public read but avoids rate-limiting
73
+ allow_patterns=["submissions/*.json"],
74
+ )
75
+ except Exception:
76
+ return []
77
+ folder = os.path.join(local_dir, "submissions")
78
+ if not os.path.isdir(folder):
79
+ return []
80
+ out = []
81
+ for fn in os.listdir(folder):
82
+ if not fn.endswith(".json"):
83
+ continue
84
+ try:
85
+ with open(os.path.join(folder, fn)) as f:
86
+ out.append(json.load(f))
87
+ except Exception:
88
+ continue
89
+ return out
90
+
91
+
92
+ def list_submissions():
93
+ return _cached_submissions(int(time.time() / 60))
94
+
95
+
96
+ def _invalidate_cache():
97
+ _cached_submissions.cache_clear()
98
+
99
+
100
+ def count_recent(hf_user_id, hours=24):
101
+ cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
102
+ n = 0
103
+ for sub in list_submissions():
104
+ if sub.get("hf_user_id") != hf_user_id:
105
+ continue
106
+ ts = sub.get("submitted_at_utc", "")
107
+ try:
108
+ t = datetime.fromisoformat(ts.rstrip("Z")).replace(tzinfo=timezone.utc)
109
+ except ValueError:
110
+ continue
111
+ if t >= cutoff:
112
+ n += 1
113
+ return n
114
+
115
+
116
+ def _upload_record(record):
117
+ payload = json.dumps(record, indent=2).encode("utf-8")
118
+ HfApi().upload_file(
119
+ path_or_fileobj=io.BytesIO(payload),
120
+ path_in_repo=f"submissions/{record['submission_id']}.json",
121
+ repo_id=PUBLIC_DATASET,
122
+ repo_type="dataset",
123
+ token=HF_TOKEN,
124
+ commit_message=f"submission {record['submission_id'][:8]} from {record['hf_user_id']}",
125
+ )
126
+
127
+
128
+ def append_submission(*, hf_user_id, team_name, method_name, model_size,
129
+ uses_external_data, uses_video_frames, method_description,
130
+ project_url, publication_url, metrics):
131
+ if not HF_TOKEN:
132
+ raise RuntimeError("HF_TOKEN not set; cannot persist submission.")
133
+ sid = str(uuid.uuid4())
134
+ record = {
135
+ "submission_id": sid,
136
+ "submitted_at_utc": _now_iso(),
137
+ "hf_user_id": hf_user_id,
138
+ "team_name": team_name,
139
+ "method_name": method_name,
140
+ "model_size": model_size or "",
141
+ "uses_external_data": bool(uses_external_data),
142
+ "uses_video_frames": uses_video_frames,
143
+ "method_description": method_description or "",
144
+ "project_url": project_url or "",
145
+ "publication_url": publication_url or "",
146
+ "is_selected": False,
147
+ "metrics": metrics,
148
+ }
149
+ _upload_record(record)
150
+ _invalidate_cache()
151
+ return sid
152
+
153
+
154
+ def set_selected(submission_id, requesting_user):
155
+ """Mark `submission_id` as the requesting_user's selected entry.
156
+
157
+ Enforces one-selected-per-user. Raises PermissionError if the submission
158
+ does not belong to requesting_user.
159
+ """
160
+ target = None
161
+ for sub in list_submissions():
162
+ if sub["submission_id"] == submission_id:
163
+ target = sub
164
+ break
165
+ if target is None:
166
+ raise ValueError(f"submission_id not found: {submission_id}")
167
+ if target["hf_user_id"] != requesting_user:
168
+ raise PermissionError("You can only modify your own submissions.")
169
+
170
+ # Un-select any other submission this user previously selected.
171
+ for sub in list_submissions():
172
+ if (sub["hf_user_id"] == requesting_user
173
+ and sub["is_selected"]
174
+ and sub["submission_id"] != submission_id):
175
+ sub["is_selected"] = False
176
+ _upload_record(sub)
177
+
178
+ target["is_selected"] = True
179
+ _upload_record(target)
180
+ _invalidate_cache()
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio==4.44.0
2
+ huggingface_hub==0.25.0
3
+ pandas==2.2.3
tests/fixtures/all_a_submission.json ADDED
@@ -0,0 +1 @@
 
 
1
+ [{"example_id": 1, "predicted_answer": "A"}, {"example_id": 2, "predicted_answer": "A"}, {"example_id": 3, "predicted_answer": "A"}, {"example_id": 4, "predicted_answer": "A"}, {"example_id": 5, "predicted_answer": "A"}, {"example_id": 6, "predicted_answer": "A"}, {"example_id": 7, "predicted_answer": "A"}, {"example_id": 8, "predicted_answer": "A"}, {"example_id": 9, "predicted_answer": "A"}, {"example_id": 10, "predicted_answer": "A"}, {"example_id": 11, "predicted_answer": "A"}, {"example_id": 12, "predicted_answer": "A"}, {"example_id": 13, "predicted_answer": "A"}, {"example_id": 14, "predicted_answer": "A"}, {"example_id": 15, "predicted_answer": "A"}, {"example_id": 16, "predicted_answer": "A"}, {"example_id": 17, "predicted_answer": "A"}, {"example_id": 18, "predicted_answer": "A"}, {"example_id": 19, "predicted_answer": "A"}, {"example_id": 20, "predicted_answer": "A"}, {"example_id": 21, "predicted_answer": "A"}, {"example_id": 22, "predicted_answer": "A"}, {"example_id": 23, "predicted_answer": "A"}, {"example_id": 24, "predicted_answer": "A"}, {"example_id": 25, "predicted_answer": "A"}, {"example_id": 26, "predicted_answer": "A"}, {"example_id": 27, "predicted_answer": "A"}, {"example_id": 28, "predicted_answer": "A"}, {"example_id": 29, "predicted_answer": "A"}, {"example_id": 30, "predicted_answer": "A"}, {"example_id": 31, "predicted_answer": "A"}, {"example_id": 32, "predicted_answer": "A"}, {"example_id": 33, "predicted_answer": "A"}, {"example_id": 34, "predicted_answer": "A"}, {"example_id": 35, "predicted_answer": "A"}, {"example_id": 36, "predicted_answer": "A"}, {"example_id": 37, "predicted_answer": "A"}, {"example_id": 38, "predicted_answer": "A"}, {"example_id": 39, "predicted_answer": "A"}, {"example_id": 40, "predicted_answer": "A"}, {"example_id": 41, "predicted_answer": "A"}, {"example_id": 42, "predicted_answer": "A"}, {"example_id": 43, "predicted_answer": "A"}, {"example_id": 44, "predicted_answer": "A"}, {"example_id": 45, "predicted_answer": "A"}, {"example_id": 46, "predicted_answer": "A"}, {"example_id": 47, "predicted_answer": "A"}, {"example_id": 48, "predicted_answer": "A"}, {"example_id": 49, "predicted_answer": "A"}, {"example_id": 50, "predicted_answer": "A"}, {"example_id": 51, "predicted_answer": "A"}, {"example_id": 52, "predicted_answer": "A"}, {"example_id": 53, "predicted_answer": "A"}, {"example_id": 54, "predicted_answer": "A"}, {"example_id": 55, "predicted_answer": "A"}, {"example_id": 56, "predicted_answer": "A"}, {"example_id": 57, "predicted_answer": "A"}, {"example_id": 58, "predicted_answer": "A"}, {"example_id": 59, "predicted_answer": "A"}, {"example_id": 60, "predicted_answer": "A"}, {"example_id": 61, "predicted_answer": "A"}, {"example_id": 62, "predicted_answer": "A"}, {"example_id": 63, "predicted_answer": "A"}, {"example_id": 64, "predicted_answer": "A"}, {"example_id": 65, "predicted_answer": "A"}, {"example_id": 66, "predicted_answer": "A"}, {"example_id": 67, "predicted_answer": "A"}, {"example_id": 68, "predicted_answer": "A"}, {"example_id": 69, "predicted_answer": "A"}, {"example_id": 70, "predicted_answer": "A"}, {"example_id": 71, "predicted_answer": "A"}, {"example_id": 72, "predicted_answer": "A"}, {"example_id": 73, "predicted_answer": "A"}, {"example_id": 74, "predicted_answer": "A"}, {"example_id": 75, "predicted_answer": "A"}, {"example_id": 76, "predicted_answer": "A"}, {"example_id": 77, "predicted_answer": "A"}, {"example_id": 78, "predicted_answer": "A"}, {"example_id": 79, "predicted_answer": "A"}, {"example_id": 80, "predicted_answer": "A"}, {"example_id": 81, "predicted_answer": "A"}, {"example_id": 82, "predicted_answer": "A"}, {"example_id": 83, "predicted_answer": "A"}, {"example_id": 84, "predicted_answer": "A"}, {"example_id": 85, "predicted_answer": "A"}, {"example_id": 86, "predicted_answer": "A"}, {"example_id": 87, "predicted_answer": "A"}, {"example_id": 88, "predicted_answer": "A"}, {"example_id": 89, "predicted_answer": "A"}, {"example_id": 90, "predicted_answer": "A"}, {"example_id": 91, "predicted_answer": "A"}, {"example_id": 92, "predicted_answer": "A"}, {"example_id": 93, "predicted_answer": "A"}, {"example_id": 94, "predicted_answer": "A"}, {"example_id": 95, "predicted_answer": "A"}, {"example_id": 96, "predicted_answer": "A"}, {"example_id": 97, "predicted_answer": "A"}, {"example_id": 98, "predicted_answer": "A"}, {"example_id": 99, "predicted_answer": "A"}, {"example_id": 100, "predicted_answer": "A"}, {"example_id": 101, "predicted_answer": "A"}, {"example_id": 102, "predicted_answer": "A"}, {"example_id": 103, "predicted_answer": "A"}, {"example_id": 104, "predicted_answer": "A"}, {"example_id": 105, "predicted_answer": "A"}, {"example_id": 106, "predicted_answer": "A"}, {"example_id": 107, "predicted_answer": "A"}, {"example_id": 108, "predicted_answer": "A"}, {"example_id": 109, "predicted_answer": "A"}, {"example_id": 110, "predicted_answer": "A"}, {"example_id": 111, "predicted_answer": "A"}, {"example_id": 112, "predicted_answer": "A"}, {"example_id": 113, "predicted_answer": "A"}, {"example_id": 114, "predicted_answer": "A"}, {"example_id": 115, "predicted_answer": "A"}, {"example_id": 116, "predicted_answer": "A"}, {"example_id": 117, "predicted_answer": "A"}, {"example_id": 118, "predicted_answer": "A"}, {"example_id": 119, "predicted_answer": "A"}, {"example_id": 120, "predicted_answer": "A"}, {"example_id": 121, "predicted_answer": "A"}, {"example_id": 122, "predicted_answer": "A"}, {"example_id": 123, "predicted_answer": "A"}, {"example_id": 124, "predicted_answer": "A"}, {"example_id": 125, "predicted_answer": "A"}, {"example_id": 126, "predicted_answer": "A"}, {"example_id": 127, "predicted_answer": "A"}, {"example_id": 128, "predicted_answer": "A"}, {"example_id": 129, "predicted_answer": "A"}, {"example_id": 130, "predicted_answer": "A"}, {"example_id": 131, "predicted_answer": "A"}, {"example_id": 132, "predicted_answer": "A"}, {"example_id": 133, "predicted_answer": "A"}, {"example_id": 134, "predicted_answer": "A"}, {"example_id": 135, "predicted_answer": "A"}, {"example_id": 136, "predicted_answer": "A"}, {"example_id": 137, "predicted_answer": "A"}, {"example_id": 138, "predicted_answer": "A"}, {"example_id": 139, "predicted_answer": "A"}, {"example_id": 140, "predicted_answer": "A"}, {"example_id": 141, "predicted_answer": "A"}, {"example_id": 142, "predicted_answer": "A"}, {"example_id": 143, "predicted_answer": "A"}, {"example_id": 144, "predicted_answer": "A"}, {"example_id": 145, "predicted_answer": "A"}, {"example_id": 146, "predicted_answer": "A"}, {"example_id": 147, "predicted_answer": "A"}, {"example_id": 148, "predicted_answer": "A"}, {"example_id": 149, "predicted_answer": "A"}, {"example_id": 150, "predicted_answer": "A"}, {"example_id": 151, "predicted_answer": "A"}, {"example_id": 152, "predicted_answer": "A"}, {"example_id": 153, "predicted_answer": "A"}, {"example_id": 154, "predicted_answer": "A"}, {"example_id": 155, "predicted_answer": "A"}, {"example_id": 156, "predicted_answer": "A"}, {"example_id": 157, "predicted_answer": "A"}, {"example_id": 158, "predicted_answer": "A"}, {"example_id": 159, "predicted_answer": "A"}, {"example_id": 160, "predicted_answer": "A"}, {"example_id": 161, "predicted_answer": "A"}, {"example_id": 162, "predicted_answer": "A"}, {"example_id": 163, "predicted_answer": "A"}, {"example_id": 164, "predicted_answer": "A"}, {"example_id": 165, "predicted_answer": "A"}, {"example_id": 166, "predicted_answer": "A"}, {"example_id": 167, "predicted_answer": "A"}, {"example_id": 168, "predicted_answer": "A"}, {"example_id": 169, "predicted_answer": "A"}, {"example_id": 170, "predicted_answer": "A"}, {"example_id": 171, "predicted_answer": "A"}, {"example_id": 172, "predicted_answer": "A"}, {"example_id": 173, "predicted_answer": "A"}, {"example_id": 174, "predicted_answer": "A"}, {"example_id": 175, "predicted_answer": "A"}, {"example_id": 176, "predicted_answer": "A"}, {"example_id": 177, "predicted_answer": "A"}, {"example_id": 178, "predicted_answer": "A"}, {"example_id": 179, "predicted_answer": "A"}, {"example_id": 180, "predicted_answer": "A"}, {"example_id": 181, "predicted_answer": "A"}, {"example_id": 182, "predicted_answer": "A"}, {"example_id": 183, "predicted_answer": "A"}, {"example_id": 184, "predicted_answer": "A"}, {"example_id": 185, "predicted_answer": "A"}, {"example_id": 186, "predicted_answer": "A"}, {"example_id": 187, "predicted_answer": "A"}, {"example_id": 188, "predicted_answer": "A"}, {"example_id": 189, "predicted_answer": "A"}, {"example_id": 190, "predicted_answer": "A"}, {"example_id": 191, "predicted_answer": "A"}, {"example_id": 192, "predicted_answer": "A"}, {"example_id": 193, "predicted_answer": "A"}, {"example_id": 194, "predicted_answer": "A"}, {"example_id": 195, "predicted_answer": "A"}, {"example_id": 196, "predicted_answer": "A"}, {"example_id": 197, "predicted_answer": "A"}, {"example_id": 198, "predicted_answer": "A"}, {"example_id": 199, "predicted_answer": "A"}, {"example_id": 200, "predicted_answer": "A"}, {"example_id": 201, "predicted_answer": "A"}, {"example_id": 202, "predicted_answer": "A"}, {"example_id": 203, "predicted_answer": "A"}, {"example_id": 204, "predicted_answer": "A"}, {"example_id": 205, "predicted_answer": "A"}, {"example_id": 206, "predicted_answer": "A"}, {"example_id": 207, "predicted_answer": "A"}, {"example_id": 208, "predicted_answer": "A"}, {"example_id": 209, "predicted_answer": "A"}, {"example_id": 210, "predicted_answer": "A"}, {"example_id": 211, "predicted_answer": "A"}, {"example_id": 212, "predicted_answer": "A"}, {"example_id": 213, "predicted_answer": "A"}, {"example_id": 214, "predicted_answer": "A"}, {"example_id": 215, "predicted_answer": "A"}, {"example_id": 216, "predicted_answer": "A"}, {"example_id": 217, "predicted_answer": "A"}, {"example_id": 218, "predicted_answer": "A"}, {"example_id": 219, "predicted_answer": "A"}, {"example_id": 220, "predicted_answer": "A"}, {"example_id": 221, "predicted_answer": "A"}, {"example_id": 222, "predicted_answer": "A"}, {"example_id": 223, "predicted_answer": "A"}, {"example_id": 224, "predicted_answer": "A"}, {"example_id": 225, "predicted_answer": "A"}, {"example_id": 226, "predicted_answer": "A"}, {"example_id": 227, "predicted_answer": "A"}, {"example_id": 228, "predicted_answer": "A"}, {"example_id": 229, "predicted_answer": "A"}, {"example_id": 230, "predicted_answer": "A"}, {"example_id": 231, "predicted_answer": "A"}, {"example_id": 232, "predicted_answer": "A"}, {"example_id": 233, "predicted_answer": "A"}, {"example_id": 234, "predicted_answer": "A"}, {"example_id": 235, "predicted_answer": "A"}, {"example_id": 236, "predicted_answer": "A"}, {"example_id": 237, "predicted_answer": "A"}, {"example_id": 238, "predicted_answer": "A"}, {"example_id": 239, "predicted_answer": "A"}, {"example_id": 240, "predicted_answer": "A"}, {"example_id": 241, "predicted_answer": "A"}, {"example_id": 242, "predicted_answer": "A"}, {"example_id": 243, "predicted_answer": "A"}, {"example_id": 244, "predicted_answer": "A"}, {"example_id": 245, "predicted_answer": "A"}, {"example_id": 246, "predicted_answer": "A"}, {"example_id": 247, "predicted_answer": "A"}, {"example_id": 248, "predicted_answer": "A"}, {"example_id": 249, "predicted_answer": "A"}, {"example_id": 250, "predicted_answer": "A"}, {"example_id": 251, "predicted_answer": "A"}, {"example_id": 252, "predicted_answer": "A"}, {"example_id": 253, "predicted_answer": "A"}, {"example_id": 254, "predicted_answer": "A"}, {"example_id": 255, "predicted_answer": "A"}, {"example_id": 256, "predicted_answer": "A"}, {"example_id": 257, "predicted_answer": "A"}, {"example_id": 258, "predicted_answer": "A"}, {"example_id": 259, "predicted_answer": "A"}, {"example_id": 260, "predicted_answer": "A"}, {"example_id": 261, "predicted_answer": "A"}, {"example_id": 262, "predicted_answer": "A"}, {"example_id": 263, "predicted_answer": "A"}, {"example_id": 264, "predicted_answer": "A"}, {"example_id": 265, "predicted_answer": "A"}, {"example_id": 266, "predicted_answer": "A"}, {"example_id": 267, "predicted_answer": "A"}, {"example_id": 268, "predicted_answer": "A"}, {"example_id": 269, "predicted_answer": "A"}, {"example_id": 270, "predicted_answer": "A"}, {"example_id": 271, "predicted_answer": "A"}, {"example_id": 272, "predicted_answer": "A"}, {"example_id": 273, "predicted_answer": "A"}, {"example_id": 274, "predicted_answer": "A"}, {"example_id": 275, "predicted_answer": "A"}, {"example_id": 276, "predicted_answer": "A"}, {"example_id": 277, "predicted_answer": "A"}, {"example_id": 278, "predicted_answer": "A"}, {"example_id": 279, "predicted_answer": "A"}, {"example_id": 280, "predicted_answer": "A"}, {"example_id": 281, "predicted_answer": "A"}, {"example_id": 282, "predicted_answer": "A"}, {"example_id": 283, "predicted_answer": "A"}, {"example_id": 284, "predicted_answer": "A"}, {"example_id": 285, "predicted_answer": "A"}, {"example_id": 286, "predicted_answer": "A"}, {"example_id": 287, "predicted_answer": "A"}, {"example_id": 288, "predicted_answer": "A"}, {"example_id": 289, "predicted_answer": "A"}, {"example_id": 290, "predicted_answer": "A"}, {"example_id": 291, "predicted_answer": "A"}, {"example_id": 292, "predicted_answer": "A"}, {"example_id": 293, "predicted_answer": "A"}, {"example_id": 294, "predicted_answer": "A"}, {"example_id": 295, "predicted_answer": "A"}, {"example_id": 296, "predicted_answer": "A"}, {"example_id": 297, "predicted_answer": "A"}, {"example_id": 298, "predicted_answer": "A"}, {"example_id": 299, "predicted_answer": "A"}, {"example_id": 300, "predicted_answer": "A"}, {"example_id": 301, "predicted_answer": "A"}, {"example_id": 302, "predicted_answer": "A"}, {"example_id": 303, "predicted_answer": "A"}, {"example_id": 304, "predicted_answer": "A"}, {"example_id": 305, "predicted_answer": "A"}, {"example_id": 306, "predicted_answer": "A"}, {"example_id": 307, "predicted_answer": "A"}, {"example_id": 308, "predicted_answer": "A"}, {"example_id": 309, "predicted_answer": "A"}, {"example_id": 310, "predicted_answer": "A"}, {"example_id": 311, "predicted_answer": "A"}, {"example_id": 312, "predicted_answer": "A"}, {"example_id": 313, "predicted_answer": "A"}, {"example_id": 314, "predicted_answer": "A"}, {"example_id": 315, "predicted_answer": "A"}, {"example_id": 316, "predicted_answer": "A"}, {"example_id": 317, "predicted_answer": "A"}, {"example_id": 318, "predicted_answer": "A"}, {"example_id": 319, "predicted_answer": "A"}, {"example_id": 320, "predicted_answer": "A"}, {"example_id": 321, "predicted_answer": "A"}, {"example_id": 322, "predicted_answer": "A"}, {"example_id": 323, "predicted_answer": "A"}, {"example_id": 324, "predicted_answer": "A"}, {"example_id": 325, "predicted_answer": "A"}, {"example_id": 326, "predicted_answer": "A"}, {"example_id": 327, "predicted_answer": "A"}, {"example_id": 328, "predicted_answer": "A"}, {"example_id": 329, "predicted_answer": "A"}, {"example_id": 330, "predicted_answer": "A"}, {"example_id": 331, "predicted_answer": "A"}, {"example_id": 332, "predicted_answer": "A"}, {"example_id": 333, "predicted_answer": "A"}, {"example_id": 334, "predicted_answer": "A"}, {"example_id": 335, "predicted_answer": "A"}, {"example_id": 336, "predicted_answer": "A"}, {"example_id": 337, "predicted_answer": "A"}, {"example_id": 338, "predicted_answer": "A"}, {"example_id": 339, "predicted_answer": "A"}, {"example_id": 340, "predicted_answer": "A"}, {"example_id": 341, "predicted_answer": "A"}, {"example_id": 342, "predicted_answer": "A"}, {"example_id": 343, "predicted_answer": "A"}, {"example_id": 344, "predicted_answer": "A"}, {"example_id": 345, "predicted_answer": "A"}, {"example_id": 346, "predicted_answer": "A"}, {"example_id": 347, "predicted_answer": "A"}, {"example_id": 348, "predicted_answer": "A"}, {"example_id": 349, "predicted_answer": "A"}, {"example_id": 350, "predicted_answer": "A"}, {"example_id": 351, "predicted_answer": "A"}, {"example_id": 352, "predicted_answer": "A"}, {"example_id": 353, "predicted_answer": "A"}, {"example_id": 354, "predicted_answer": "A"}, {"example_id": 355, "predicted_answer": "A"}, {"example_id": 356, "predicted_answer": "A"}, {"example_id": 357, "predicted_answer": "A"}, {"example_id": 358, "predicted_answer": "A"}, {"example_id": 359, "predicted_answer": "A"}, {"example_id": 360, "predicted_answer": "A"}, {"example_id": 361, "predicted_answer": "A"}, {"example_id": 362, "predicted_answer": "A"}, {"example_id": 363, "predicted_answer": "A"}, {"example_id": 364, "predicted_answer": "A"}, {"example_id": 365, "predicted_answer": "A"}, {"example_id": 366, "predicted_answer": "A"}, {"example_id": 367, "predicted_answer": "A"}, {"example_id": 368, "predicted_answer": "A"}, {"example_id": 369, "predicted_answer": "A"}, {"example_id": 370, "predicted_answer": "A"}, {"example_id": 371, "predicted_answer": "A"}, {"example_id": 372, "predicted_answer": "A"}, {"example_id": 373, "predicted_answer": "A"}, {"example_id": 374, "predicted_answer": "A"}, {"example_id": 375, "predicted_answer": "A"}, {"example_id": 376, "predicted_answer": "A"}, {"example_id": 377, "predicted_answer": "A"}, {"example_id": 378, "predicted_answer": "A"}, {"example_id": 379, "predicted_answer": "A"}, {"example_id": 380, "predicted_answer": "A"}, {"example_id": 381, "predicted_answer": "A"}, {"example_id": 382, "predicted_answer": "A"}, {"example_id": 383, "predicted_answer": "A"}, {"example_id": 384, "predicted_answer": "A"}, {"example_id": 385, "predicted_answer": "A"}, {"example_id": 386, "predicted_answer": "A"}, {"example_id": 387, "predicted_answer": "A"}, {"example_id": 388, "predicted_answer": "A"}, {"example_id": 389, "predicted_answer": "A"}, {"example_id": 390, "predicted_answer": "A"}, {"example_id": 391, "predicted_answer": "A"}, {"example_id": 392, "predicted_answer": "A"}, {"example_id": 393, "predicted_answer": "A"}, {"example_id": 394, "predicted_answer": "A"}, {"example_id": 395, "predicted_answer": "A"}, {"example_id": 396, "predicted_answer": "A"}, {"example_id": 397, "predicted_answer": "A"}, {"example_id": 398, "predicted_answer": "A"}, {"example_id": 399, "predicted_answer": "A"}, {"example_id": 400, "predicted_answer": "A"}, {"example_id": 401, "predicted_answer": "A"}, {"example_id": 402, "predicted_answer": "A"}, {"example_id": 403, "predicted_answer": "A"}, {"example_id": 404, "predicted_answer": "A"}, {"example_id": 405, "predicted_answer": "A"}, {"example_id": 406, "predicted_answer": "A"}, {"example_id": 407, "predicted_answer": "A"}, {"example_id": 408, "predicted_answer": "A"}, {"example_id": 409, "predicted_answer": "A"}, {"example_id": 410, "predicted_answer": "A"}, {"example_id": 411, "predicted_answer": "A"}, {"example_id": 412, "predicted_answer": "A"}, {"example_id": 413, "predicted_answer": "A"}, {"example_id": 414, "predicted_answer": "A"}, {"example_id": 415, "predicted_answer": "A"}, {"example_id": 416, "predicted_answer": "A"}, {"example_id": 417, "predicted_answer": "A"}, {"example_id": 418, "predicted_answer": "A"}, {"example_id": 419, "predicted_answer": "A"}, {"example_id": 420, "predicted_answer": "A"}, {"example_id": 421, "predicted_answer": "A"}, {"example_id": 422, "predicted_answer": "A"}, {"example_id": 423, "predicted_answer": "A"}, {"example_id": 424, "predicted_answer": "A"}, {"example_id": 425, "predicted_answer": "A"}, {"example_id": 426, "predicted_answer": "A"}, {"example_id": 427, "predicted_answer": "A"}, {"example_id": 428, "predicted_answer": "A"}, {"example_id": 429, "predicted_answer": "A"}, {"example_id": 430, "predicted_answer": "A"}, {"example_id": 431, "predicted_answer": "A"}, {"example_id": 432, "predicted_answer": "A"}, {"example_id": 433, "predicted_answer": "A"}, {"example_id": 434, "predicted_answer": "A"}, {"example_id": 435, "predicted_answer": "A"}, {"example_id": 436, "predicted_answer": "A"}, {"example_id": 437, "predicted_answer": "A"}, {"example_id": 438, "predicted_answer": "A"}, {"example_id": 439, "predicted_answer": "A"}, {"example_id": 440, "predicted_answer": "A"}, {"example_id": 441, "predicted_answer": "A"}, {"example_id": 442, "predicted_answer": "A"}, {"example_id": 443, "predicted_answer": "A"}, {"example_id": 444, "predicted_answer": "A"}, {"example_id": 445, "predicted_answer": "A"}, {"example_id": 446, "predicted_answer": "A"}, {"example_id": 447, "predicted_answer": "A"}, {"example_id": 448, "predicted_answer": "A"}, {"example_id": 449, "predicted_answer": "A"}, {"example_id": 450, "predicted_answer": "A"}, {"example_id": 451, "predicted_answer": "A"}, {"example_id": 452, "predicted_answer": "A"}, {"example_id": 453, "predicted_answer": "A"}, {"example_id": 454, "predicted_answer": "A"}, {"example_id": 455, "predicted_answer": "A"}, {"example_id": 456, "predicted_answer": "A"}, {"example_id": 457, "predicted_answer": "A"}, {"example_id": 458, "predicted_answer": "A"}, {"example_id": 459, "predicted_answer": "A"}, {"example_id": 460, "predicted_answer": "A"}, {"example_id": 461, "predicted_answer": "A"}, {"example_id": 462, "predicted_answer": "A"}, {"example_id": 463, "predicted_answer": "A"}, {"example_id": 464, "predicted_answer": "A"}, {"example_id": 465, "predicted_answer": "A"}, {"example_id": 466, "predicted_answer": "A"}, {"example_id": 467, "predicted_answer": "A"}, {"example_id": 468, "predicted_answer": "A"}, {"example_id": 469, "predicted_answer": "A"}, {"example_id": 470, "predicted_answer": "A"}, {"example_id": 471, "predicted_answer": "A"}, {"example_id": 472, "predicted_answer": "A"}, {"example_id": 473, "predicted_answer": "A"}, {"example_id": 474, "predicted_answer": "A"}, {"example_id": 475, "predicted_answer": "A"}, {"example_id": 476, "predicted_answer": "A"}, {"example_id": 477, "predicted_answer": "A"}, {"example_id": 478, "predicted_answer": "A"}, {"example_id": 479, "predicted_answer": "A"}, {"example_id": 480, "predicted_answer": "A"}, {"example_id": 481, "predicted_answer": "A"}, {"example_id": 482, "predicted_answer": "A"}, {"example_id": 483, "predicted_answer": "A"}, {"example_id": 484, "predicted_answer": "A"}, {"example_id": 485, "predicted_answer": "A"}, {"example_id": 486, "predicted_answer": "A"}, {"example_id": 487, "predicted_answer": "A"}, {"example_id": 488, "predicted_answer": "A"}, {"example_id": 489, "predicted_answer": "A"}, {"example_id": 490, "predicted_answer": "A"}, {"example_id": 491, "predicted_answer": "A"}, {"example_id": 492, "predicted_answer": "A"}, {"example_id": 493, "predicted_answer": "A"}, {"example_id": 494, "predicted_answer": "A"}, {"example_id": 495, "predicted_answer": "A"}, {"example_id": 496, "predicted_answer": "A"}, {"example_id": 497, "predicted_answer": "A"}, {"example_id": 498, "predicted_answer": "A"}, {"example_id": 499, "predicted_answer": "A"}, {"example_id": 500, "predicted_answer": "A"}]
tests/fixtures/oracle_submission.json ADDED
@@ -0,0 +1 @@
 
 
1
+ [{"example_id": 1, "predicted_answer": "A"}, {"example_id": 2, "predicted_answer": "B"}, {"example_id": 3, "predicted_answer": "C"}, {"example_id": 4, "predicted_answer": "D"}, {"example_id": 5, "predicted_answer": "A"}, {"example_id": 6, "predicted_answer": "B"}, {"example_id": 7, "predicted_answer": "C"}, {"example_id": 8, "predicted_answer": "D"}, {"example_id": 9, "predicted_answer": "E"}, {"example_id": 10, "predicted_answer": "A"}, {"example_id": 11, "predicted_answer": "D"}, {"example_id": 12, "predicted_answer": "E"}, {"example_id": 13, "predicted_answer": "B"}, {"example_id": 14, "predicted_answer": "B"}, {"example_id": 15, "predicted_answer": "C"}, {"example_id": 16, "predicted_answer": "C"}, {"example_id": 17, "predicted_answer": "F"}, {"example_id": 18, "predicted_answer": "E"}, {"example_id": 19, "predicted_answer": "A"}, {"example_id": 20, "predicted_answer": "A"}, {"example_id": 21, "predicted_answer": "B"}, {"example_id": 22, "predicted_answer": "B"}, {"example_id": 23, "predicted_answer": "D"}, {"example_id": 24, "predicted_answer": "C"}, {"example_id": 25, "predicted_answer": "A"}, {"example_id": 26, "predicted_answer": "C"}, {"example_id": 27, "predicted_answer": "E"}, {"example_id": 28, "predicted_answer": "A"}, {"example_id": 29, "predicted_answer": "B"}, {"example_id": 30, "predicted_answer": "C"}, {"example_id": 31, "predicted_answer": "F"}, {"example_id": 32, "predicted_answer": "A"}, {"example_id": 33, "predicted_answer": "D"}, {"example_id": 34, "predicted_answer": "E"}, {"example_id": 35, "predicted_answer": "C"}, {"example_id": 36, "predicted_answer": "C"}, {"example_id": 37, "predicted_answer": "C"}, {"example_id": 38, "predicted_answer": "D"}, {"example_id": 39, "predicted_answer": "D"}, {"example_id": 40, "predicted_answer": "C"}, {"example_id": 41, "predicted_answer": "B"}, {"example_id": 42, "predicted_answer": "B"}, {"example_id": 43, "predicted_answer": "B"}, {"example_id": 44, "predicted_answer": "B"}, {"example_id": 45, "predicted_answer": "C"}, {"example_id": 46, "predicted_answer": "B"}, {"example_id": 47, "predicted_answer": "B"}, {"example_id": 48, "predicted_answer": "A"}, {"example_id": 49, "predicted_answer": "B"}, {"example_id": 50, "predicted_answer": "E"}, {"example_id": 51, "predicted_answer": "F"}, {"example_id": 52, "predicted_answer": "F"}, {"example_id": 53, "predicted_answer": "F"}, {"example_id": 54, "predicted_answer": "F"}, {"example_id": 55, "predicted_answer": "F"}, {"example_id": 56, "predicted_answer": "F"}, {"example_id": 57, "predicted_answer": "F"}, {"example_id": 58, "predicted_answer": "F"}, {"example_id": 59, "predicted_answer": "F"}, {"example_id": 60, "predicted_answer": "F"}, {"example_id": 61, "predicted_answer": "F"}, {"example_id": 62, "predicted_answer": "F"}, {"example_id": 63, "predicted_answer": "F"}, {"example_id": 64, "predicted_answer": "F"}, {"example_id": 65, "predicted_answer": "F"}, {"example_id": 66, "predicted_answer": "F"}, {"example_id": 67, "predicted_answer": "F"}, {"example_id": 68, "predicted_answer": "F"}, {"example_id": 69, "predicted_answer": "B"}, {"example_id": 70, "predicted_answer": "C"}, {"example_id": 71, "predicted_answer": "D"}, {"example_id": 72, "predicted_answer": "E"}, {"example_id": 73, "predicted_answer": "F"}, {"example_id": 74, "predicted_answer": "A"}, {"example_id": 75, "predicted_answer": "B"}, {"example_id": 76, "predicted_answer": "C"}, {"example_id": 77, "predicted_answer": "D"}, {"example_id": 78, "predicted_answer": "E"}, {"example_id": 79, "predicted_answer": "F"}, {"example_id": 80, "predicted_answer": "E"}, {"example_id": 81, "predicted_answer": "E"}, {"example_id": 82, "predicted_answer": "E"}, {"example_id": 83, "predicted_answer": "A"}, {"example_id": 84, "predicted_answer": "B"}, {"example_id": 85, "predicted_answer": "C"}, {"example_id": 86, "predicted_answer": "D"}, {"example_id": 87, "predicted_answer": "E"}, {"example_id": 88, "predicted_answer": "A"}, {"example_id": 89, "predicted_answer": "B"}, {"example_id": 90, "predicted_answer": "D"}, {"example_id": 91, "predicted_answer": "E"}, {"example_id": 92, "predicted_answer": "A"}, {"example_id": 93, "predicted_answer": "C"}, {"example_id": 94, "predicted_answer": "C"}, {"example_id": 95, "predicted_answer": "D"}, {"example_id": 96, "predicted_answer": "A"}, {"example_id": 97, "predicted_answer": "B"}, {"example_id": 98, "predicted_answer": "C"}, {"example_id": 99, "predicted_answer": "D"}, {"example_id": 100, "predicted_answer": "A"}, {"example_id": 101, "predicted_answer": "B"}, {"example_id": 102, "predicted_answer": "C"}, {"example_id": 103, "predicted_answer": "D"}, {"example_id": 104, "predicted_answer": "A"}, {"example_id": 105, "predicted_answer": "B"}, {"example_id": 106, "predicted_answer": "D"}, {"example_id": 107, "predicted_answer": "A"}, {"example_id": 108, "predicted_answer": "C"}, {"example_id": 109, "predicted_answer": "B"}, {"example_id": 110, "predicted_answer": "C"}, {"example_id": 111, "predicted_answer": "D"}, {"example_id": 112, "predicted_answer": "D"}, {"example_id": 113, "predicted_answer": "E"}, {"example_id": 114, "predicted_answer": "A"}, {"example_id": 115, "predicted_answer": "B"}, {"example_id": 116, "predicted_answer": "A"}, {"example_id": 117, "predicted_answer": "C"}, {"example_id": 118, "predicted_answer": "C"}, {"example_id": 119, "predicted_answer": "D"}, {"example_id": 120, "predicted_answer": "E"}, {"example_id": 121, "predicted_answer": "F"}, {"example_id": 122, "predicted_answer": "B"}, {"example_id": 123, "predicted_answer": "D"}, {"example_id": 124, "predicted_answer": "D"}, {"example_id": 125, "predicted_answer": "D"}, {"example_id": 126, "predicted_answer": "B"}, {"example_id": 127, "predicted_answer": "D"}, {"example_id": 128, "predicted_answer": "D"}, {"example_id": 129, "predicted_answer": "B"}, {"example_id": 130, "predicted_answer": "C"}, {"example_id": 131, "predicted_answer": "D"}, {"example_id": 132, "predicted_answer": "E"}, {"example_id": 133, "predicted_answer": "F"}, {"example_id": 134, "predicted_answer": "A"}, {"example_id": 135, "predicted_answer": "D"}, {"example_id": 136, "predicted_answer": "E"}, {"example_id": 137, "predicted_answer": "B"}, {"example_id": 138, "predicted_answer": "C"}, {"example_id": 139, "predicted_answer": "B"}, {"example_id": 140, "predicted_answer": "A"}, {"example_id": 141, "predicted_answer": "A"}, {"example_id": 142, "predicted_answer": "C"}, {"example_id": 143, "predicted_answer": "D"}, {"example_id": 144, "predicted_answer": "E"}, {"example_id": 145, "predicted_answer": "E"}, {"example_id": 146, "predicted_answer": "G"}, {"example_id": 147, "predicted_answer": "F"}, {"example_id": 148, "predicted_answer": "D"}, {"example_id": 149, "predicted_answer": "D"}, {"example_id": 150, "predicted_answer": "A"}, {"example_id": 151, "predicted_answer": "C"}, {"example_id": 152, "predicted_answer": "A"}, {"example_id": 153, "predicted_answer": "E"}, {"example_id": 154, "predicted_answer": "A"}, {"example_id": 155, "predicted_answer": "A"}, {"example_id": 156, "predicted_answer": "A"}, {"example_id": 157, "predicted_answer": "A"}, {"example_id": 158, "predicted_answer": "A"}, {"example_id": 159, "predicted_answer": "B"}, {"example_id": 160, "predicted_answer": "A"}, {"example_id": 161, "predicted_answer": "E"}, {"example_id": 162, "predicted_answer": "G"}, {"example_id": 163, "predicted_answer": "A"}, {"example_id": 164, "predicted_answer": "G"}, {"example_id": 165, "predicted_answer": "A"}, {"example_id": 166, "predicted_answer": "B"}, {"example_id": 167, "predicted_answer": "B"}, {"example_id": 168, "predicted_answer": "F"}, {"example_id": 169, "predicted_answer": "F"}, {"example_id": 170, "predicted_answer": "G"}, {"example_id": 171, "predicted_answer": "G"}, {"example_id": 172, "predicted_answer": "E"}, {"example_id": 173, "predicted_answer": "F"}, {"example_id": 174, "predicted_answer": "A"}, {"example_id": 175, "predicted_answer": "B"}, {"example_id": 176, "predicted_answer": "B"}, {"example_id": 177, "predicted_answer": "C"}, {"example_id": 178, "predicted_answer": "B"}, {"example_id": 179, "predicted_answer": "F"}, {"example_id": 180, "predicted_answer": "F"}, {"example_id": 181, "predicted_answer": "G"}, {"example_id": 182, "predicted_answer": "E"}, {"example_id": 183, "predicted_answer": "G"}, {"example_id": 184, "predicted_answer": "E"}, {"example_id": 185, "predicted_answer": "B"}, {"example_id": 186, "predicted_answer": "B"}, {"example_id": 187, "predicted_answer": "C"}, {"example_id": 188, "predicted_answer": "F"}, {"example_id": 189, "predicted_answer": "E"}, {"example_id": 190, "predicted_answer": "D"}, {"example_id": 191, "predicted_answer": "D"}, {"example_id": 192, "predicted_answer": "D"}, {"example_id": 193, "predicted_answer": "D"}, {"example_id": 194, "predicted_answer": "C"}, {"example_id": 195, "predicted_answer": "E"}, {"example_id": 196, "predicted_answer": "F"}, {"example_id": 197, "predicted_answer": "B"}, {"example_id": 198, "predicted_answer": "F"}, {"example_id": 199, "predicted_answer": "C"}, {"example_id": 200, "predicted_answer": "C"}, {"example_id": 201, "predicted_answer": "C"}, {"example_id": 202, "predicted_answer": "D"}, {"example_id": 203, "predicted_answer": "E"}, {"example_id": 204, "predicted_answer": "C"}, {"example_id": 205, "predicted_answer": "F"}, {"example_id": 206, "predicted_answer": "F"}, {"example_id": 207, "predicted_answer": "E"}, {"example_id": 208, "predicted_answer": "D"}, {"example_id": 209, "predicted_answer": "E"}, {"example_id": 210, "predicted_answer": "E"}, {"example_id": 211, "predicted_answer": "D"}, {"example_id": 212, "predicted_answer": "D"}, {"example_id": 213, "predicted_answer": "B"}, {"example_id": 214, "predicted_answer": "D"}, {"example_id": 215, "predicted_answer": "A"}, {"example_id": 216, "predicted_answer": "B"}, {"example_id": 217, "predicted_answer": "C"}, {"example_id": 218, "predicted_answer": "C"}, {"example_id": 219, "predicted_answer": "C"}, {"example_id": 220, "predicted_answer": "F"}, {"example_id": 221, "predicted_answer": "E"}, {"example_id": 222, "predicted_answer": "E"}, {"example_id": 223, "predicted_answer": "D"}, {"example_id": 224, "predicted_answer": "D"}, {"example_id": 225, "predicted_answer": "B"}, {"example_id": 226, "predicted_answer": "D"}, {"example_id": 227, "predicted_answer": "C"}, {"example_id": 228, "predicted_answer": "C"}, {"example_id": 229, "predicted_answer": "A"}, {"example_id": 230, "predicted_answer": "B"}, {"example_id": 231, "predicted_answer": "E"}, {"example_id": 232, "predicted_answer": "E"}, {"example_id": 233, "predicted_answer": "C"}, {"example_id": 234, "predicted_answer": "F"}, {"example_id": 235, "predicted_answer": "D"}, {"example_id": 236, "predicted_answer": "F"}, {"example_id": 237, "predicted_answer": "C"}, {"example_id": 238, "predicted_answer": "B"}, {"example_id": 239, "predicted_answer": "A"}, {"example_id": 240, "predicted_answer": "B"}, {"example_id": 241, "predicted_answer": "C"}, {"example_id": 242, "predicted_answer": "B"}, {"example_id": 243, "predicted_answer": "C"}, {"example_id": 244, "predicted_answer": "C"}, {"example_id": 245, "predicted_answer": "D"}, {"example_id": 246, "predicted_answer": "D"}, {"example_id": 247, "predicted_answer": "C"}, {"example_id": 248, "predicted_answer": "C"}, {"example_id": 249, "predicted_answer": "D"}, {"example_id": 250, "predicted_answer": "C"}, {"example_id": 251, "predicted_answer": "C"}, {"example_id": 252, "predicted_answer": "C"}, {"example_id": 253, "predicted_answer": "C"}, {"example_id": 254, "predicted_answer": "C"}, {"example_id": 255, "predicted_answer": "C"}, {"example_id": 256, "predicted_answer": "C"}, {"example_id": 257, "predicted_answer": "E"}, {"example_id": 258, "predicted_answer": "C"}, {"example_id": 259, "predicted_answer": "B"}, {"example_id": 260, "predicted_answer": "B"}, {"example_id": 261, "predicted_answer": "B"}, {"example_id": 262, "predicted_answer": "C"}, {"example_id": 263, "predicted_answer": "C"}, {"example_id": 264, "predicted_answer": "B"}, {"example_id": 265, "predicted_answer": "B"}, {"example_id": 266, "predicted_answer": "B"}, {"example_id": 267, "predicted_answer": "C"}, {"example_id": 268, "predicted_answer": "F"}, {"example_id": 269, "predicted_answer": "F"}, {"example_id": 270, "predicted_answer": "A"}, {"example_id": 271, "predicted_answer": "A"}, {"example_id": 272, "predicted_answer": "E"}, {"example_id": 273, "predicted_answer": "F"}, {"example_id": 274, "predicted_answer": "F"}, {"example_id": 275, "predicted_answer": "D"}, {"example_id": 276, "predicted_answer": "B"}, {"example_id": 277, "predicted_answer": "C"}, {"example_id": 278, "predicted_answer": "F"}, {"example_id": 279, "predicted_answer": "F"}, {"example_id": 280, "predicted_answer": "E"}, {"example_id": 281, "predicted_answer": "C"}, {"example_id": 282, "predicted_answer": "D"}, {"example_id": 283, "predicted_answer": "C"}, {"example_id": 284, "predicted_answer": "E"}, {"example_id": 285, "predicted_answer": "C"}, {"example_id": 286, "predicted_answer": "B"}, {"example_id": 287, "predicted_answer": "D"}, {"example_id": 288, "predicted_answer": "E"}, {"example_id": 289, "predicted_answer": "A"}, {"example_id": 290, "predicted_answer": "D"}, {"example_id": 291, "predicted_answer": "B"}, {"example_id": 292, "predicted_answer": "D"}, {"example_id": 293, "predicted_answer": "E"}, {"example_id": 294, "predicted_answer": "F"}, {"example_id": 295, "predicted_answer": "C"}, {"example_id": 296, "predicted_answer": "A"}, {"example_id": 297, "predicted_answer": "C"}, {"example_id": 298, "predicted_answer": "D"}, {"example_id": 299, "predicted_answer": "B"}, {"example_id": 300, "predicted_answer": "D"}, {"example_id": 301, "predicted_answer": "A"}, {"example_id": 302, "predicted_answer": "C"}, {"example_id": 303, "predicted_answer": "A"}, {"example_id": 304, "predicted_answer": "A"}, {"example_id": 305, "predicted_answer": "E"}, {"example_id": 306, "predicted_answer": "F"}, {"example_id": 307, "predicted_answer": "E"}, {"example_id": 308, "predicted_answer": "D"}, {"example_id": 309, "predicted_answer": "D"}, {"example_id": 310, "predicted_answer": "F"}, {"example_id": 311, "predicted_answer": "D"}, {"example_id": 312, "predicted_answer": "B"}, {"example_id": 313, "predicted_answer": "A"}, {"example_id": 314, "predicted_answer": "B"}, {"example_id": 315, "predicted_answer": "C"}, {"example_id": 316, "predicted_answer": "D"}, {"example_id": 317, "predicted_answer": "B"}, {"example_id": 318, "predicted_answer": "A"}, {"example_id": 319, "predicted_answer": "E"}, {"example_id": 320, "predicted_answer": "C"}, {"example_id": 321, "predicted_answer": "A"}, {"example_id": 322, "predicted_answer": "B"}, {"example_id": 323, "predicted_answer": "D"}, {"example_id": 324, "predicted_answer": "D"}, {"example_id": 325, "predicted_answer": "D"}, {"example_id": 326, "predicted_answer": "F"}, {"example_id": 327, "predicted_answer": "D"}, {"example_id": 328, "predicted_answer": "C"}, {"example_id": 329, "predicted_answer": "C"}, {"example_id": 330, "predicted_answer": "B"}, {"example_id": 331, "predicted_answer": "D"}, {"example_id": 332, "predicted_answer": "F"}, {"example_id": 333, "predicted_answer": "E"}, {"example_id": 334, "predicted_answer": "F"}, {"example_id": 335, "predicted_answer": "E"}, {"example_id": 336, "predicted_answer": "C"}, {"example_id": 337, "predicted_answer": "E"}, {"example_id": 338, "predicted_answer": "E"}, {"example_id": 339, "predicted_answer": "E"}, {"example_id": 340, "predicted_answer": "F"}, {"example_id": 341, "predicted_answer": "F"}, {"example_id": 342, "predicted_answer": "D"}, {"example_id": 343, "predicted_answer": "A"}, {"example_id": 344, "predicted_answer": "C"}, {"example_id": 345, "predicted_answer": "E"}, {"example_id": 346, "predicted_answer": "B"}, {"example_id": 347, "predicted_answer": "B"}, {"example_id": 348, "predicted_answer": "B"}, {"example_id": 349, "predicted_answer": "E"}, {"example_id": 350, "predicted_answer": "F"}, {"example_id": 351, "predicted_answer": "C"}, {"example_id": 352, "predicted_answer": "A"}, {"example_id": 353, "predicted_answer": "B"}, {"example_id": 354, "predicted_answer": "B"}, {"example_id": 355, "predicted_answer": "F"}, {"example_id": 356, "predicted_answer": "C"}, {"example_id": 357, "predicted_answer": "F"}, {"example_id": 358, "predicted_answer": "C"}, {"example_id": 359, "predicted_answer": "C"}, {"example_id": 360, "predicted_answer": "B"}, {"example_id": 361, "predicted_answer": "C"}, {"example_id": 362, "predicted_answer": "D"}, {"example_id": 363, "predicted_answer": "A"}, {"example_id": 364, "predicted_answer": "B"}, {"example_id": 365, "predicted_answer": "B"}, {"example_id": 366, "predicted_answer": "A"}, {"example_id": 367, "predicted_answer": "A"}, {"example_id": 368, "predicted_answer": "B"}, {"example_id": 369, "predicted_answer": "E"}, {"example_id": 370, "predicted_answer": "F"}, {"example_id": 371, "predicted_answer": "E"}, {"example_id": 372, "predicted_answer": "A"}, {"example_id": 373, "predicted_answer": "D"}, {"example_id": 374, "predicted_answer": "B"}, {"example_id": 375, "predicted_answer": "C"}, {"example_id": 376, "predicted_answer": "D"}, {"example_id": 377, "predicted_answer": "A"}, {"example_id": 378, "predicted_answer": "F"}, {"example_id": 379, "predicted_answer": "E"}, {"example_id": 380, "predicted_answer": "A"}, {"example_id": 381, "predicted_answer": "C"}, {"example_id": 382, "predicted_answer": "A"}, {"example_id": 383, "predicted_answer": "F"}, {"example_id": 384, "predicted_answer": "A"}, {"example_id": 385, "predicted_answer": "B"}, {"example_id": 386, "predicted_answer": "F"}, {"example_id": 387, "predicted_answer": "E"}, {"example_id": 388, "predicted_answer": "C"}, {"example_id": 389, "predicted_answer": "B"}, {"example_id": 390, "predicted_answer": "C"}, {"example_id": 391, "predicted_answer": "D"}, {"example_id": 392, "predicted_answer": "A"}, {"example_id": 393, "predicted_answer": "C"}, {"example_id": 394, "predicted_answer": "F"}, {"example_id": 395, "predicted_answer": "E"}, {"example_id": 396, "predicted_answer": "F"}, {"example_id": 397, "predicted_answer": "E"}, {"example_id": 398, "predicted_answer": "D"}, {"example_id": 399, "predicted_answer": "D"}, {"example_id": 400, "predicted_answer": "D"}, {"example_id": 401, "predicted_answer": "B"}, {"example_id": 402, "predicted_answer": "E"}, {"example_id": 403, "predicted_answer": "G"}, {"example_id": 404, "predicted_answer": "G"}, {"example_id": 405, "predicted_answer": "G"}, {"example_id": 406, "predicted_answer": "C"}, {"example_id": 407, "predicted_answer": "G"}, {"example_id": 408, "predicted_answer": "A"}, {"example_id": 409, "predicted_answer": "F"}, {"example_id": 410, "predicted_answer": "B"}, {"example_id": 411, "predicted_answer": "H"}, {"example_id": 412, "predicted_answer": "E"}, {"example_id": 413, "predicted_answer": "F"}, {"example_id": 414, "predicted_answer": "C"}, {"example_id": 415, "predicted_answer": "F"}, {"example_id": 416, "predicted_answer": "C"}, {"example_id": 417, "predicted_answer": "B"}, {"example_id": 418, "predicted_answer": "C"}, {"example_id": 419, "predicted_answer": "G"}, {"example_id": 420, "predicted_answer": "A"}, {"example_id": 421, "predicted_answer": "B"}, {"example_id": 422, "predicted_answer": "H"}, {"example_id": 423, "predicted_answer": "E"}, {"example_id": 424, "predicted_answer": "G"}, {"example_id": 425, "predicted_answer": "G"}, {"example_id": 426, "predicted_answer": "G"}, {"example_id": 427, "predicted_answer": "B"}, {"example_id": 428, "predicted_answer": "E"}, {"example_id": 429, "predicted_answer": "C"}, {"example_id": 430, "predicted_answer": "C"}, {"example_id": 431, "predicted_answer": "E"}, {"example_id": 432, "predicted_answer": "F"}, {"example_id": 433, "predicted_answer": "A"}, {"example_id": 434, "predicted_answer": "D"}, {"example_id": 435, "predicted_answer": "D"}, {"example_id": 436, "predicted_answer": "C"}, {"example_id": 437, "predicted_answer": "H"}, {"example_id": 438, "predicted_answer": "H"}, {"example_id": 439, "predicted_answer": "A"}, {"example_id": 440, "predicted_answer": "F"}, {"example_id": 441, "predicted_answer": "C"}, {"example_id": 442, "predicted_answer": "F"}, {"example_id": 443, "predicted_answer": "G"}, {"example_id": 444, "predicted_answer": "G"}, {"example_id": 445, "predicted_answer": "G"}, {"example_id": 446, "predicted_answer": "D"}, {"example_id": 447, "predicted_answer": "B"}, {"example_id": 448, "predicted_answer": "A"}, {"example_id": 449, "predicted_answer": "A"}, {"example_id": 450, "predicted_answer": "I"}, {"example_id": 451, "predicted_answer": "B"}, {"example_id": 452, "predicted_answer": "I"}, {"example_id": 453, "predicted_answer": "F"}, {"example_id": 454, "predicted_answer": "F"}, {"example_id": 455, "predicted_answer": "J"}, {"example_id": 456, "predicted_answer": "H"}, {"example_id": 457, "predicted_answer": "C"}, {"example_id": 458, "predicted_answer": "A"}, {"example_id": 459, "predicted_answer": "B"}, {"example_id": 460, "predicted_answer": "C"}, {"example_id": 461, "predicted_answer": "I"}, {"example_id": 462, "predicted_answer": "E"}, {"example_id": 463, "predicted_answer": "A"}, {"example_id": 464, "predicted_answer": "E"}, {"example_id": 465, "predicted_answer": "F"}, {"example_id": 466, "predicted_answer": "C"}, {"example_id": 467, "predicted_answer": "E"}, {"example_id": 468, "predicted_answer": "D"}, {"example_id": 469, "predicted_answer": "J"}, {"example_id": 470, "predicted_answer": "A"}, {"example_id": 471, "predicted_answer": "C"}, {"example_id": 472, "predicted_answer": "D"}, {"example_id": 473, "predicted_answer": "J"}, {"example_id": 474, "predicted_answer": "H"}, {"example_id": 475, "predicted_answer": "F"}, {"example_id": 476, "predicted_answer": "E"}, {"example_id": 477, "predicted_answer": "J"}, {"example_id": 478, "predicted_answer": "A"}, {"example_id": 479, "predicted_answer": "D"}, {"example_id": 480, "predicted_answer": "F"}, {"example_id": 481, "predicted_answer": "G"}, {"example_id": 482, "predicted_answer": "A"}, {"example_id": 483, "predicted_answer": "E"}, {"example_id": 484, "predicted_answer": "J"}, {"example_id": 485, "predicted_answer": "C"}, {"example_id": 486, "predicted_answer": "F"}, {"example_id": 487, "predicted_answer": "F"}, {"example_id": 488, "predicted_answer": "B"}, {"example_id": 489, "predicted_answer": "F"}, {"example_id": 490, "predicted_answer": "C"}, {"example_id": 491, "predicted_answer": "D"}, {"example_id": 492, "predicted_answer": "C"}, {"example_id": 493, "predicted_answer": "A"}, {"example_id": 494, "predicted_answer": "H"}, {"example_id": 495, "predicted_answer": "H"}, {"example_id": 496, "predicted_answer": "A"}, {"example_id": 497, "predicted_answer": "H"}, {"example_id": 498, "predicted_answer": "F"}, {"example_id": 499, "predicted_answer": "A"}, {"example_id": 500, "predicted_answer": "B"}]
tests/test_evaluator.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for evaluator.score_submission().
2
+
3
+ Run from the EgoMemReason-Space/ directory:
4
+ python -m pytest tests/ -q
5
+ """
6
+
7
+ import json
8
+ import os
9
+ import pathlib
10
+ import sys
11
+
12
+ import pytest
13
+
14
+ ROOT = pathlib.Path(__file__).resolve().parents[1]
15
+ sys.path.insert(0, str(ROOT))
16
+
17
+ import evaluator
18
+
19
+ ANN = ROOT / "annotations_private.json"
20
+ ORACLE = ROOT / "tests" / "fixtures" / "oracle_submission.json"
21
+ ALL_A = ROOT / "tests" / "fixtures" / "all_a_submission.json"
22
+
23
+
24
+ pytestmark = pytest.mark.skipif(
25
+ not ANN.exists(),
26
+ reason=f"{ANN} not present (copy from ../EgoMemReason-EvalAI.archived/)",
27
+ )
28
+
29
+
30
+ def test_oracle_scores_100():
31
+ metrics = evaluator.score_submission(str(ORACLE), str(ANN))
32
+ for k, v in metrics.items():
33
+ assert v == 100.0, f"{k} should be 100.0, got {v}"
34
+
35
+
36
+ def test_all_a_scores_around_14():
37
+ # All-A's exact score depends on the A-letter frequency in the dataset
38
+ # — we measured 14.2% during the EvalAI port. Allow a wide band.
39
+ metrics = evaluator.score_submission(str(ALL_A), str(ANN))
40
+ assert 10.0 <= metrics["Overall"] <= 20.0, metrics
41
+
42
+
43
+ def test_broken_submission_raises(tmp_path):
44
+ broken = tmp_path / "broken.json"
45
+ json.dump(
46
+ [{"example_id": 1, "predicted_answer": "ZZ"}], # bogus letter + only 1 row
47
+ broken.open("w"),
48
+ )
49
+ with pytest.raises(ValueError) as exc:
50
+ evaluator.score_submission(str(broken), str(ANN))
51
+ assert "must be one of" in str(exc.value)
52
+ assert "missing" in str(exc.value)