| --- |
| title: FinMMEval Task 2 Final Test Portal |
| emoji: 📊 |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| --- |
| |
| # Task 2 November Test Preparation |
|
|
| This folder prepares the FinMMEval Task 2 held-out test files from: |
|
|
| - `TheFinAI/PolyFiQA-Easy-November` |
| - `TheFinAI/PolyFiQA-Expert-November` |
|
|
| Task definition references: |
|
|
| - FinMMEval Lab task paper: https://arxiv.org/pdf/2602.10886 |
| - PolyFiQA / MultiFinBen source paper: https://arxiv.org/pdf/2506.14028 |
|
|
| The task asks systems to answer questions using English 10-K/10-Q excerpts |
| and multilingual news in English, Chinese, Japanese, Spanish, and Greek. The |
| answer should be concise, evidence-grounded, and no more than 100 words. The |
| primary metric is ROUGE-1; BLEURT and factual consistency are described as |
| secondary measures in the FinMMEval task paper. |
|
|
| Both source repositories are private and contain gold answers. Do not publish |
| files under `outputs/gold_private`, `outputs/raw_private`, or |
| `outputs/removed_public_overlap`. |
|
|
| ## Clean Test Construction |
|
|
| The November datasets contain rows that overlap with the public PolyFiQA |
| release: |
|
|
| - `TheFinAI/PolyFiQA-Easy` |
| - `TheFinAI/PolyFiQA-Expert` |
|
|
| `prepare_task2_test.py` removes rows whose source `task_id` or full `query` |
| appears in the corresponding public dataset. The clean held-out test set is: |
|
|
| - Easy: 128 rows |
| - Expert: 128 rows |
| - Total: 256 rows |
|
|
| The public release files do not include `answer`. |
|
|
| ## Files |
|
|
| - `outputs/release/task2_test_questions.jsonl` |
| Public questions for participants. |
| - `outputs/release/task2_submission_template.jsonl` |
| Participant submission template with empty answers. |
| - `outputs/gold_private/task2_test_gold.jsonl` |
| Private organizer gold answers. |
| - `outputs/task2_test_summary.json` |
| Counts and construction summary. |
|
|
| Public question files expose only `task_id`, `tier`, `query`, and `question`. |
| They intentionally do not expose answers, original source row indices, or source |
| task IDs. |
|
|
| ## Submission Format |
|
|
| Participants should submit JSONL with one row per task: |
|
|
| ```json |
| {"task_id": "easy_0001", "answer": "..."} |
| ``` |
|
|
| The portal only accepts `.jsonl` uploads. A submission is marked complete only |
| when every expected `task_id` appears exactly once, all answers are non-empty, |
| and there are no extra task IDs. Answers longer than the 100-word guideline are |
| reported as a warning, not as a hard format failure, because reference-style |
| evidence answers can exceed this limit in practice. |
|
|
| Answers should follow the tier-specific PolyFiQA format: |
|
|
| - Easy: |
| - `Answer: ...` |
| - `News Evidence: ...` |
| - Expert: |
| - `Answer: ...` |
| - `Financial Statements Evidence: ...` |
|
|
| If the question cannot be answered or no relevant evidence is found, participants |
| should write `None`. |
|
|
| ## Rebuild |
|
|
| Use the local Keychain-managed Hugging Face read token: |
|
|
| ```bash |
| python3 "/Users/Zhuohan Xie/.codex/skills/keychain-secret-manager/scripts/secret_keychain.py" run --service hf --profile read -- \ |
| python3 task2_november_test/prepare_task2_test.py |
| ``` |
|
|
| ## Evaluate |
|
|
| ```bash |
| python3 task2_november_test/evaluate_task2_submission.py \ |
| --submission path/to/submission.jsonl |
| ``` |
|
|
| The evaluator reports ROUGE-1 precision, recall, and F1 overall and by tier. |
| The main ranking metric should be confirmed before public launch; ROUGE-1 F1 is |
| currently implemented as the default practical metric. |
|
|
| ## Hidden Status Portal |
|
|
| `portal.py` provides a lightweight final-test submission portal. It accepts |
| participant JSONL uploads and displays only completion status: |
|
|
| - team name |
| - update time |
| - overall answered coverage |
| - Easy answered coverage |
| - Expert answered coverage |
| - completed yes/no |
| - non-answer-leaking validation status |
|
|
| It does not expose rank, ROUGE, gold answers, per-item correctness, or score |
| breakdowns through the page or public API. Registered email is collected for |
| deduplication and organizer-side contact, but is not displayed in the public |
| status table or public status API. |
|
|
| Optional automatic format-check notification email can be enabled with SMTP |
| environment variables: |
|
|
| ```bash |
| FINMMEVAL_NOTIFY_SMTP_HOST=<smtp-host> |
| FINMMEVAL_NOTIFY_SMTP_PORT=587 |
| FINMMEVAL_NOTIFY_SMTP_USERNAME=<smtp-username> |
| FINMMEVAL_NOTIFY_SMTP_PASSWORD=<smtp-password-or-app-password> |
| FINMMEVAL_NOTIFY_FROM="FinMMEval Organizers <no-reply@example.org>" |
| FINMMEVAL_NOTIFY_REPLY_TO=<organizer-contact-email> |
| FINMMEVAL_SUBMISSION_DEADLINE_TEXT="25 May 2026 AoE" |
| ``` |
|
|
| When configured, submissions that fail the final-test format check automatically |
| trigger an email to the registered participant address. The email reports only |
| non-answer-leaking diagnostics such as missing task IDs, extra IDs, duplicate |
| IDs, blank answers, and coverage. Scores and ranks remain hidden. |
|
|
| Run locally: |
|
|
| ```bash |
| PORT=8093 python3 task2_november_test/portal.py |
| ``` |
|
|
| Open: |
|
|
| ```text |
| http://127.0.0.1:8093/task2/test |
| ``` |
|
|
| Public status API: |
|
|
| ```text |
| http://127.0.0.1:8093/task2/test/api/status |
| ``` |
|
|
| The portal stores local uploads and private evaluation records under |
| `task2_november_test/portal_data/`, which is ignored by git. For a public |
| deployment, do not upload `outputs/gold_private/`, `outputs/raw_private/`, or |
| `outputs/removed_public_overlap/` to a public repository. |
|
|
| Before publishing participant-facing files, run: |
|
|
| ```bash |
| python3 task2_november_test/check_public_release.py |
| ``` |
|
|
| ## Notes |
|
|
| The published task/source papers describe PolyFiQA-Easy and PolyFiQA-Expert as |
| 172 examples per tier. The November private repositories currently contain 204 |
| rows per tier, and 76 rows per tier overlap with the public PolyFiQA release. |
| For the CLEF held-out test, this preparation keeps only the non-overlapping |
| November rows, resulting in 128 clean examples per tier. Public announcements |
| should therefore refer to this as the organizer-held Task 2 test set rather than |
| reusing the paper-level 172-example count. |
|
|