Spaces:

MBZUAI
/

finmmeval-task2-final-portal

Running

App Files Files Community

finmmeval-task2-final-portal / README.md

Zhuohan

Update Task 2 final deadline to 25 May 2026 AoE

480ba45 verified 1 day ago

preview code

raw

history blame contribute delete

5.84 kB

metadata

title: FinMMEval Task 2 Final Test Portal
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Task 2 November Test Preparation

This folder prepares the FinMMEval Task 2 held-out test files from:

TheFinAI/PolyFiQA-Easy-November
TheFinAI/PolyFiQA-Expert-November

Task definition references:

FinMMEval Lab task paper: https://arxiv.org/pdf/2602.10886
PolyFiQA / MultiFinBen source paper: https://arxiv.org/pdf/2506.14028

The task asks systems to answer questions using English 10-K/10-Q excerpts and multilingual news in English, Chinese, Japanese, Spanish, and Greek. The answer should be concise, evidence-grounded, and no more than 100 words. The primary metric is ROUGE-1; BLEURT and factual consistency are described as secondary measures in the FinMMEval task paper.

Both source repositories are private and contain gold answers. Do not publish files under outputs/gold_private, outputs/raw_private, or outputs/removed_public_overlap.

Clean Test Construction

The November datasets contain rows that overlap with the public PolyFiQA release:

TheFinAI/PolyFiQA-Easy
TheFinAI/PolyFiQA-Expert

prepare_task2_test.py removes rows whose source task_id or full query appears in the corresponding public dataset. The clean held-out test set is:

Easy: 128 rows
Expert: 128 rows
Total: 256 rows

The public release files do not include answer.

Files

outputs/release/task2_test_questions.jsonl Public questions for participants.
outputs/release/task2_submission_template.jsonl Participant submission template with empty answers.
outputs/gold_private/task2_test_gold.jsonl Private organizer gold answers.
outputs/task2_test_summary.json Counts and construction summary.

Public question files expose only task_id, tier, query, and question. They intentionally do not expose answers, original source row indices, or source task IDs.

Submission Format

Participants should submit JSONL with one row per task:

{"task_id": "easy_0001", "answer": "..."}

The portal only accepts .jsonl uploads. A submission is marked complete only when every expected task_id appears exactly once, all answers are non-empty, and there are no extra task IDs. Answers longer than the 100-word guideline are reported as a warning, not as a hard format failure, because reference-style evidence answers can exceed this limit in practice.

Answers should follow the tier-specific PolyFiQA format:

Easy:
- Answer: ...
- News Evidence: ...
Expert:
- Answer: ...
- Financial Statements Evidence: ...

If the question cannot be answered or no relevant evidence is found, participants should write None.

Rebuild

Use the local Keychain-managed Hugging Face read token:

python3 "/Users/Zhuohan Xie/.codex/skills/keychain-secret-manager/scripts/secret_keychain.py" run --service hf --profile read -- \
  python3 task2_november_test/prepare_task2_test.py

Evaluate

python3 task2_november_test/evaluate_task2_submission.py \
  --submission path/to/submission.jsonl

The evaluator reports ROUGE-1 precision, recall, and F1 overall and by tier. The main ranking metric should be confirmed before public launch; ROUGE-1 F1 is currently implemented as the default practical metric.

Hidden Status Portal

portal.py provides a lightweight final-test submission portal. It accepts participant JSONL uploads and displays only completion status:

team name
update time
overall answered coverage
Easy answered coverage
Expert answered coverage
completed yes/no
non-answer-leaking validation status

It does not expose rank, ROUGE, gold answers, per-item correctness, or score breakdowns through the page or public API. Registered email is collected for deduplication and organizer-side contact, but is not displayed in the public status table or public status API.

Optional automatic format-check notification email can be enabled with SMTP environment variables:

FINMMEVAL_NOTIFY_SMTP_HOST=<smtp-host>
FINMMEVAL_NOTIFY_SMTP_PORT=587
FINMMEVAL_NOTIFY_SMTP_USERNAME=<smtp-username>
FINMMEVAL_NOTIFY_SMTP_PASSWORD=<smtp-password-or-app-password>
FINMMEVAL_NOTIFY_FROM="FinMMEval Organizers <no-reply@example.org>"
FINMMEVAL_NOTIFY_REPLY_TO=<organizer-contact-email>
FINMMEVAL_SUBMISSION_DEADLINE_TEXT="25 May 2026 AoE"

When configured, submissions that fail the final-test format check automatically trigger an email to the registered participant address. The email reports only non-answer-leaking diagnostics such as missing task IDs, extra IDs, duplicate IDs, blank answers, and coverage. Scores and ranks remain hidden.

Run locally:

PORT=8093 python3 task2_november_test/portal.py

Open:

http://127.0.0.1:8093/task2/test

Public status API:

http://127.0.0.1:8093/task2/test/api/status

The portal stores local uploads and private evaluation records under task2_november_test/portal_data/, which is ignored by git. For a public deployment, do not upload outputs/gold_private/, outputs/raw_private/, or outputs/removed_public_overlap/ to a public repository.

Before publishing participant-facing files, run:

python3 task2_november_test/check_public_release.py

Notes

The published task/source papers describe PolyFiQA-Easy and PolyFiQA-Expert as 172 examples per tier. The November private repositories currently contain 204 rows per tier, and 76 rows per tier overlap with the public PolyFiQA release. For the CLEF held-out test, this preparation keeps only the non-overlapping November rows, resulting in 128 clean examples per tier. Public announcements should therefore refer to this as the organizer-held Task 2 test set rather than reusing the paper-level 172-example count.