Spaces:

MBZUAI
/

finmmeval-task2-final-portal

Running

App Files Files Community

finmmeval-task2-final-portal / README.md

Zhuohan

Update Task 2 final deadline to 25 May 2026 AoE

480ba45 verified 1 day ago

preview code

raw

history blame contribute delete

5.84 kB

	---
	title: FinMMEval Task 2 Final Test Portal
	emoji: 📊
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# Task 2 November Test Preparation

	This folder prepares the FinMMEval Task 2 held-out test files from:

	- `TheFinAI/PolyFiQA-Easy-November`
	- `TheFinAI/PolyFiQA-Expert-November`

	Task definition references:

	- FinMMEval Lab task paper: https://arxiv.org/pdf/2602.10886
	- PolyFiQA / MultiFinBen source paper: https://arxiv.org/pdf/2506.14028

	The task asks systems to answer questions using English 10-K/10-Q excerpts
	and multilingual news in English, Chinese, Japanese, Spanish, and Greek. The
	answer should be concise, evidence-grounded, and no more than 100 words. The
	primary metric is ROUGE-1; BLEURT and factual consistency are described as
	secondary measures in the FinMMEval task paper.

	Both source repositories are private and contain gold answers. Do not publish
	files under `outputs/gold_private`, `outputs/raw_private`, or
	`outputs/removed_public_overlap`.

	## Clean Test Construction

	The November datasets contain rows that overlap with the public PolyFiQA
	release:

	- `TheFinAI/PolyFiQA-Easy`
	- `TheFinAI/PolyFiQA-Expert`

	`prepare_task2_test.py` removes rows whose source `task_id` or full `query`
	appears in the corresponding public dataset. The clean held-out test set is:

	- Easy: 128 rows
	- Expert: 128 rows
	- Total: 256 rows

	The public release files do not include `answer`.

	## Files

	- `outputs/release/task2_test_questions.jsonl`
	Public questions for participants.
	- `outputs/release/task2_submission_template.jsonl`
	Participant submission template with empty answers.
	- `outputs/gold_private/task2_test_gold.jsonl`
	Private organizer gold answers.
	- `outputs/task2_test_summary.json`
	Counts and construction summary.

	Public question files expose only `task_id`, `tier`, `query`, and `question`.
	They intentionally do not expose answers, original source row indices, or source
	task IDs.

	## Submission Format

	Participants should submit JSONL with one row per task:

	```json
	{"task_id": "easy_0001", "answer": "..."}
	```

	The portal only accepts `.jsonl` uploads. A submission is marked complete only
	when every expected `task_id` appears exactly once, all answers are non-empty,
	and there are no extra task IDs. Answers longer than the 100-word guideline are
	reported as a warning, not as a hard format failure, because reference-style
	evidence answers can exceed this limit in practice.

	Answers should follow the tier-specific PolyFiQA format:

	- Easy:
	- `Answer: ...`
	- `News Evidence: ...`
	- Expert:
	- `Answer: ...`
	- `Financial Statements Evidence: ...`

	If the question cannot be answered or no relevant evidence is found, participants
	should write `None`.

	## Rebuild

	Use the local Keychain-managed Hugging Face read token:

	```bash
	python3 "/Users/Zhuohan Xie/.codex/skills/keychain-secret-manager/scripts/secret_keychain.py" run --service hf --profile read -- \
	python3 task2_november_test/prepare_task2_test.py
	```

	## Evaluate

	```bash
	python3 task2_november_test/evaluate_task2_submission.py \
	--submission path/to/submission.jsonl
	```

	The evaluator reports ROUGE-1 precision, recall, and F1 overall and by tier.
	The main ranking metric should be confirmed before public launch; ROUGE-1 F1 is
	currently implemented as the default practical metric.

	## Hidden Status Portal

	`portal.py` provides a lightweight final-test submission portal. It accepts
	participant JSONL uploads and displays only completion status:

	- team name
	- update time
	- overall answered coverage
	- Easy answered coverage
	- Expert answered coverage
	- completed yes/no
	- non-answer-leaking validation status

	It does not expose rank, ROUGE, gold answers, per-item correctness, or score
	breakdowns through the page or public API. Registered email is collected for
	deduplication and organizer-side contact, but is not displayed in the public
	status table or public status API.

	Optional automatic format-check notification email can be enabled with SMTP
	environment variables:

	```bash
	FINMMEVAL_NOTIFY_SMTP_HOST=<smtp-host>
	FINMMEVAL_NOTIFY_SMTP_PORT=587
	FINMMEVAL_NOTIFY_SMTP_USERNAME=<smtp-username>
	FINMMEVAL_NOTIFY_SMTP_PASSWORD=<smtp-password-or-app-password>
	FINMMEVAL_NOTIFY_FROM="FinMMEval Organizers <no-reply@example.org>"
	FINMMEVAL_NOTIFY_REPLY_TO=<organizer-contact-email>
	FINMMEVAL_SUBMISSION_DEADLINE_TEXT="25 May 2026 AoE"
	```

	When configured, submissions that fail the final-test format check automatically
	trigger an email to the registered participant address. The email reports only
	non-answer-leaking diagnostics such as missing task IDs, extra IDs, duplicate
	IDs, blank answers, and coverage. Scores and ranks remain hidden.

	Run locally:

	```bash
	PORT=8093 python3 task2_november_test/portal.py
	```

	Open:

	```text
	http://127.0.0.1:8093/task2/test
	```

	Public status API:

	```text
	http://127.0.0.1:8093/task2/test/api/status
	```

	The portal stores local uploads and private evaluation records under
	`task2_november_test/portal_data/`, which is ignored by git. For a public
	deployment, do not upload `outputs/gold_private/`, `outputs/raw_private/`, or
	`outputs/removed_public_overlap/` to a public repository.

	Before publishing participant-facing files, run:

	```bash
	python3 task2_november_test/check_public_release.py
	```

	## Notes

	The published task/source papers describe PolyFiQA-Easy and PolyFiQA-Expert as
	172 examples per tier. The November private repositories currently contain 204
	rows per tier, and 76 rows per tier overlap with the public PolyFiQA release.
	For the CLEF held-out test, this preparation keeps only the non-overlapping
	November rows, resulting in 128 clean examples per tier. Public announcements
	should therefore refer to this as the organizer-held Task 2 test set rather than
	reusing the paper-level 172-example count.