temporal-twins-code / metadata /CROISSANT_VALIDATION_NOTES.md

Upload 2 files

806ff14 verified 5 days ago

4.94 kB

	# Temporal Twins Croissant Validation Notes

	## 1. How to Validate

	Use the official MLCommons Croissant tooling after the final release files are hosted.

	1. Confirm the hosted URLs in `metadata/temporal_twins_croissant.json` match the current public dataset and code repositories.
	2. Validate the file with the official Croissant validator from the MLCommons Croissant project. If you use the web validator, upload the final JSON-LD file or point it at the hosted Croissant URL.
	3. As a local smoke check, you can also load the JSON-LD with a JSON parser before running the full validator:

	```bash
	python3 - <<'PY'
	import json
	from pathlib import Path
	path = Path("metadata/temporal_twins_croissant.json")
	with path.open() as f:
	json.load(f)
	print("JSON parse OK")
	PY
	```

	4. After JSON parsing succeeds, run the official Croissant validation step and confirm the record sets, fields, and distribution references resolve correctly.

	## 2. Hosted URLs and Remaining Placeholders

	Dataset-side URLs now resolve to:

	- Dataset URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins`
	- Croissant metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json`
	- Croissant metadata browser page: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json`
	- Data URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data`
	- Results URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results`
	- Configs URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs`
	- Metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata`
	- Release landing URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins`

	Code repository URL:

	- `https://huggingface.co/temporal-twins-benchmark/temporal-twins-code`

	Paper URL status:

	- Not available during double-blind review; to be added after publication.

	## 3. Release Checklist

	- Dataset URL is accessible to reviewers.
	- Croissant file validates with the official MLCommons Croissant validator.
	- Distribution URLs resolve to the intended hosted artifacts.
	- Record-set columns match the actual hosted files.
	- RAI fields are present.
	- Dataset license is present (`CC-BY-4.0`).
	- Code repository license is present (`Apache-2.0`).

	## 4. Packaging Notes

	- The Croissant file describes four dataset slices: `oracle_calib`, `easy`, `medium`, and `hard`.
	- It assumes deterministic release seeds `0, 1, 2, 3, 4`.
	- It assumes paper-suite configuration `num_users=350`, `simulation_days=45`, `fast_mode=false`, and `n_checkpoints=8`.
	- The `matched_prefix_examples` record set uses the release-facing column name `matched_local_event_idx`.
	- If the final hosted matched-pairs files keep the internal pipeline column name `eval_local_event_idx` instead, either rename that column in the export or update the Croissant metadata so the record-set field names match the hosted files exactly.

	## 5. Official Croissant Checker Result

	- Validator: `https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker`
	- Validation date: `2026-05-05`
	- Hosted Croissant URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json`

	Status:

	- JSON Format Validation: `PASS`
	- Croissant Schema Validation: `PASS`
	- Responsible AI Metadata: `PASS`
	- Records Generation Test: `Known non-blocking streaming issue`

	The records-generation test reaches `temporal_twins_data.zip`, but fails while streaming Parquet fields from the zip archive. The checker reports unnamed or integer-indexed columns instead of the expected Parquet column names such as `sender_id`. This appears to be a checker or streaming compatibility issue with Parquet files inside the zip archive, not a schema or metadata failure.

	Additional notes:

	- The hosted archive contains `20` `transactions.parquet` files and `20` `matched_pairs.parquet` files.
	- Hosted paths match:
	- `data//seed_/transactions.parquet`
	- `data//seed_/matched_pairs.parquet`
	- The files are loadable directly with pandas/pyarrow using the instructions in `data/README_GENERATION.md`.
	- Schema validation and Responsible AI metadata validation both pass.

	### Reviewer Loading Snippet

	```python
	import zipfile
	import pandas as pd

	zip_path = "temporal_twins_data.zip"

	with zipfile.ZipFile(zip_path) as zf:
	with zf.open("data/medium/seed_0/transactions.parquet") as f:
	transactions = pd.read_parquet(f)
	with zf.open("data/medium/seed_0/matched_pairs.parquet") as f:
	matched_pairs = pd.read_parquet(f)

	print(transactions.columns.tolist())
	print(matched_pairs.columns.tolist())
	print(transactions.head())
	print(matched_pairs.head())
	```