| # Temporal Twins Croissant Validation Notes |
|
|
| ## 1. How to Validate |
|
|
| Use the official MLCommons Croissant tooling after the final release files are hosted. |
|
|
| 1. Confirm the hosted URLs in `metadata/temporal_twins_croissant.json` match the current public dataset and code repositories. |
| 2. Validate the file with the official Croissant validator from the MLCommons Croissant project. If you use the web validator, upload the final JSON-LD file or point it at the hosted Croissant URL. |
| 3. As a local smoke check, you can also load the JSON-LD with a JSON parser before running the full validator: |
|
|
| ```bash |
| python3 - <<'PY' |
| import json |
| from pathlib import Path |
| path = Path("metadata/temporal_twins_croissant.json") |
| with path.open() as f: |
| json.load(f) |
| print("JSON parse OK") |
| PY |
| ``` |
|
|
| 4. After JSON parsing succeeds, run the official Croissant validation step and confirm the record sets, fields, and distribution references resolve correctly. |
|
|
| ## 2. Hosted URLs and Remaining Placeholders |
|
|
| Dataset-side URLs now resolve to: |
|
|
| - Dataset URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins` |
| - Croissant metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json` |
| - Croissant metadata browser page: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json` |
| - Data URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data` |
| - Results URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results` |
| - Configs URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs` |
| - Metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata` |
| - Release landing URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins` |
|
|
| Code repository URL: |
|
|
| - `https://huggingface.co/temporal-twins-benchmark/temporal-twins-code` |
|
|
| Paper URL status: |
|
|
| - Not available during double-blind review; to be added after publication. |
|
|
| ## 3. Release Checklist |
|
|
| - Dataset URL is accessible to reviewers. |
| - Croissant file validates with the official MLCommons Croissant validator. |
| - Distribution URLs resolve to the intended hosted artifacts. |
| - Record-set columns match the actual hosted files. |
| - RAI fields are present. |
| - Dataset license is present (`CC-BY-4.0`). |
| - Code repository license is present (`Apache-2.0`). |
|
|
| ## 4. Packaging Notes |
|
|
| - The Croissant file describes four dataset slices: `oracle_calib`, `easy`, `medium`, and `hard`. |
| - It assumes deterministic release seeds `0, 1, 2, 3, 4`. |
| - It assumes paper-suite configuration `num_users=350`, `simulation_days=45`, `fast_mode=false`, and `n_checkpoints=8`. |
| - The `matched_prefix_examples` record set uses the release-facing column name `matched_local_event_idx`. |
| - If the final hosted matched-pairs files keep the internal pipeline column name `eval_local_event_idx` instead, either rename that column in the export or update the Croissant metadata so the record-set field names match the hosted files exactly. |
|
|
| ## 5. Official Croissant Checker Result |
|
|
| - Validator: `https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker` |
| - Validation date: `2026-05-05` |
| - Hosted Croissant URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json` |
|
|
| Status: |
|
|
| - JSON Format Validation: `PASS` |
| - Croissant Schema Validation: `PASS` |
| - Responsible AI Metadata: `PASS` |
| - Records Generation Test: `Known non-blocking streaming issue` |
|
|
| The records-generation test reaches `temporal_twins_data.zip`, but fails while streaming Parquet fields from the zip archive. The checker reports unnamed or integer-indexed columns instead of the expected Parquet column names such as `sender_id`. This appears to be a checker or streaming compatibility issue with Parquet files inside the zip archive, not a schema or metadata failure. |
|
|
| Additional notes: |
|
|
| - The hosted archive contains `20` `transactions.parquet` files and `20` `matched_pairs.parquet` files. |
| - Hosted paths match: |
| - `data/*/seed_*/transactions.parquet` |
| - `data/*/seed_*/matched_pairs.parquet` |
| - The files are loadable directly with pandas/pyarrow using the instructions in `data/README_GENERATION.md`. |
| - Schema validation and Responsible AI metadata validation both pass. |
|
|
| ### Reviewer Loading Snippet |
|
|
| ```python |
| import zipfile |
| import pandas as pd |
| |
| zip_path = "temporal_twins_data.zip" |
| |
| with zipfile.ZipFile(zip_path) as zf: |
| with zf.open("data/medium/seed_0/transactions.parquet") as f: |
| transactions = pd.read_parquet(f) |
| with zf.open("data/medium/seed_0/matched_pairs.parquet") as f: |
| matched_pairs = pd.read_parquet(f) |
| |
| print(transactions.columns.tolist()) |
| print(matched_pairs.columns.tolist()) |
| print(transactions.head()) |
| print(matched_pairs.head()) |
| ``` |
|
|