Temporal Twins Croissant Validation Notes
1. How to Validate
Use the official MLCommons Croissant tooling after the final release files are hosted.
- Confirm the hosted URLs in
metadata/temporal_twins_croissant.jsonmatch the current public dataset and code repositories. - Validate the file with the official Croissant validator from the MLCommons Croissant project. If you use the web validator, upload the final JSON-LD file or point it at the hosted Croissant URL.
- As a local smoke check, you can also load the JSON-LD with a JSON parser before running the full validator:
python3 - <<'PY'
import json
from pathlib import Path
path = Path("metadata/temporal_twins_croissant.json")
with path.open() as f:
json.load(f)
print("JSON parse OK")
PY
- After JSON parsing succeeds, run the official Croissant validation step and confirm the record sets, fields, and distribution references resolve correctly.
2. Hosted URLs and Remaining Placeholders
Dataset-side URLs now resolve to:
- Dataset URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins - Croissant metadata URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json - Croissant metadata browser page:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json - Data URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data - Results URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results - Configs URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs - Metadata URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata - Release landing URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
Code repository URL:
https://huggingface.co/temporal-twins-benchmark/temporal-twins-code
Paper URL status:
- Not available during double-blind review; to be added after publication.
3. Release Checklist
- Dataset URL is accessible to reviewers.
- Croissant file validates with the official MLCommons Croissant validator.
- Distribution URLs resolve to the intended hosted artifacts.
- Record-set columns match the actual hosted files.
- RAI fields are present.
- Dataset license is present (
CC-BY-4.0). - Code repository license is present (
Apache-2.0).
4. Packaging Notes
- The Croissant file describes four dataset slices:
oracle_calib,easy,medium, andhard. - It assumes deterministic release seeds
0, 1, 2, 3, 4. - It assumes paper-suite configuration
num_users=350,simulation_days=45,fast_mode=false, andn_checkpoints=8. - The
matched_prefix_examplesrecord set uses the release-facing column namematched_local_event_idx. - If the final hosted matched-pairs files keep the internal pipeline column name
eval_local_event_idxinstead, either rename that column in the export or update the Croissant metadata so the record-set field names match the hosted files exactly.
5. Official Croissant Checker Result
- Validator:
https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker - Validation date:
2026-05-05 - Hosted Croissant URL:
https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json
Status:
- JSON Format Validation:
PASS - Croissant Schema Validation:
PASS - Responsible AI Metadata:
PASS - Records Generation Test:
Known non-blocking streaming issue
The records-generation test reaches temporal_twins_data.zip, but fails while streaming Parquet fields from the zip archive. The checker reports unnamed or integer-indexed columns instead of the expected Parquet column names such as sender_id. This appears to be a checker or streaming compatibility issue with Parquet files inside the zip archive, not a schema or metadata failure.
Additional notes:
- The hosted archive contains
20transactions.parquetfiles and20matched_pairs.parquetfiles. - Hosted paths match:
data/*/seed_*/transactions.parquetdata/*/seed_*/matched_pairs.parquet
- The files are loadable directly with pandas/pyarrow using the instructions in
data/README_GENERATION.md. - Schema validation and Responsible AI metadata validation both pass.
Reviewer Loading Snippet
import zipfile
import pandas as pd
zip_path = "temporal_twins_data.zip"
with zipfile.ZipFile(zip_path) as zf:
with zf.open("data/medium/seed_0/transactions.parquet") as f:
transactions = pd.read_parquet(f)
with zf.open("data/medium/seed_0/matched_pairs.parquet") as f:
matched_pairs = pd.read_parquet(f)
print(transactions.columns.tolist())
print(matched_pairs.columns.tolist())
print(transactions.head())
print(matched_pairs.head())