Temporal Twins Croissant Validation Notes

1. How to Validate

Use the official MLCommons Croissant tooling after the final release files are hosted.

Confirm the hosted URLs in metadata/temporal_twins_croissant.json match the current public dataset and code repositories.
Validate the file with the official Croissant validator from the MLCommons Croissant project. If you use the web validator, upload the final JSON-LD file or point it at the hosted Croissant URL.
As a local smoke check, you can also load the JSON-LD with a JSON parser before running the full validator:

python3 - <<'PY'
import json
from pathlib import Path
path = Path("metadata/temporal_twins_croissant.json")
with path.open() as f:
    json.load(f)
print("JSON parse OK")
PY

After JSON parsing succeeds, run the official Croissant validation step and confirm the record sets, fields, and distribution references resolve correctly.

2. Hosted URLs and Remaining Placeholders

Dataset-side URLs now resolve to:

Dataset URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
Croissant metadata URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json
Croissant metadata browser page: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json
Data URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data
Results URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results
Configs URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs
Metadata URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata
Release landing URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins

Code repository URL:

https://huggingface.co/temporal-twins-benchmark/temporal-twins-code

Paper URL status:

Not available during double-blind review; to be added after publication.

3. Release Checklist

Dataset URL is accessible to reviewers.
Croissant file validates with the official MLCommons Croissant validator.
Distribution URLs resolve to the intended hosted artifacts.
Record-set columns match the actual hosted files.
RAI fields are present.
Dataset license is present (CC-BY-4.0).
Code repository license is present (Apache-2.0).

4. Packaging Notes

The Croissant file describes four dataset slices: oracle_calib, easy, medium, and hard.
It assumes deterministic release seeds 0, 1, 2, 3, 4.
It assumes paper-suite configuration num_users=350, simulation_days=45, fast_mode=false, and n_checkpoints=8.
The matched_prefix_examples record set uses the release-facing column name matched_local_event_idx.
If the final hosted matched-pairs files keep the internal pipeline column name eval_local_event_idx instead, either rename that column in the export or update the Croissant metadata so the record-set field names match the hosted files exactly.

5. Official Croissant Checker Result

Validator: https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker
Validation date: 2026-05-05
Hosted Croissant URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json

Status:

JSON Format Validation: PASS
Croissant Schema Validation: PASS
Responsible AI Metadata: PASS
Records Generation Test: Known non-blocking streaming issue

The records-generation test reaches temporal_twins_data.zip, but fails while streaming Parquet fields from the zip archive. The checker reports unnamed or integer-indexed columns instead of the expected Parquet column names such as sender_id. This appears to be a checker or streaming compatibility issue with Parquet files inside the zip archive, not a schema or metadata failure.

Additional notes:

The hosted archive contains 20 transactions.parquet files and 20 matched_pairs.parquet files.
Hosted paths match:
- data/*/seed_*/transactions.parquet
- data/*/seed_*/matched_pairs.parquet
The files are loadable directly with pandas/pyarrow using the instructions in data/README_GENERATION.md.
Schema validation and Responsible AI metadata validation both pass.

Reviewer Loading Snippet

import zipfile
import pandas as pd

zip_path = "temporal_twins_data.zip"

with zipfile.ZipFile(zip_path) as zf:
    with zf.open("data/medium/seed_0/transactions.parquet") as f:
        transactions = pd.read_parquet(f)
    with zf.open("data/medium/seed_0/matched_pairs.parquet") as f:
        matched_pairs = pd.read_parquet(f)

print(transactions.columns.tolist())
print(matched_pairs.columns.tolist())
print(transactions.head())
print(matched_pairs.head())