temporal-twins-anon commited on
Commit
a19847a
·
verified ·
1 Parent(s): 8287b76

Update Croissant files

Browse files
metadata/CROISSANT_VALIDATION_NOTES.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Temporal Twins Croissant Validation Notes
2
+
3
+ ## 1. How to Validate
4
+
5
+ Use the official MLCommons Croissant tooling after the dataset release files are hosted.
6
+
7
+ 1. Confirm the hosted dataset and code repository URLs in `metadata/temporal_twins_croissant.json` are correct for the current release.
8
+ 2. Validate the file with the official Croissant validator from the MLCommons Croissant project. If you use the web validator, upload the final JSON-LD file or point it at the hosted Croissant URL.
9
+ 3. As a local smoke check, you can also load the JSON-LD with a JSON parser before running the full validator:
10
+
11
+ ```bash
12
+ python3 - <<'PY'
13
+ import json
14
+ from pathlib import Path
15
+ path = Path("metadata/temporal_twins_croissant.json")
16
+ with path.open() as f:
17
+ json.load(f)
18
+ print("JSON parse OK")
19
+ PY
20
+ ```
21
+
22
+ 4. After JSON parsing succeeds, run the official Croissant validation step and confirm the record sets, fields, and distribution references resolve correctly.
23
+
24
+ ## 2. Hosted URLs and Remaining Placeholders
25
+
26
+ Dataset-side URLs now resolve to:
27
+
28
+ - Dataset URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins`
29
+ - Croissant metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json`
30
+ - Croissant metadata browser page: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json`
31
+ - Data URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data`
32
+ - Results URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results`
33
+ - Configs URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs`
34
+ - Metadata URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata`
35
+ - Release landing URL: `https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins`
36
+
37
+ Code repository URL:
38
+
39
+ - `https://huggingface.co/temporal-twins-benchmark/temporal-twins-code`
40
+
41
+ Paper URL status:
42
+
43
+ - Not available during double-blind review; to be added after publication.
44
+
45
+ ## 3. Release Checklist
46
+
47
+ - Dataset URL is accessible to reviewers.
48
+ - Croissant file validates with the official MLCommons Croissant validator.
49
+ - Distribution URLs resolve to the intended hosted artifacts.
50
+ - Record-set columns match the actual hosted files.
51
+ - RAI fields are present.
52
+ - Dataset license is present (`CC-BY-4.0`).
53
+ - Code repository license is present (`Apache-2.0`).
54
+
55
+ ## 4. Packaging Notes
56
+
57
+ - The Croissant file describes four dataset slices: `oracle_calib`, `easy`, `medium`, and `hard`.
58
+ - It assumes deterministic release seeds `0, 1, 2, 3, 4`.
59
+ - It assumes paper-suite configuration `num_users=350`, `simulation_days=45`, `fast_mode=false`, and `n_checkpoints=8`.
60
+ - The `matched_prefix_examples` record set uses the release-facing column name `matched_local_event_idx`.
61
+ - If the final hosted matched-pairs files keep the internal pipeline column name `eval_local_event_idx` instead, either rename that column in the export or update the Croissant metadata so the record-set field names match the hosted files exactly.
metadata/temporal_twins_croissant.json CHANGED
@@ -1,5 +1,6 @@
1
  {
2
  "@context": {
 
3
  "@vocab": "https://schema.org/",
4
  "sc": "https://schema.org/",
5
  "cr": "http://mlcommons.org/croissant/",
@@ -51,79 +52,42 @@
51
  "reproducible benchmark"
52
  ],
53
  "distribution": [
54
- {
55
- "@id": "transactions-archive",
56
- "@type": "cr:FileObject",
57
- "name": "Transactions archive",
58
- "description": "Hosted archive containing synthetic transaction files for oracle_calib, easy, medium, and hard across seeds 0 through 4.",
59
- "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data",
60
- "encodingFormat": "application/zip"
61
- },
62
- {
63
- "@id": "matched-prefix-archive",
64
- "@type": "cr:FileObject",
65
- "name": "Matched-prefix examples archive",
66
- "description": "Hosted release archive containing matched-prefix fraud/benign evaluation examples under release/data/*/seed_*/matched_pairs.parquet.",
67
- "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins",
68
- "encodingFormat": "application/zip"
69
- },
70
- {
71
- "@id": "configs-archive",
72
- "@type": "cr:FileObject",
73
- "name": "Configs archive",
74
- "description": "Hosted release archive containing benchmark configuration files under release/configs/*.yaml.",
75
- "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins",
76
- "encodingFormat": "application/zip"
77
- },
78
- {
79
- "@id": "results-archive",
80
- "@type": "cr:FileObject",
81
- "name": "Results archive",
82
- "description": "Hosted release archive containing the deterministic 5-seed paper-suite outputs under release/results/.",
83
- "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins",
84
- "encodingFormat": "application/zip"
85
- },
86
  {
87
  "@id": "metadata-files",
88
  "@type": "cr:FileSet",
89
  "name": "Metadata files",
90
- "description": "Metadata payload for the public release, including this Croissant file and companion notes.",
91
- "containedIn": {
92
- "@id": "results-archive"
93
- },
94
- "includes": "release/metadata/*"
95
  },
96
  {
97
  "@id": "transactions-files",
98
  "@type": "cr:FileSet",
99
  "name": "Synthetic transactions parquet files",
100
  "description": "Expected synthetic transaction files for benchmark modes oracle_calib, easy, medium, and hard across seeds 0 through 4.",
101
- "containedIn": {
102
- "@id": "transactions-archive"
103
- },
104
- "includes": "release/data/*/seed_*/transactions.parquet",
105
  "encodingFormat": "application/x-parquet"
106
  },
107
  {
108
  "@id": "matched-prefix-files",
109
  "@type": "cr:FileSet",
110
  "name": "Matched-prefix example parquet files",
111
- "description": "Expected matched-prefix benchmark examples for the release. Each file contains fraud and benign twin examples evaluated at the same local prefix index.",
112
- "containedIn": {
113
- "@id": "matched-prefix-archive"
114
- },
115
- "includes": "release/data/*/seed_*/matched_pairs.parquet",
116
  "encodingFormat": "application/x-parquet"
117
  },
118
  {
119
  "@id": "config-files",
120
  "@type": "cr:FileSet",
121
  "name": "Benchmark config files",
122
- "description": "YAML configuration files for the public release.",
123
- "containedIn": {
124
- "@id": "configs-archive"
125
- },
126
- "includes": "release/configs/*.yaml"
 
 
 
 
127
  },
128
  {
129
  "@id": "paper-suite-runs-csv",
@@ -131,9 +95,6 @@
131
  "name": "Per-run paper-suite results",
132
  "description": "Per-run deterministic results for the final 5-seed paper-scale suite.",
133
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_runs.csv",
134
- "containedIn": {
135
- "@id": "results-archive"
136
- },
137
  "encodingFormat": "text/csv"
138
  },
139
  {
@@ -142,9 +103,6 @@
142
  "name": "Paper-suite summary results",
143
  "description": "Mean and standard deviation summary of the deterministic 5-seed paper suite.",
144
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_summary.csv",
145
- "containedIn": {
146
- "@id": "results-archive"
147
- },
148
  "encodingFormat": "text/csv"
149
  },
150
  {
@@ -153,9 +111,6 @@
153
  "name": "Paper-suite runtime summary",
154
  "description": "Runtime and StaticGNN evaluation diagnostics for the final paper suite.",
155
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_runtime.csv",
156
- "containedIn": {
157
- "@id": "results-archive"
158
- },
159
  "encodingFormat": "text/csv"
160
  },
161
  {
@@ -164,9 +119,6 @@
164
  "name": "Paper-suite failed gate checks",
165
  "description": "Gate-check and advisory-check outcomes for each run in the final paper suite.",
166
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_failed_checks.csv",
167
- "containedIn": {
168
- "@id": "results-archive"
169
- },
170
  "encodingFormat": "text/csv"
171
  },
172
  {
@@ -175,9 +127,6 @@
175
  "name": "Temporal Twins Croissant metadata",
176
  "description": "MLCommons Croissant 1.1 metadata for the full Temporal Twins benchmark collection.",
177
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json",
178
- "containedIn": {
179
- "@id": "metadata-files"
180
- },
181
  "encodingFormat": "application/ld+json"
182
  }
183
  ],
 
1
  {
2
  "@context": {
3
+ "@language": "en",
4
  "@vocab": "https://schema.org/",
5
  "sc": "https://schema.org/",
6
  "cr": "http://mlcommons.org/croissant/",
 
52
  "reproducible benchmark"
53
  ],
54
  "distribution": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  {
56
  "@id": "metadata-files",
57
  "@type": "cr:FileSet",
58
  "name": "Metadata files",
59
+ "description": "Metadata payload for the hosted release, including this Croissant file and companion notes.",
60
+ "includes": "metadata/*"
 
 
 
61
  },
62
  {
63
  "@id": "transactions-files",
64
  "@type": "cr:FileSet",
65
  "name": "Synthetic transactions parquet files",
66
  "description": "Expected synthetic transaction files for benchmark modes oracle_calib, easy, medium, and hard across seeds 0 through 4.",
67
+ "includes": "data/*/seed_*/transactions.parquet",
 
 
 
68
  "encodingFormat": "application/x-parquet"
69
  },
70
  {
71
  "@id": "matched-prefix-files",
72
  "@type": "cr:FileSet",
73
  "name": "Matched-prefix example parquet files",
74
+ "description": "Expected matched-prefix benchmark examples for the hosted release. Each file contains fraud and benign twin examples evaluated at the same local prefix index.",
75
+ "includes": "data/*/seed_*/matched_pairs.parquet",
 
 
 
76
  "encodingFormat": "application/x-parquet"
77
  },
78
  {
79
  "@id": "config-files",
80
  "@type": "cr:FileSet",
81
  "name": "Benchmark config files",
82
+ "description": "YAML configuration files for the hosted release.",
83
+ "includes": "configs/*.yaml"
84
+ },
85
+ {
86
+ "@id": "results-files",
87
+ "@type": "cr:FileSet",
88
+ "name": "Results files",
89
+ "description": "Hosted result summaries and diagnostics for the deterministic paper suite.",
90
+ "includes": "results/*"
91
  },
92
  {
93
  "@id": "paper-suite-runs-csv",
 
95
  "name": "Per-run paper-suite results",
96
  "description": "Per-run deterministic results for the final 5-seed paper-scale suite.",
97
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_runs.csv",
 
 
 
98
  "encodingFormat": "text/csv"
99
  },
100
  {
 
103
  "name": "Paper-suite summary results",
104
  "description": "Mean and standard deviation summary of the deterministic 5-seed paper suite.",
105
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_summary.csv",
 
 
 
106
  "encodingFormat": "text/csv"
107
  },
108
  {
 
111
  "name": "Paper-suite runtime summary",
112
  "description": "Runtime and StaticGNN evaluation diagnostics for the final paper suite.",
113
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_runtime.csv",
 
 
 
114
  "encodingFormat": "text/csv"
115
  },
116
  {
 
119
  "name": "Paper-suite failed gate checks",
120
  "description": "Gate-check and advisory-check outcomes for each run in the final paper suite.",
121
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/results/paper_suite_failed_checks.csv",
 
 
 
122
  "encodingFormat": "text/csv"
123
  },
124
  {
 
127
  "name": "Temporal Twins Croissant metadata",
128
  "description": "MLCommons Croissant 1.1 metadata for the full Temporal Twins benchmark collection.",
129
  "contentUrl": "https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json",
 
 
 
130
  "encodingFormat": "application/ld+json"
131
  }
132
  ],