bruAristimunha commited on
Commit
e0a7464
·
0 Parent(s):

Initial Space: searchable EEGDash catalog

Browse files
Files changed (5) hide show
  1. DEPLOY.md +108 -0
  2. README.md +58 -0
  3. app.py +342 -0
  4. dataset_summary.csv +0 -0
  5. requirements.txt +3 -0
DEPLOY.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploying the EEGDash Space and datasets
2
+
3
+ One-time setup, per-push workflow, and how the dataset mirrors are kept in sync.
4
+
5
+ ## 1. Create the org (one-time)
6
+
7
+ 1. Sign in at <https://huggingface.co>.
8
+ 2. Create org → handle **`EEGDash`**, display name *EEG-DaSh*, link
9
+ `https://eegdash.org` and `https://github.com/eegdash/EEGDash`, upload
10
+ `docs/source/_static/eegdash_image_only.svg` as the logo.
11
+ 3. Add maintainers.
12
+ 4. Generate a **write** access token (Settings → Access Tokens) and export it as
13
+ `HF_TOKEN` locally and in CI.
14
+
15
+ ## 2. Create the Space
16
+
17
+ ```bash
18
+ huggingface-cli login # paste the write token
19
+ huggingface-cli repo create \
20
+ --type space --space_sdk gradio EEGDash/catalog
21
+ ```
22
+
23
+ ## 3. Push the Space
24
+
25
+ From the repo root:
26
+
27
+ ```bash
28
+ cd huggingface-space
29
+
30
+ git init -b main
31
+ git remote add origin https://huggingface.co/spaces/EEGDash/catalog
32
+ git add README.md app.py requirements.txt dataset_summary.csv
33
+ git commit -m "Initial Space: searchable EEGDash catalog"
34
+ git push origin main
35
+ ```
36
+
37
+ The Space will build and expose at <https://huggingface.co/spaces/EEGDash/catalog>.
38
+
39
+ ### Keeping the catalog fresh
40
+
41
+ `dataset_summary.csv` in this folder is a snapshot of
42
+ `eegdash/dataset/dataset_summary.csv`. Refresh it whenever the source changes:
43
+
44
+ ```bash
45
+ cp ../eegdash/dataset/dataset_summary.csv dataset_summary.csv
46
+ git add dataset_summary.csv
47
+ git commit -m "Refresh catalog snapshot"
48
+ git push
49
+ ```
50
+
51
+ A GitHub Action that runs on pushes to `develop` can automate this — see the
52
+ stub in `.github/workflows/sync-hf-space.yml` (add when ready).
53
+
54
+ ## 4. Mirror datasets to `EEGDash/<slug>`
55
+
56
+ This is what powers the `on 🤗` column. Push one or more datasets with the helper
57
+ script at `scripts/push_to_hf.py`:
58
+
59
+ ```bash
60
+ # Single dataset
61
+ python scripts/push_to_hf.py --dataset ds002718
62
+
63
+ # Batch, skipping anything already on the Hub, capped at 5 GB
64
+ python scripts/push_to_hf.py \
65
+ --from-csv eegdash/dataset/dataset_summary.csv \
66
+ --max-size-gb 5 \
67
+ --skip-existing
68
+ ```
69
+
70
+ Under the hood this calls `EEGDashDataset(...).push_to_hub("EEGDash/<slug>")`,
71
+ which is the `HubDatasetMixin` braindecode inherits from. The resulting repo
72
+ lays out:
73
+
74
+ ```
75
+ EEGDash/<slug>/
76
+ ├── README.md # Dataset card with load snippets
77
+ ├── format_info.json # Version + compression metadata
78
+ └── sourcedata/braindecode/
79
+ ├── dataset_description.json # BIDS-compliant
80
+ ├── participants.tsv # BIDS-compliant
81
+ ├── dataset.zarr/ # blosc-compressed windowed data
82
+ └── sub-<label>/eeg/
83
+ ├── *_events.tsv
84
+ ├── *_channels.tsv
85
+ └── *_eeg.json
86
+ ```
87
+
88
+ Users then load it with:
89
+
90
+ ```python
91
+ from braindecode.datasets import BaseConcatDataset
92
+ ds = BaseConcatDataset.pull_from_hub("EEGDash/ds002718")
93
+ ```
94
+
95
+ ## 5. Verify
96
+
97
+ - Space renders: <https://huggingface.co/spaces/EEGDash/catalog>.
98
+ - Org page shows the Space card + dataset repos: <https://huggingface.co/EEGDash>.
99
+ - At least one dataset loadable end-to-end via `pull_from_hub`.
100
+
101
+ ## Troubleshooting
102
+
103
+ | Symptom | Likely cause |
104
+ |---|---|
105
+ | `on 🤗` column empty for everything | Space has no outbound network, or rate-limited; the Space caches once per process so redeploy to retry. |
106
+ | `push_to_hub` fails with `ImportError` | `pip install braindecode[hub]` (pulls in `zarr` + `huggingface_hub`). |
107
+ | Repo exists but Space doesn't flag it | `HfApi().list_datasets(author="EEGDash", limit=500)` caps at 500 — raise the limit in `app.py::_hf_repos` if the org grows beyond that. |
108
+ | `dataset_summary.csv` out of sync | Re-run step 3's refresh or add the workflow stub. |
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: EEGDash Dataset Catalog
3
+ emoji: 🧠
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: true
10
+ license: bsd-3-clause
11
+ short_description: Search 200+ EEG/MEG datasets and load them with one line.
12
+ tags:
13
+ - eeg
14
+ - meg
15
+ - neuroscience
16
+ - brain-computer-interface
17
+ - braindecode
18
+ - pytorch
19
+ - datasets
20
+ hf_oauth: false
21
+ ---
22
+
23
+ # EEGDash — Dataset Catalog
24
+
25
+ Search, filter, and load 200+ publicly shared EEG/MEG datasets. Mirrors the
26
+ catalog at [eegdash.org](https://eegdash.org) and generates one-liner load
27
+ snippets for [EEGDash](https://github.com/eegdash/EEGDash) and
28
+ [braindecode](https://braindecode.org).
29
+
30
+ ## How it works
31
+
32
+ - The left panel filters the catalog by modality, subject type, source,
33
+ license, subject count, and sampling rate.
34
+ - Selecting a row shows the dataset card + copy-paste load snippets.
35
+ - Rows tagged **on 🤗** have a mirrored HF dataset repo at
36
+ `EEGDash/<slug>` and can be fetched with
37
+ `BaseConcatDataset.pull_from_hub(...)`.
38
+
39
+ ## Loading a dataset
40
+
41
+ ```python
42
+ # Native EEGDash (streams from S3/NEMAR)
43
+ from eegdash import EEGDashDataset
44
+ ds = EEGDashDataset(dataset="ds002718", cache_dir="./cache")
45
+
46
+ # From HF Hub (braindecode's pull_from_hub, BIDS-inspired Zarr)
47
+ from braindecode.datasets import BaseConcatDataset
48
+ ds = BaseConcatDataset.pull_from_hub("EEGDash/ds002718")
49
+ ```
50
+
51
+ ## Deploying / updating the Space
52
+
53
+ See [`DEPLOY.md`](./DEPLOY.md) for the one-time org setup and per-push workflow.
54
+
55
+ ## License
56
+
57
+ BSD-3-Clause. The hosted datasets retain their upstream licenses — consult each
58
+ dataset card before redistribution.
app.py ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """EEGDash Dataset Catalog — Hugging Face Space.
2
+
3
+ Mirrors the searchable table from https://eegdash.org and generates one-liner
4
+ load snippets for EEGDash and braindecode. Rows whose slug matches an existing
5
+ repo under the ``EEGDash`` org on the Hub are flagged as ``on 🤗`` and can be
6
+ loaded via ``BaseConcatDataset.pull_from_hub``.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import ast
12
+ import json
13
+ import os
14
+ from functools import lru_cache
15
+ from pathlib import Path
16
+
17
+ import gradio as gr
18
+ import pandas as pd
19
+ from huggingface_hub import HfApi
20
+ from huggingface_hub.utils import HfHubHTTPError
21
+
22
+ HF_ORG = "EEGDash"
23
+ CSV_PATH = Path(__file__).parent / "dataset_summary.csv"
24
+ EEGDASH_URL = "https://eegdash.org"
25
+ GITHUB_URL = "https://github.com/eegdash/EEGDash"
26
+
27
+ TABLE_COLUMNS = [
28
+ "dataset",
29
+ "author_year",
30
+ "source",
31
+ "record_modality",
32
+ "Type Subject",
33
+ "modality of exp",
34
+ "type of exp",
35
+ "n_subjects",
36
+ "n_records",
37
+ "n_tasks",
38
+ "nchans",
39
+ "sfreq",
40
+ "size",
41
+ "license",
42
+ "on_hf",
43
+ ]
44
+
45
+ DISPLAY_HEADERS = {
46
+ "dataset": "Dataset",
47
+ "author_year": "Author (year)",
48
+ "source": "Source",
49
+ "record_modality": "Recording",
50
+ "Type Subject": "Pathology",
51
+ "modality of exp": "Modality",
52
+ "type of exp": "Type",
53
+ "n_subjects": "Subjects",
54
+ "n_records": "Records",
55
+ "n_tasks": "Tasks",
56
+ "nchans": "Channels",
57
+ "sfreq": "Sampling rate (Hz)",
58
+ "size": "Size",
59
+ "license": "License",
60
+ "on_hf": "on 🤗",
61
+ }
62
+
63
+
64
+ def _parse_mode_from_json_col(cell: object) -> str:
65
+ """Return the most common value from a ``[{val, count}, ...]`` JSON cell.
66
+
67
+ The summary CSV stores per-recording distributions of channel counts and
68
+ sampling rates as a JSON list. The catalog UI wants a single
69
+ representative value: the one with the highest ``count``.
70
+ """
71
+ if not isinstance(cell, str) or not cell.strip():
72
+ return ""
73
+ try:
74
+ parsed = json.loads(cell)
75
+ except json.JSONDecodeError:
76
+ try:
77
+ parsed = ast.literal_eval(cell)
78
+ except (SyntaxError, ValueError):
79
+ return ""
80
+ if not parsed:
81
+ return ""
82
+ top = max(parsed, key=lambda d: d.get("count", 0))
83
+ val = top.get("val", "")
84
+ if isinstance(val, float) and val.is_integer():
85
+ val = int(val)
86
+ return str(val)
87
+
88
+
89
+ @lru_cache(maxsize=1)
90
+ def _hf_repos() -> set[str]:
91
+ """Slugs that exist as dataset repos under the EEGDash org.
92
+
93
+ Cached for the lifetime of the process. Failures (no network, rate limit)
94
+ degrade to an empty set rather than breaking the page load.
95
+ """
96
+ try:
97
+ api = HfApi()
98
+ repos = api.list_datasets(author=HF_ORG, limit=500)
99
+ return {r.id.split("/", 1)[-1] for r in repos}
100
+ except (HfHubHTTPError, Exception): # noqa: BLE001
101
+ return set()
102
+
103
+
104
+ def _load_catalog() -> pd.DataFrame:
105
+ df = pd.read_csv(CSV_PATH)
106
+ df["nchans"] = df["nchans_set"].apply(_parse_mode_from_json_col)
107
+ df["sfreq"] = df["sampling_freqs"].apply(_parse_mode_from_json_col)
108
+ on_hub = _hf_repos()
109
+ df["on_hf"] = df["dataset"].apply(lambda s: "✓" if s in on_hub else "")
110
+ for col in ("n_subjects", "n_records", "n_tasks"):
111
+ df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)
112
+ for col in TABLE_COLUMNS:
113
+ if col not in df.columns:
114
+ df[col] = ""
115
+ df = df[TABLE_COLUMNS].fillna("")
116
+ return df
117
+
118
+
119
+ def _unique_sorted(series: pd.Series) -> list[str]:
120
+ return sorted({str(v).strip() for v in series if str(v).strip()})
121
+
122
+
123
+ def _filter(
124
+ df: pd.DataFrame,
125
+ query: str,
126
+ modalities: list[str],
127
+ subject_types: list[str],
128
+ sources: list[str],
129
+ licenses: list[str],
130
+ min_subjects: int,
131
+ only_on_hf: bool,
132
+ ) -> pd.DataFrame:
133
+ out = df
134
+ if query:
135
+ q = query.lower().strip()
136
+ hay = (
137
+ out["dataset"].str.lower()
138
+ + " "
139
+ + out["author_year"].str.lower()
140
+ )
141
+ out = out[hay.str.contains(q, regex=False, na=False)]
142
+ if modalities:
143
+ out = out[out["modality of exp"].isin(modalities)]
144
+ if subject_types:
145
+ out = out[out["Type Subject"].isin(subject_types)]
146
+ if sources:
147
+ out = out[out["source"].isin(sources)]
148
+ if licenses:
149
+ out = out[out["license"].isin(licenses)]
150
+ if min_subjects > 0:
151
+ out = out[out["n_subjects"] >= min_subjects]
152
+ if only_on_hf:
153
+ out = out[out["on_hf"] == "✓"]
154
+ return out
155
+
156
+
157
+ def _render_table(df: pd.DataFrame) -> pd.DataFrame:
158
+ return df.rename(columns=DISPLAY_HEADERS)
159
+
160
+
161
+ def _snippets(slug: str, on_hf: bool) -> str:
162
+ native = f"""```python
163
+ # EEGDash (streams from S3 / NEMAR, preserves BIDS)
164
+ from eegdash import EEGDashDataset
165
+
166
+ ds = EEGDashDataset(dataset="{slug}", cache_dir="./cache")
167
+ print(len(ds), "recordings")
168
+ ```"""
169
+ hf_block = f"""```python
170
+ # From Hugging Face (braindecode Zarr format, pre-windowed)
171
+ from braindecode.datasets import BaseConcatDataset
172
+
173
+ ds = BaseConcatDataset.pull_from_hub("{HF_ORG}/{slug}")
174
+ ```"""
175
+ if not on_hf:
176
+ hf_block = (
177
+ "> ℹ️ Not mirrored on Hugging Face yet. "
178
+ "Open an issue on "
179
+ f"[github.com/eegdash/EEGDash]({GITHUB_URL}/issues) to request it, "
180
+ "or push it yourself:\n\n"
181
+ "```python\n"
182
+ "from eegdash import EEGDashDataset\n"
183
+ f'ds = EEGDashDataset(dataset="{slug}", cache_dir="./cache")\n'
184
+ f'ds.push_to_hub("{HF_ORG}/{slug}")\n'
185
+ "```"
186
+ )
187
+ return native + "\n\n" + hf_block
188
+
189
+
190
+ def _detail(df: pd.DataFrame, slug: str) -> str:
191
+ if not slug:
192
+ return "Pick a dataset row above to see details and load snippets."
193
+ match = df[df["dataset"] == slug]
194
+ if match.empty:
195
+ return f"Dataset `{slug}` not found in the catalog."
196
+ row = match.iloc[0]
197
+ on_hf = row["on_hf"] == "✓"
198
+ doi = row.get("doi", "")
199
+ title = row.get("dataset_title", "") or slug
200
+ lines = [f"## `{slug}` — {title}"]
201
+ if on_hf:
202
+ lines.append(
203
+ f"[🤗 EEGDash/{slug}](https://huggingface.co/datasets/{HF_ORG}/{slug})"
204
+ )
205
+ if doi:
206
+ lines.append(f"[DOI: {doi}](https://doi.org/{doi})")
207
+ lines.append("")
208
+ lines.append("| | |")
209
+ lines.append("|--|--|")
210
+ for key, label in [
211
+ ("author_year", "Author (year)"),
212
+ ("source", "Source"),
213
+ ("record_modality", "Recording"),
214
+ ("Type Subject", "Pathology"),
215
+ ("modality of exp", "Modality"),
216
+ ("type of exp", "Type"),
217
+ ("n_subjects", "Subjects"),
218
+ ("n_records", "Records"),
219
+ ("n_tasks", "Tasks"),
220
+ ("nchans", "Channels"),
221
+ ("sfreq", "Sampling rate (Hz)"),
222
+ ("size", "Size"),
223
+ ("license", "License"),
224
+ ]:
225
+ val = row.get(key, "")
226
+ if str(val).strip():
227
+ lines.append(f"| **{label}** | {val} |")
228
+ lines.append("")
229
+ lines.append("### Load")
230
+ lines.append(_snippets(slug, on_hf))
231
+ return "\n".join(lines)
232
+
233
+
234
+ CATALOG = _load_catalog()
235
+ MODALITY_CHOICES = _unique_sorted(CATALOG["modality of exp"])
236
+ SUBJECT_CHOICES = _unique_sorted(CATALOG["Type Subject"])
237
+ SOURCE_CHOICES = _unique_sorted(CATALOG["source"])
238
+ LICENSE_CHOICES = _unique_sorted(CATALOG["license"])
239
+
240
+
241
+ CSS = """
242
+ #detail { min-height: 320px; }
243
+ .gradio-container { max-width: 1400px !important; }
244
+ """
245
+
246
+
247
+ def _on_select(evt: gr.SelectData, df: pd.DataFrame) -> str:
248
+ if df is None or df.empty:
249
+ return ""
250
+ row = df.iloc[evt.index[0]]
251
+ return row["Dataset"]
252
+
253
+
254
+ def _on_filter(
255
+ query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf
256
+ ):
257
+ filtered = _filter(
258
+ CATALOG, query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf
259
+ )
260
+ count_md = f"**{len(filtered)}** of {len(CATALOG)} datasets"
261
+ return _render_table(filtered), count_md
262
+
263
+
264
+ with gr.Blocks(title="EEGDash Dataset Catalog", css=CSS, theme=gr.themes.Soft()) as demo:
265
+ gr.Markdown(
266
+ f"""# 🧠 EEGDash Dataset Catalog
267
+
268
+ Search {len(CATALOG)}+ EEG/MEG datasets and get copy-paste load snippets.
269
+ Mirrored from [eegdash.org]({EEGDASH_URL}) · Code on [GitHub]({GITHUB_URL}) ·
270
+ Library on [PyPI](https://pypi.org/project/eegdash/).
271
+ """
272
+ )
273
+
274
+ with gr.Row():
275
+ with gr.Column(scale=1):
276
+ query = gr.Textbox(
277
+ label="Search",
278
+ placeholder="dataset id, author, year…",
279
+ show_label=True,
280
+ )
281
+ modalities = gr.CheckboxGroup(
282
+ label="Modality",
283
+ choices=MODALITY_CHOICES,
284
+ value=[],
285
+ )
286
+ subject_types = gr.CheckboxGroup(
287
+ label="Subject type",
288
+ choices=SUBJECT_CHOICES,
289
+ value=[],
290
+ )
291
+ sources = gr.CheckboxGroup(
292
+ label="Source",
293
+ choices=SOURCE_CHOICES,
294
+ value=[],
295
+ )
296
+ licenses = gr.Dropdown(
297
+ label="License",
298
+ choices=LICENSE_CHOICES,
299
+ multiselect=True,
300
+ value=[],
301
+ )
302
+ min_subjects = gr.Slider(
303
+ label="Min. subjects",
304
+ minimum=0,
305
+ maximum=500,
306
+ step=10,
307
+ value=0,
308
+ )
309
+ only_on_hf = gr.Checkbox(label="Only datasets mirrored on 🤗", value=False)
310
+ count = gr.Markdown(f"**{len(CATALOG)}** of {len(CATALOG)} datasets")
311
+
312
+ with gr.Column(scale=3):
313
+ table = gr.Dataframe(
314
+ value=_render_table(CATALOG),
315
+ interactive=False,
316
+ wrap=True,
317
+ column_widths=[
318
+ "130px", "140px", "90px", "90px", "120px", "110px",
319
+ "150px", "90px", "90px", "70px", "90px", "120px",
320
+ "90px", "130px", "70px",
321
+ ],
322
+ label="Catalog",
323
+ show_search="filter",
324
+ )
325
+ detail = gr.Markdown(
326
+ "Pick a dataset row above to see details and load snippets.",
327
+ elem_id="detail",
328
+ )
329
+
330
+ filter_inputs = [
331
+ query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf,
332
+ ]
333
+ for w in filter_inputs:
334
+ w.change(_on_filter, filter_inputs, [table, count])
335
+
336
+ selected_slug = gr.State("")
337
+ table.select(_on_select, [table], [selected_slug])
338
+ selected_slug.change(lambda s: _detail(CATALOG, s), [selected_slug], [detail])
339
+
340
+
341
+ if __name__ == "__main__":
342
+ demo.launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", 7860)))
dataset_summary.csv ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio==4.44.0
2
+ pandas>=2.0
3
+ huggingface_hub>=0.24