Spaces:

EEGDash
/

catalog

Running

App Files Files Community

bruAristimunha commited on Apr 19

Commit

e0a7464

0 Parent(s):

Initial Space: searchable EEGDash catalog

Browse files

Files changed (5) hide show

DEPLOY.md +108 -0
README.md +58 -0
app.py +342 -0
dataset_summary.csv +0 -0
requirements.txt +3 -0

DEPLOY.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Deploying the EEGDash Space and datasets
+One-time setup, per-push workflow, and how the dataset mirrors are kept in sync.
+## 1. Create the org (one-time)
+1. Sign in at <https://huggingface.co>.
+2. Create org → handle **`EEGDash`**, display name *EEG-DaSh*, link
+   `https://eegdash.org` and `https://github.com/eegdash/EEGDash`, upload
+   `docs/source/_static/eegdash_image_only.svg` as the logo.
+3. Add maintainers.
+4. Generate a **write** access token (Settings → Access Tokens) and export it as
+   `HF_TOKEN` locally and in CI.
+## 2. Create the Space
+```bash
+huggingface-cli login            # paste the write token
+huggingface-cli repo create \
+    --type space --space_sdk gradio EEGDash/catalog
+```
+## 3. Push the Space
+From the repo root:
+```bash
+cd huggingface-space
+git init -b main
+git remote add origin https://huggingface.co/spaces/EEGDash/catalog
+git add README.md app.py requirements.txt dataset_summary.csv
+git commit -m "Initial Space: searchable EEGDash catalog"
+git push origin main
+```
+The Space will build and expose at <https://huggingface.co/spaces/EEGDash/catalog>.
+### Keeping the catalog fresh
+`dataset_summary.csv` in this folder is a snapshot of
+`eegdash/dataset/dataset_summary.csv`. Refresh it whenever the source changes:
+```bash
+cp ../eegdash/dataset/dataset_summary.csv dataset_summary.csv
+git add dataset_summary.csv
+git commit -m "Refresh catalog snapshot"
+git push
+```
+A GitHub Action that runs on pushes to `develop` can automate this — see the
+stub in `.github/workflows/sync-hf-space.yml` (add when ready).
+## 4. Mirror datasets to `EEGDash/<slug>`
+This is what powers the `on 🤗` column. Push one or more datasets with the helper
+script at `scripts/push_to_hf.py`:
+```bash
+# Single dataset
+python scripts/push_to_hf.py --dataset ds002718
+# Batch, skipping anything already on the Hub, capped at 5 GB
+python scripts/push_to_hf.py \
+    --from-csv eegdash/dataset/dataset_summary.csv \
+    --max-size-gb 5 \
+    --skip-existing
+```
+Under the hood this calls `EEGDashDataset(...).push_to_hub("EEGDash/<slug>")`,
+which is the `HubDatasetMixin` braindecode inherits from. The resulting repo
+lays out:
+```
+EEGDash/<slug>/
+├── README.md                        # Dataset card with load snippets
+├── format_info.json                 # Version + compression metadata
+└── sourcedata/braindecode/
+    ├── dataset_description.json     # BIDS-compliant
+    ├── participants.tsv             # BIDS-compliant
+    ├── dataset.zarr/                # blosc-compressed windowed data
+    └── sub-<label>/eeg/
+        ├── *_events.tsv
+        ├── *_channels.tsv
+        └── *_eeg.json
+```
+Users then load it with:
+```python
+from braindecode.datasets import BaseConcatDataset
+ds = BaseConcatDataset.pull_from_hub("EEGDash/ds002718")
+```
+## 5. Verify
+- Space renders: <https://huggingface.co/spaces/EEGDash/catalog>.
+- Org page shows the Space card + dataset repos: <https://huggingface.co/EEGDash>.
+- At least one dataset loadable end-to-end via `pull_from_hub`.
+## Troubleshooting
+| Symptom | Likely cause |
+|---|---|
+| `on 🤗` column empty for everything | Space has no outbound network, or rate-limited; the Space caches once per process so redeploy to retry. |
+| `push_to_hub` fails with `ImportError` | `pip install braindecode[hub]` (pulls in `zarr` + `huggingface_hub`). |
+| Repo exists but Space doesn't flag it | `HfApi().list_datasets(author="EEGDash", limit=500)` caps at 500 — raise the limit in `app.py::_hf_repos` if the org grows beyond that. |
+| `dataset_summary.csv` out of sync | Re-run step 3's refresh or add the workflow stub. |

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+---
+title: EEGDash Dataset Catalog
+emoji: 🧠
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: true
+license: bsd-3-clause
+short_description: Search 200+ EEG/MEG datasets and load them with one line.
+tags:
+  - eeg
+  - meg
+  - neuroscience
+  - brain-computer-interface
+  - braindecode
+  - pytorch
+  - datasets
+hf_oauth: false
+---
+# EEGDash — Dataset Catalog
+Search, filter, and load 200+ publicly shared EEG/MEG datasets. Mirrors the
+catalog at [eegdash.org](https://eegdash.org) and generates one-liner load
+snippets for [EEGDash](https://github.com/eegdash/EEGDash) and
+[braindecode](https://braindecode.org).
+## How it works
+- The left panel filters the catalog by modality, subject type, source,
+  license, subject count, and sampling rate.
+- Selecting a row shows the dataset card + copy-paste load snippets.
+- Rows tagged **on 🤗** have a mirrored HF dataset repo at
+  `EEGDash/<slug>` and can be fetched with
+  `BaseConcatDataset.pull_from_hub(...)`.
+## Loading a dataset
+```python
+# Native EEGDash (streams from S3/NEMAR)
+from eegdash import EEGDashDataset
+ds = EEGDashDataset(dataset="ds002718", cache_dir="./cache")
+# From HF Hub (braindecode's pull_from_hub, BIDS-inspired Zarr)
+from braindecode.datasets import BaseConcatDataset
+ds = BaseConcatDataset.pull_from_hub("EEGDash/ds002718")
+```
+## Deploying / updating the Space
+See [`DEPLOY.md`](./DEPLOY.md) for the one-time org setup and per-push workflow.
+## License
+BSD-3-Clause. The hosted datasets retain their upstream licenses — consult each
+dataset card before redistribution.

app.py ADDED Viewed

	@@ -0,0 +1,342 @@

+"""EEGDash Dataset Catalog — Hugging Face Space.
+Mirrors the searchable table from https://eegdash.org and generates one-liner
+load snippets for EEGDash and braindecode. Rows whose slug matches an existing
+repo under the ``EEGDash`` org on the Hub are flagged as ``on 🤗`` and can be
+loaded via ``BaseConcatDataset.pull_from_hub``.
+"""
+from __future__ import annotations
+import ast
+import json
+import os
+from functools import lru_cache
+from pathlib import Path
+import gradio as gr
+import pandas as pd
+from huggingface_hub import HfApi
+from huggingface_hub.utils import HfHubHTTPError
+HF_ORG = "EEGDash"
+CSV_PATH = Path(__file__).parent / "dataset_summary.csv"
+EEGDASH_URL = "https://eegdash.org"
+GITHUB_URL = "https://github.com/eegdash/EEGDash"
+TABLE_COLUMNS = [
+    "dataset",
+    "author_year",
+    "source",
+    "record_modality",
+    "Type Subject",
+    "modality of exp",
+    "type of exp",
+    "n_subjects",
+    "n_records",
+    "n_tasks",
+    "nchans",
+    "sfreq",
+    "size",
+    "license",
+    "on_hf",
+]
+DISPLAY_HEADERS = {
+    "dataset": "Dataset",
+    "author_year": "Author (year)",
+    "source": "Source",
+    "record_modality": "Recording",
+    "Type Subject": "Pathology",
+    "modality of exp": "Modality",
+    "type of exp": "Type",
+    "n_subjects": "Subjects",
+    "n_records": "Records",
+    "n_tasks": "Tasks",
+    "nchans": "Channels",
+    "sfreq": "Sampling rate (Hz)",
+    "size": "Size",
+    "license": "License",
+    "on_hf": "on 🤗",
+}
+def _parse_mode_from_json_col(cell: object) -> str:
+    """Return the most common value from a ``[{val, count}, ...]`` JSON cell.
+    The summary CSV stores per-recording distributions of channel counts and
+    sampling rates as a JSON list. The catalog UI wants a single
+    representative value: the one with the highest ``count``.
+    """
+    if not isinstance(cell, str) or not cell.strip():
+        return ""
+    try:
+        parsed = json.loads(cell)
+    except json.JSONDecodeError:
+        try:
+            parsed = ast.literal_eval(cell)
+        except (SyntaxError, ValueError):
+            return ""
+    if not parsed:
+        return ""
+    top = max(parsed, key=lambda d: d.get("count", 0))
+    val = top.get("val", "")
+    if isinstance(val, float) and val.is_integer():
+        val = int(val)
+    return str(val)
+@lru_cache(maxsize=1)
+def _hf_repos() -> set[str]:
+    """Slugs that exist as dataset repos under the EEGDash org.
+    Cached for the lifetime of the process. Failures (no network, rate limit)
+    degrade to an empty set rather than breaking the page load.
+    """
+    try:
+        api = HfApi()
+        repos = api.list_datasets(author=HF_ORG, limit=500)
+        return {r.id.split("/", 1)[-1] for r in repos}
+    except (HfHubHTTPError, Exception):  # noqa: BLE001
+        return set()
+def _load_catalog() -> pd.DataFrame:
+    df = pd.read_csv(CSV_PATH)
+    df["nchans"] = df["nchans_set"].apply(_parse_mode_from_json_col)
+    df["sfreq"] = df["sampling_freqs"].apply(_parse_mode_from_json_col)
+    on_hub = _hf_repos()
+    df["on_hf"] = df["dataset"].apply(lambda s: "✓" if s in on_hub else "")
+    for col in ("n_subjects", "n_records", "n_tasks"):
+        df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(int)
+    for col in TABLE_COLUMNS:
+        if col not in df.columns:
+            df[col] = ""
+    df = df[TABLE_COLUMNS].fillna("")
+    return df
+def _unique_sorted(series: pd.Series) -> list[str]:
+    return sorted({str(v).strip() for v in series if str(v).strip()})
+def _filter(
+    df: pd.DataFrame,
+    query: str,
+    modalities: list[str],
+    subject_types: list[str],
+    sources: list[str],
+    licenses: list[str],
+    min_subjects: int,
+    only_on_hf: bool,
+) -> pd.DataFrame:
+    out = df
+    if query:
+        q = query.lower().strip()
+        hay = (
+            out["dataset"].str.lower()
+            + " "
+            + out["author_year"].str.lower()
+        )
+        out = out[hay.str.contains(q, regex=False, na=False)]
+    if modalities:
+        out = out[out["modality of exp"].isin(modalities)]
+    if subject_types:
+        out = out[out["Type Subject"].isin(subject_types)]
+    if sources:
+        out = out[out["source"].isin(sources)]
+    if licenses:
+        out = out[out["license"].isin(licenses)]
+    if min_subjects > 0:
+        out = out[out["n_subjects"] >= min_subjects]
+    if only_on_hf:
+        out = out[out["on_hf"] == "✓"]
+    return out
+def _render_table(df: pd.DataFrame) -> pd.DataFrame:
+    return df.rename(columns=DISPLAY_HEADERS)
+def _snippets(slug: str, on_hf: bool) -> str:
+    native = f"""```python
+# EEGDash (streams from S3 / NEMAR, preserves BIDS)
+from eegdash import EEGDashDataset
+ds = EEGDashDataset(dataset="{slug}", cache_dir="./cache")
+print(len(ds), "recordings")
+```"""
+    hf_block = f"""```python
+# From Hugging Face (braindecode Zarr format, pre-windowed)
+from braindecode.datasets import BaseConcatDataset
+ds = BaseConcatDataset.pull_from_hub("{HF_ORG}/{slug}")
+```"""
+    if not on_hf:
+        hf_block = (
+            "> ℹ️ Not mirrored on Hugging Face yet. "
+            "Open an issue on "
+            f"[github.com/eegdash/EEGDash]({GITHUB_URL}/issues) to request it, "
+            "or push it yourself:\n\n"
+            "```python\n"
+            "from eegdash import EEGDashDataset\n"
+            f'ds = EEGDashDataset(dataset="{slug}", cache_dir="./cache")\n'
+            f'ds.push_to_hub("{HF_ORG}/{slug}")\n'
+            "```"
+        )
+    return native + "\n\n" + hf_block
+def _detail(df: pd.DataFrame, slug: str) -> str:
+    if not slug:
+        return "Pick a dataset row above to see details and load snippets."
+    match = df[df["dataset"] == slug]
+    if match.empty:
+        return f"Dataset `{slug}` not found in the catalog."
+    row = match.iloc[0]
+    on_hf = row["on_hf"] == "✓"
+    doi = row.get("doi", "")
+    title = row.get("dataset_title", "") or slug
+    lines = [f"## `{slug}` — {title}"]
+    if on_hf:
+        lines.append(
+            f"[🤗 EEGDash/{slug}](https://huggingface.co/datasets/{HF_ORG}/{slug})"
+        )
+    if doi:
+        lines.append(f"[DOI: {doi}](https://doi.org/{doi})")
+    lines.append("")
+    lines.append("| | |")
+    lines.append("|--|--|")
+    for key, label in [
+        ("author_year", "Author (year)"),
+        ("source", "Source"),
+        ("record_modality", "Recording"),
+        ("Type Subject", "Pathology"),
+        ("modality of exp", "Modality"),
+        ("type of exp", "Type"),
+        ("n_subjects", "Subjects"),
+        ("n_records", "Records"),
+        ("n_tasks", "Tasks"),
+        ("nchans", "Channels"),
+        ("sfreq", "Sampling rate (Hz)"),
+        ("size", "Size"),
+        ("license", "License"),
+    ]:
+        val = row.get(key, "")
+        if str(val).strip():
+            lines.append(f"| **{label}** | {val} |")
+    lines.append("")
+    lines.append("### Load")
+    lines.append(_snippets(slug, on_hf))
+    return "\n".join(lines)
+CATALOG = _load_catalog()
+MODALITY_CHOICES = _unique_sorted(CATALOG["modality of exp"])
+SUBJECT_CHOICES = _unique_sorted(CATALOG["Type Subject"])
+SOURCE_CHOICES = _unique_sorted(CATALOG["source"])
+LICENSE_CHOICES = _unique_sorted(CATALOG["license"])
+CSS = """
+#detail { min-height: 320px; }
+.gradio-container { max-width: 1400px !important; }
+"""
+def _on_select(evt: gr.SelectData, df: pd.DataFrame) -> str:
+    if df is None or df.empty:
+        return ""
+    row = df.iloc[evt.index[0]]
+    return row["Dataset"]
+def _on_filter(
+    query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf
+):
+    filtered = _filter(
+        CATALOG, query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf
+    )
+    count_md = f"**{len(filtered)}** of {len(CATALOG)} datasets"
+    return _render_table(filtered), count_md
+with gr.Blocks(title="EEGDash Dataset Catalog", css=CSS, theme=gr.themes.Soft()) as demo:
+    gr.Markdown(
+        f"""# 🧠 EEGDash Dataset Catalog
+Search {len(CATALOG)}+ EEG/MEG datasets and get copy-paste load snippets.
+Mirrored from [eegdash.org]({EEGDASH_URL}) · Code on [GitHub]({GITHUB_URL}) ·
+Library on [PyPI](https://pypi.org/project/eegdash/).
+"""
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            query = gr.Textbox(
+                label="Search",
+                placeholder="dataset id, author, year…",
+                show_label=True,
+            )
+            modalities = gr.CheckboxGroup(
+                label="Modality",
+                choices=MODALITY_CHOICES,
+                value=[],
+            )
+            subject_types = gr.CheckboxGroup(
+                label="Subject type",
+                choices=SUBJECT_CHOICES,
+                value=[],
+            )
+            sources = gr.CheckboxGroup(
+                label="Source",
+                choices=SOURCE_CHOICES,
+                value=[],
+            )
+            licenses = gr.Dropdown(
+                label="License",
+                choices=LICENSE_CHOICES,
+                multiselect=True,
+                value=[],
+            )
+            min_subjects = gr.Slider(
+                label="Min. subjects",
+                minimum=0,
+                maximum=500,
+                step=10,
+                value=0,
+            )
+            only_on_hf = gr.Checkbox(label="Only datasets mirrored on 🤗", value=False)
+            count = gr.Markdown(f"**{len(CATALOG)}** of {len(CATALOG)} datasets")
+        with gr.Column(scale=3):
+            table = gr.Dataframe(
+                value=_render_table(CATALOG),
+                interactive=False,
+                wrap=True,
+                column_widths=[
+                    "130px", "140px", "90px", "90px", "120px", "110px",
+                    "150px", "90px", "90px", "70px", "90px", "120px",
+                    "90px", "130px", "70px",
+                ],
+                label="Catalog",
+                show_search="filter",
+            )
+            detail = gr.Markdown(
+                "Pick a dataset row above to see details and load snippets.",
+                elem_id="detail",
+            )
+    filter_inputs = [
+        query, modalities, subject_types, sources, licenses, min_subjects, only_on_hf,
+    ]
+    for w in filter_inputs:
+        w.change(_on_filter, filter_inputs, [table, count])
+    selected_slug = gr.State("")
+    table.select(_on_select, [table], [selected_slug])
+    selected_slug.change(lambda s: _detail(CATALOG, s), [selected_slug], [detail])
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", 7860)))

dataset_summary.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio==4.44.0
+pandas>=2.0
+huggingface_hub>=0.24