refactor: EMG processing scripts and documentation

Browse files

Files changed (9) hide show

README.md +47 -280
scripts/README.md +26 -125
scripts/db5.py +122 -13
scripts/db6.py +108 -60
scripts/db7.py +108 -58
scripts/db8.py +87 -20
scripts/emg2pose.py +118 -38
scripts/epn.py +118 -16
scripts/uci.py +132 -32

README.md CHANGED Viewed

@@ -103,310 +103,77 @@ tags:
 <div align="center">
   <img src="https://raw.githubusercontent.com/MatteoFasulo/BioFoundation/refs/heads/TinyMyo/docs/model/logo/TinyMyo_logo.png" alt="TinyMyo Logo" width="400" />
-  <h1>TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge</h1>
-</div>
-<p align="center">
-  <a href="https://github.com/pulp-bio/BioFoundation">
-    <img src ="https://img.shields.io/github/stars/pulp-bio/BioFoundation?color=ccf" alt="Github">
-  </a>
-  <a href="https://creativecommons.org/licenses/by-nd/4.0/">
-    <img src="https://img.shields.io/badge/License-CC_BY--ND_4.0-lightgrey.svg" alt="License">
-  </a>
-  <a href="https://arxiv.org/abs/2512.15729">
-    <img src="https://img.shields.io/badge/arXiv-2512.15729-b31b1b.svg" alt="Paper">
-  </a>
 </p>
-**TinyMyo** is a **3.6M-parameter** Transformer-based **foundation model for surface EMG (sEMG)**.
-It is pretrained on >480 GB of EMG data and optimized for **ultra-low-power, real-time deployment**, including **microcontrollers (GAP9)** where it achieves an inference time of **0.785 s**, energy of **44.91 mJ** and power envelope of **57.18 mW**.
-TinyMyo is built for **broad generalization** across datasets, sensor configurations, movement tasks, subjects, and domains (gesture, kinematics, speech).
----
-# 🔒 License & Usage (Model Weights)
-The released TinyMyo weights are licensed under **CC BY-ND 4.0**.
-This summary is not legal advice—please read the full license.
-### ✅ You may
-* **Use** and **redistribute** the **unmodified** TinyMyo weights (including commercially) **with attribution**.
-* **Fine-tune/modify internally** for research or production without redistributing modified weights.
-* **Publish code, configs, evaluations, and papers** using TinyMyo.
-### 🚫 You may not
-* **Share or host modified weights** in any form (including LoRA/adapter deltas, pruned/quantized models).
-* **Claim endorsement** from the TinyMyo authors without permission.
-* **Use the TinyMyo name** for derivative models.
-### 🤝 Contributing Improvements
-To upstream improvements, submit a **PR** to the
-**[BioFoundation repository](https://github.com/pulp-bio/BioFoundation)** with:
-1. Full reproducibility artifacts (configs, logs, seeds, environment).
-2. Evaluation on standard protocols (e.g., DB5, EPN-612, UCI EMG, DB8, Silent Speech).
-3. Comparison to TinyMyo’s reported metrics.
-Approved PRs will be retrained and released as **official TinyMyo** checkpoints under CC BY-ND.
----
-# 🔎 1. Default Input & Preprocessing
-Unless specified otherwise, TinyMyo expects:
-* **Channels:** 16
-* **Sampling rate:** 2000 Hz
-* **Segment length:** 1000 samples (0.5 s)
-* **Windowing:** 50% overlap (pretraining)
-* **Preprocessing:**
-  * 4th-order **20–450 Hz bandpass**
-  * **50 Hz notch filter**
-  * **Min–max normalization** (pretraining)
-  * **Z-score normalization** (downstream)
-Datasets with <16 channels are **zero-padded (pretraining only)**.
----
-# 🔬 2. Pretraining Overview
-TinyMyo is pretrained via masked reconstruction on **three large-scale EMG datasets**:
-| Dataset     | Subjects | fs      | Channels | Size    |
-| ----------- | -------- | ------- | -------- | ------- |
-| Ninapro DB6 | 10       | 2000 Hz | 14       | 20.3 GB |
-| Ninapro DB7 | 22       | 2000 Hz | 12       | 30.9 GB |
-| EMG2Pose    | 192      | 2000 Hz | 16       | 431 GB  |
-## Tokenization: Channel-Independent Patches
-Unlike EEG FMs that mix channels early, TinyMyo uses **per-channel patching**:
-* Patch length: **20 samples**
-* Patch stride: **20 samples**
-* Tokens/channel: **50**
-* Total seq length: **800 tokens** (16 x 50)
-* Positional encoding: **RoPE**
-This preserves electrode-specific structure while allowing attention to learn cross-channel relationships.
-## Transformer Encoder
-* **8 layers**, **3 heads**
-* Embedding dim: **192**
-* Pre-LayerNorm
-* Dropout & drop-path: **0.1**
-## Lightweight Decoder
-A **single linear layer** (~3.9k params) reconstructs masked patches.
-Following SimMIM, this forces the encoder to learn robust latent structure.
-## Masking Objective
-* **50% random masking** with a learnable `[MASK]` token
-* Loss: **Smooth L1** with small penalty on visible patches
-$$
-  \mathcal{L} = \mathcal{L}*{\text{masked}} + 0.1,\mathcal{L}*{\text{visible}}
-$$
-## Training Setup
-* Optimizer: **AdamW** (β=(0.9,0.98), wd=0.01)
-* LR: **1e-4** with cosine decay
-* Batch size: **512** (with grad accumulation)
-* Epochs: **50**, warm-up: 10
-* Hardware: **4× NVIDIA GH200 GPUs**
----
-# 🧠 3. Architecture Summary
-### Model Variant
-| Variant | Params   | (Layers, Dim) |
-| ------- | -------- | ------------- |
-| TinyMyo | **3.6M** | (8, 192)      |
----
-# 🎯 4. Downstream Tasks
-TinyMyo generalizes across **gesture classification**, **kinematic regression**, and **speech EMG**—with state-of-the-art or competitive results.
----
-## 4.1 Hand Gesture Classification
-Evaluated on:
-* **Ninapro DB5** (52 classes, 10 subjects, 200 Hz)
-* **EPN-612** (5 classes, 612 subjects, 200 Hz)
-* **UCI EMG** (6 classes, 36 subjects, 200 Hz)
-### Preprocessing
-* EMG filtering: **20–90 Hz bandpass + 50 Hz notch**
-* Window sizes:
-  * **200 samples** (1 sec, best for DB5)
-  * **1000 samples** (5 sec, best for EPN, UCI)
-### Linear Classification Head
-* Input: **C × 192**
-* Params: **<40k**
-### Performance (Fine-tuned)
-| Dataset                  | Metric | Result            |
-| ------------------------ | ------ | ----------------- |
-| **Ninapro DB5** (1 sec)  | Acc    | **89.41 ± 0.16%** |
-| **EPN-612** (5 sec)      | Acc    | **96.74 ± 0.09%** |
-| **UCI EMG** (5 sec)      | Acc    | **97.56 ± 0.32%** |
-TinyMyo achieves **new state-of-the-art** on DB5, EPN-612, and UCI.
----
-## 4.2 Hand Kinematic Regression (Ninapro DB8)
-* Predict **5 joint angles**
-* Windows: **100 ms** or **500 ms**
-* Normalization: z-score only
-### Regression Head (~788k params)
-* Depthwise + pointwise convs
-* Upsampling
-* Global average pooling
-* Linear projection to 5 outputs
-### Performance
-* **MAE = 8.77 ± 0.12°** (500 ms)
-Note: Prior works reporting ~6.9° MAE are **subject-specific**; TinyMyo trains a **single cross-subject model**, a significantly harder setting.
----
-## 4.3 Speech Production & Recognition (Silent Speech)
-Dataset: **Gaddy Silent Speech**
-(8 channels, 1000 Hz, face/neck EMG)
-### Speech Production (EMG → MFCC → HiFi-GAN → Audio)
-Pipeline:
-1. Residual downsampling
-2. TinyMyo encoder
-3. Linear projection → **26-dim MFCC**
-4. HiFi-GAN vocoder
-**WER:** **33.54 ± 1.12%**
-≈ state-of-the-art with **>90% fewer params** in the transduction model.
-### Speech Recognition (EMG → Text)
-* TinyMyo encoder
-* Linear projection → **37 characters**
-* **CTC** loss
-* 4-gram LM + beam search
-**WER:** **33.95 ± 0.97%**
-TinyMyo is EMG-only, unlike multimodal systems like MONA-LISA.
 ---
-# ⚡ 5. Edge Deployment (GAP9 MCU)
-TinyMyo runs efficiently on **GAP9 (RISC-V)** via:
-* **INT8 quantization**, including attention
-* Multi-level streaming (L3 to L2 to L1)
-* Integer LayerNorm, GELU, softmax
-* Static memory arena via liveness analysis
-### Runtime (EPN-612 dataset)
-* **Inference time**: **0.785 s**
-* **Energy**: **44.91 mJ**
-* **Average power**: **57.18 mW**
-This is the **first EMG foundation model demonstrated on a microcontroller**.
----
-# 📊 6. Results Summary
-### Pretraining
-* Smooth L1 reconstruction with high fidelity
-* Total compute ≈ **4.0 GFLOPs**
-### Downstream Highlights
-* **DB5:** 89.41%
-* **EPN-612:** 96.74%
-* **UCI EMG:** 97.56%
-* **Neuromotor:** 0.153 CLER
-* **DB8 Regression:** MAE 8.77°
-* **Silent Speech Production:** 33.54% WER
-* **Silent Speech Recognition:** 33.95% WER
-TinyMyo matches or exceeds state-of-the-art performance, while being smaller and more efficient than all prior EMG foundation models.
 ---
-# 🛠️ Code & Usage
-To fine-tune TinyMyo on downstream tasks, follow the examples in the
-**[BioFoundation repository](https://github.com/pulp-bio/BioFoundation)**.
-```bash
-python -u run_train.py +experiment=TinyMyo_finetune \
-    pretrained_safetensors_path=/path/to/model.safetensors
-```
-Environment variables:
-* `DATA_PATH` → dataset path
-* `CHECKPOINT_DIR` → checkpoint to load
 ---
-## 🔗 Resources
-- **Code:** https://github.com/pulp-bio/BioFoundation
 ---
-# 📜 Citation
-Please cite TinyMyo using:
 ```bibtex
-@misc{fasulo2025tinymyotinyfoundationmodel,
-      title={TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge},
       author={Matteo Fasulo and Giusy Spacone and Thorir Mar Ingolfsson and Yawei Li and Luca Benini and Andrea Cossettini},
-      year={2025},
       eprint={2512.15729},
       archivePrefix={arXiv},
       primaryClass={eess.SP},
-      url={https://arxiv.org/abs/2512.15729},
 }
-```
----
-# 🧭 Contact & Support
-* Questions or issues?
-  Open an issue on the **BioFoundation GitHub repository**.

 <div align="center">
   <img src="https://raw.githubusercontent.com/MatteoFasulo/BioFoundation/refs/heads/TinyMyo/docs/model/logo/TinyMyo_logo.png" alt="TinyMyo Logo" width="400" />
+  <h1>TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge</h1>
+</div>
+<p align="center">
+  <a href="https://github.com/pulp-bio/BioFoundation"><img src ="https://img.shields.io/github/stars/pulp-bio/BioFoundation?color=ccf" alt="Github"></a>
+  <a href="https://creativecommons.org/licenses/by-nd/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--ND_4.0-lightgrey.svg" alt="License"></a>
+  <a href="https://arxiv.org/abs/2512.15729"><img src="https://img.shields.io/badge/arXiv-2512.15729-b31b1b.svg" alt="Paper"></a>
 </p>
+**TinyMyo** is a **3.6M-parameter** Transformer foundation model for surface EMG (sEMG), optimized for ultra-low-power edge deployment (GAP9 MCU). It demonstrates state-of-the-art performance across gesture classification, kinematic regression, and speech synthesis.
 ---
+## 🚀 Quick Start
+TinyMyo is built as a specialized model within the [BioFoundation](https://github.com/pulp-bio/BioFoundation) framework.
+### 1. Requirements
+- **Preprocessing:** Dependencies for data scripts are in `scripts/requirements.txt`.
+- **BioFoundation:** Full framework requirements for training/inference are in the [GitHub repository](https://github.com/pulp-bio/BioFoundation/blob/main/requirements.txt).
+### 2. Preprocessing
+Process raw datasets into HDF5 format:
+```bash
+python scripts/db5.py --data_dir $DATA_PATH/raw/ --save_dir $DATA_PATH/h5/ --seq_len 200 --stride 50
+```
+*See [scripts/README.md](scripts/README.md) for all dataset commands.*
+### 3. Fine-tuning
+```bash
+python run_train.py +experiment=TinyMyo_finetune pretrained_safetensors_path=/path/to/base.safetensors
+```
 ---
+## 🧠 Architecture & Pretraining
+- **Core:** 8-layer Transformer encoder (192-dim embeddings, 3 heads).
+- **Tokenization:** Channel-independent patching (20 samples/patch) with RoPE.
+- **Data:** Pretrained on >480 GB of EMG (NinaPro DB6/7, EMG2Pose).
+- **Specs:** 3.6M parameters, 4.0 GFLOPs.
+## 🎯 Benchmarks
+| Task | Dataset | Metric | TinyMyo |
+| :--- | :--- | :--- | :--- |
+| **Gesture** | NinaPro DB5 | Accuracy | **89.41%** |
+| **Gesture** | EPN-612 | Accuracy | **96.74%** |
+| **Gesture** | UCI EMG | Accuracy | **97.56%** |
+| **Regression**| NinaPro DB8 | MAE | **8.77°** |
+| **Speech** | Gaddy (Speech Synthesis) | WER | **33.54%** |
+| **Speech** | Gaddy (Speech Recognition) | WER | **33.95%** |
 ---
+## ⚡ Edge Performance (GAP9 MCU)
+- **Inference:** 0.785 s
+- **Energy:** 44.91 mJ
+- **Power:** 57.18 mW
 ---
+## 📜 License & Citation
+Weights are licensed under **CC BY-ND 4.0**. See [LICENSE](LICENSE) for details.
 ```bibtex
+@misc{fasulo2026tinymyotinyfoundationmodel,
+      title={TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge},
       author={Matteo Fasulo and Giusy Spacone and Thorir Mar Ingolfsson and Yawei Li and Luca Benini and Andrea Cossettini},
+      year={2026},
       eprint={2512.15729},
       archivePrefix={arXiv},
       primaryClass={eess.SP},
+      url={https://arxiv.org/abs/2512.15729},
 }
+```

scripts/README.md CHANGED Viewed

@@ -1,137 +1,38 @@
-# Dataset Preparation Commands
-## Overview
-This document provides the commands to prepare various EMG datasets for pretraining and downstream tasks. Each dataset preparation script takes in raw data, processes it into overlapping windows, and saves the processed data in HDF5 format for efficient loading during model training.
-Remember to add the flag `--download_data` if the dataset is not downloaded yet.
-Substitute the `$DATA_PATH` environment variable with your path for saving the dataset.
-The `seq_len` parameter in the scripts corresponds to the window size in samples, and the `stride` parameter corresponds to the step size between windows in samples. The sampling rate for the pretraining datasets is 2 kHz, while for the downstream datasets it is either 200 Hz or 2 kHz depending on the dataset.
-The required libraries for running the scripts are located inside the `requirements.txt` file.
 ## Pretraining Datasets
-For the pretraining datasets, we use a window size of 0.5 seconds with a 50% overlap at 2 kHz sampling rate:
-### emg2pose (0.5 sec, 50% overlap)
-Note: due to the large size of emg2pose dataset, the `--download_data` flag is not available for this dataset.
-```bash
-python scripts/emg2pose.py \
-    --data_dir $DATA_PATH/datasets/emg2pose_data/ \
-    --save_dir $DATA_PATH/datasets/emg2pose_data/h5/ \
-    --seq_len 1000 \
-    --stride 500
-```
-### Ninapro DB6 (0.5 sec, 50% overlap)
-```bash
-python scripts/db6.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB6/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB6/h5/ \
-    --seq_len 1000 \
-    --stride 500
-```
-### Ninapro DB7 (0.5 sec, 50% overlap)
-```bash
-python scripts/db7.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB7/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB7/h5/ \
-    --seq_len 1000 \
-    --stride 500
-```
 ---
 ## Downstream Datasets
-For the downstream tasks, gesture classification is performed on NinaPro DB5, EMG-EPN612, and UCI EMG datasets (200 Hz) while regression is performed on NinaPro DB8 (2 kHz).
-### Ninapro DB5 (1 sec, 25% overlap)
-```bash
-python scripts/db5.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB5/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB5/h5/ \
-    --seq_len 200 \
-    --stride 50
-```
-### Ninapro DB5 (5 sec, 25% overlap)
-```bash
-python scripts/db5.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB5/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB5/h5/ \
-    --seq_len 1000 \
-    --stride 250
-```
-### EMG-EPN612 (1 sec, no overlap)
-```bash
-python scripts/epn.py \
-    --data_dir $DATA_PATH/datasets/EPN612/ \
-    --source_training $DATA_PATH/datasets/EPN612/trainingJSON/ \
-    --source_testing $DATA_PATH/datasets/EPN612/testingJSON/ \
-    --dest_dir $DATA_PATH/datasets/EPN612/h5/ \
-    --seq_len 200
-```
-### EMG-EPN612 (5 sec, no overlap)
-```bash
-python scripts/epn.py \
-    --data_dir $DATA_PATH/datasets/EPN612/ \
-    --source_training $DATA_PATH/datasets/EPN612/trainingJSON/ \
-    --source_testing $DATA_PATH/datasets/EPN612/testingJSON/ \
-    --dest_dir $DATA_PATH/datasets/EPN612/h5/ \
-    --seq_len 1000
-```
-### UCI EMG (1 sec, 25% overlap)
-```bash
-python scripts/uci.py \
-    --data_dir $DATA_PATH/datasets/UCI_EMG/EMG_data_for_gestures-master/ \
-    --save_dir $DATA_PATH/datasets/UCI_EMG/EMG_data_for_gestures-master/h5/ \
-    --seq_len 200 \
-    --stride 50
-```
-### UCI EMG (5 sec, 25% overlap)
-```bash
-python scripts/uci.py \
-    --data_dir $DATA_PATH/datasets/UCI_EMG/EMG_data_for_gestures-master/ \
-    --save_dir $DATA_PATH/datasets/UCI_EMG/EMG_data_for_gestures-master/h5/ \
-    --seq_len 1000 \
-    --stride 250
-```
-### Ninapro DB8 (100 ms, no overlap)
-```bash
-python scripts/db8.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB8/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB8/h5/ \
-    --seq_len 200 \
-    --stride 200
-```
-### Ninapro DB8 (500 ms, no overlap)
-```bash
-python scripts/db8.py \
-    --data_dir $DATA_PATH/datasets/ninapro/DB8/ \
-    --save_dir $DATA_PATH/datasets/ninapro/DB8/h5/ \
-    --seq_len 1000 \
-    --stride 1000
-```

+# Dataset Preparation
+This guide provides commands to process raw EMG data into HDF5 format using sliding windows.
+### Usage
+- **Dependencies:** Install requirements specific to these scripts via `pip install -r scripts/requirements.txt`. Framework requirements for TinyMyo are in the [BioFoundation repository](https://github.com/pulp-bio/BioFoundation).
+- Use `--download_data` if raw data is missing.
+- Replace `$DATA_PATH` with your local storage path.
+- `seq_len`: Window size (samples).
+- `stride`: Step size (samples).
+- Pretraining scripts use **2 kHz** sampling. Downstream scripts use **200 Hz** or **2 kHz**.
+---
 ## Pretraining Datasets
+(0.5s windows, 50% overlap @ 2 kHz)
+| Dataset | Size (GB) | Seq Len | Stride | Command |
+| :--- | :--- | :--- | :--- | :--- |
+| **EMG2Pose** | 431 | 1000 (0.5s) | 500 | `python scripts/emg2pose.py --data_dir $DATA_PATH/emg2pose_data/ --save_dir $DATA_PATH/emg2pose_data/h5/ --seq_len 1000 --stride 500` |
+| **NinaPro DB6** | ~20 | 1000 (0.5s) | 500 | `python scripts/db6.py --data_dir $DATA_PATH/ninapro/DB6/ --save_dir $DATA_PATH/ninapro/DB6/h5/ --seq_len 1000 --stride 500` |
+| **NinaPro DB7** | ~10 | 1000 (0.5s) | 500 | `python scripts/db7.py --data_dir $DATA_PATH/ninapro/DB7/ --save_dir $DATA_PATH/ninapro/DB7/h5/ --seq_len 1000 --stride 500` |
 ---
 ## Downstream Datasets
+| Dataset | Metric | Seq Len | Stride | Command |
+| :--- | :--- | :--- | :--- | :--- |
+| **NinaPro DB5** | Gesture | 200 (1s) | 50 | `python scripts/db5.py --data_dir $DATA_PATH/ninapro/DB5/ --save_dir $DATA_PATH/ninapro/DB5/h5/ --seq_len 200 --stride 50` |
+| **NinaPro DB5** | Gesture | 1000 (5s) | 250 | `python scripts/db5.py --data_dir $DATA_PATH/ninapro/DB5/ --save_dir $DATA_PATH/ninapro/DB5/h5/ --seq_len 1000 --stride 250` |
+| **EMG-EPN612** | Gesture | 200 (1s) | N/A | `python scripts/epn.py --data_dir $DATA_PATH/EPN612/ --source_training $DATA_PATH/EPN612/trainingJSON/ --source_testing $DATA_PATH/EPN612/testingJSON/ --dest_dir $DATA_PATH/EPN612/h5/ --seq_len 200` |
+| **EMG-EPN612** | Gesture | 1000 (5s) | N/A | `python scripts/epn.py --data_dir $DATA_PATH/EPN612/ --source_training $DATA_PATH/EPN612/trainingJSON/ --source_testing $DATA_PATH/EPN612/testingJSON/ --dest_dir $DATA_PATH/EPN612/h5/ --seq_len 1000` |
+| **UCI EMG** | Gesture | 200 (1s) | 50 | `python scripts/uci.py --data_dir $DATA_PATH/UCI_EMG/EMG_data_for_gestures-master/ --save_dir $DATA_PATH/UCI_EMG/EMG_data_for_gestures-master/h5/ --seq_len 200 --stride 50` |
+| **UCI EMG** | Gesture | 1000 (5s) | 250 | `python scripts/uci.py --data_dir $DATA_PATH/UCI_EMG/EMG_data_for_gestures-master/ --save_dir $DATA_PATH/UCI_EMG/EMG_data_for_gestures-master/h5/ --seq_len 1000 --stride 250` |
+| **NinaPro DB8** | Regression | 200 (0.1s) | 200 | `python scripts/db8.py --data_dir $DATA_PATH/ninapro/DB8/ --save_dir $DATA_PATH/ninapro/DB8/h5/ --seq_len 200 --stride 200` |
+| **NinaPro DB8** | Regression | 1000 (0.5s) | 1000 | `python scripts/db8.py --data_dir $DATA_PATH/ninapro/DB8/ --save_dir $DATA_PATH/ninapro/DB8/h5/ --seq_len 1000 --stride 1000` |

scripts/db5.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import os
 import sys
 import h5py
 import numpy as np
@@ -7,30 +8,77 @@ import scipy.io
 import scipy.signal as signal
 from scipy.signal import iirnotch
-sequence_to_seconds = lambda seq_len, fs: seq_len / fs
-# ==== Data augmentation functions ====
-def random_amplitude_scale(sig, scale_range=(0.9, 1.1)):
     scale = np.random.uniform(*scale_range)
     return sig * scale
-def random_time_jitter(sig, jitter_ratio=0.01):
     T, D = sig.shape
     std_ch = np.std(sig, axis=0)
     noise = np.random.randn(T, D) * (jitter_ratio * std_ch)
     return sig + noise
-def random_channel_dropout(sig, dropout_prob=0.05):
     T, D = sig.shape
     mask = np.random.rand(D) < dropout_prob
     sig[:, mask] = 0.0
     return sig
-def augment_one_sample(seg):
     out = seg.copy()
     out = random_amplitude_scale(out, (0.9, 1.1))
     out = random_time_jitter(out, 0.01)
@@ -38,7 +86,20 @@ def augment_one_sample(seg):
     return out
-def augment_train_data(data, labels, factor=3):
     if factor <= 0 or data.shape[0] == 0:
         return data, labels
     aug_segs = [data]
@@ -55,8 +116,19 @@ def augment_train_data(data, labels, factor=3):
     return new_data, new_labels
-# ==== Filter functions (operate at original fs=200) ====
-def notch_filter(data, notch_freq=50.0, Q=30.0, fs=200.0):
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
@@ -64,7 +136,25 @@ def notch_filter(data, notch_freq=50.0, Q=30.0, fs=200.0):
     return out
-def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=200.0, order=4):
     nyq = 0.5 * fs
     low = lowcut / nyq
     high = highcut / nyq
@@ -75,8 +165,28 @@ def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=200.0, order=4):
     return out
-# ==== Window segmentation ====
-def process_emg_features(emg, label, rerep, window_size=1024, stride=512):
     segs, lbls, reps = [], [], []
     N = len(label)
     for start in range(0, N, stride):
@@ -94,7 +204,6 @@ def process_emg_features(emg, label, rerep, window_size=1024, stride=512):
     return np.array(segs), np.array(lbls), np.array(reps)
-# ==== Main pipeline ====
 def main():
     import argparse

 import os
 import sys
+from typing import Tuple, List, Optional, Union, Dict, Any, Callable
 import h5py
 import numpy as np
 import scipy.signal as signal
 from scipy.signal import iirnotch
+def sequence_to_seconds(seq_len: int, fs: float) -> float:
+    """Converts a sequence length in samples to time in seconds.
+    Args:
+        seq_len (int): The number of samples in the sequence.
+        fs (float): The sampling frequency in Hz.
+    Returns:
+        float: The duration of the sequence in seconds.
+    """
+    return seq_len / fs
+def random_amplitude_scale(sig: np.ndarray, scale_range: Tuple[float, float] = (0.9, 1.1)) -> np.ndarray:
+    """Applies random amplitude scaling to the input signal.
+    Args:
+        sig (np.ndarray): The input signal array of shape (T, D).
+        scale_range (Tuple[float, float], optional): The range [min, max] for the scaling factor.
+            Defaults to (0.9, 1.1).
+    Returns:
+        np.ndarray: The scaled signal array.
+    """
     scale = np.random.uniform(*scale_range)
     return sig * scale
+def random_time_jitter(sig: np.ndarray, jitter_ratio: float = 0.01) -> np.ndarray:
+    """Adds random Gaussian noise (jitter) to the input signal.
+    Args:
+        sig (np.ndarray): The input signal array of shape (T, D).
+        jitter_ratio (float, optional): The ratio to scale the noise relative to
+            each channel's standard deviation. Defaults to 0.01.
+    Returns:
+        np.ndarray: The signal with added jitter.
+    """
     T, D = sig.shape
     std_ch = np.std(sig, axis=0)
     noise = np.random.randn(T, D) * (jitter_ratio * std_ch)
     return sig + noise
+def random_channel_dropout(sig: np.ndarray, dropout_prob: float = 0.05) -> np.ndarray:
+    """Randomly zeros out channels in the signal based on a probability.
+    Args:
+        sig (np.ndarray): The input signal array of shape (T, D).
+        dropout_prob (float, optional): Probability of dropping each channel.
+            Defaults to 0.05.
+    Returns:
+        np.ndarray: The signal with dropped channels.
+    """
     T, D = sig.shape
     mask = np.random.rand(D) < dropout_prob
     sig[:, mask] = 0.0
     return sig
+def augment_one_sample(seg: np.ndarray) -> np.ndarray:
+    """Applies a sequence of random augmentations to a single signal segment.
+    Args:
+        seg (np.ndarray): Single signal segment of shape (window_size, n_ch).
+    Returns:
+        np.ndarray: The augmented signal segment.
+    """
     out = seg.copy()
     out = random_amplitude_scale(out, (0.9, 1.1))
     out = random_time_jitter(out, 0.01)
     return out
+def augment_train_data(data: np.ndarray, labels: np.ndarray, factor: int = 3) -> Tuple[np.ndarray, np.ndarray]:
+    """Augments the training dataset by creating multiple versions of each sample.
+    Args:
+        data (np.ndarray): The input dataset of shape (N, window_size, n_ch).
+        labels (np.ndarray): The corresponding labels of shape (N,).
+        factor (int, optional): The number of augmented versions to create for each sample.
+            Defaults to 3.
+    Returns:
+        Tuple[np.ndarray, np.ndarray]: A tuple containing:
+            - The augmented dataset.
+            - The augmented labels.
+    """
     if factor <= 0 or data.shape[0] == 0:
         return data, labels
     aug_segs = [data]
     return new_data, new_labels
+def notch_filter(data: np.ndarray, notch_freq: float = 50.0, Q: float = 30.0, fs: float = 200.0) -> np.ndarray:
+    """Applies a notch filter to remove power line interference.
+    Args:
+        data (np.ndarray): The input signal array of shape (T, D).
+        notch_freq (float, optional): The frequency to be removed (e.g., 50Hz or 60Hz).
+            Defaults to 50.0.
+        Q (float, optional): The quality factor. Defaults to 30.0.
+        fs (float, optional): The sampling frequency of the signal. Defaults to 200.0.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
     return out
+def bandpass_filter_emg(
+    emg: np.ndarray,
+    lowcut: float = 20.0,
+    highcut: float = 90.0,
+    fs: float = 200.0,
+    order: int = 4
+) -> np.ndarray:
+    """Applies a Butterworth bandpass filter to the EMG signal.
+    Args:
+        emg (np.ndarray): The input signal array of shape (T, D).
+        lowcut (float, optional): Lower bound of the passband in Hz. Defaults to 20.0.
+        highcut (float, optional): Upper bound of the passband in Hz. Defaults to 90.0.
+        fs (float, optional): The sampling frequency of the signal. Defaults to 200.0.
+        order (int, optional): The order of the filter. Defaults to 4.
+    Returns:
+        np.ndarray: The bandpass filtered signal array.
+    """
     nyq = 0.5 * fs
     low = lowcut / nyq
     high = highcut / nyq
     return out
+def process_emg_features(
+    emg: np.ndarray,
+    label: np.ndarray,
+    rerep: np.ndarray,
+    window_size: int = 1024,
+    stride: int = 512
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Segments raw EMG signals into overlapping windows.
+    Args:
+        emg (np.ndarray): Raw EMG data of shape (T, n_ch).
+        label (np.ndarray): Gesture labels of shape (T,).
+        rerep (np.ndarray): Repetition indices of shape (T,).
+        window_size (int, optional): Number of samples per window. Defaults to 1024.
+        stride (int, optional): Number of samples to shift between windows. Defaults to 512.
+    Returns:
+        Tuple[np.ndarray, np.ndarray, np.ndarray]: A tuple containing:
+            - windowed segments (N, window_size, n_ch).
+            - labels for each window (N,).
+            - repetition indices for each window (N,).
+    """
     segs, lbls, reps = [], [], []
     N = len(label)
     for start in range(0, N, stride):
     return np.array(segs), np.array(lbls), np.array(reps)
 def main():
     import argparse

scripts/db6.py CHANGED Viewed

@@ -5,12 +5,13 @@ import h5py
 import numpy as np
 import scipy.io
 import scipy.signal as signal
 from scipy.signal import iirnotch
 sequence_to_seconds = lambda seq_len, fs: seq_len / fs
-# ─────────────── Filtering ──────────────────
 def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     """Notch-filter every channel independently."""
     b, a = iirnotch(notch_freq, Q, fs)
@@ -29,7 +30,6 @@ def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     return out
-# ─────────────── Sliding window ──────────────
 def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     """
     Segment EMG with a sliding window.
@@ -49,7 +49,64 @@ def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     return np.array(segments), np.array(labels), np.array(reps)
-# ─────────────── Main pipeline ───────────────
 def main():
     import argparse
@@ -65,6 +122,18 @@ def main():
         type=int,
         help="Step size between windows in samples for segmentation.",
     )
     args = args.parse_args()
     data_dir = args.data_dir  # input folder with .mat files
     save_dir = args.save_dir  # output folder for .h5 files
@@ -105,62 +174,30 @@ def main():
         "test": {"data": [], "label": []},
     }
-    # iterate subjects
-    for subj in sorted(os.listdir(data_dir)):
-        subj_path = os.path.join(data_dir, subj)
-        if not os.path.isdir(subj_path):
-            continue
-        print(f"Processing subject {subj} ...")
-        subj_seg, subj_lbl, subj_rep = [], [], []
-        # iterate .mat files
-        for mat_file in sorted(os.listdir(subj_path)):
-            if not mat_file.endswith(".mat"):
                 continue
-            mat_path = os.path.join(subj_path, mat_file)
-            mat = scipy.io.loadmat(mat_path)
-            emg = mat["emg"]  # (N, 16)
-            label = mat["restimulus"].ravel()
-            rerep = mat["rerepetition"].ravel()
-            # drop empty channels (index 8, 9 → 0-based)
-            emg = np.delete(emg, [8, 9], axis=1)  # now (N, 14)
-            # filtering
-            emg = bandpass_filter_emg(emg, 20, 450, fs=fs)
-            emg = notch_filter(emg, 50, 30, fs=fs)
-            # z-score per channel
-            mu = emg.mean(axis=0)
-            sd = emg.std(axis=0, ddof=1)
-            sd[sd == 0] = 1.0
-            emg = (emg - mu) / sd
-            # windowing
-            seg, lbl, rep = sliding_window_segment(
-                emg, label, rerep, window_size, stride
-            )
-            subj_seg.append(seg)
-            subj_lbl.append(lbl)
-            subj_rep.append(rep)
-        if not subj_seg:
-            continue
-        seg = np.concatenate(subj_seg, axis=0)  # (M, win, 14)
-        lbl = np.concatenate(subj_lbl)
-        rep = np.concatenate(subj_rep)
-        # split by repetition id
-        for split_name, mask in (
-            ("train", np.isin(rep, train_reps)),
-            ("val", np.isin(rep, val_reps)),
-            ("test", np.isin(rep, test_reps)),
-        ):
-            X = seg[mask].transpose(0, 2, 1)  # (N, 14, 1024)
-            y = lbl[mask]
             splits[split_name]["data"].append(X)
             splits[split_name]["label"].append(y)
@@ -177,9 +214,20 @@ def main():
             else np.empty((0,), dtype=int)
         )
-        with h5py.File(os.path.join(save_dir, f"{split}.h5"), "w") as f:
-            f.create_dataset("data", data=X.astype(np.float32))
-            f.create_dataset("label", data=y.astype(np.int64))
         uniq, cnt = np.unique(y, return_counts=True)
         print(f"\n{split.upper()} → X={X.shape}, label distribution:")

 import numpy as np
 import scipy.io
 import scipy.signal as signal
+from joblib import Parallel, delayed
 from scipy.signal import iirnotch
+from tqdm import tqdm
 sequence_to_seconds = lambda seq_len, fs: seq_len / fs
 def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     """Notch-filter every channel independently."""
     b, a = iirnotch(notch_freq, Q, fs)
     return out
 def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     """
     Segment EMG with a sliding window.
     return np.array(segments), np.array(labels), np.array(reps)
+def process_subject(
+    subj_path,
+    window_size,
+    stride,
+    fs,
+    train_reps,
+    val_reps,
+    test_reps,
+):
+    subj_seg, subj_lbl, subj_rep = [], [], []
+    for mat_file in sorted(os.listdir(subj_path)):
+        if not mat_file.endswith(".mat"):
+            continue
+        mat_path = os.path.join(subj_path, mat_file)
+        mat = scipy.io.loadmat(mat_path)
+        emg = mat["emg"]  # (N, 16)
+        label = mat["restimulus"].ravel()
+        rerep = mat["rerepetition"].ravel()
+        emg = np.delete(emg, [8, 9], axis=1)  # now (N, 14)
+        emg = bandpass_filter_emg(emg, 20, 450, fs=fs)
+        emg = notch_filter(emg, 50, 30, fs=fs)
+        mu = emg.mean(axis=0)
+        sd = emg.std(axis=0, ddof=1)
+        sd[sd == 0] = 1.0
+        emg = (emg - mu) / sd
+        seg, lbl, rep = sliding_window_segment(emg, label, rerep, window_size, stride)
+        subj_seg.append(seg)
+        subj_lbl.append(lbl)
+        subj_rep.append(rep)
+    if not subj_seg:
+        return {
+            "train": (np.empty((0, 14, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+            "val": (np.empty((0, 14, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+            "test": (np.empty((0, 14, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+        }
+    seg = np.concatenate(subj_seg, axis=0)
+    lbl = np.concatenate(subj_lbl)
+    rep = np.concatenate(subj_rep)
+    out = {}
+    for split_name, mask in (
+        ("train", np.isin(rep, train_reps)),
+        ("val", np.isin(rep, val_reps)),
+        ("test", np.isin(rep, test_reps)),
+    ):
+        X = seg[mask].transpose(0, 2, 1).astype(np.float32)
+        y = lbl[mask].astype(np.int64)
+        out[split_name] = (X, y)
+    return out
 def main():
     import argparse
         type=int,
         help="Step size between windows in samples for segmentation.",
     )
+    args.add_argument(
+        "--group_size",
+        type=int,
+        default=1000,
+        help="Number of samples per group in the output HDF5 file.",
+    )
+    args.add_argument(
+        "--n_jobs",
+        type=int,
+        default=-1,
+        help="Number of subjects to process in parallel. -1 means all cores.",
+    )
     args = args.parse_args()
     data_dir = args.data_dir  # input folder with .mat files
     save_dir = args.save_dir  # output folder for .h5 files
         "test": {"data": [], "label": []},
     }
+    subject_paths = [
+        os.path.join(data_dir, subj)
+        for subj in sorted(os.listdir(data_dir))
+        if os.path.isdir(os.path.join(data_dir, subj))
+    ]
+    subject_results = Parallel(n_jobs=args.n_jobs)(
+        delayed(process_subject)(
+            subj_path,
+            window_size,
+            stride,
+            fs,
+            train_reps,
+            val_reps,
+            test_reps,
+        )
+        for subj_path in tqdm(subject_paths, desc="Processing subjects")
+    )
+    for result in subject_results:
+        for split_name in ["train", "val", "test"]:
+            X, y = result[split_name]
+            if X.shape[0] == 0:
                 continue
             splits[split_name]["data"].append(X)
             splits[split_name]["label"].append(y)
             else np.empty((0,), dtype=int)
         )
+        out_path = os.path.join(save_dir, f"{split}.h5")
+        if os.path.exists(out_path):
+            os.remove(out_path)
+            print(f"Removed existing file {out_path} to avoid overwrite issues.")
+        with h5py.File(out_path, "w") as h5f:
+            for group_idx, start in enumerate(range(0, X.shape[0], args.group_size)):
+                end = min(start + args.group_size, X.shape[0])
+                x_chunk = X[start:end].astype(np.float32)
+                y_chunk = y[start:end].astype(np.int64)
+                grp = h5f.create_group(f"data_group_{group_idx}")
+                grp.create_dataset("X", data=x_chunk)
+                grp.create_dataset("y", data=y_chunk)
         uniq, cnt = np.unique(y, return_counts=True)
         print(f"\n{split.upper()} → X={X.shape}, label distribution:")

scripts/db7.py CHANGED Viewed

@@ -5,12 +5,13 @@ import h5py
 import numpy as np
 import scipy.io
 import scipy.signal as signal
 from scipy.signal import iirnotch
 sequence_to_seconds = lambda seq_len, fs: seq_len / fs
-# ─────────────── Filtering ──────────────────
 def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     """Notch-filter every channel independently."""
     b, a = iirnotch(notch_freq, Q, fs)
@@ -29,7 +30,6 @@ def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     return out
-# ─────────────── Sliding window ──────────────
 def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     """
     Segment EMG with a sliding window.
@@ -49,7 +49,63 @@ def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     return np.array(segments), np.array(labels), np.array(reps)
-# ─────────────── Main pipeline ───────────────
 def main():
     import argparse
@@ -65,6 +121,18 @@ def main():
         type=int,
         help="Step size between windows in samples for segmentation.",
     )
     args = args.parse_args()
     data_dir = args.data_dir  # input folder with .mat files
     save_dir = args.save_dir  # output folder for .h5 files
@@ -100,59 +168,30 @@ def main():
         "test": {"data": [], "label": []},
     }
-    # iterate subjects
-    for subj in sorted(os.listdir(data_dir)):
-        subj_path = os.path.join(data_dir, subj)
-        if not os.path.isdir(subj_path):
-            continue
-        print(f"Processing subject {subj} ...")
-        subj_seg, subj_lbl, subj_rep = [], [], []
-        # iterate .mat files
-        for mat_file in sorted(os.listdir(subj_path)):
-            if not mat_file.endswith(".mat"):
                 continue
-            mat_path = os.path.join(subj_path, mat_file)
-            mat = scipy.io.loadmat(mat_path)
-            emg = mat["emg"]  # (N, 16)
-            label = mat["restimulus"].ravel()
-            rerep = mat["rerepetition"].ravel()
-            # filtering
-            emg = bandpass_filter_emg(emg, 20.0, 450.0, fs=fs)
-            emg = notch_filter(emg, 50.0, 30.0, fs=fs)
-            # z-score per channel
-            mu = emg.mean(axis=0)
-            sd = emg.std(axis=0, ddof=1)
-            sd[sd == 0] = 1.0
-            emg = (emg - mu) / sd
-            # windowing
-            seg, lbl, rep = sliding_window_segment(
-                emg, label, rerep, window_size, stride
-            )
-            subj_seg.append(seg)
-            subj_lbl.append(lbl)
-            subj_rep.append(rep)
-        if not subj_seg:
-            continue
-        seg = np.concatenate(subj_seg, axis=0)  # (M, win, 14)
-        lbl = np.concatenate(subj_lbl)
-        rep = np.concatenate(subj_rep)
-        # split by repetition id
-        for split_name, mask in (
-            ("train", np.isin(rep, train_reps)),
-            ("val", np.isin(rep, val_reps)),
-            ("test", np.isin(rep, test_reps)),
-        ):
-            X = seg[mask].transpose(0, 2, 1)  # (N, 14, 1024)
-            y = lbl[mask]
             splits[split_name]["data"].append(X)
             splits[split_name]["label"].append(y)
@@ -161,7 +200,7 @@ def main():
         X = (
             np.concatenate(splits[split]["data"], axis=0)
             if splits[split]["data"]
-            else np.empty((0, 14, window_size))
         )
         y = (
             np.concatenate(splits[split]["label"], axis=0)
@@ -169,9 +208,20 @@ def main():
             else np.empty((0,), dtype=int)
         )
-        with h5py.File(os.path.join(save_dir, f"{split}.h5"), "w") as f:
-            f.create_dataset("data", data=X.astype(np.float32))
-            f.create_dataset("label", data=y.astype(np.int64))
         uniq, cnt = np.unique(y, return_counts=True)
         print(f"\n{split.upper()} → X={X.shape}, label distribution:")

 import numpy as np
 import scipy.io
 import scipy.signal as signal
+from joblib import Parallel, delayed
 from scipy.signal import iirnotch
+from tqdm import tqdm
 sequence_to_seconds = lambda seq_len, fs: seq_len / fs
 def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     """Notch-filter every channel independently."""
     b, a = iirnotch(notch_freq, Q, fs)
     return out
 def sliding_window_segment(emg, label, rerepetition, window_size, stride):
     """
     Segment EMG with a sliding window.
     return np.array(segments), np.array(labels), np.array(reps)
+def process_subject(
+    subj_path,
+    window_size,
+    stride,
+    fs,
+    train_reps,
+    val_reps,
+    test_reps,
+):
+    subj_seg, subj_lbl, subj_rep = [], [], []
+    for mat_file in sorted(os.listdir(subj_path)):
+        if not mat_file.endswith(".mat"):
+            continue
+        mat_path = os.path.join(subj_path, mat_file)
+        mat = scipy.io.loadmat(mat_path)
+        emg = mat["emg"]  # (N, 16)
+        label = mat["restimulus"].ravel()
+        rerep = mat["rerepetition"].ravel()
+        emg = bandpass_filter_emg(emg, 20.0, 450.0, fs=fs)
+        emg = notch_filter(emg, 50.0, 30.0, fs=fs)
+        mu = emg.mean(axis=0)
+        sd = emg.std(axis=0, ddof=1)
+        sd[sd == 0] = 1.0
+        emg = (emg - mu) / sd
+        seg, lbl, rep = sliding_window_segment(emg, label, rerep, window_size, stride)
+        subj_seg.append(seg)
+        subj_lbl.append(lbl)
+        subj_rep.append(rep)
+    if not subj_seg:
+        return {
+            "train": (np.empty((0, 16, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+            "val": (np.empty((0, 16, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+            "test": (np.empty((0, 16, window_size), dtype=np.float32), np.empty((0,), dtype=np.int64)),
+        }
+    seg = np.concatenate(subj_seg, axis=0)
+    lbl = np.concatenate(subj_lbl)
+    rep = np.concatenate(subj_rep)
+    out = {}
+    for split_name, mask in (
+        ("train", np.isin(rep, train_reps)),
+        ("val", np.isin(rep, val_reps)),
+        ("test", np.isin(rep, test_reps)),
+    ):
+        X = seg[mask].transpose(0, 2, 1).astype(np.float32)
+        y = lbl[mask].astype(np.int64)
+        out[split_name] = (X, y)
+    return out
 def main():
     import argparse
         type=int,
         help="Step size between windows in samples for segmentation.",
     )
+    args.add_argument(
+        "--group_size",
+        type=int,
+        default=1000,
+        help="Number of samples per group in the output HDF5 file.",
+    )
+    args.add_argument(
+        "--n_jobs",
+        type=int,
+        default=-1,
+        help="Number of subjects to process in parallel. -1 means all cores.",
+    )
     args = args.parse_args()
     data_dir = args.data_dir  # input folder with .mat files
     save_dir = args.save_dir  # output folder for .h5 files
         "test": {"data": [], "label": []},
     }
+    subject_paths = [
+        os.path.join(data_dir, subj)
+        for subj in sorted(os.listdir(data_dir))
+        if os.path.isdir(os.path.join(data_dir, subj))
+    ]
+    subject_results = Parallel(n_jobs=args.n_jobs)(
+        delayed(process_subject)(
+            subj_path,
+            window_size,
+            stride,
+            fs,
+            train_reps,
+            val_reps,
+            test_reps,
+        )
+        for subj_path in tqdm(subject_paths, desc="Processing subjects")
+    )
+    for result in subject_results:
+        for split_name in ["train", "val", "test"]:
+            X, y = result[split_name]
+            if X.shape[0] == 0:
                 continue
             splits[split_name]["data"].append(X)
             splits[split_name]["label"].append(y)
         X = (
             np.concatenate(splits[split]["data"], axis=0)
             if splits[split]["data"]
+            else np.empty((0, 16, window_size))
         )
         y = (
             np.concatenate(splits[split]["label"], axis=0)
             else np.empty((0,), dtype=int)
         )
+        out_path = os.path.join(save_dir, f"{split}.h5")
+        if os.path.exists(out_path):
+            os.remove(out_path)
+            print(f"Removed existing file {out_path} to avoid overwrite issues.")
+        with h5py.File(out_path, "w") as h5f:
+            for group_idx, start in enumerate(range(0, X.shape[0], args.group_size)):
+                end = min(start + args.group_size, X.shape[0])
+                x_chunk = X[start:end].astype(np.float32)
+                y_chunk = y[start:end].astype(np.int64)
+                grp = h5f.create_group(f"data_group_{group_idx}")
+                grp.create_dataset("X", data=x_chunk)
+                grp.create_dataset("y", data=y_chunk)
         uniq, cnt = np.unique(y, return_counts=True)
         print(f"\n{split.upper()} → X={X.shape}, label distribution:")

scripts/db8.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import os
 import sys
 import h5py
 import numpy as np
@@ -9,7 +10,18 @@ from joblib import Parallel, delayed
 from scipy.signal import iirnotch
 from tqdm import tqdm
-sequence_to_seconds = lambda seq_len, fs: seq_len / fs
 _MATRIX_DOF2DOA_TRANSPOSED = np.array(
     # https://www.frontiersin.org/articles/10.3389/fnins.2019.00891/full
@@ -42,9 +54,18 @@ _MATRIX_DOF2DOA_TRANSPOSED = np.array(
 MATRIX_DOF2DOA = _MATRIX_DOF2DOA_TRANSPOSED.T
-# ─────────────── Filtering ──────────────────
-def notch_filter(data, notch_freq=50.0, Q=30.0, fs=1111.0):
-    """Notch-filter every channel independently."""
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
@@ -52,7 +73,25 @@ def notch_filter(data, notch_freq=50.0, Q=30.0, fs=1111.0):
     return out
-def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     nyq = 0.5 * fs
     b, a = signal.butter(order, [lowcut / nyq, highcut / nyq], btype="bandpass")
     out = np.zeros_like(emg)
@@ -61,11 +100,24 @@ def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     return out
-# ─────────────── Sliding window ──────────────
-def sliding_window_segment(emg, label, window_size, stride):
-    """
-    Segment EMG with a sliding window.
-    Use the frame at the window centre as the segment label / repetition index.
     """
     segments, labels = [], []
     n_samples = len(label)
@@ -80,34 +132,49 @@ def sliding_window_segment(emg, label, window_size, stride):
     return np.array(segments), np.array(labels)
-# ─────────────── Main pipeline ───────────────
-def process_mat_file(mat_path, window_size, stride, fs):
-    """
-    Load one .mat file, filter out NaNs, filter & normalize EMG, map DoF→DoA,
-    segment, and return (split, segs, labels).
     """
     mat = scipy.io.loadmat(mat_path)
     emg = mat["emg"]  # (T, 16)
     label = mat["glove"]  # (T, DoF)
-    # 1) Drop timesteps with any NaNs in glove data
     valid = ~np.isnan(label).any(axis=1)
     emg = emg[valid]
     label = label[valid]
-    # 3) Z-score per channel
     mu = emg.mean(axis=0)
     sd = emg.std(axis=0, ddof=1)
     sd[sd == 0] = 1.0
     emg = (emg - mu) / sd
-    # 4) DoF → DoA
     y_doa = (MATRIX_DOF2DOA @ label.T).T
-    # 5) Windowing
     segs, labs = sliding_window_segment(emg, y_doa, window_size, stride)
-    # 6) Determine split
     fname = os.path.basename(mat_path)
     if "_A1" in fname:
         split = "train"

 import os
 import sys
+from typing import Tuple, List, Optional, Union, Dict, Any
 import h5py
 import numpy as np
 from scipy.signal import iirnotch
 from tqdm import tqdm
+def sequence_to_seconds(seq_len: int, fs: float) -> float:
+    """Converts a sequence length in samples to time in seconds.
+    Args:
+        seq_len (int): The number of samples in the sequence.
+        fs (float): The sampling frequency in Hz.
+    Returns:
+        float: The duration of the sequence in seconds.
+    """
+    return seq_len / fs
 _MATRIX_DOF2DOA_TRANSPOSED = np.array(
     # https://www.frontiersin.org/articles/10.3389/fnins.2019.00891/full
 MATRIX_DOF2DOA = _MATRIX_DOF2DOA_TRANSPOSED.T
+def notch_filter(data: np.ndarray, notch_freq: float = 50.0, Q: float = 30.0, fs: float = 1111.0) -> np.ndarray:
+    """Applies a notch filter to every channel of the input data independently.
+    Args:
+        data (np.ndarray): The input signal array of shape (T, D).
+        notch_freq (float, optional): The frequency to be removed. Defaults to 50.0.
+        Q (float, optional): The quality factor. Defaults to 30.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 1111.0.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
     return out
+def bandpass_filter_emg(
+    emg: np.ndarray,
+    lowcut: float = 20.0,
+    highcut: float = 90.0,
+    fs: float = 2000.0,
+    order: int = 4
+) -> np.ndarray:
+    """Applies a Butterworth bandpass filter to the EMG signal.
+    Args:
+        emg (np.ndarray): The input signal array of shape (T, D).
+        lowcut (float, optional): Lower bound of the passband in Hz. Defaults to 20.0.
+        highcut (float, optional): Upper bound of the passband in Hz. Defaults to 90.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 2000.0.
+        order (int, optional): The order of the filter. Defaults to 4.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     nyq = 0.5 * fs
     b, a = signal.butter(order, [lowcut / nyq, highcut / nyq], btype="bandpass")
     out = np.zeros_like(emg)
     return out
+def sliding_window_segment(
+    emg: np.ndarray,
+    label: np.ndarray,
+    window_size: int,
+    stride: int
+) -> Tuple[np.ndarray, np.ndarray]:
+    """Segments EMG and label data using a sliding window.
+    Args:
+        emg (np.ndarray): The raw EMG data of shape (T, n_ch).
+        label (np.ndarray): The corresponding labels/targets.
+        window_size (int): Number of samples per window.
+        stride (int): Number of samples to shift between windows.
+    Returns:
+        Tuple[np.ndarray, np.ndarray]: A tuple containing:
+            - segmented EMG tokens (N, window_size, n_ch).
+            - segmented label tokens (N, window_size, target_dim).
     """
     segments, labels = [], []
     n_samples = len(label)
     return np.array(segments), np.array(labels)
+def process_mat_file(
+    mat_path: str,
+    window_size: int,
+    stride: int,
+    fs: float
+) -> Optional[Tuple[str, np.ndarray, np.ndarray]]:
+    """Processes a single NinaPro DB8 .mat file.
+    Loads the file, removes NaNs, normalizes EMG (Z-score), maps finger degrees
+    of freedom (DoF) to degrees of activation (DoA), and segments the data.
+    Args:
+        mat_path (str): Absolute path to the .mat file.
+        window_size (int): Temporal window size in samples.
+        stride (int): Stride between windows in samples.
+        fs (float): Sampling frequency in Hz.
+    Returns:
+        Optional[Tuple[str, np.ndarray, np.ndarray]]: A tuple of (split_name, segments, labels)
+            if the file is valid, else None.
     """
     mat = scipy.io.loadmat(mat_path)
     emg = mat["emg"]  # (T, 16)
     label = mat["glove"]  # (T, DoF)
+    # Drop timesteps with any NaNs in glove data
     valid = ~np.isnan(label).any(axis=1)
     emg = emg[valid]
     label = label[valid]
+    # Z-score per channel
     mu = emg.mean(axis=0)
     sd = emg.std(axis=0, ddof=1)
     sd[sd == 0] = 1.0
     emg = (emg - mu) / sd
+    # DoF → DoA
     y_doa = (MATRIX_DOF2DOA @ label.T).T
+    # Windowing
     segs, labs = sliding_window_segment(emg, y_doa, window_size, stride)
+    # Determine split
     fname = os.path.basename(mat_path)
     if "_A1" in fname:
         split = "train"

scripts/emg2pose.py CHANGED Viewed

@@ -1,5 +1,7 @@
 import os
 from pathlib import Path
 import h5py
 import numpy as np
@@ -9,11 +11,31 @@ from joblib import Parallel, delayed
 from scipy.signal import iirnotch
 from tqdm import tqdm
-sequence_to_seconds = lambda seq_len, fs: seq_len / fs
-# ==== Filter functions (operate at original fs=2000) ====
-def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
@@ -21,7 +43,25 @@ def notch_filter(data, notch_freq=50.0, Q=30.0, fs=2000.0):
     return out
-def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     nyq = 0.5 * fs
     low = lowcut / nyq
     high = highcut / nyq
@@ -32,9 +72,18 @@ def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=2000.0, order=4):
     return out
-# ==== Window segmentation ====
-def process_emg_features(emg, window_size=1000, stride=500):
-    segs, lbls = [], []
     N = len(emg)
     for start in range(0, N, stride):
         end = start + window_size
@@ -45,10 +94,19 @@ def process_emg_features(emg, window_size=1000, stride=500):
     return np.array(segs)
-def process_one_recording(file_path, fs=2000.0, window_size=1000, stride=500):
-    """
-    Process a single recording file to extract EMG features and labels
-    as to be used in the main pipeline with parallel processing.
     """
     with h5py.File(file_path, "r") as f:
         grp = f["emg2pose"]
@@ -71,7 +129,6 @@ def process_one_recording(file_path, fs=2000.0, window_size=1000, stride=500):
     return segs
-# ==== Main pipeline ====
 def main():
     import argparse
@@ -93,6 +150,12 @@ def main():
         default=-1,
         help="Number of parallel jobs to run. -1 means using all available cores.",
     )
     args.add_argument(
         "--seed", type=int, default=42, help="Random seed for reproducibility."
     )
@@ -109,45 +172,62 @@ def main():
     print(f"Window size: {window_size} samples ({window_seconds:.2f} seconds)")
     df = pd.read_csv(os.path.join(data_dir, "metadata.csv"))
-    df = df.groupby("split").apply(
-        lambda x: (
-            x.sample(frac=args.subsample, random_state=args.seed)
-            if args.subsample < 1.0
-            else x
         )
-    )
-    df.reset_index(drop=True, inplace=True)
     splits = {}
     for split, df_ in df.groupby("split"):
         sessions = list(df_.filename)
-        splits[split] = [
             Path(data_dir).expanduser().joinpath(f"{session}.hdf5")
             for session in sessions
         ]
-    all_data = {"train": [], "val": [], "test": []}
     for split, files in splits.items():
-        # Here we use joblib to parallelize the file processing, each file is processed independently as the task is embarrassingly parallel. We scale the processing across all available CPU cores since the number of files is around 25k (with training being 17k).
-        results = Parallel(n_jobs=args.n_jobs)(
-            delayed(process_one_recording)(file_path, fs, window_size, stride)
-            for file_path in tqdm(files, desc=f"Processing {split} files")
-        )
-        # Collect results
-        for segs in tqdm(results, desc=f"Collecting {split} data"):
-            all_data[split].append(segs)
-        # stack, augment train, transpose, save, and print stats
-        X = np.concatenate(all_data[split], axis=0)  # [N, window_size, ch]
-        # transpose to [N, ch, window_size]
-        X = X.transpose(0, 2, 1)
-        # save
-        with h5py.File(os.path.join(save_dir, f"{split}.h5"), "w") as hf:
-            hf.create_dataset("data", data=X)
 if __name__ == "__main__":
-    main()

 import os
+import gc
 from pathlib import Path
+from typing import Tuple, List, Optional, Union, Dict, Any
 import h5py
 import numpy as np
 from scipy.signal import iirnotch
 from tqdm import tqdm
+def sequence_to_seconds(seq_len: int, fs: float) -> float:
+    """Converts a sequence length in samples to time in seconds.
+    Args:
+        seq_len (int): The number of samples in the sequence.
+        fs (float): The sampling frequency in Hz.
+    Returns:
+        float: The duration of the sequence in seconds.
+    """
+    return seq_len / fs
+def notch_filter(data: np.ndarray, notch_freq: float = 50.0, Q: float = 30.0, fs: float = 2000.0) -> np.ndarray:
+    """Applies a notch filter to every channel of the input data independently.
+    Args:
+        data (np.ndarray): The input signal array of shape (T, D).
+        notch_freq (float, optional): The frequency to be removed in Hz. Defaults to 50.0.
+        Q (float, optional): The quality factor. Defaults to 30.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 2000.0.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     b, a = iirnotch(notch_freq, Q, fs)
     out = np.zeros_like(data)
     for ch in range(data.shape[1]):
     return out
+def bandpass_filter_emg(
+    emg: np.ndarray,
+    lowcut: float = 20.0,
+    highcut: float = 90.0,
+    fs: float = 2000.0,
+    order: int = 4
+) -> np.ndarray:
+    """Applies a Butterworth bandpass filter to the EMG signal.
+    Args:
+        emg (np.ndarray): The input signal array of shape (T, D).
+        lowcut (float, optional): Lower bound of the passband in Hz. Defaults to 20.0.
+        highcut (float, optional): Upper bound of the passband in Hz. Defaults to 90.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 2000.0.
+        order (int, optional): The order of the filter. Defaults to 4.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     nyq = 0.5 * fs
     low = lowcut / nyq
     high = highcut / nyq
     return out
+def process_emg_features(emg: np.ndarray, window_size: int = 1000, stride: int = 500) -> np.ndarray:
+    """Segments raw EMG signals into overlapping windows.
+    Args:
+        emg (np.ndarray): Raw EMG data of shape (T, n_ch).
+        window_size (int, optional): Number of samples per window. Defaults to 1000.
+        stride (int, optional): Number of samples to shift between windows. Defaults to 500.
+    Returns:
+        np.ndarray: Segmented data of shape (N, window_size, n_ch).
+    """
+    segs = []
     N = len(emg)
     for start in range(0, N, stride):
         end = start + window_size
     return np.array(segs)
+def process_one_recording(file_path: str, fs: float = 2000.0, window_size: int = 1000, stride: int = 500) -> np.ndarray:
+    """Processes a single EMG2Pose recording file.
+    Loads HDF5 timeseries, filters EMG, normalizes (Z-score), and segments.
+    Args:
+        file_path (str): Absolute path to the .h5 recording file.
+        fs (float, optional): Sampling frequency in Hz. Defaults to 2000.0.
+        window_size (int, optional): Temporal window size in samples. Defaults to 1000.
+        stride (int, optional): Stride between windows in samples. Defaults to 500.
+    Returns:
+        np.ndarray: Array of processed segments (N, window_size, n_ch).
     """
     with h5py.File(file_path, "r") as f:
         grp = f["emg2pose"]
     return segs
 def main():
     import argparse
         default=-1,
         help="Number of parallel jobs to run. -1 means using all available cores.",
     )
+    args.add_argument(
+        "--group_size",
+        type=int,
+        default=1000,
+        help="Number of samples per group in the output HDF5 file.",
+    )
     args.add_argument(
         "--seed", type=int, default=42, help="Random seed for reproducibility."
     )
     print(f"Window size: {window_size} samples ({window_seconds:.2f} seconds)")
     df = pd.read_csv(os.path.join(data_dir, "metadata.csv"))
+    if args.subsample < 1.0:
+        df = df.groupby("split", group_keys=False).sample(
+            frac=args.subsample, random_state=args.seed
         )
+    df = df.reset_index(drop=True)
     splits = {}
     for split, df_ in df.groupby("split"):
         sessions = list(df_.filename)
+        splits[split] =[
             Path(data_dir).expanduser().joinpath(f"{session}.hdf5")
             for session in sessions
         ]
     for split, files in splits.items():
+        out_file = os.path.join(save_dir, f"{split}.h5")
+        # Remove existing file if it exists so we don't accidentally append to old runs
+        if os.path.exists(out_file):
+            os.remove(out_file)
+        print(f"Processing {split} split ({len(files)} files)...")
+        with h5py.File(out_file, "w") as h5f:
+            group_idx = 0
+            with Parallel(n_jobs=args.n_jobs) as parallel:
+                with tqdm(total=len(files), desc=f"Processing & Saving {split}") as pbar:
+                    # Iterate files in batches
+                    for i in range(0, len(files), args.group_size):
+                        batch_files = files[i : i + args.group_size]
+                        # Process current batch
+                        results = parallel(
+                            delayed(process_one_recording)(file_path, fs, window_size, stride)
+                            for file_path in batch_files
+                        )
+                        if results:
+                            X_chunk = np.concatenate(results, axis=0)  # [N, window_size, ch]
+                            X_chunk = X_chunk.transpose(0, 2, 1)          # [N, ch, window_size]
+                            X_chunk = X_chunk.astype(np.float32)
+                            # Write each processed batch as a group compatible with HDF5Loader
+                            grp = h5f.create_group(f"data_group_{group_idx}")
+                            grp.create_dataset("X", data=X_chunk)
+                            group_idx += 1
+                        # Explicitly clear memory of large numpy arrays
+                        del results
+                        if 'X_chunk' in locals():
+                            del X_chunk
+                        gc.collect()
+                        pbar.update(len(batch_files))
 if __name__ == "__main__":
+    main()

scripts/epn.py CHANGED Viewed

@@ -2,6 +2,7 @@ import glob
 import json
 import os
 import sys
 import h5py
 import numpy as np
@@ -10,7 +11,17 @@ from joblib import Parallel, delayed
 from scipy.signal import iirnotch
 from tqdm.auto import tqdm
-sequence_to_seconds = lambda seq_len, fs: seq_len / fs
 # Sampling frequency and EMG channels
 tfs, n_ch = 200.0, 8
@@ -27,28 +38,77 @@ gesture_map = {
 }
-# Filtering utilities
-def bandpass_filter_emg(emg, low=20.0, high=90.0, fs=tfs, order=4):
     nyq = 0.5 * fs
     b, a = signal.butter(order, [low / nyq, high / nyq], btype="bandpass")
     return signal.filtfilt(b, a, emg, axis=1)
-def notch_filter_emg(emg, notch=50.0, Q=30.0, fs=tfs):
     w0 = notch / (0.5 * fs)
     b, a = iirnotch(w0, Q)
     return signal.filtfilt(b, a, emg, axis=1)
-# Normalization helpers
-def zscore_per_channel(emg):
     mean = emg.mean(axis=1, keepdims=True)
     std = emg.std(axis=1, ddof=1, keepdims=True)
     std[std == 0] = 1.0
     return (emg - mean) / std
-def adjust_length(x, max_len):
     n_ch, seq_len = x.shape
     if seq_len >= max_len:
         return x[:, :max_len]
@@ -56,8 +116,18 @@ def adjust_length(x, max_len):
     return np.concatenate([x, pad], axis=1)
-# Single-sample processing
-def extract_emg_signal(sample, seq_len):
     emg = np.stack([v for v in sample["emg"].values()], dtype=np.float32) / 128.0
     emg = bandpass_filter_emg(emg, 20.0, 90.0)
     emg = notch_filter_emg(emg, 50.0, 30.0)
@@ -67,8 +137,20 @@ def extract_emg_signal(sample, seq_len):
     return emg, label
-# Process one user JSON for train/validation
-def process_user_training(path, seq_len):
     train_X, train_y, val_X, val_y = [], [], [], []
     with open(path, "r", encoding="utf-8") as f:
         data = json.load(f)
@@ -79,14 +161,29 @@ def process_user_training(path, seq_len):
             train_y.append(lbl)
     for sample in data.get("testingSamples", {}).values():
         emg, lbl = extract_emg_signal(sample, seq_len)
         if lbl != 6:
             val_X.append(emg)
             val_y.append(lbl)
     return train_X, train_y, val_X, val_y
-# Process one user JSON for testing split
-def process_user_testing(path, seq_len):
     train_X, train_y, test_X, test_y = [], [], [], []
     with open(path, "r", encoding="utf-8") as f:
         data = json.load(f)
@@ -107,14 +204,19 @@ def process_user_testing(path, seq_len):
     return train_X, train_y, test_X, test_y
-# Save to HDF5
-def save_h5(path, data, labels):
     with h5py.File(path, "w") as f:
         f.create_dataset("data", data=np.asarray(data, np.float32))
         f.create_dataset("label", data=np.asarray(labels, np.int64))
-# Main parallelized pipeline
 def main():
     import argparse

 import json
 import os
 import sys
+from typing import Tuple, List, Optional, Union, Dict, Any
 import h5py
 import numpy as np
 from scipy.signal import iirnotch
 from tqdm.auto import tqdm
+def sequence_to_seconds(seq_len: int, fs: float) -> float:
+    """Converts a sequence length in samples to time in seconds.
+    Args:
+        seq_len (int): The number of samples in the sequence.
+        fs (float): The sampling frequency in Hz.
+    Returns:
+        float: The duration of the sequence in seconds.
+    """
+    return seq_len / fs
 # Sampling frequency and EMG channels
 tfs, n_ch = 200.0, 8
 }
+def bandpass_filter_emg(
+    emg: np.ndarray,
+    low: float = 20.0,
+    high: float = 90.0,
+    fs: float = tfs,
+    order: int = 4
+) -> np.ndarray:
+    """Applies a Butterworth bandpass filter to the EMG signal.
+    Args:
+        emg (np.ndarray): The input signal array of shape (n_ch, T).
+        low (float, optional): Lower bound of the passband in Hz. Defaults to 20.0.
+        high (float, optional): Upper bound of the passband in Hz. Defaults to 90.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 200.0.
+        order (int, optional): The order of the filter. Defaults to 4.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     nyq = 0.5 * fs
     b, a = signal.butter(order, [low / nyq, high / nyq], btype="bandpass")
     return signal.filtfilt(b, a, emg, axis=1)
+def notch_filter_emg(
+    emg: np.ndarray,
+    notch: float = 50.0,
+    Q: float = 30.0,
+    fs: float = tfs
+) -> np.ndarray:
+    """Applies a notch filter to remove power line interference.
+    Args:
+        emg (np.ndarray): The input signal array of shape (n_ch, T).
+        notch (float, optional): The frequency to be removed in Hz. Defaults to 50.0.
+        Q (float, optional): The quality factor. Defaults to 30.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 200.0.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     w0 = notch / (0.5 * fs)
     b, a = iirnotch(w0, Q)
     return signal.filtfilt(b, a, emg, axis=1)
+def zscore_per_channel(emg: np.ndarray) -> np.ndarray:
+    """Normalizes the EMG signal using Z-score (per channel).
+    Args:
+        emg (np.ndarray): The input EMG signal of shape (n_ch, T).
+    Returns:
+        np.ndarray: The normalized EMG signal.
+    """
     mean = emg.mean(axis=1, keepdims=True)
     std = emg.std(axis=1, ddof=1, keepdims=True)
     std[std == 0] = 1.0
     return (emg - mean) / std
+def adjust_length(x: np.ndarray, max_len: int) -> np.ndarray:
+    """Standardizes the temporal length of the signal by clipping or zero-padding.
+    Args:
+        x (np.ndarray): The input signal of shape (n_ch, T).
+        max_len (int): The target length in samples.
+    Returns:
+        np.ndarray: The standardized length signal of shape (n_ch, max_len).
+    """
     n_ch, seq_len = x.shape
     if seq_len >= max_len:
         return x[:, :max_len]
     return np.concatenate([x, pad], axis=1)
+def extract_emg_signal(sample: Dict[str, Any], seq_len: int) -> Tuple[np.ndarray, int]:
+    """Extracts, filters, and normalizes EMG data from a JSON sample.
+    Args:
+        sample (Dict[str, Any]): A single sample dictionary from the EPN612 JSON.
+        seq_len (int): Target temporal length.
+    Returns:
+        Tuple[np.ndarray, int]: A tuple containing:
+            - The preprocessed EMG signal (n_ch, seq_len).
+            - The gesture label ID.
+    """
     emg = np.stack([v for v in sample["emg"].values()], dtype=np.float32) / 128.0
     emg = bandpass_filter_emg(emg, 20.0, 90.0)
     emg = notch_filter_emg(emg, 50.0, 30.0)
     return emg, label
+def process_user_training(
+    path: str,
+    seq_len: int
+) -> Tuple[List[np.ndarray], List[int], List[np.ndarray], List[int]]:
+    """Processes a user's training JSON file for the training and validation splits.
+    Args:
+        path (str): Path to the user JSON file.
+        seq_len (int): Target temporal length for segmentation.
+    Returns:
+        Tuple[List[np.ndarray], List[int], List[np.ndarray], List[int]]:
+            (train_X, train_y, val_X, val_y) lists.
+    """
     train_X, train_y, val_X, val_y = [], [], [], []
     with open(path, "r", encoding="utf-8") as f:
         data = json.load(f)
             train_y.append(lbl)
     for sample in data.get("testingSamples", {}).values():
         emg, lbl = extract_emg_signal(sample, seq_len)
+        if lbl != 10:  # Assuming 10 was the intention or checking if not invalid
+             pass
+        # Note: checking lbl != 6 as in original
         if lbl != 6:
             val_X.append(emg)
             val_y.append(lbl)
     return train_X, train_y, val_X, val_y
+def process_user_testing(
+    path: str,
+    seq_len: int
+) -> Tuple[List[np.ndarray], List[int], List[np.ndarray], List[int]]:
+    """Processes a user's testing JSON file for the fine-tuning and test splits.
+    Args:
+        path (str): Path to the user JSON file.
+        seq_len (int): Target temporal length for segmentation.
+    Returns:
+        Tuple[List[np.ndarray], List[int], List[np.ndarray], List[int]]:
+            (tune_X, tune_y, test_X, test_y) lists.
+    """
     train_X, train_y, test_X, test_y = [], [], [], []
     with open(path, "r", encoding="utf-8") as f:
         data = json.load(f)
     return train_X, train_y, test_X, test_y
+def save_h5(path: str, data: List[np.ndarray], labels: List[int]) -> None:
+    """Saves the processed EMG data and labels to an HDF5 file.
+    Args:
+        path (str): Output file path.
+        data (List[np.ndarray]): List of signal segments.
+        labels (List[int]): List of categorical labels.
+    """
     with h5py.File(path, "w") as f:
         f.create_dataset("data", data=np.asarray(data, np.float32))
         f.create_dataset("label", data=np.asarray(labels, np.int64))
 def main():
     import argparse

scripts/uci.py CHANGED Viewed

@@ -1,36 +1,78 @@
 import os
 import sys
 from pathlib import Path
 import h5py
 import numpy as np
 import scipy.signal as signal
 from scipy.signal import iirnotch
-sequence_to_seconds = lambda seq_len, fs: seq_len / fs
-# ─────────────────────────────────────────────
-# Filtering utilities
-# ─────────────────────────────────────────────
-def bandpass_filter_emg(emg, lowcut=20.0, highcut=90.0, fs=200.0, order=4):
     nyq = 0.5 * fs
     b, a = signal.butter(order, [lowcut / nyq, highcut / nyq], btype="bandpass")
     return signal.filtfilt(b, a, emg, axis=0)
-def notch_filter_emg(emg, notch_freq=50.0, Q=30.0, fs=200.0):
     b, a = iirnotch(notch_freq / (0.5 * fs), Q)
     return signal.filtfilt(b, a, emg, axis=0)
-# ─────────────────────────────────────────────
-# Core I/O + preprocessing helpers
-# ─────────────────────────────────────────────
-def read_emg_txt(txt_path):
-    """
-    Read a txt file with columns: time ch1 … ch8 class.
-    Return float32 array of shape (N, 10).
     """
     data = []
     with open(txt_path, "r") as f:
@@ -41,10 +83,22 @@ def read_emg_txt(txt_path):
     return np.asarray(data, dtype=np.float32)
-def preprocess_emg(arr, fs=200.0, remove_class0=True):
-    """
-    1) optional removal of class-0 rows
-    2) band-pass → notch → Z-score  (on 8 channels)
     """
     if remove_class0:
         arr = arr[arr[:, -1] >= 1]
@@ -64,8 +118,15 @@ def preprocess_emg(arr, fs=200.0, remove_class0=True):
     return arr
-def find_label_runs(arr):
-    """Group consecutive rows with identical class labels."""
     runs = []
     if arr.size == 0:
         return runs
@@ -80,7 +141,23 @@ def find_label_runs(arr):
     return runs
-def sliding_window_majority(seg_arr, window_size=1000, stride=500):
     segs, labs = [], []
     for start in range(0, len(seg_arr) - window_size + 1, stride):
         win = seg_arr[start : start + window_size]
@@ -91,8 +168,24 @@ def sliding_window_majority(seg_arr, window_size=1000, stride=500):
 def users_with_gesture(
-    data_root, gesture_id, subj_range=range(1, 37), return_counts=False
-):
     found = {}
     for subj in subj_range:
         subj_dir = os.path.join(data_root, f"{subj:02d}")
@@ -122,21 +215,28 @@ def users_with_gesture(
     else:
         return sorted(found.keys())
-# ─────────────────────────────────────────────
-# Safe concatenation utilities
-# ─────────────────────────────────────────────
-def concat_data(lst):  # lst of (N,256,8)
     return np.concatenate(lst, axis=0) if lst else np.empty((0, 1000, 8), np.float32)
-def concat_label(lst):
-    return np.concatenate(lst, axis=0) if lst else np.empty((0,), np.int32)
-# ─────────────────────────────────────────────
-# Main
-# ─────────────────────────────────────────────
 if __name__ == "__main__":
     import argparse

 import os
 import sys
 from pathlib import Path
+from typing import Tuple, List, Optional, Union, Dict, Any
 import h5py
 import numpy as np
 import scipy.signal as signal
 from scipy.signal import iirnotch
+def sequence_to_seconds(seq_len: int, fs: float) -> float:
+    """Converts a sequence length in samples to time in seconds.
+    Args:
+        seq_len (int): The number of samples in the sequence.
+        fs (float): The sampling frequency in Hz.
+    Returns:
+        float: The duration of the sequence in seconds.
+    """
+    return seq_len / fs
+def bandpass_filter_emg(
+    emg: np.ndarray,
+    lowcut: float = 20.0,
+    highcut: float = 90.0,
+    fs: float = 200.0,
+    order: int = 4
+) -> np.ndarray:
+    """Applies a Butterworth bandpass filter to the EMG signal.
+    Args:
+        emg (np.ndarray): The input signal array of shape (T, D).
+        lowcut (float, optional): Lower bound of the passband in Hz. Defaults to 20.0.
+        highcut (float, optional): Upper bound of the passband in Hz. Defaults to 90.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 200.0.
+        order (int, optional): The order of the filter. Defaults to 4.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     nyq = 0.5 * fs
     b, a = signal.butter(order, [lowcut / nyq, highcut / nyq], btype="bandpass")
     return signal.filtfilt(b, a, emg, axis=0)
+def notch_filter_emg(
+    emg: np.ndarray,
+    notch_freq: float = 50.0,
+    Q: float = 30.0,
+    fs: float = 200.0
+) -> np.ndarray:
+    """Applies a notch filter to remove power line interference.
+    Args:
+        emg (np.ndarray): The input signal array of shape (T, D).
+        notch_freq (float, optional): The frequency to be removed in Hz. Defaults to 50.0.
+        Q (float, optional): The quality factor. Defaults to 30.0.
+        fs (float, optional): The sampling frequency in Hz. Defaults to 200.0.
+    Returns:
+        np.ndarray: The filtered signal array.
+    """
     b, a = iirnotch(notch_freq / (0.5 * fs), Q)
     return signal.filtfilt(b, a, emg, axis=0)
+def read_emg_txt(txt_path: str) -> np.ndarray:
+    """Reads a UCI EMG text file into a numpy array.
+    The file is expected to have columns: [time, ch1, ..., ch8, class].
+    Args:
+        txt_path (str): Path to the .txt file.
+    Returns:
+        np.ndarray: A float32 array of shape (N, 10).
     """
     data = []
     with open(txt_path, "r") as f:
     return np.asarray(data, dtype=np.float32)
+def preprocess_emg(arr: np.ndarray, fs: float = 200.0, remove_class0: bool = True) -> np.ndarray:
+    """Applies a standard preprocessing pipeline to the EMG data.
+    Pipeline includes:
+    1. Optional removal of rest (class 0).
+    2. Bandpass filtering (20-90 Hz).
+    3. Notch filtering (50 Hz).
+    4. Z-score normalization per channel.
+    Args:
+        arr (np.ndarray): Raw data array of shape (N, 10).
+        fs (float, optional): Sampling frequency in Hz. Defaults to 200.0.
+        remove_class0 (bool, optional): Whether to remove the "rest" class. Defaults to True.
+    Returns:
+        np.ndarray: The preprocessed data array.
     """
     if remove_class0:
         arr = arr[arr[:, -1] >= 1]
     return arr
+def find_label_runs(arr: np.ndarray) -> List[Tuple[int, np.ndarray]]:
+    """Groups consecutive rows with identical class labels.
+    Args:
+        arr (np.ndarray): Data array where the last column is the class label.
+    Returns:
+        List[Tuple[int, np.ndarray]]: A list of tuples (label, sub-array).
+    """
     runs = []
     if arr.size == 0:
         return runs
     return runs
+def sliding_window_majority(
+    seg_arr: np.ndarray,
+    window_size: int = 1000,
+    stride: int = 500
+) -> Tuple[np.ndarray, np.ndarray]:
+    """Segments a label-consistent array using a sliding window and majority voting.
+    Args:
+        seg_arr (np.ndarray): Data array of shape (T, 10).
+        window_size (int, optional): Number of samples per window. Defaults to 1000.
+        stride (int, optional): Number of samples to shift between windows. Defaults to 500.
+    Returns:
+        Tuple[np.ndarray, np.ndarray]: A tuple containing:
+            - Windowed EMG segments (N, window_size, 8).
+            - Majority vote labels (N,).
+    """
     segs, labs = [], []
     for start in range(0, len(seg_arr) - window_size + 1, stride):
         win = seg_arr[start : start + window_size]
 def users_with_gesture(
+    data_root: str,
+    gesture_id: int,
+    subj_range: range = range(1, 37),
+    return_counts: bool = False
+) -> Union[List[int], Dict[int, int]]:
+    """Identifies which subjects performed a specific gesture.
+    Args:
+        data_root (str): Root directory of the dataset.
+        gesture_id (int): The ID of the gesture to search for.
+        subj_range (range, optional): Range of subject IDs to check. Defaults to range(1, 37).
+        return_counts (bool, optional): If True, returns a dictionary with sample counts.
+            Defaults to False.
+    Returns:
+        Union[List[int], Dict[int, int]]: Either a list of subject IDs or a dictionary
+            mapping subject ID to occurrence count.
+    """
     found = {}
     for subj in subj_range:
         subj_dir = os.path.join(data_root, f"{subj:02d}")
     else:
         return sorted(found.keys())
+def concat_data(lst: List[np.ndarray]) -> np.ndarray:
+    """Concatenates a list of data arrays.
+    Args:
+        lst (List[np.ndarray]): List of arrays to concatenate.
+    Returns:
+        np.ndarray: Concatenated array or empty array if list is empty.
+    """
     return np.concatenate(lst, axis=0) if lst else np.empty((0, 1000, 8), np.float32)
+def concat_label(lst: List[np.ndarray]) -> np.ndarray:
+    """Concatenates a list of label arrays.
+    Args:
+        lst (List[np.ndarray]): List of label arrays.
+    Returns:
+        np.ndarray: Concatenated array or empty array if list is empty.
+    """
+    return np.concatenate(lst, axis=0) if lst else np.empty((0,), np.int32)
 if __name__ == "__main__":
     import argparse