tags:
- setfit
- sentence-transformers
- text-classification
- generated_from_setfit_trainer
widget:
- text: >-
# Dataset Card for The Wilds Bioacoustics Monitors
This dataset contains passive acoustic recordings collected at [The Wilds
safari park](https://www.thewilds.org/) in Ohio during Summer 2025.
Recorders captured ambient soundscapes to support ecological monitoring,
animal behavior analysis, and acoustic biodiversity modeling.
## Dataset Details
### Dataset Description
- **Curated by:** Tanishka Wani, Vedant Patil, Rugved Katole, Bharath
Pillai, Anirudh Potlapally, Ally Bonney, and Jenna Kline
- **Repository:**
[https://github.com/Imageomics/naturelab](https://github.com/Imageomics/naturelab)
- **Paper:** [SmartWilds: Multimodal Wildlife Monitoring
Dataset](https://arxiv.org/abs/2509.18894)
This dataset was created to support multimodal wildlife monitoring
research using passive acoustic monitoring. Bioacoustic data were
collected using Wildlife Acoustics Song Meter devices deployed across four
field sites at The Wilds. The recordings capture natural soundscapes
including wildlife vocalizations, environmental sounds, and ambient audio
that can be used for species detection, behavioral analysis, and
biodiversity assessment.
### Supported Tasks and Leaderboards
- **Audio Classification:** Species identification from acoustic
recordings
- **Sound Event Detection:** Detection and localization of animal
vocalizations
- **Biodiversity Assessment:** Acoustic diversity indices and community
analysis
- **Behavioral Analysis:** Temporal activity patterns and acoustic
behavior studies
- **Soundscape Ecology:** Environmental audio analysis and habitat
characterization
[No benchmarks currently available]
## Dataset Structure
The dataset is organized hierarchically by site and deployment session:
```
/dataset/
bioacoustic.txt
The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv
The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv
TW05-SM01/
metadata.md
SD01_20250630_20250703/
SM001_20250630_195900.wav
SM001_20250630_200402.wav
SM001_20250630_200902.wav
...
SM001_20250703_064902.wav
SM001_20250703_065402.wav
SM001_20250703_065902.wav
TW06-SM03/
metadata.md
SD03_20250630_20250703/
SM03_20250630_140000.wav
SM03_20250630_150000.wav
SM03_20250630_160000.wav
SM03_20250630_170000.wav
...
SM03_20250703_140000.wav
SM03_20250703_150000.wav
SM03_20250703_160000.wav
TW07-SM02/
metadata.md
SD02_20250630_20250703/
SM002_20250630_195900.wav
SM002_20250630_205902.wav
SM002_20250701_050300.wav
...
SM002_20250702_205902.wav
SM002_20250703_050400.wav
SM002_20250703_060402.wav
TW08-SM04/
metadata.md
SD04_20250630_20250703/
SM04_20250630_120000.wav
SM04_20250630_130000.wav
SM04_20250630_140000.wav
...
SM04_20250703_150000.wav
SM04_20250703_160000.wav
SM04_20250703_170000.wav
```
### Data Instances
Each bioacoustic deployment folder contains:
- **Audio files:** .wav format recordings captured by scheduled recording
- **Metadata file:** `metadata.md` with deployment information and
recorder settings
**File Counts by Recorder:**
- **TW05-SM01:** 144 audio files (.wav recordings)
- **TW06-SM03:** 75 audio files (.wav recordings)
- **TW07-SM02:** 12 audio files (.wav recordings)
- **TW08-SM04:** 78 audio files (.wav recordings)
**Audio File Specifications:**
- **Format:** .wav (uncompressed)
- **Channels:** Mono
- **Bit depth:** 16-bit
- **Sample rate:** 48 kHz
- **Duration:** Variable based on recording schedule
**Filename Conventions:**
- **SM001/SM03/SM04 series:** SM0##_YYYYMMDD_HHMMSS.wav (TW05-SM01,
TW06-SM03, TW08-SM04)
- **SM002 series:** SM002_YYYYMMDD_HHMMSS.wav (TW07-SM02)
**Total Dataset Size:** 311 audio files across all bioacoustic monitor
deployments.
Each .wav file is a field recording captured according to programmed
recording schedules. File names include timestamps indicating the start
time of each recording session.
### Data Fields
**metadata.md** (found in each recorder deployment folder):
- **Recorder ID:** Unique device identifier (SM01, SM02, SM03, SM04)
- **Device Model:** Song Meter model name (e.g., Song Meter Micro 2)
- **Device Serial Number:** Manufacturer-assigned serial number
- **Site ID:** Location code where deployed (TW05, TW06, TW07, TW08)
- **Deployment Location Description:** Text description of exact location
and surroundings
- **GPS Coordinates:** Latitude and longitude in decimal format
- **Deployment Date and Time:** Recorder deployment timestamp (YYYY-MM-DD
HH:MM format)
- **Retrieval Date and Time:** Recorder retrieval timestamp (YYYY-MM-DD
HH:MM format)
- **Orientation / Microphone Facing:** Direction and environmental
considerations (e.g., "East, away from wind and road")
- **Mounting Height:** Approximate height of microphone from ground in
meters
- **Recording Schedule Preset:** Schedule or settings used for recording
(e.g., "1 hour at sunrise and sunset")
- **Time Zone Set on Device:** Local time zone configured (e.g., "USA
Eastern (UTC-5)")
- **Maintenance Notes:** Issues, configuration changes, or deviations from
standard settings
- **Observer:** Name or initials of person completing metadata
**CSV Log Files:**
- `The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv`: Deployment log from
June 30, 2025
- `The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv`: Retrieval log from
July 4, 2025
### Data Splits
This dataset has no predefined training/validation/test splits. Data are
organized by site (TW05-TW08) and deployment session. Users may create
their own splits based on:
- **Temporal splits:** Using recording timestamps across the deployment
period
- **Spatial splits:** Using different site locations (TW05, TW06, TW07,
TW08)
- **Recorder-based splits:** Using different Song Meter devices (SM01,
SM02, SM03, SM04)
Recommended approach depends on modeling goals and research questions.
## Dataset Creation
### Curation Rationale
This dataset supports biodiversity monitoring, behavioral ecology
research, and the development of automated species detection and
classification models from passive acoustic recordings. Bioacoustic
monitoring provides complementary data to camera trap surveys and enables
detection of cryptic or nocturnal species that may be missed by visual
methods.
### Source Data
#### Data Collection and Processing
Recordings were collected at The Wilds safari park during summer 2025
using Wildlife Acoustics Song Meter devices. Four recorders (SM01-SM04)
were strategically deployed at sites TW05-TW08 from June 30 to July 3,
2025.
Devices were programmed for scheduled recordings with different sampling
strategies across sites. Recorders were mounted on trees or posts at
appropriate heights and orientations to minimize wind noise and maximize
acoustic detection. Upon retrieval, audio files were organized by
deployment session and basic metadata were recorded. No audio processing,
filtering, or annotation was applied to preserve the raw acoustic data.
#### Who are the source data producers?
The dataset was collected and curated by researchers and students from the
Imageomics Institute and Ohio State University in collaboration with
conservation staff at The Wilds safari park in Ohio.
### Annotations
#### Annotation process
No species identification or acoustic annotations are currently provided
with this initial dataset release. Manual and AI-assisted labeling efforts
for species detection, vocalization classification, and acoustic event
annotation are planned for future versions.
#### Who are the annotators?
N/A - annotations will be added in future releases
### Personal and Sensitive Information
The dataset includes GPS coordinates within The Wilds, a public
conservation
- text: >-
# Securing an MLOps Pipeline: Training to Deployment
# Securiser un Pipeline MLOps : Entrainement au Deploiement
> This dataset contains a technical article available in both French and
English.
> Cet article technique est disponible en francais et en anglais.
---
## Navigation
- [Version Francaise](#version-francaise)
- [English Version](#english-version)
---
<a id="version-francaise"></a>
---
title: "Securiser un Pipeline MLOps : De l'Entrainement au Deploiement"
author: "AYI-NEDJIMI Consultants"
date: "2026-02-21"
language: "fr"
tags:
- mlops
- securite
- pipeline
- supply-chain
- model-poisoning
- gpu-isolation
license: "cc-by-sa-4.0"
---
# Securiser un Pipeline MLOps : De l'Entrainement au Deploiement
**Auteur** : AYI-NEDJIMI Consultants | **Date** : 21 fevrier 2026 |
**Temps de lecture** : 11 min
---
## Introduction
Les pipelines MLOps representent une surface d'attaque croissante et
encore mal comprise par la majorite des equipes de securite. Un modele
d'IA compromis dans la supply chain peut avoir des consequences
catastrophiques : un modele de detection d'intrusion empoisonne qui ignore
certaines attaques, un assistant cybersecurite qui fournit des
recommandations deliberement erronees, ou un systeme de classification qui
exfiltre des donnees via ses predictions.
Cet article detaille les menaces specifiques aux pipelines MLOps et les
contre-mesures a implementer, de l'entrainement au deploiement en
production. Cette demarche complete notre approche de [securisation des
infrastructures
virtualisees](https://www.ayinedjimi-consultants.fr/virtualisation/hyperv-securisation-2025.html)
et nos services d'[audit
d'infrastructure](https://www.ayinedjimi-consultants.fr/audit-infrastructure.html).
## Taxonomie des Menaces MLOps
### Surface d'attaque du pipeline
```
[Donnees] [Code] [Modele] [Infra] [Deploiement]
| | | | |
v v v v v
Data Code Model Training Serving
Poisoning Injection Poisoning Infra Endpoint
| | | Compromise Compromise
v v v | |
Backdoor Supply Trojan GPU API
Dataset Chain Model Hijacking Injection
Attack
```
### Les 10 risques majeurs
| Risque | Phase | Impact | Probabilite |
|--------|-------|--------|------------|
| Data poisoning | Entrainement | Critique | Moyenne |
| Model backdoor | Entrainement | Critique | Faible |
| Supply chain (packages) | Developpement | Eleve | Elevee |
| GPU memory leakage | Entrainement | Eleve | Moyenne |
| Model serialization attack | Distribution | Critique | Moyenne |
| Adversarial inputs | Inference | Moyen | Elevee |
| Model inversion | Inference | Eleve | Faible |
| API injection | Deploiement | Eleve | Elevee |
| Drift detection evasion | Production | Moyen | Faible |
| Exfiltration via predictions | Production | Critique | Faible |
## Securisation de la Phase d'Entrainement
### Integrite des donnees
```python
import hashlib
import json
from pathlib import Path
from datetime import datetime
class DataIntegrityChecker:
"""Verification de l'integrite des donnees d'entrainement."""
def __init__(self, manifest_path: str):
self.manifest_path = manifest_path
self.manifest = self._load_or_create_manifest()
def _load_or_create_manifest(self):
if Path(self.manifest_path).exists():
with open(self.manifest_path) as f:
return json.load(f)
return {"files": {}, "created": datetime.now().isoformat()}
def register_dataset(self, file_path: str) -> str:
"""Enregistre un fichier de dataset avec son hash."""
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
file_hash = sha256.hexdigest()
self.manifest["files"][file_path] = {
"sha256": file_hash,
"registered": datetime.now().isoformat(),
"size": Path(file_path).stat().st_size,
}
self._save_manifest()
return file_hash
def verify_dataset(self, file_path: str) -> bool:
"""Verifie l'integrite d'un dataset avant entrainement."""
if file_path not in self.manifest["files"]:
raise ValueError(f"Dataset non enregistre : {file_path}")
expected_hash = self.manifest["files"][file_path]["sha256"]
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != expected_hash:
raise SecurityError(
f"INTEGRITE COMPROMISE : {file_path}\n"
f"Attendu : {expected_hash}\n"
f"Trouve : {actual_hash}"
)
return True
def _save_manifest(self):
with open(self.manifest_path, "w") as f:
json.dump(self.manifest, f, indent=2)
```
### Isolation GPU et securisation de l'entrainement
```python
import subprocess
import os
class SecureTrainingEnvironment:
"""Environnement d'entrainement securise avec isolation GPU."""
def __init__(self, gpu_id: int = 0):
self.gpu_id = gpu_id
def setup_gpu_isolation(self):
"""Configure l'isolation GPU pour l'entrainement."""
# Restreindre l'acces GPU
os.environ["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
# Activer le mode MIG (Multi-Instance GPU) si disponible
# Uniquement sur A100/H100
try:
subprocess.run([
"nvidia-smi", "mig", "-cgi", "9,9", "-C"
], check=True, capture_output=True)
except subprocess.CalledProcessError:
print("MIG non disponible, utilisation du mode exclusif")
subprocess.run([
"nvidia-smi", "-i", str(self.gpu_id),
"-c", "EXCLUSIVE_PROCESS"
], check=True)
def clear_gpu_memory(self):
"""Nettoie la memoire GPU apres l'entrainement."""
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Forcer le reset GPU
subprocess.run([
"nvidia-smi", "--gpu-reset", "-i", str(self.gpu_id)
], capture_output=True)
def verify_no_data_leakage(self):
"""Verifie qu'aucune donnee ne persiste en memoire GPU."""
result = subprocess.run(
["nvidia-smi", "--query-compute-apps=pid,used_memory",
"--format=csv,noheader", "-i", str(self.gpu_id)],
capture_output=True, text=True
)
if result.stdout.strip():
raise SecurityError(
f"Processus encore actifs sur GPU {self.gpu_id}: {result.stdout}"
)
return True
```
## Securisation de la Supply Chain ML
### Verification des dependances
```python
# requirements-secure.txt avec hash pinning
# pip install --require-hashes -r requirements-secure.txt
torch==2.2.0 \
--hash=sha256:abc123... \
--hash=sha256:def456...
transformers==4.38.0 \
--hash=sha256:ghi789...
peft==0.9.0 \
--hash=sha256:jkl012...
```
### Scan des modeles avant deploiement
```python
import pickle
import struct
from pathlib import Path
class ModelSecurityScanner:
"""Scanner de securite pour les modeles ML."""
DANGEROUS_OPCODES = {
b'cos\n', # cos.system
b'cposix\n', # cposix.system
b'csubprocess\n', # subprocess
b'cbuiltins\n', # builtins
b'c__builtin__\n', # __builtin__
}
def scan_pickle(self, model_path: str) -> dict:
"""Analyse un fichier pickle pour detecter du code malveillant."""
results = {"safe": True, "warning
- text: >-
# APIEval-20: A Benchmark for Black-Box API Test Suite Generation
---
## Motivation
Testing APIs thoroughly is one of the most critical, yet consistently
underserved, activities in software engineering. Despite a rich ecosystem
of API testing tools — Postman, RestAssured, Schemathesis, Dredd, and
others — we found ourselves asking a deceptively simple question:
**Given only the schema and an example payload of an API request — no
source code, no documentation, no prior knowledge — how well can an AI
agent generate a test suite that actually finds bugs?**
We searched for an existing benchmark that captured this black-box
scenario and came up empty. Every evaluation we found either required
access to the implementation, relied on rich API documentation, or
measured properties like schema compliance rather than actual bug-finding
capability. The practitioner reality is different: teams frequently
receive API payloads with little context and need to construct meaningful
tests quickly.
That gap is the reason **APIEval-20** exists.
APIEval-20 is not a model benchmark. It is a **task benchmark for AI
agents**. It evaluates end-to-end agent behavior — the ability to reason
about an API surface, design targeted tests, and uncover real bugs — not
just the quality of generated text.
---
## 1. Benchmark Overview
APIEval-20 consists of 20 carefully designed API scenarios drawn from
real-world application domains. Each scenario presents the agent with an
API request schema and a sample payload, then challenges it to produce a
test suite that exposes bugs hidden within a live reference
implementation.
### Domains Covered
The 20 scenarios span the following application domains, chosen to reflect
a broad range of validation patterns, business logic complexity, and
security sensitivity:
| Domain | Scenarios |
|---|---|
| **E-commerce** | Order placement, coupon redemption, inventory
adjustment |
| **Payments** | Transaction creation, refund processing, currency
conversion |
| **Authentication** | Login, token refresh, password reset, session
management |
| **User Management** | Account creation, profile update, role assignment
|
| **Scheduling** | Appointment booking, availability queries, recurring
events |
| **Notifications** | Email dispatch, push configuration, preference
management |
| **Search & Filtering** | Query construction, pagination, sort and rank |
---
## 2. Bug Spectrum
Each scenario contains between 3 and 8 planted bugs. Rather than
categorising bugs by severity, APIEval-20 classifies them by
**complexity** — reflecting how much reasoning is required to discover
them. Bugs range along a continuum from simple to complex.
### Simple Bugs
Require no semantic understanding of the domain. They test whether the API
handles basic structural issues correctly: missing required fields, empty
values (`""`, `null`, `[]`), and wrong data types.
### Moderate Bugs
Require understanding the meaning of individual fields and their
constraints: numeric values outside valid range, strings violating format
constraints (malformed email, invalid currency code, wrong date format),
and enum fields receiving boundary or undocumented values.
### Complex Bugs
Require understanding the *relationship* between multiple fields, or the
broader semantics of the operation: mutually exclusive fields both
provided, discounts applied to ineligible orders, fields whose validity
depends on the value of another field.
**A strong test suite should span the full complexity spectrum — simple
structural checks alone will not surface the bugs that matter most in
production.**
---
## 3. Agent I/O
### What the Agent Receives
For each scenario, the agent is given exactly two inputs. Nothing else —
no response schema, no implementation details, no error messages, no
changelog. This deliberate constraint reflects the black-box testing
reality and prevents agents from trivially exploiting documentation.
1. **JSON Schema** — The full request schema: field names, types,
required/optional status, and any documented constraints.
2. **Sample Payload** — A concrete example of a valid request, showing
realistic field values.
**Example Input — `POST /api/v1/orders`**
Schema:
```json
{
"user_id": { "type": "string", "required": true },
"items": { "type": "array", "required": true,
"items": { "product_id": "string", "quantity": "integer", "unit_price": "number" } },
"coupon_code": { "type": "string", "required": false },
"currency": { "type": "string", "required": true, "description": "ISO 4217 currency code" },
"shipping": { "type": "object", "required": true,
"properties": { "address": "string", "method": "string" } }
}
```
Sample Payload:
```json
{
"user_id": "usr_4821",
"items": [
{ "product_id": "prod_991", "quantity": 2, "unit_price": 29.99 }
],
"coupon_code": "SAVE10",
"currency": "USD",
"shipping": {
"address": "123 Main St, Springfield",
"method": "standard"
}
}
```
### What the Agent Produces
The agent must output a **test suite** — a list of test cases, where each
test case contains a short human-readable test name and the complete
request payload as a valid JSON object. No expected outcome is required.
Evaluation is performed by running each test case against the live
reference implementation and observing what actually happens.
**Example Test Case Output:**
```json
{
"test_name": "Order with zero quantity item",
"payload": {
"user_id": "usr_4821",
"items": [{ "product_id": "prod_991", "quantity": 0, "unit_price": 29.99 }],
"currency": "USD",
"shipping": { "address": "123 Main St, Springfield", "method": "standard" }
}
}
```
---
## 4. Evaluation Methodology
All 20 reference API implementations are deployed and running. Evaluation
is fully automated: each test case in the agent's output is executed
against the live API, and the responses are analysed to determine which
planted bugs were triggered.
A bug is considered **detected** if at least one test case in the suite
produces a response that deviates from the correct behaviour in a way that
corresponds to the planted bug — for example, a `200 OK` where a `400`
should have been returned, or a silently incorrect computed value in the
response body.
---
## 5. Scoring
The final score combines three factors, weighted to emphasise real-world
value: finding bugs matters most, systematic coverage rewards
thoroughness, and efficiency discourages noise.
| Component | Weight | Description |
|---|---|---|
| Bug Detection Score | 70% | Primary metric |
| Coverage Score | 20% | API surface exploration |
| Efficiency Score | 10% | Signal-to-noise ratio |
### Bug Detection Score — Primary (70%)
Measures how many of the planted bugs were successfully triggered. This is
the core metric of the benchmark — an agent that finds more bugs scores
higher, regardless of how it gets there.
```
Bug Detection Rate = bugs_found / total_bugs
```
**Range: 0 – 1.** A score of 1 means every planted bug was triggered; 0
means none were. Scores below 0.3 indicate the agent is missing most bugs;
above 0.7 is considered strong performance on a scenario.
### Coverage Score — 20%
Measures how well the test suite explores the API surface across three
independently computed dimensions. Each dimension produces a value between
0 and 1; the three are averaged to produce the final Coverage Score.
```
Coverage Score = (param_coverage + edge_coverage + variation_score) / 3
```
**Range: 0 – 1.** All three sub-dimensions are individually bounded [0,
1], so the average is too. A score of 1 requires full field coverage, edge
tests on every field, and completely non-overlapping payloads — a high bar
that rewards comprehensive, systematic suites.
#### Parameter Coverage
What fraction of schema fields are the *focus* of at least one test —
i.e., differ from the valid sample payload in that test case (modified,
omitted, or set to an alternate value).
```
param_coverage
- text: >-
# UltraData-Math
<div align="center">
<img src="assets/ultradata-math-logo.png" width="600"/>
</div>
<p align="center">
<a href="https://huggingface.co/datasets/openbmb/UltraData-Math">🤗
Dataset</a> | <a
href="https://github.com/UltraData-OpenBMB/UltraData-Math">💻 Source
Code</a> | <a
href="https://huggingface.co/datasets/openbmb/UltraData-Math/blob/main/README_ZH.md">🇨🇳
中文 README</a>
</p>
***UltraData-Math*** is a large-scale, high-quality mathematical
pre-training dataset totaling **290B+ tokens** across three progressive
tiers—**L1** (170.5B tokens web corpus), **L2** (33.7B tokens
quality-selected), and **L3** (88B tokens multi-format refined)—designed
to systematically enhance mathematical reasoning in LLMs. It has been
applied to the mathematical pre-training of the [MiniCPM
Series](https://huggingface.co/collections/openbmb/minicpm4) models.
## 🆕 What's New
- **[2026.02.09]**: **UltraData-Math**, a large-scale high-quality
mathematical pre-training dataset with 290B+ tokens across three
progressive tiers (L1/L2-preview/L3), is now available on Hugging Face.
Released as part of the [UltraData](https://ultradata.openbmb.cn/)
ecosystem. 🔥🔥🔥
- **[2026.02.10]**: **UltraData-Math** tops the Hugging Face Datasets
Trending list, reaching the #1 spot! ⭐️⭐️⭐️
## 📚 Introduction
High-quality pre-training data is crucial for enhancing the mathematical
reasoning capabilities of large language models (LLMs). However, existing
mathematical pre-training data construction schemes have the following
shortcomings:
- **HTML Parsing**: General parsers (such as trafilatura, readability) are
mainly designed for news/article parsing, lacking specialized processing
for mathematical formulas and other content, often leading to formula
structure destruction or loss; meanwhile, mathematical discussions on
forum-like pages are difficult to extract completely.
- **Data Quality**: Existing datasets generally lack a systematic quality
grading mechanism, with high-value mathematical content mixed with
low-quality noise.
- **Data Diversity**: Mainstream datasets mostly originate from textbooks
or competition question banks, lacking mathematical discussions and
application scenarios in real web pages; synthetic data formats are
single, difficult to cover diverse needs such as multi-turn dialogues and
multi-style expressions.
To address these issues, we propose ***UltraData-Math***—a large-scale
high-quality pre-training dataset for mathematical reasoning tasks. This
dataset is developed based on the
[UltraData](https://ultradata.openbmb.cn/blog/position-paper) L0-L4 Tiered
Data Management Framework, containing four progressive levels:
- **L0 Raw Data**: Develops a mathematical parser based on *magic-html*,
combined with *w3m* layout preservation rendering and multi-level fallback
strategies, standardizing MathML, KaTeX, and AsciiMath into LaTeX format.
- **L1 Filtered Data**: Cleans noise through heuristic rules and performs
document-level deduplication.
- **L2 Selected Data**: Uses proprietary large models to annotate seed
data and distills it into a lightweight embedding classifier to achieve
efficient quality grading of the full corpus.
- **L3 Refined Data**: Produces structured content with clear reasoning
through rewriting, synthetic generation, and refinement in various formats
such as Q&A, multi-turn dialogues, multi-style rewriting, and
knowledge-grounded textbooks.
Experiments show that on the MiniCPM-1.2B architecture,
***UltraData-Math*** achieves a score of **37.02pp** on the MATH500
benchmark, an improvement of **+3.62pp** compared to Nemotron-CC 4plus; it
achieves **61.79pp** on GSM8K, an improvement of **+3.34pp**, while
maintaining code generation and general knowledge capabilities.
***UltraData-Math*** has been applied to the mathematical pre-training of
the [MiniCPM
Series](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b)
models.
-
**[UltraData-Math-L1](https://huggingface.co/datasets/openbmb/UltraData-Math)**:
Large-scale high-quality mathematical pre-training dataset, containing
170.5B tokens of web mathematical corpus.
-
**[UltraData-Math-L2](https://huggingface.co/datasets/openbmb/UltraData-Math-L2)**:
High-quality mathematical pre-training dataset selected by the quality
model, containing 33.7B tokens of high-quality web mathematical corpus.
-
**[UltraData-Math-L3](https://huggingface.co/datasets/openbmb/UltraData-Math-L3)**:
High-quality refined mathematical dataset, containing 88B tokens of
multi-format refined data (Q&A, multi-turn dialogues, knowledge textbooks,
etc.).
## 🏗️ Data Processing Pipeline
To break through the limitations of existing mathematical datasets in
quality and diversity, we established a refined grading standard centered
on "mathematical content integrity" and "information density".
***UltraData-Math*** adopts the **L0-L4 Tiered Data Management Framework**
proposed by the
[UltraData](https://ultradata.openbmb.cn/blog/position-paper) paper.
Through standardized level definitions, it achieves orderly management and
efficient flow of mathematical data assets. Each level represents higher
data purity and mathematical value, while also corresponding to a more
refined degree of processing.
<div align="center">
<img src="assets/ultradata-math-pipeline.png" width="900"/>
</div>
### L0: Raw Data Parsing and Standardization
**Goal**: Address the poor support of general HTML parsers for
mathematical formulas and maximize the preservation of mathematical
semantics in web pages.
The L0 phase mainly processes raw web data obtained from sources such as
Common Crawl. Given the specificity of mathematical web pages, we develop
specialized parsing strategies through the
[UltraData-Math-Parser](https://huggingface.co/spaces/openbmb/UltraData-Math-L0-Parser)
instead of directly using general parsers like trafilatura or readability.
- **Unified Parsing Mode**: Automatically identifies page types to ensure
complete content extraction as much as possible.
- **Multi-level Fallback Strategy**: To prevent data loss due to parsing
failures, we implement a multi-level fallback mechanism to ensure text
content is captured even if structured parsing fails.
- **Mathematical Formula Standardization**: We unify different
mathematical expressions in web pages into standard LaTeX format,
achieving data format normalization for unified model learning.
### L1: Heuristic Cleaning and Filtering
**Goal**: Remove format noise and improve data readability and
standardization.
After obtaining text containing complete mathematical formulas, we clean
the L0 data through a series of heuristic rules:
- **Format Repair**:
- Clean invisible characters, garbled text, and unnatural continuous line breaks.
- Remove irrelevant web noise such as navigation bars, footers, ad pop-ups, and "read more".
- **Content Filtering**:
- *Length Filtering*: Remove overly short text fragments, which usually lack context and are difficult to support effective mathematical reasoning training.
- *Language Identification*: Ensure the dataset is composed mainly of high-quality English and Chinese mathematical content.
- *Document Deduplication*: Perform deduplication at the document level to prevent duplicate content from biasing model training.
### L2: Selection Based on Quality Models
**Goal**: Identify core corpora with high value from massive data.
Although L1 data has a clean format, the content quality varies. The L2
phase introduces a model-based quality assessment system:
- **Seed Data Annotation**: Use proprietary large models to score a
portion of seed data across multiple dimensions.
- **Classifier Training and Distillation**: Train lightweight embedding
classifiers based on annotated data to equip them with the ability to
identify high-value mathematical content.
- **Full-scale Inference**: Use the trained classifier to score and screen
L1 data in full.
- *Retention*: Content containing detailed problem-solving step
- text: >-
# Swiss Case Law Dataset
**962,724 published decisions from Swiss federal, cantonal, and regulatory
bodies.**
Full text, structured metadata, extracted case-citation references, and
daily updates. The March 20, 2026 snapshot contains German, French, and
Italian decisions; the export schema also reserves `rm` for Romansh.
[](https://opencaselaw.ch)
[](https://github.com/jonashertner/caselaw-repo-1)
[](https://mcp.opencaselaw.ch/health)
[](https://creativecommons.org/publicdomain/zero/1.0/)
[](https://github.com/jonashertner/caselaw-repo-1/blob/main/LICENSE)
## Dataset Summary
The largest open collection of Swiss court decisions: 962,724 decisions
from 102 federal, cantonal, and regulatory courts or public bodies,
scraped from official publication channels. New decisions are added every
night.
- **20 federal courts and bodies**: BGer, BVGer, BStGer, BPatGer, BGE,
FINMA, WEKO, EDÖB, ECHR (Swiss cases), VPB, Sports Tribunal, and more
- **82 cantonal courts** across all 26 cantons
- **Current decision languages**: German (448,461; 46.6%), French
(434,663; 45.1%), Italian (79,600; 8.3%); the export schema also reserves
`rm`
- **Temporal range**: 1875–present (BGE historical vol. 1 from 1875)
- **8.76 million extracted case-citation references**
- **6.42 million resolved decision-to-decision links** (with confidence
scores)
- **11.23 million statute-decision links** (e.g., which decisions cite
Art. 41 OR)
- **80 federal laws indexed** with 39,000 articles in 3 languages
- **34 structured fields** per decision in Parquet; 24 in the FTS5 search
index
## Quick Start
### Load with HuggingFace datasets
```python
from datasets import load_dataset
# Load all courts
ds = load_dataset("voilaj/swiss-caselaw")
# Load a single court
bger = load_dataset("voilaj/swiss-caselaw",
data_files="data/bger.parquet")
```
### Load with pandas
```python
import pandas as pd
df =
pd.read_parquet("hf://datasets/voilaj/swiss-caselaw/data/bger.parquet")
df_recent = df[df["decision_date"] >= "2024-01-01"]
print(f"{len(df_recent)} decisions since 2024")
# Filter by language
df_french = df[df["language"] == "fr"]
# Group by legal area
df.groupby("legal_area").size().sort_values(ascending=False).head(10)
```
### Direct download
Every court is a single Parquet file:
```
https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/bger.parquet
https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/bvger.parquet
https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/zh_gerichte.parquet
```
Full list:
[huggingface.co/datasets/voilaj/swiss-caselaw/tree/main/data](https://huggingface.co/datasets/voilaj/swiss-caselaw/tree/main/data)
### REST API (no setup)
Query via the HuggingFace Datasets Server — no installation required:
```bash
# Get rows
curl
"https://datasets-server.huggingface.co/rows?dataset=voilaj/swiss-caselaw&config=default&split=train&offset=0&length=5"
# Dataset info
curl
"https://datasets-server.huggingface.co/info?dataset=voilaj/swiss-caselaw"
```
### Full-text search via MCP
Connect the dataset to Claude, ChatGPT, or Gemini for natural-language
search over all 962,724 decisions. The MCP surface is
deployment-dependent: local deployments can expose up to 21 tools, remote
mode omits local update tools, and legislation tools depend on
LexFind-backed configuration.
**Remote (no download needed):**
```bash
# Claude Code
claude mcp add swiss-caselaw --transport sse https://mcp.opencaselaw.ch
# Claude Desktop: Settings → Connectors → Add custom connector →
https://mcp.opencaselaw.ch
# ChatGPT: Settings → Apps → Developer mode → Create app →
https://mcp.opencaselaw.ch/sse (auth: None)
# Recommended with GPT-5.3
# Gemini CLI: add to ~/.gemini/settings.json
# { "mcpServers": { "swiss-caselaw": { "url": "https://mcp.opencaselaw.ch"
} } }
```
Search results include enriched metadata: court name (human-readable),
court level, legal area, statute articles cited, citation count, and
leading-case flag.
**Local (offline access, ~65 GB disk):**
```bash
git clone https://github.com/jonashertner/caselaw-repo-1.git
cd caselaw-repo-1
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
pip install mcp pydantic huggingface-hub pyarrow
claude mcp add swiss-caselaw -- /path/to/.venv/bin/python3
/path/to/mcp_server.py
# Windows: use .venv\Scripts\python.exe instead
```
On first search, the server downloads the Parquet files (~7 GB) from this
dataset and builds a local SQLite FTS5 index (~58 GB). This takes 30–60
minutes and only happens once. After that, searches are instant.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total decisions | 962,724 |
| Courts | 101 |
| Temporal range | 1875–present |
| Average decision length | ~22,000 characters |
| Full text coverage | 100% |
| Regeste (headnote) coverage | ~54% |
| Extracted case-citation references | 8.76 million |
| Resolved decision links | 6.42 million |
| Statute-decision links | 11.23 million |
| Federal laws indexed | 80 (39,000 articles) |
| Legislation texts searchable | 33,000+ |
| MCP tools | Deployment-dependent (up to 21) |
**Language distribution:**
| Language | Count | Share |
|----------|-------|-------|
| German (de) | 448,215 | 46.58% |
| French (fr) | 434,470 | 45.15% |
| Italian (it) | 79,587 | 8.27% |
**Reference graph:** 8.76 million extracted case-citation references, 6.42
million resolved decision-to-decision links, and 11.23 million
statute-to-decision links. The most-cited decision is BGE 125 V 351 with
54,000 incoming citations.
**Search benchmark (frozen offline baseline):**
`benchmarks/search_benchmark_2026-03-19_offline_full.json` records a
100-query run against a 1,078,177-row local `decisions.db`, with MRR@10 =
0.4697, Recall@10 = 0.4958, nDCG@10 = 0.5250, and Hit@1 = 0.33. This is a
reproducible offline baseline, not a fully provisioned hosted-system
score.
## Intended Uses
- **Legal research and case law analysis**: full-text search and citation
network analysis across the Swiss court system
- **NLP research on multilingual legal text**: classification,
summarization, named entity recognition, and cross-lingual tasks on
German/French/Italian legal corpora
- **Legal tech development**: building search engines, citation analysis
tools, and document drafting assistants grounded in Swiss jurisprudence
- **Academic study of Swiss jurisprudence**: tracking doctrinal evolution,
identifying leading cases, analyzing court output over time
**Not intended for**: automated legal advice or replacing professional
legal counsel. This dataset is a research and analysis resource, not a
substitute for qualified legal representation.
## Limitations
- **Temporal coverage varies by court**: federal courts from 1996, some
cantonal courts from 2000+; historical BGE volumes from 1875
- **Historical OCR artifacts**: BGE decisions from volumes 1–79
(1875–1953) were digitized from print and may contain OCR errors
- **Publication delays**: some cantonal courts have irregular publication
schedules; decisions may appear weeks after being rendered
- **Language distribution is unbalanced by design**: it reflects actual
court output (German and French cantons are larger), not balanced sampling
- **Anonymization varies by court**: most federal decisions are
anonymized; some cantonal decisions may contain personal names or details
- **~1.9% short-text decisions**: some decisions are PDF-only publications
where text extraction produced fewer than 500 characters; full text may be
available at the source URL
## Dataset Creation
**Collection**: 54 automated scrapers target of
metrics:
- accuracy
pipeline_tag: text-classification
library_name: setfit
inference: true
datasets:
- davanstrien/hf-dataset-domain-labels-v0
base_model: BAAI/bge-small-en-v1.5
model-index:
- name: SetFit with BAAI/bge-small-en-v1.5
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: davanstrien/hf-dataset-domain-labels-v0
type: davanstrien/hf-dataset-domain-labels-v0
split: test
metrics:
- type: accuracy
value: 0.8333333333333334
name: Accuracy
SetFit with BAAI/bge-small-en-v1.5
This is a SetFit model trained on the davanstrien/hf-dataset-domain-labels-v0 dataset that can be used for Text Classification. This SetFit model uses BAAI/bge-small-en-v1.5 as the Sentence Transformer embedding model. A LogisticRegression instance is used for classification.
The model has been trained using an efficient few-shot learning technique that involves:
- Fine-tuning a Sentence Transformer with contrastive learning.
- Training a classification head with features from the fine-tuned Sentence Transformer.
Model Details
Model Description
- Model Type: SetFit
- Sentence Transformer body: BAAI/bge-small-en-v1.5
- Classification head: a LogisticRegression instance
- Maximum Sequence Length: 512 tokens
- Number of Classes: 9 classes
- Training Dataset: davanstrien/hf-dataset-domain-labels-v0
Model Sources
- Repository: SetFit on GitHub
- Paper: Efficient Few-Shot Learning Without Prompts
- Blogpost: SetFit: Efficient Few-Shot Learning Without Prompts
Model Labels
| Label | Examples |
|---|---|
| biology |
|
| chemistry |
|
| climate |
|
| code |
|
| cybersecurity |
|
| finance |
|
| legal |
|
| math |
|
| medical |
|
Evaluation
Metrics
| Label | Accuracy |
|---|---|
| all | 0.8333 |
Uses
Direct Use for Inference
First install the SetFit library:
pip install setfit
Then you can load this model and run inference.
from setfit import SetFitModel
# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("davanstrien/setfit-hf-dataset-domain-v0")
# Run inference
preds = model("# Dataset Card for The Wilds Bioacoustics Monitors
This dataset contains passive acoustic recordings collected at [The Wilds safari park](https://www.thewilds.org/) in Ohio during Summer 2025.
Recorders captured ambient soundscapes to support ecological monitoring, animal behavior analysis, and acoustic biodiversity modeling.
## Dataset Details
### Dataset Description
- **Curated by:** Tanishka Wani, Vedant Patil, Rugved Katole, Bharath Pillai, Anirudh Potlapally, Ally Bonney, and Jenna Kline
- **Repository:** [https://github.com/Imageomics/naturelab](https://github.com/Imageomics/naturelab)
- **Paper:** [SmartWilds: Multimodal Wildlife Monitoring Dataset](https://arxiv.org/abs/2509.18894)
This dataset was created to support multimodal wildlife monitoring research using passive acoustic monitoring. Bioacoustic data were collected using Wildlife Acoustics Song Meter devices deployed across four field sites at The Wilds. The recordings capture natural soundscapes including wildlife vocalizations, environmental sounds, and ambient audio that can be used for species detection, behavioral analysis, and biodiversity assessment.
### Supported Tasks and Leaderboards
- **Audio Classification:** Species identification from acoustic recordings
- **Sound Event Detection:** Detection and localization of animal vocalizations
- **Biodiversity Assessment:** Acoustic diversity indices and community analysis
- **Behavioral Analysis:** Temporal activity patterns and acoustic behavior studies
- **Soundscape Ecology:** Environmental audio analysis and habitat characterization
[No benchmarks currently available]
## Dataset Structure
The dataset is organized hierarchically by site and deployment session:
/dataset/ bioacoustic.txt The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv TW05-SM01/ metadata.md SD01_20250630_20250703/ SM001_20250630_195900.wav SM001_20250630_200402.wav SM001_20250630_200902.wav ... SM001_20250703_064902.wav SM001_20250703_065402.wav SM001_20250703_065902.wav TW06-SM03/ metadata.md SD03_20250630_20250703/ SM03_20250630_140000.wav SM03_20250630_150000.wav SM03_20250630_160000.wav SM03_20250630_170000.wav ... SM03_20250703_140000.wav SM03_20250703_150000.wav SM03_20250703_160000.wav TW07-SM02/ metadata.md SD02_20250630_20250703/ SM002_20250630_195900.wav SM002_20250630_205902.wav SM002_20250701_050300.wav ... SM002_20250702_205902.wav SM002_20250703_050400.wav SM002_20250703_060402.wav TW08-SM04/ metadata.md SD04_20250630_20250703/ SM04_20250630_120000.wav SM04_20250630_130000.wav SM04_20250630_140000.wav ... SM04_20250703_150000.wav SM04_20250703_160000.wav SM04_20250703_170000.wav
### Data Instances
Each bioacoustic deployment folder contains:
- **Audio files:** .wav format recordings captured by scheduled recording
- **Metadata file:** `metadata.md` with deployment information and recorder settings
**File Counts by Recorder:**
- **TW05-SM01:** 144 audio files (.wav recordings)
- **TW06-SM03:** 75 audio files (.wav recordings)
- **TW07-SM02:** 12 audio files (.wav recordings)
- **TW08-SM04:** 78 audio files (.wav recordings)
**Audio File Specifications:**
- **Format:** .wav (uncompressed)
- **Channels:** Mono
- **Bit depth:** 16-bit
- **Sample rate:** 48 kHz
- **Duration:** Variable based on recording schedule
**Filename Conventions:**
- **SM001/SM03/SM04 series:** SM0##_YYYYMMDD_HHMMSS.wav (TW05-SM01, TW06-SM03, TW08-SM04)
- **SM002 series:** SM002_YYYYMMDD_HHMMSS.wav (TW07-SM02)
**Total Dataset Size:** 311 audio files across all bioacoustic monitor deployments.
Each .wav file is a field recording captured according to programmed recording schedules. File names include timestamps indicating the start time of each recording session.
### Data Fields
**metadata.md** (found in each recorder deployment folder):
- **Recorder ID:** Unique device identifier (SM01, SM02, SM03, SM04)
- **Device Model:** Song Meter model name (e.g., Song Meter Micro 2)
- **Device Serial Number:** Manufacturer-assigned serial number
- **Site ID:** Location code where deployed (TW05, TW06, TW07, TW08)
- **Deployment Location Description:** Text description of exact location and surroundings
- **GPS Coordinates:** Latitude and longitude in decimal format
- **Deployment Date and Time:** Recorder deployment timestamp (YYYY-MM-DD HH:MM format)
- **Retrieval Date and Time:** Recorder retrieval timestamp (YYYY-MM-DD HH:MM format)
- **Orientation / Microphone Facing:** Direction and environmental considerations (e.g., \"East, away from wind and road\")
- **Mounting Height:** Approximate height of microphone from ground in meters
- **Recording Schedule Preset:** Schedule or settings used for recording (e.g., \"1 hour at sunrise and sunset\")
- **Time Zone Set on Device:** Local time zone configured (e.g., \"USA Eastern (UTC-5)\")
- **Maintenance Notes:** Issues, configuration changes, or deviations from standard settings
- **Observer:** Name or initials of person completing metadata
**CSV Log Files:**
- `The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv`: Deployment log from June 30, 2025
- `The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv`: Retrieval log from July 4, 2025
### Data Splits
This dataset has no predefined training/validation/test splits. Data are organized by site (TW05-TW08) and deployment session. Users may create their own splits based on:
- **Temporal splits:** Using recording timestamps across the deployment period
- **Spatial splits:** Using different site locations (TW05, TW06, TW07, TW08)
- **Recorder-based splits:** Using different Song Meter devices (SM01, SM02, SM03, SM04)
Recommended approach depends on modeling goals and research questions.
## Dataset Creation
### Curation Rationale
This dataset supports biodiversity monitoring, behavioral ecology research, and the development of automated species detection and classification models from passive acoustic recordings. Bioacoustic monitoring provides complementary data to camera trap surveys and enables detection of cryptic or nocturnal species that may be missed by visual methods.
### Source Data
#### Data Collection and Processing
Recordings were collected at The Wilds safari park during summer 2025 using Wildlife Acoustics Song Meter devices. Four recorders (SM01-SM04) were strategically deployed at sites TW05-TW08 from June 30 to July 3, 2025.
Devices were programmed for scheduled recordings with different sampling strategies across sites. Recorders were mounted on trees or posts at appropriate heights and orientations to minimize wind noise and maximize acoustic detection. Upon retrieval, audio files were organized by deployment session and basic metadata were recorded. No audio processing, filtering, or annotation was applied to preserve the raw acoustic data.
#### Who are the source data producers?
The dataset was collected and curated by researchers and students from the Imageomics Institute and Ohio State University in collaboration with conservation staff at The Wilds safari park in Ohio.
### Annotations
#### Annotation process
No species identification or acoustic annotations are currently provided with this initial dataset release. Manual and AI-assisted labeling efforts for species detection, vocalization classification, and acoustic event annotation are planned for future versions.
#### Who are the annotators?
N/A - annotations will be added in future releases
### Personal and Sensitive Information
The dataset includes GPS coordinates within The Wilds, a public conservation ")
Training Details
Training Set Metrics
| Training set | Min | Median | Max |
|---|---|---|---|
| Word count | 2 | 400.3986 | 4498 |
| Label | Training Sample Count |
|---|---|
| biology | 149 |
| chemistry | 89 |
| climate | 135 |
| code | 200 |
| cybersecurity | 200 |
| finance | 200 |
| legal | 200 |
| math | 185 |
| medical | 200 |
Training Hyperparameters
- batch_size: (32, 32)
- num_epochs: (1, 1)
- max_steps: -1
- sampling_strategy: oversampling
- num_iterations: 5
- body_learning_rate: (2e-05, 1e-05)
- head_learning_rate: 0.01
- loss: CosineSimilarityLoss
- distance_metric: cosine_distance
- margin: 0.25
- end_to_end: False
- use_amp: False
- warmup_proportion: 0.1
- l2_weight: 0.01
- seed: 42
- eval_max_steps: -1
- load_best_model_at_end: False
Training Results
| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 0.0021 | 1 | 0.2723 | - |
| 0.1027 | 50 | 0.2194 | - |
| 0.2053 | 100 | 0.1241 | - |
| 0.3080 | 150 | 0.0837 | - |
| 0.4107 | 200 | 0.0693 | - |
| 0.5133 | 250 | 0.0579 | - |
| 0.6160 | 300 | 0.0501 | - |
| 0.7187 | 350 | 0.0443 | - |
| 0.8214 | 400 | 0.0415 | - |
| 0.9240 | 450 | 0.0394 | - |
Framework Versions
- Python: 3.12.12
- SetFit: 1.1.3
- Sentence Transformers: 5.3.0
- Transformers: 4.50.3
- PyTorch: 2.11.0+cu130
- Datasets: 4.8.4
- Tokenizers: 0.21.4
Citation
BibTeX
@article{https://doi.org/10.48550/arxiv.2209.11055,
doi = {10.48550/ARXIV.2209.11055},
url = {https://arxiv.org/abs/2209.11055},
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Efficient Few-Shot Learning Without Prompts},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
\n\n# The ClimateCheck Dataset\n\nThis dataset is used for the ClimateCheck: Scientific Fact-checking of Social Media Posts on Climate Change Shared Task.\nThe 2025 iteration was hosted at the