davanstrien's picture
davanstrien HF Staff
Push model using huggingface_hub.
a74c815 verified
metadata
tags:
  - setfit
  - sentence-transformers
  - text-classification
  - generated_from_setfit_trainer
widget:
  - text: >-
      # Dataset Card for The Wilds Bioacoustics Monitors


      This dataset contains passive acoustic recordings collected at [The Wilds
      safari park](https://www.thewilds.org/) in Ohio during Summer 2025. 

      Recorders captured ambient soundscapes to support ecological monitoring,
      animal behavior analysis, and acoustic biodiversity modeling.


      ## Dataset Details


      ### Dataset Description


      - **Curated by:** Tanishka Wani, Vedant Patil, Rugved Katole, Bharath
      Pillai, Anirudh Potlapally, Ally Bonney, and Jenna Kline

      - **Repository:**
      [https://github.com/Imageomics/naturelab](https://github.com/Imageomics/naturelab)  

      - **Paper:** [SmartWilds: Multimodal Wildlife Monitoring
      Dataset](https://arxiv.org/abs/2509.18894)


      This dataset was created to support multimodal wildlife monitoring
      research using passive acoustic monitoring. Bioacoustic data were
      collected using Wildlife Acoustics Song Meter devices deployed across four
      field sites at The Wilds. The recordings capture natural soundscapes
      including wildlife vocalizations, environmental sounds, and ambient audio
      that can be used for species detection, behavioral analysis, and
      biodiversity assessment.


      ### Supported Tasks and Leaderboards


      - **Audio Classification:** Species identification from acoustic
      recordings

      - **Sound Event Detection:** Detection and localization of animal
      vocalizations

      - **Biodiversity Assessment:** Acoustic diversity indices and community
      analysis

      - **Behavioral Analysis:** Temporal activity patterns and acoustic
      behavior studies

      - **Soundscape Ecology:** Environmental audio analysis and habitat
      characterization


      [No benchmarks currently available]


      ## Dataset Structure


      The dataset is organized hierarchically by site and deployment session:


      ```

      /dataset/
          bioacoustic.txt
          The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv
          The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv
          TW05-SM01/
              metadata.md
              SD01_20250630_20250703/
                  SM001_20250630_195900.wav
                  SM001_20250630_200402.wav
                  SM001_20250630_200902.wav
                  ...
                  SM001_20250703_064902.wav
                  SM001_20250703_065402.wav
                  SM001_20250703_065902.wav
          TW06-SM03/
              metadata.md
              SD03_20250630_20250703/
                  SM03_20250630_140000.wav
                  SM03_20250630_150000.wav
                  SM03_20250630_160000.wav
                  SM03_20250630_170000.wav
                  ...
                  SM03_20250703_140000.wav
                  SM03_20250703_150000.wav
                  SM03_20250703_160000.wav
          TW07-SM02/
              metadata.md
              SD02_20250630_20250703/
                  SM002_20250630_195900.wav
                  SM002_20250630_205902.wav
                  SM002_20250701_050300.wav
                  ...
                  SM002_20250702_205902.wav
                  SM002_20250703_050400.wav
                  SM002_20250703_060402.wav
          TW08-SM04/
              metadata.md
              SD04_20250630_20250703/
                  SM04_20250630_120000.wav
                  SM04_20250630_130000.wav
                  SM04_20250630_140000.wav
                  ...
                  SM04_20250703_150000.wav
                  SM04_20250703_160000.wav
                  SM04_20250703_170000.wav
      ```


      ### Data Instances


      Each bioacoustic deployment folder contains:

      - **Audio files:** .wav format recordings captured by scheduled recording

      - **Metadata file:** `metadata.md` with deployment information and
      recorder settings


      **File Counts by Recorder:**

      - **TW05-SM01:** 144 audio files (.wav recordings)

      - **TW06-SM03:** 75 audio files (.wav recordings)

      - **TW07-SM02:** 12 audio files (.wav recordings)

      - **TW08-SM04:** 78 audio files (.wav recordings)


      **Audio File Specifications:**

      - **Format:** .wav (uncompressed)

      - **Channels:** Mono

      - **Bit depth:** 16-bit

      - **Sample rate:** 48 kHz

      - **Duration:** Variable based on recording schedule


      **Filename Conventions:**

      - **SM001/SM03/SM04 series:** SM0##_YYYYMMDD_HHMMSS.wav (TW05-SM01,
      TW06-SM03, TW08-SM04)

      - **SM002 series:** SM002_YYYYMMDD_HHMMSS.wav (TW07-SM02)


      **Total Dataset Size:** 311 audio files across all bioacoustic monitor
      deployments.


      Each .wav file is a field recording captured according to programmed
      recording schedules. File names include timestamps indicating the start
      time of each recording session.


      ### Data Fields


      **metadata.md** (found in each recorder deployment folder):

      - **Recorder ID:** Unique device identifier (SM01, SM02, SM03, SM04)

      - **Device Model:** Song Meter model name (e.g., Song Meter Micro 2)

      - **Device Serial Number:** Manufacturer-assigned serial number

      - **Site ID:** Location code where deployed (TW05, TW06, TW07, TW08)

      - **Deployment Location Description:** Text description of exact location
      and surroundings

      - **GPS Coordinates:** Latitude and longitude in decimal format

      - **Deployment Date and Time:** Recorder deployment timestamp (YYYY-MM-DD
      HH:MM format)

      - **Retrieval Date and Time:** Recorder retrieval timestamp (YYYY-MM-DD
      HH:MM format)

      - **Orientation / Microphone Facing:** Direction and environmental
      considerations (e.g., "East, away from wind and road")

      - **Mounting Height:** Approximate height of microphone from ground in
      meters

      - **Recording Schedule Preset:** Schedule or settings used for recording
      (e.g., "1 hour at sunrise and sunset")

      - **Time Zone Set on Device:** Local time zone configured (e.g., "USA
      Eastern (UTC-5)")

      - **Maintenance Notes:** Issues, configuration changes, or deviations from
      standard settings

      - **Observer:** Name or initials of person completing metadata


      **CSV Log Files:**

      - `The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv`: Deployment log from
      June 30, 2025

      - `The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv`: Retrieval log from
      July 4, 2025


      ### Data Splits


      This dataset has no predefined training/validation/test splits. Data are
      organized by site (TW05-TW08) and deployment session. Users may create
      their own splits based on:

      - **Temporal splits:** Using recording timestamps across the deployment
      period

      - **Spatial splits:** Using different site locations (TW05, TW06, TW07,
      TW08)

      - **Recorder-based splits:** Using different Song Meter devices (SM01,
      SM02, SM03, SM04)


      Recommended approach depends on modeling goals and research questions.


      ## Dataset Creation


      ### Curation Rationale


      This dataset supports biodiversity monitoring, behavioral ecology
      research, and the development of automated species detection and
      classification models from passive acoustic recordings. Bioacoustic
      monitoring provides complementary data to camera trap surveys and enables
      detection of cryptic or nocturnal species that may be missed by visual
      methods.


      ### Source Data


      #### Data Collection and Processing


      Recordings were collected at The Wilds safari park during summer 2025
      using Wildlife Acoustics Song Meter devices. Four recorders (SM01-SM04)
      were strategically deployed at sites TW05-TW08 from June 30 to July 3,
      2025. 


      Devices were programmed for scheduled recordings with different sampling
      strategies across sites. Recorders were mounted on trees or posts at
      appropriate heights and orientations to minimize wind noise and maximize
      acoustic detection. Upon retrieval, audio files were organized by
      deployment session and basic metadata were recorded. No audio processing,
      filtering, or annotation was applied to preserve the raw acoustic data.


      #### Who are the source data producers?


      The dataset was collected and curated by researchers and students from the
      Imageomics Institute and Ohio State University in collaboration with
      conservation staff at The Wilds safari park in Ohio.


      ### Annotations


      #### Annotation process


      No species identification or acoustic annotations are currently provided
      with this initial dataset release. Manual and AI-assisted labeling efforts
      for species detection, vocalization classification, and acoustic event
      annotation are planned for future versions.


      #### Who are the annotators?


      N/A - annotations will be added in future releases


      ### Personal and Sensitive Information


      The dataset includes GPS coordinates within The Wilds, a public
      conservation 
  - text: >-
      # Securing an MLOps Pipeline: Training to Deployment

      # Securiser un Pipeline MLOps : Entrainement au Deploiement


      > This dataset contains a technical article available in both French and
      English.

      > Cet article technique est disponible en francais et en anglais.


      ---


      ## Navigation


      - [Version Francaise](#version-francaise)

      - [English Version](#english-version)


      ---


      <a id="version-francaise"></a>


      ---

      title: "Securiser un Pipeline MLOps : De l'Entrainement au Deploiement"

      author: "AYI-NEDJIMI Consultants"

      date: "2026-02-21"

      language: "fr"

      tags:
        - mlops
        - securite
        - pipeline
        - supply-chain
        - model-poisoning
        - gpu-isolation
      license: "cc-by-sa-4.0"

      ---


      # Securiser un Pipeline MLOps : De l'Entrainement au Deploiement


      **Auteur** : AYI-NEDJIMI Consultants | **Date** : 21 fevrier 2026 |
      **Temps de lecture** : 11 min


      ---


      ## Introduction


      Les pipelines MLOps representent une surface d'attaque croissante et
      encore mal comprise par la majorite des equipes de securite. Un modele
      d'IA compromis dans la supply chain peut avoir des consequences
      catastrophiques : un modele de detection d'intrusion empoisonne qui ignore
      certaines attaques, un assistant cybersecurite qui fournit des
      recommandations deliberement erronees, ou un systeme de classification qui
      exfiltre des donnees via ses predictions.


      Cet article detaille les menaces specifiques aux pipelines MLOps et les
      contre-mesures a implementer, de l'entrainement au deploiement en
      production. Cette demarche complete notre approche de [securisation des
      infrastructures
      virtualisees](https://www.ayinedjimi-consultants.fr/virtualisation/hyperv-securisation-2025.html)
      et nos services d'[audit
      d'infrastructure](https://www.ayinedjimi-consultants.fr/audit-infrastructure.html).


      ## Taxonomie des Menaces MLOps


      ### Surface d'attaque du pipeline


      ```

      [Donnees]     [Code]      [Modele]     [Infra]      [Deploiement]
         |            |            |            |              |
         v            v            v            v              v
      Data         Code         Model       Training       Serving

      Poisoning    Injection    Poisoning   Infra          Endpoint
         |            |            |        Compromise     Compromise
         v            v            v            |              |
      Backdoor     Supply       Trojan        GPU           API

      Dataset      Chain        Model       Hijacking     Injection
                   Attack
      ```


      ### Les 10 risques majeurs


      | Risque | Phase | Impact | Probabilite |

      |--------|-------|--------|------------|

      | Data poisoning | Entrainement | Critique | Moyenne |

      | Model backdoor | Entrainement | Critique | Faible |

      | Supply chain (packages) | Developpement | Eleve | Elevee |

      | GPU memory leakage | Entrainement | Eleve | Moyenne |

      | Model serialization attack | Distribution | Critique | Moyenne |

      | Adversarial inputs | Inference | Moyen | Elevee |

      | Model inversion | Inference | Eleve | Faible |

      | API injection | Deploiement | Eleve | Elevee |

      | Drift detection evasion | Production | Moyen | Faible |

      | Exfiltration via predictions | Production | Critique | Faible |


      ## Securisation de la Phase d'Entrainement


      ### Integrite des donnees


      ```python

      import hashlib

      import json

      from pathlib import Path

      from datetime import datetime


      class DataIntegrityChecker:
          """Verification de l'integrite des donnees d'entrainement."""

          def __init__(self, manifest_path: str):
              self.manifest_path = manifest_path
              self.manifest = self._load_or_create_manifest()

          def _load_or_create_manifest(self):
              if Path(self.manifest_path).exists():
                  with open(self.manifest_path) as f:
                      return json.load(f)
              return {"files": {}, "created": datetime.now().isoformat()}

          def register_dataset(self, file_path: str) -> str:
              """Enregistre un fichier de dataset avec son hash."""
              sha256 = hashlib.sha256()
              with open(file_path, "rb") as f:
                  for chunk in iter(lambda: f.read(8192), b""):
                      sha256.update(chunk)

              file_hash = sha256.hexdigest()
              self.manifest["files"][file_path] = {
                  "sha256": file_hash,
                  "registered": datetime.now().isoformat(),
                  "size": Path(file_path).stat().st_size,
              }
              self._save_manifest()
              return file_hash

          def verify_dataset(self, file_path: str) -> bool:
              """Verifie l'integrite d'un dataset avant entrainement."""
              if file_path not in self.manifest["files"]:
                  raise ValueError(f"Dataset non enregistre : {file_path}")

              expected_hash = self.manifest["files"][file_path]["sha256"]
              sha256 = hashlib.sha256()
              with open(file_path, "rb") as f:
                  for chunk in iter(lambda: f.read(8192), b""):
                      sha256.update(chunk)

              actual_hash = sha256.hexdigest()
              if actual_hash != expected_hash:
                  raise SecurityError(
                      f"INTEGRITE COMPROMISE : {file_path}\n"
                      f"Attendu : {expected_hash}\n"
                      f"Trouve  : {actual_hash}"
                  )
              return True

          def _save_manifest(self):
              with open(self.manifest_path, "w") as f:
                  json.dump(self.manifest, f, indent=2)
      ```


      ### Isolation GPU et securisation de l'entrainement


      ```python

      import subprocess

      import os


      class SecureTrainingEnvironment:
          """Environnement d'entrainement securise avec isolation GPU."""

          def __init__(self, gpu_id: int = 0):
              self.gpu_id = gpu_id

          def setup_gpu_isolation(self):
              """Configure l'isolation GPU pour l'entrainement."""
              # Restreindre l'acces GPU
              os.environ["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)

              # Activer le mode MIG (Multi-Instance GPU) si disponible
              # Uniquement sur A100/H100
              try:
                  subprocess.run([
                      "nvidia-smi", "mig", "-cgi", "9,9", "-C"
                  ], check=True, capture_output=True)
              except subprocess.CalledProcessError:
                  print("MIG non disponible, utilisation du mode exclusif")
                  subprocess.run([
                      "nvidia-smi", "-i", str(self.gpu_id),
                      "-c", "EXCLUSIVE_PROCESS"
                  ], check=True)

          def clear_gpu_memory(self):
              """Nettoie la memoire GPU apres l'entrainement."""
              import torch
              if torch.cuda.is_available():
                  torch.cuda.empty_cache()
                  torch.cuda.synchronize()

              # Forcer le reset GPU
              subprocess.run([
                  "nvidia-smi", "--gpu-reset", "-i", str(self.gpu_id)
              ], capture_output=True)

          def verify_no_data_leakage(self):
              """Verifie qu'aucune donnee ne persiste en memoire GPU."""
              result = subprocess.run(
                  ["nvidia-smi", "--query-compute-apps=pid,used_memory",
                   "--format=csv,noheader", "-i", str(self.gpu_id)],
                  capture_output=True, text=True
              )
              if result.stdout.strip():
                  raise SecurityError(
                      f"Processus encore actifs sur GPU {self.gpu_id}: {result.stdout}"
                  )
              return True
      ```


      ## Securisation de la Supply Chain ML


      ### Verification des dependances


      ```python

      # requirements-secure.txt avec hash pinning

      # pip install --require-hashes -r requirements-secure.txt


      torch==2.2.0 \
          --hash=sha256:abc123... \
          --hash=sha256:def456...
      transformers==4.38.0 \
          --hash=sha256:ghi789...
      peft==0.9.0 \
          --hash=sha256:jkl012...
      ```


      ### Scan des modeles avant deploiement


      ```python

      import pickle

      import struct

      from pathlib import Path


      class ModelSecurityScanner:
          """Scanner de securite pour les modeles ML."""

          DANGEROUS_OPCODES = {
              b'cos\n',  # cos.system
              b'cposix\n',  # cposix.system
              b'csubprocess\n',  # subprocess
              b'cbuiltins\n',  # builtins
              b'c__builtin__\n',  # __builtin__
          }

          def scan_pickle(self, model_path: str) -> dict:
              """Analyse un fichier pickle pour detecter du code malveillant."""
              results = {"safe": True, "warning
  - text: >-
      # APIEval-20: A Benchmark for Black-Box API Test Suite Generation


      ---


      ## Motivation


      Testing APIs thoroughly is one of the most critical, yet consistently
      underserved, activities in software engineering. Despite a rich ecosystem
      of API testing tools — Postman, RestAssured, Schemathesis, Dredd, and
      others — we found ourselves asking a deceptively simple question:


      **Given only the schema and an example payload of an API request — no
      source code, no documentation, no prior knowledge — how well can an AI
      agent generate a test suite that actually finds bugs?**


      We searched for an existing benchmark that captured this black-box
      scenario and came up empty. Every evaluation we found either required
      access to the implementation, relied on rich API documentation, or
      measured properties like schema compliance rather than actual bug-finding
      capability. The practitioner reality is different: teams frequently
      receive API payloads with little context and need to construct meaningful
      tests quickly.


      That gap is the reason **APIEval-20** exists.


      APIEval-20 is not a model benchmark. It is a **task benchmark for AI
      agents**. It evaluates end-to-end agent behavior — the ability to reason
      about an API surface, design targeted tests, and uncover real bugs — not
      just the quality of generated text.


      ---


      ## 1. Benchmark Overview


      APIEval-20 consists of 20 carefully designed API scenarios drawn from
      real-world application domains. Each scenario presents the agent with an
      API request schema and a sample payload, then challenges it to produce a
      test suite that exposes bugs hidden within a live reference
      implementation.


      ### Domains Covered


      The 20 scenarios span the following application domains, chosen to reflect
      a broad range of validation patterns, business logic complexity, and
      security sensitivity:


      | Domain | Scenarios |

      |---|---|

      | **E-commerce** | Order placement, coupon redemption, inventory
      adjustment |

      | **Payments** | Transaction creation, refund processing, currency
      conversion |

      | **Authentication** | Login, token refresh, password reset, session
      management |

      | **User Management** | Account creation, profile update, role assignment
      |

      | **Scheduling** | Appointment booking, availability queries, recurring
      events |

      | **Notifications** | Email dispatch, push configuration, preference
      management |

      | **Search & Filtering** | Query construction, pagination, sort and rank |


      ---


      ## 2. Bug Spectrum


      Each scenario contains between 3 and 8 planted bugs. Rather than
      categorising bugs by severity, APIEval-20 classifies them by
      **complexity** — reflecting how much reasoning is required to discover
      them. Bugs range along a continuum from simple to complex.


      ### Simple Bugs


      Require no semantic understanding of the domain. They test whether the API
      handles basic structural issues correctly: missing required fields, empty
      values (`""`, `null`, `[]`), and wrong data types.


      ### Moderate Bugs


      Require understanding the meaning of individual fields and their
      constraints: numeric values outside valid range, strings violating format
      constraints (malformed email, invalid currency code, wrong date format),
      and enum fields receiving boundary or undocumented values.


      ### Complex Bugs


      Require understanding the *relationship* between multiple fields, or the
      broader semantics of the operation: mutually exclusive fields both
      provided, discounts applied to ineligible orders, fields whose validity
      depends on the value of another field.


      **A strong test suite should span the full complexity spectrum — simple
      structural checks alone will not surface the bugs that matter most in
      production.**


      ---


      ## 3. Agent I/O


      ### What the Agent Receives


      For each scenario, the agent is given exactly two inputs. Nothing else —
      no response schema, no implementation details, no error messages, no
      changelog. This deliberate constraint reflects the black-box testing
      reality and prevents agents from trivially exploiting documentation.


      1. **JSON Schema** — The full request schema: field names, types,
      required/optional status, and any documented constraints.

      2. **Sample Payload** — A concrete example of a valid request, showing
      realistic field values.


      **Example Input — `POST /api/v1/orders`**


      Schema:

      ```json

      {
        "user_id":    { "type": "string",  "required": true },
        "items":      { "type": "array",   "required": true,
          "items": { "product_id": "string", "quantity": "integer", "unit_price": "number" } },
        "coupon_code": { "type": "string",  "required": false },
        "currency":   { "type": "string",  "required": true, "description": "ISO 4217 currency code" },
        "shipping":   { "type": "object",  "required": true,
          "properties": { "address": "string", "method": "string" } }
      }

      ```


      Sample Payload:

      ```json

      {
        "user_id": "usr_4821",
        "items": [
          { "product_id": "prod_991", "quantity": 2, "unit_price": 29.99 }
        ],
        "coupon_code": "SAVE10",
        "currency": "USD",
        "shipping": {
          "address": "123 Main St, Springfield",
          "method": "standard"
        }
      }

      ```


      ### What the Agent Produces


      The agent must output a **test suite**  a list of test cases, where each
      test case contains a short human-readable test name and the complete
      request payload as a valid JSON object. No expected outcome is required.
      Evaluation is performed by running each test case against the live
      reference implementation and observing what actually happens.


      **Example Test Case Output:**

      ```json

      {
        "test_name": "Order with zero quantity item",
        "payload": {
          "user_id": "usr_4821",
          "items": [{ "product_id": "prod_991", "quantity": 0, "unit_price": 29.99 }],
          "currency": "USD",
          "shipping": { "address": "123 Main St, Springfield", "method": "standard" }
        }
      }

      ```


      ---


      ## 4. Evaluation Methodology


      All 20 reference API implementations are deployed and running. Evaluation
      is fully automated: each test case in the agent's output is executed
      against the live API, and the responses are analysed to determine which
      planted bugs were triggered.


      A bug is considered **detected** if at least one test case in the suite
      produces a response that deviates from the correct behaviour in a way that
      corresponds to the planted bug  for example, a `200 OK` where a `400`
      should have been returned, or a silently incorrect computed value in the
      response body.


      ---


      ## 5. Scoring


      The final score combines three factors, weighted to emphasise real-world
      value: finding bugs matters most, systematic coverage rewards
      thoroughness, and efficiency discourages noise.


      | Component | Weight | Description |

      |---|---|---|

      | Bug Detection Score | 70% | Primary metric |

      | Coverage Score | 20% | API surface exploration |

      | Efficiency Score | 10% | Signal-to-noise ratio |


      ### Bug Detection Score — Primary (70%)


      Measures how many of the planted bugs were successfully triggered. This is
      the core metric of the benchmark  an agent that finds more bugs scores
      higher, regardless of how it gets there.


      ```

      Bug Detection Rate = bugs_found / total_bugs

      ```


      **Range: 0  1.** A score of 1 means every planted bug was triggered; 0
      means none were. Scores below 0.3 indicate the agent is missing most bugs;
      above 0.7 is considered strong performance on a scenario.


      ### Coverage Score — 20%


      Measures how well the test suite explores the API surface across three
      independently computed dimensions. Each dimension produces a value between
      0 and 1; the three are averaged to produce the final Coverage Score.


      ```

      Coverage Score = (param_coverage + edge_coverage + variation_score) / 3

      ```


      **Range: 0  1.** All three sub-dimensions are individually bounded [0,
      1], so the average is too. A score of 1 requires full field coverage, edge
      tests on every field, and completely non-overlapping payloads  a high bar
      that rewards comprehensive, systematic suites.


      #### Parameter Coverage


      What fraction of schema fields are the *focus* of at least one test 
      i.e., differ from the valid sample payload in that test case (modified,
      omitted, or set to an alternate value).


      ```

      param_coverage 
  - text: >-
      # UltraData-Math


      <div align="center">
        <img src="assets/ultradata-math-logo.png" width="600"/>
      </div>


      <p align="center">

      <a href="https://huggingface.co/datasets/openbmb/UltraData-Math">🤗
      Dataset</a> | <a
      href="https://github.com/UltraData-OpenBMB/UltraData-Math">💻 Source
      Code</a> | <a
      href="https://huggingface.co/datasets/openbmb/UltraData-Math/blob/main/README_ZH.md">🇨🇳
      中文 README</a>

      </p>


      ***UltraData-Math*** is a large-scale, high-quality mathematical
      pre-training dataset totaling **290B+ tokens** across three progressive
      tiers—**L1** (170.5B tokens web corpus), **L2** (33.7B tokens
      quality-selected), and **L3** (88B tokens multi-format refined)—designed
      to systematically enhance mathematical reasoning in LLMs. It has been
      applied to the mathematical pre-training of the [MiniCPM
      Series](https://huggingface.co/collections/openbmb/minicpm4) models.


      ## 🆕 What's New


      - **[2026.02.09]**: **UltraData-Math**, a large-scale high-quality
      mathematical pre-training dataset with 290B+ tokens across three
      progressive tiers (L1/L2-preview/L3), is now available on Hugging Face.
      Released as part of the [UltraData](https://ultradata.openbmb.cn/)
      ecosystem. 🔥🔥🔥

      - **[2026.02.10]**: **UltraData-Math** tops the Hugging Face Datasets
      Trending list, reaching the #1 spot! ⭐️⭐️⭐️


      ## 📚 Introduction


      High-quality pre-training data is crucial for enhancing the mathematical
      reasoning capabilities of large language models (LLMs). However, existing
      mathematical pre-training data construction schemes have the following
      shortcomings:


      - **HTML Parsing**: General parsers (such as trafilatura, readability) are
      mainly designed for news/article parsing, lacking specialized processing
      for mathematical formulas and other content, often leading to formula
      structure destruction or loss; meanwhile, mathematical discussions on
      forum-like pages are difficult to extract completely.

      - **Data Quality**: Existing datasets generally lack a systematic quality
      grading mechanism, with high-value mathematical content mixed with
      low-quality noise.

      - **Data Diversity**: Mainstream datasets mostly originate from textbooks
      or competition question banks, lacking mathematical discussions and
      application scenarios in real web pages; synthetic data formats are
      single, difficult to cover diverse needs such as multi-turn dialogues and
      multi-style expressions.


      To address these issues, we propose ***UltraData-Math***—a large-scale
      high-quality pre-training dataset for mathematical reasoning tasks. This
      dataset is developed based on the
      [UltraData](https://ultradata.openbmb.cn/blog/position-paper) L0-L4 Tiered
      Data Management Framework, containing four progressive levels:


      - **L0 Raw Data**: Develops a mathematical parser based on *magic-html*,
      combined with *w3m* layout preservation rendering and multi-level fallback
      strategies, standardizing MathML, KaTeX, and AsciiMath into LaTeX format.

      - **L1 Filtered Data**: Cleans noise through heuristic rules and performs
      document-level deduplication.

      - **L2 Selected Data**: Uses proprietary large models to annotate seed
      data and distills it into a lightweight embedding classifier to achieve
      efficient quality grading of the full corpus.

      - **L3 Refined Data**: Produces structured content with clear reasoning
      through rewriting, synthetic generation, and refinement in various formats
      such as Q&A, multi-turn dialogues, multi-style rewriting, and
      knowledge-grounded textbooks.


      Experiments show that on the MiniCPM-1.2B architecture,
      ***UltraData-Math*** achieves a score of **37.02pp** on the MATH500
      benchmark, an improvement of **+3.62pp** compared to Nemotron-CC 4plus; it
      achieves **61.79pp** on GSM8K, an improvement of **+3.34pp**, while
      maintaining code generation and general knowledge capabilities.


      ***UltraData-Math*** has been applied to the mathematical pre-training of
      the [MiniCPM
      Series](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b)
      models.


      -
      **[UltraData-Math-L1](https://huggingface.co/datasets/openbmb/UltraData-Math)**:
      Large-scale high-quality mathematical pre-training dataset, containing
      170.5B tokens of web mathematical corpus. 

      -
      **[UltraData-Math-L2](https://huggingface.co/datasets/openbmb/UltraData-Math-L2)**:
      High-quality mathematical pre-training dataset selected by the quality
      model, containing 33.7B tokens of high-quality web mathematical corpus.

      -
      **[UltraData-Math-L3](https://huggingface.co/datasets/openbmb/UltraData-Math-L3)**:
      High-quality refined mathematical dataset, containing 88B tokens of
      multi-format refined data (Q&A, multi-turn dialogues, knowledge textbooks,
      etc.).


      ## 🏗️ Data Processing Pipeline


      To break through the limitations of existing mathematical datasets in
      quality and diversity, we established a refined grading standard centered
      on "mathematical content integrity" and "information density".
      ***UltraData-Math*** adopts the **L0-L4 Tiered Data Management Framework**
      proposed by the
      [UltraData](https://ultradata.openbmb.cn/blog/position-paper) paper.
      Through standardized level definitions, it achieves orderly management and
      efficient flow of mathematical data assets. Each level represents higher
      data purity and mathematical value, while also corresponding to a more
      refined degree of processing.


      <div align="center">
        <img src="assets/ultradata-math-pipeline.png" width="900"/>
      </div>


      ### L0: Raw Data Parsing and Standardization


      **Goal**: Address the poor support of general HTML parsers for
      mathematical formulas and maximize the preservation of mathematical
      semantics in web pages.


      The L0 phase mainly processes raw web data obtained from sources such as
      Common Crawl. Given the specificity of mathematical web pages, we develop
      specialized parsing strategies through the
      [UltraData-Math-Parser](https://huggingface.co/spaces/openbmb/UltraData-Math-L0-Parser)
      instead of directly using general parsers like trafilatura or readability.


      - **Unified Parsing Mode**: Automatically identifies page types to ensure
      complete content extraction as much as possible.

      - **Multi-level Fallback Strategy**: To prevent data loss due to parsing
      failures, we implement a multi-level fallback mechanism to ensure text
      content is captured even if structured parsing fails.

      - **Mathematical Formula Standardization**: We unify different
      mathematical expressions in web pages into standard LaTeX format,
      achieving data format normalization for unified model learning.


      ### L1: Heuristic Cleaning and Filtering


      **Goal**: Remove format noise and improve data readability and
      standardization.


      After obtaining text containing complete mathematical formulas, we clean
      the L0 data through a series of heuristic rules:


      - **Format Repair**:
        - Clean invisible characters, garbled text, and unnatural continuous line breaks.
        - Remove irrelevant web noise such as navigation bars, footers, ad pop-ups, and "read more".
      - **Content Filtering**:
        - *Length Filtering*: Remove overly short text fragments, which usually lack context and are difficult to support effective mathematical reasoning training.
        - *Language Identification*: Ensure the dataset is composed mainly of high-quality English and Chinese mathematical content.
        - *Document Deduplication*: Perform deduplication at the document level to prevent duplicate content from biasing model training.

      ### L2: Selection Based on Quality Models


      **Goal**: Identify core corpora with high value from massive data.


      Although L1 data has a clean format, the content quality varies. The L2
      phase introduces a model-based quality assessment system:


      - **Seed Data Annotation**: Use proprietary large models to score a
      portion of seed data across multiple dimensions.

      - **Classifier Training and Distillation**: Train lightweight embedding
      classifiers based on annotated data to equip them with the ability to
      identify high-value mathematical content.

      - **Full-scale Inference**: Use the trained classifier to score and screen
      L1 data in full.
        - *Retention*: Content containing detailed problem-solving step
  - text: >-
      # Swiss Case Law Dataset


      **962,724 published decisions from Swiss federal, cantonal, and regulatory
      bodies.**


      Full text, structured metadata, extracted case-citation references, and
      daily updates. The March 20, 2026 snapshot contains German, French, and
      Italian decisions; the export schema also reserves `rm` for Romansh.


      [![Dashboard](https://img.shields.io/badge/Dashboard-live-d1242f)](https://opencaselaw.ch)

      [![GitHub](https://img.shields.io/badge/GitHub-source-black)](https://github.com/jonashertner/caselaw-repo-1)

      [![MCP
      Server](https://img.shields.io/badge/MCP-live-blue)](https://mcp.opencaselaw.ch/health)

      [![Data License:
      CC0--1.0](https://img.shields.io/badge/Data_License-CC0--1.0-blue.svg)](https://creativecommons.org/publicdomain/zero/1.0/)

      [![Code License:
      MIT](https://img.shields.io/badge/Code_License-MIT-green.svg)](https://github.com/jonashertner/caselaw-repo-1/blob/main/LICENSE)


      ## Dataset Summary


      The largest open collection of Swiss court decisions: 962,724 decisions
      from 102 federal, cantonal, and regulatory courts or public bodies,
      scraped from official publication channels. New decisions are added every
      night.


      - **20 federal courts and bodies**: BGer, BVGer, BStGer, BPatGer, BGE,
      FINMA, WEKO, EDÖB, ECHR (Swiss cases), VPB, Sports Tribunal, and more

      - **82 cantonal courts** across all 26 cantons

      - **Current decision languages**: German (448,461; 46.6%), French
      (434,663; 45.1%), Italian (79,600; 8.3%); the export schema also reserves
      `rm`

      - **Temporal range**: 1875–present (BGE historical vol. 1 from 1875)

      - **8.76 million extracted case-citation references**

      - **6.42 million resolved decision-to-decision links** (with confidence
      scores)

      - **11.23 million statute-decision links** (e.g., which decisions cite
      Art. 41 OR)

      - **80 federal laws indexed** with 39,000 articles in 3 languages

      - **34 structured fields** per decision in Parquet; 24 in the FTS5 search
      index


      ## Quick Start


      ### Load with HuggingFace datasets


      ```python

      from datasets import load_dataset


      # Load all courts

      ds = load_dataset("voilaj/swiss-caselaw")


      # Load a single court

      bger = load_dataset("voilaj/swiss-caselaw",
      data_files="data/bger.parquet")

      ```


      ### Load with pandas


      ```python

      import pandas as pd


      df =
      pd.read_parquet("hf://datasets/voilaj/swiss-caselaw/data/bger.parquet")

      df_recent = df[df["decision_date"] >= "2024-01-01"]

      print(f"{len(df_recent)} decisions since 2024")


      # Filter by language

      df_french = df[df["language"] == "fr"]


      # Group by legal area

      df.groupby("legal_area").size().sort_values(ascending=False).head(10)

      ```


      ### Direct download


      Every court is a single Parquet file:


      ```

      https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/bger.parquet

      https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/bvger.parquet

      https://huggingface.co/datasets/voilaj/swiss-caselaw/resolve/main/data/zh_gerichte.parquet

      ```


      Full list:
      [huggingface.co/datasets/voilaj/swiss-caselaw/tree/main/data](https://huggingface.co/datasets/voilaj/swiss-caselaw/tree/main/data)


      ### REST API (no setup)


      Query via the HuggingFace Datasets Server  no installation required:


      ```bash

      # Get rows

      curl
      "https://datasets-server.huggingface.co/rows?dataset=voilaj/swiss-caselaw&config=default&split=train&offset=0&length=5"


      # Dataset info

      curl
      "https://datasets-server.huggingface.co/info?dataset=voilaj/swiss-caselaw"

      ```


      ### Full-text search via MCP


      Connect the dataset to Claude, ChatGPT, or Gemini for natural-language
      search over all 962,724 decisions. The MCP surface is
      deployment-dependent: local deployments can expose up to 21 tools, remote
      mode omits local update tools, and legislation tools depend on
      LexFind-backed configuration.


      **Remote (no download needed):**


      ```bash

      # Claude Code

      claude mcp add swiss-caselaw --transport sse https://mcp.opencaselaw.ch


      # Claude Desktop: Settings → Connectors → Add custom connector →
      https://mcp.opencaselaw.ch


      # ChatGPT: Settings → Apps → Developer mode → Create app →
      https://mcp.opencaselaw.ch/sse (auth: None)

      # Recommended with GPT-5.3


      # Gemini CLI: add to ~/.gemini/settings.json

      # { "mcpServers": { "swiss-caselaw": { "url": "https://mcp.opencaselaw.ch"
      } } }

      ```


      Search results include enriched metadata: court name (human-readable),
      court level, legal area, statute articles cited, citation count, and
      leading-case flag.


      **Local (offline access, ~65 GB disk):**


      ```bash

      git clone https://github.com/jonashertner/caselaw-repo-1.git

      cd caselaw-repo-1

      python3 -m venv .venv

      source .venv/bin/activate          # Windows: .venv\Scripts\Activate.ps1

      pip install mcp pydantic huggingface-hub pyarrow

      claude mcp add swiss-caselaw -- /path/to/.venv/bin/python3
      /path/to/mcp_server.py

      # Windows: use .venv\Scripts\python.exe instead

      ```


      On first search, the server downloads the Parquet files (~7 GB) from this
      dataset and builds a local SQLite FTS5 index (~58 GB). This takes 30–60
      minutes and only happens once. After that, searches are instant.


      ## Dataset Statistics


      | Metric | Value |

      |--------|-------|

      | Total decisions | 962,724 |

      | Courts | 101 |

      | Temporal range | 1875–present |

      | Average decision length | ~22,000 characters |

      | Full text coverage | 100% |

      | Regeste (headnote) coverage | ~54% |

      | Extracted case-citation references | 8.76 million |

      | Resolved decision links | 6.42 million |

      | Statute-decision links | 11.23 million |

      | Federal laws indexed | 80 (39,000 articles) |

      | Legislation texts searchable | 33,000+ |

      | MCP tools | Deployment-dependent (up to 21) |


      **Language distribution:**


      | Language | Count | Share |

      |----------|-------|-------|

      | German (de) | 448,215 | 46.58% |

      | French (fr) | 434,470 | 45.15% |

      | Italian (it) | 79,587 | 8.27% |


      **Reference graph:** 8.76 million extracted case-citation references, 6.42
      million resolved decision-to-decision links, and 11.23 million
      statute-to-decision links. The most-cited decision is BGE 125 V 351 with
      54,000 incoming citations.


      **Search benchmark (frozen offline baseline):**
      `benchmarks/search_benchmark_2026-03-19_offline_full.json` records a
      100-query run against a 1,078,177-row local `decisions.db`, with MRR@10 =
      0.4697, Recall@10 = 0.4958, nDCG@10 = 0.5250, and Hit@1 = 0.33. This is a
      reproducible offline baseline, not a fully provisioned hosted-system
      score.


      ## Intended Uses


      - **Legal research and case law analysis**: full-text search and citation
      network analysis across the Swiss court system

      - **NLP research on multilingual legal text**: classification,
      summarization, named entity recognition, and cross-lingual tasks on
      German/French/Italian legal corpora

      - **Legal tech development**: building search engines, citation analysis
      tools, and document drafting assistants grounded in Swiss jurisprudence

      - **Academic study of Swiss jurisprudence**: tracking doctrinal evolution,
      identifying leading cases, analyzing court output over time


      **Not intended for**: automated legal advice or replacing professional
      legal counsel. This dataset is a research and analysis resource, not a
      substitute for qualified legal representation.


      ## Limitations


      - **Temporal coverage varies by court**: federal courts from 1996, some
      cantonal courts from 2000+; historical BGE volumes from 1875

      - **Historical OCR artifacts**: BGE decisions from volumes 1–79
      (1875–1953) were digitized from print and may contain OCR errors

      - **Publication delays**: some cantonal courts have irregular publication
      schedules; decisions may appear weeks after being rendered

      - **Language distribution is unbalanced by design**: it reflects actual
      court output (German and French cantons are larger), not balanced sampling

      - **Anonymization varies by court**: most federal decisions are
      anonymized; some cantonal decisions may contain personal names or details

      - **~1.9% short-text decisions**: some decisions are PDF-only publications
      where text extraction produced fewer than 500 characters; full text may be
      available at the source URL


      ## Dataset Creation


      **Collection**: 54 automated scrapers target of
metrics:
  - accuracy
pipeline_tag: text-classification
library_name: setfit
inference: true
datasets:
  - davanstrien/hf-dataset-domain-labels-v0
base_model: BAAI/bge-small-en-v1.5
model-index:
  - name: SetFit with BAAI/bge-small-en-v1.5
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          name: davanstrien/hf-dataset-domain-labels-v0
          type: davanstrien/hf-dataset-domain-labels-v0
          split: test
        metrics:
          - type: accuracy
            value: 0.8333333333333334
            name: Accuracy

SetFit with BAAI/bge-small-en-v1.5

This is a SetFit model trained on the davanstrien/hf-dataset-domain-labels-v0 dataset that can be used for Text Classification. This SetFit model uses BAAI/bge-small-en-v1.5 as the Sentence Transformer embedding model. A LogisticRegression instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

  1. Fine-tuning a Sentence Transformer with contrastive learning.
  2. Training a classification head with features from the fine-tuned Sentence Transformer.

Model Details

Model Description

Model Sources

Model Labels

Label Examples
biology
  • '# Blind Spots of google/gemma-3-1b-pt\n\n## Model Tested\nModel: google/gemma-3-1b-pt \nParameters: 1B \nType: Pre-trained base language model (not instruction-tuned) \nTested by: Toka-Tarek
chemistry
  • '# Concrete Compressive Strength Testing Dataset for West Africa\n\n## Abstract\n\nThis dataset provides synthetic but research-grounded data on concrete compressive strength testing across West Africa, with particular focus on Nigerian construction practices. The data encompasses laboratory testing parameters, mix design variables, and compressive strength outcomes based on ASTM C39 standards and published empirical studies from the region. Each record contains detailed information on mix proportions, water-cement ratios, curing conditions, aggregate sources, and corresponding compressive strength values at various test ages.\n\nKeywords: concrete compressive strength, ASTM C39, mix design, water-cement ratio, West Africa, Nigeria, structural concrete\n\n---\n\n## 1. Introduction\n\n### 1.1 Background\n\nConcrete remains the most widely used construction material in West Africa, with compressive strength serving as the primary indicator of structural quality and performance. Understanding the relationships between mix parameters and resulting strength is essential for quality control, structural design, and construction practice optimization in the region.\n\n### 1.2 Problem Statement\n\nTraditional concrete mix design in Nigeria and neighboring West African countries relies heavily on empirical practices and local material sources. However, significant variability exists in:\n- Cement brands and their performance characteristics\n- Aggregate sources and quality\n- Curing conditions (especially in tropical climates)\n- Testing laboratory standards and practices\n\n### 1.3 Research Objectives\n\nThis dataset aims to provide:\n1. Comprehensive data on concrete compressive strength testing parameters\n2. Realistic statistical distributions based on Nigerian and West African empirical studies\n3. Variable combinations reflecting actual construction site conditions\n4. Quality control benchmarks aligned with international standards\n\n---\n\n## 2. Methodology\n\n### 2.1 Data Generation Framework\n\nThe synthetic data generation follows a DAG-based (Directed Acyclic Graph) sampling approach, where parent variables are sampled before dependent child variables. This ensures realistic correlations between parameters.\n\n### 2.2 Parameter Evidence Table\n\n
climate
  • 'logo\n\n# The ClimateCheck Dataset\n\nThis dataset is used for the ClimateCheck: Scientific Fact-checking of Social Media Posts on Climate Change Shared Task.\nThe 2025 iteration was hosted at the Scholarly Document Processing workshop at ACL 2025, and a new 2026 iteration will be hosted at the Natural Scientific Language Processing workshop at LREC 2026.\n\n## 2026 Update\n\nFor running the next iteration of the task, we added manually labelled training data, resulting in 3023 claim-abstract pairs overall. The claims used for testing are unchanged. \n\n## Dataset Development Process\n\nClaims\n\nThe claims used for this dataset were gathered from the following existing resources: ClimaConvo, DEBAGREEMENT, Climate-Fever, MultiFC, and ClimateFeedback. \nSome of which are extracted from social media (Twitter/X and Reddit) and some were created synthetically from news and media outlets using text style transfer techniques to resemble tweets. \nAll claims underwent a process of scientific check-worthiness detection and are formed as atomic claims (i.e. containing only one core claim).\n\nPublications Corpus\n\nTo retrieve relevant abstracts, a corpus of publications was gathered from OpenAlex and S2ORC, containining 394,269 abstracts. It can be accessed here: https://huggingface.co/datasets/rabuahmad/climatecheck_publications_corpus\n\n**Annotation Processes\n\nThe training and testing data for claim verification were annotated by five graduate students in the Climate and Environmental Sciences. \nUsing a TREC-like pooling approach, we retrieved the top 20 abstracts for each claim using BM25 followed by a neural cross-encoder trained on the MSMARCO data. \nThen we used 6 state-of-the-art models to classify claim-abstract pairs. If a pair resulted in at least 3 evidentiary predictions, it was added to the annotation corpus. \nEach claim-abstract pair was annotated by two students and resolved by a curator in cases of disagreements.\n\nThe training and testing data for narrative classification were annotated by four graduate students, all of whom annotated every unique claim in the dataset. \nThe final labels were chosen using a majority vote approach. When there was no majority, two curators annotated and discussed the final label choice. \n\nTraining Data**\n\nThe training data contains the following:\n\n- claim: a string value of a claim about climate change.\n- abstract: a string value of an abstract relating to the claim.\n- abstract_id: the ID of the connected abstract, which corresponds to the publications corpus (see above) and can be used to retrieve more metadata about the abstract.\n- annotation: a label of 'Supports', 'Refutes', or 'Not Enough Information', describing the relation of the connected abstract to the claim.\n- data_version: 2025 if the claim was released during the 1st iteration of the task and 2026 if it was added in the 2nd iteration.\n- narrative: a label according to the CARDS taxonomy denoting whether the claim is an example of a known climate disinformation narrative. Only the first two levels of the taxonomy were used. \n \nThe training data consists of 3023 instances with 782 unique claims. Each claim is connected to at least 1 and at most 5 abstracts. \n\nThe distribution of the labels for claim verification is as follows:\n\n
code
  • '# Dataset Card for python_code_instructions_18k_alpaca\n\nThe dataset contains problem descriptions and code in python language.\nThis dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.'
  • '# SERA — Consolidated & Rectified\n\n211,360 multi-turn SWE-agent coding trajectories from the SERA (Soft-Verified Efficient Repository Agents) project, consolidated from 4 source datasets into a single file with strict reasoning + tool-call format and validated FSM transitions.\n\n## Origin\n\nDerived from Allen AI's Open Coding Agents release:\n\n
cybersecurity
  • '# Bug Bounty & Méthodologies de Pentest\n\nMéthodologies (OWASP, PTES), checklists par type d app, techniques d attaque, plateformes, templates de rapports et outils.\n\n## Links\n- Version anglaise\n- AYI NEDJIMI Consultants'
  • '# 🛡️ Security Reasoning Dataset for Prompt injection and PII (senstive data) detection\n\nThis dataset contains 2,139 high-quality synthetic examples designed for training lightweight security models—specifically targeting the firewall-gemma-3-4b-it architecture—using the Distilling Step-by-Step methodology. \n\n## 📊 Dataset Analytics & Distribution\n\nThe dataset is engineered to handle real-world enterprise edge cases, specifically the "needle-in-a-haystack" problem where malicious payloads or sensitive data are buried deep within massive contexts.\n\n* Total Samples: 2,139 \n* Training Set: 2,000 examples\n* Test Set: 139 examples (Note: test sets can be adjusted to a strict 100 split based on the files used)\n* Length Distribution: Ranges from short 10-word direct triggers to complex payloads exceeding 1,500 characters. \n* Format: Multi-turn conversational formats, raw document text, and code blocks.\n\n### Category Breakdown & Domain Coverage\nThe dataset spans 50+ technical and business domains to ensure the firewall remains highly accurate across different enterprise environments.\n\n
finance
  • '# AMM-Events: A Multi-Protocol DeFi Event Dataset\n\n## Dataset Description\n\nAMM-Events is a high-fidelity, block-level dataset capturing 8.9 million on-chain events from the Ethereum mainnet, specifically designed for event-aware forecasting and market microstructure analysis in Decentralized Finance (DeFi). \n\nUnlike traditional financial datasets based on Limit Order Books (LOB), this dataset focuses on Automated Market Makers (AMMs), where price dynamics are triggered exclusively by discrete on-chain events (e.g., swaps, mints, burns) rather than continuous off-chain information.\n\n- Paper Title: Towards Event-Aware Forecasting in DeFi: Insights from On-chain Automated Market Maker Protocols\n- Total Events: 8,917,353\n- Time Span: Jan 1, 2024 – Sep 16, 2025\n- Block Range: 18,908,896 – 23,374,292\n- Protocols: Uniswap V3, Aave, Morpho, Pendle\n- Granularity: Block-level timestamps & transaction-level event types\n\n### Supported Tasks\n- Event Forecasting: Predicting the next event type (classification/TPP) and time-to-next-event (regression/TPP).\n- Market Microstructure Analysis: Analyzing causal synchronization between liquidity events and price shocks.\n- Anomaly Detection: Identifying "Black Swan" traffic surges or congestion events.\n\n---\n\n## Dataset Structure\n\nThe data is organized into a standardized JSON format. Each entry decouples complex smart contract logic into interpretable metrics.\n\n### Data Fields\n\n- block_number (int): The Ethereum block height where the event occurred.\n- timestamp (int): Unix timestamp of the block.\n- transaction_hash (string): Unique identifier for the transaction.\n- protocol (string): Origin protocol (Uniswap V3, Aave, Morpho, or Pendle).\n- event_type (string): The category of the event (Swap, Mint, Burn, UpdateImpliedRate, etc.).\n- payload (dict): Protocol-specific metrics (e.g., amount0, amount1, liquidity, tick for Uniswap).\n\n### Data Splits\n\nThe dataset covers 359 liquidity pools selected for high activity and representativeness:\n- Pendle: 296 pools (Yield Trading)\n- Aave: 53 pools (Lending)\n- Uniswap V3: 5 pools (Spot Trading)\n- Morpho: 5 pools (Lending Optimization)\n\n---\n\n## Usage\n\n### Loading the Data\nYou can load this dataset directly using the Hugging Face datasets library:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset("Jackson668/AMM-Events")\n\n# Example: Accessing the first train example\nprint(dataset['train'])'
  • 'Dataset: Mimi1782/kaggle\n\n'
  • 'Dataset: Menoviar28/menov131\n\n'
legal
  • '# UAE Laws Q&A Dataset (IRAC Format)\n\nA high-quality dataset of 9,477 question-answer pairs about UAE laws, formatted in IRAC (Issue, Rule, Application, Conclusion) legal reasoning structure.\n\n## Dataset Creation\n\n### Source Documents\nThe dataset was built from a comprehensive collection of UAE legal documents, including:\n- Federal Decrees and Laws\n- Cabinet Resolutions\n- Ministerial Decisions\n- Civil and Commercial Codes\n- Labor Law\n- Traffic Law\n- And more\n\n### Creation Process\n\n1. Human-Written Seed Data: Manually crafted ~1,000 high-quality question-answer pairs directly from the source legal documents, ensuring accuracy and proper legal reasoning.\n\n2. Synthetic Generation: The remaining samples were synthetically generated using the source documents as context, following the patterns established by the human-written examples.\n\n3. Human Feedback & Validation: All synthetically generated samples underwent human review for:\n - Legal accuracy and correctness\n - Proper citation of articles and laws\n - Clarity and completeness of explanations\n - Adherence to IRAC format\n\n4. IRAC Formatting: All responses were standardized to follow the IRAC legal reasoning format for consistency and educational value.\n\n## Dataset Description\n\nAll responses follow the IRAC format:\n- Issue: Identifies the legal question or problem\n- Rule: States the relevant law, article, or regulation\n- Application: Explains how the rule applies to the situation\n- Conclusion: Provides the final answer\n\n## Dataset Statistics\n\n
math
  • '# FineProofs SFT\n\n## Dataset Description\n\nFineProofs SFT is a high-quality supervised fine-tuning dataset containing mathematical Olympiad problems paired with chain-of-thought reasoning and formal proofs distilled from DeepSeek-Math-V2. The dataset comprises 7,777 samples (4,300 unique problems) sourced from international Olympiad competitions and Art of Problem Solving (AoPS), each annotated with:\n\n- Detailed reasoning traces (thinking content) generated by deepseek-ai/DeepSeek-Math-V2\n- Formal mathematical proofs generated by deepseek-ai/DeepSeek-Math-V2\n- Expert grades from Gemini-3-Pro (0-7 point scale)\n- Model-based difficulty/reward scores from Qwen/Qwen3-4B-Thinking-2507\n- Problem categorization and competition metadata\n\nThis dataset is specifically designed for training reasoning models to generate proofs for competition-level mathematics.\n\n### Key Features\n\n- Chain-of-Thought Reasoning: Each solution includes explicit reasoning traces enclosed in <think> tags, distilled from deepseek-ai/DeepSeek-Math-V2\n- Formal Proofs: LaTeX-formatted mathematical proofs\n- Dual Quality Signals:\n - Expert grades from Gemini-3-Pro (continuous 0-7 scale)\n - Difficulty estimates from Qwen/Qwen3-4B-Thinking-2507 reward@128 scored by openai/gpt-oss-20b\n- Unfiltered Teacher Distillation: Includes both correct and incomplete solutions from teacher model to maximize learning signal for smaller models\n- Diverse Sources: Problems from IMO, APMO, USA(J)MO, and other prestigious competitions\n- Rich Metadata: Includes problem type categorization, competition information, and source attribution\n- Benchmark Decontaminated: No overlap with IMO-ProofBench or ProofBench evaluation sets\n\n## Dataset Structure\n\n### Data Fields\n\n
medical
  • "# Dataset Card for Dataset Name\n\n\n\nThis medium sized dataset 20K samples has been created with AOKVQA Train & Val split, Path-VQA Train & Val Split, TDIUC Val Split (Quantitative and Physical Reasoning Questions only). This is a multidomain dataset solely created to test the multidomain knowledge of VLM's, it can be used for inference or rapid prototyping. This is for educational and research purposes only. All the copyright belongs to the original owners of the datasets."
  • '# Medication-Specific QA Benchmark\n\n## Dataset Details\n\nTo construct a controlled evaluation benchmark, we selected 25 widely prescribed medications in Brazil across four therapeutic categories: antibiotics, analgesics and anti-inflammatory agents, antihypertensives, and antidiabetics. These categories were chosen to ensure clinical diversity across infectious, inflammatory, cardiovascular, and metabolic conditions.\n\nFor each selected medication, we verified the presence of its corresponding leaflet in the cleaned corpus. We then identified 10 standardized sections consistently present across documents, including indications, dosage, contraindications, warnings, adverse reactions, and drug interactions.\n\nQuestion generation was performed using a large language model conditioned on the medication name and structured leaflet content. To ensure balanced coverage and section-level control, we generated:\n\n- 4 questions per medication per section;\n- 40 questions per section across medications;\n- 400 total open-ended questions.\n\nThis benchmark, referred to as the Medication-Specific QA Benchmark, is explicitly designed to measure factual recall, section-level grounding, and evidence attribution when answers are directly localized within regulatory documents.\n\n## Citation\n\nThis work was accepted at The First Workshop on Language Technologies for Health (Lang4Health) is a workshop dedicated to the development and application of Natural Language Processing (NLP) technologies in the healthcare field.'
  • '# What this repo does\n\nThis repository provides a Clarus v0.8 clinical five-node dataset for detecting and reasoning about multi-organ failure cascade boundary transitions.\n\nThe dataset models situations where a patient state is no longer contained within a single metabolic-organ failure basin but is shifting between competing regimes such as:\n\n- metabolic overload with early organ strain\n- renal-hepatic failure transition\n- perfusion-linked organ cascade\n- refractory multi-organ failure\n\nThis is the conceptual upgrade introduced in Clarus v0.8.\n\nEarlier ladder versions detect instability, forecast deterioration, estimate collapse boundaries, model recovery geometry, and reason about intervention.\n\nv0.8 introduces regime transition geometry.\n\nThe system now measures not only distance to the nearest failure boundary but also distance to the nearest competing regime boundary.\n\nThis allows models to detect:\n\n- regime switching\n- competing failure modes\n- unstable regime identity\n- transition-aware intervention reasoning\n\n# Core five-node cascade\n\nThe five core variables in this dataset are:\n\n- metabolic_stress\n- physiologic_buffer\n- response_lag\n- organ_coupling\n- perfusion_stability\n\nOperational definitions:\n\nmetabolic_stress\n\nTotal metabolic burden imposed by acidosis, catabolism, electrolyte instability, and impaired substrate handling.\n\nphysiologic_buffer\n\nRemaining reserve available to absorb metabolic insult without organ-level cascade.\n\nresponse_lag\n\nDelay in correcting metabolic derangement or restoring systemic stability.\n\norgan_coupling\n\nDegree to which metabolic instability is propagating into coordinated failure across organs.\n\nperfusion_stability\n\nCurrent ability to maintain pressure-flow coherence across tissues despite worsening metabolic and organ stress.\n\n# Clinical variable mapping\n\n

Evaluation

Metrics

Label Accuracy
all 0.8333

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("davanstrien/setfit-hf-dataset-domain-v0")
# Run inference
preds = model("# Dataset Card for The Wilds Bioacoustics Monitors

This dataset contains passive acoustic recordings collected at [The Wilds safari park](https://www.thewilds.org/) in Ohio during Summer 2025. 
Recorders captured ambient soundscapes to support ecological monitoring, animal behavior analysis, and acoustic biodiversity modeling.

## Dataset Details

### Dataset Description

- **Curated by:** Tanishka Wani, Vedant Patil, Rugved Katole, Bharath Pillai, Anirudh Potlapally, Ally Bonney, and Jenna Kline
- **Repository:** [https://github.com/Imageomics/naturelab](https://github.com/Imageomics/naturelab)  
- **Paper:** [SmartWilds: Multimodal Wildlife Monitoring Dataset](https://arxiv.org/abs/2509.18894)

This dataset was created to support multimodal wildlife monitoring research using passive acoustic monitoring. Bioacoustic data were collected using Wildlife Acoustics Song Meter devices deployed across four field sites at The Wilds. The recordings capture natural soundscapes including wildlife vocalizations, environmental sounds, and ambient audio that can be used for species detection, behavioral analysis, and biodiversity assessment.

### Supported Tasks and Leaderboards

- **Audio Classification:** Species identification from acoustic recordings
- **Sound Event Detection:** Detection and localization of animal vocalizations
- **Biodiversity Assessment:** Acoustic diversity indices and community analysis
- **Behavioral Analysis:** Temporal activity patterns and acoustic behavior studies
- **Soundscape Ecology:** Environmental audio analysis and habitat characterization

[No benchmarks currently available]

## Dataset Structure

The dataset is organized hierarchically by site and deployment session:

/dataset/ bioacoustic.txt The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv TW05-SM01/ metadata.md SD01_20250630_20250703/ SM001_20250630_195900.wav SM001_20250630_200402.wav SM001_20250630_200902.wav ... SM001_20250703_064902.wav SM001_20250703_065402.wav SM001_20250703_065902.wav TW06-SM03/ metadata.md SD03_20250630_20250703/ SM03_20250630_140000.wav SM03_20250630_150000.wav SM03_20250630_160000.wav SM03_20250630_170000.wav ... SM03_20250703_140000.wav SM03_20250703_150000.wav SM03_20250703_160000.wav TW07-SM02/ metadata.md SD02_20250630_20250703/ SM002_20250630_195900.wav SM002_20250630_205902.wav SM002_20250701_050300.wav ... SM002_20250702_205902.wav SM002_20250703_050400.wav SM002_20250703_060402.wav TW08-SM04/ metadata.md SD04_20250630_20250703/ SM04_20250630_120000.wav SM04_20250630_130000.wav SM04_20250630_140000.wav ... SM04_20250703_150000.wav SM04_20250703_160000.wav SM04_20250703_170000.wav


### Data Instances

Each bioacoustic deployment folder contains:
- **Audio files:** .wav format recordings captured by scheduled recording
- **Metadata file:** `metadata.md` with deployment information and recorder settings

**File Counts by Recorder:**
- **TW05-SM01:** 144 audio files (.wav recordings)
- **TW06-SM03:** 75 audio files (.wav recordings)
- **TW07-SM02:** 12 audio files (.wav recordings)
- **TW08-SM04:** 78 audio files (.wav recordings)

**Audio File Specifications:**
- **Format:** .wav (uncompressed)
- **Channels:** Mono
- **Bit depth:** 16-bit
- **Sample rate:** 48 kHz
- **Duration:** Variable based on recording schedule

**Filename Conventions:**
- **SM001/SM03/SM04 series:** SM0##_YYYYMMDD_HHMMSS.wav (TW05-SM01, TW06-SM03, TW08-SM04)
- **SM002 series:** SM002_YYYYMMDD_HHMMSS.wav (TW07-SM02)

**Total Dataset Size:** 311 audio files across all bioacoustic monitor deployments.

Each .wav file is a field recording captured according to programmed recording schedules. File names include timestamps indicating the start time of each recording session.

### Data Fields

**metadata.md** (found in each recorder deployment folder):
- **Recorder ID:** Unique device identifier (SM01, SM02, SM03, SM04)
- **Device Model:** Song Meter model name (e.g., Song Meter Micro 2)
- **Device Serial Number:** Manufacturer-assigned serial number
- **Site ID:** Location code where deployed (TW05, TW06, TW07, TW08)
- **Deployment Location Description:** Text description of exact location and surroundings
- **GPS Coordinates:** Latitude and longitude in decimal format
- **Deployment Date and Time:** Recorder deployment timestamp (YYYY-MM-DD HH:MM format)
- **Retrieval Date and Time:** Recorder retrieval timestamp (YYYY-MM-DD HH:MM format)
- **Orientation / Microphone Facing:** Direction and environmental considerations (e.g., \"East, away from wind and road\")
- **Mounting Height:** Approximate height of microphone from ground in meters
- **Recording Schedule Preset:** Schedule or settings used for recording (e.g., \"1 hour at sunrise and sunset\")
- **Time Zone Set on Device:** Local time zone configured (e.g., \"USA Eastern (UTC-5)\")
- **Maintenance Notes:** Issues, configuration changes, or deviations from standard settings
- **Observer:** Name or initials of person completing metadata

**CSV Log Files:**
- `The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv`: Deployment log from June 30, 2025
- `The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv`: Retrieval log from July 4, 2025

### Data Splits

This dataset has no predefined training/validation/test splits. Data are organized by site (TW05-TW08) and deployment session. Users may create their own splits based on:
- **Temporal splits:** Using recording timestamps across the deployment period
- **Spatial splits:** Using different site locations (TW05, TW06, TW07, TW08)
- **Recorder-based splits:** Using different Song Meter devices (SM01, SM02, SM03, SM04)

Recommended approach depends on modeling goals and research questions.

## Dataset Creation

### Curation Rationale

This dataset supports biodiversity monitoring, behavioral ecology research, and the development of automated species detection and classification models from passive acoustic recordings. Bioacoustic monitoring provides complementary data to camera trap surveys and enables detection of cryptic or nocturnal species that may be missed by visual methods.

### Source Data

#### Data Collection and Processing

Recordings were collected at The Wilds safari park during summer 2025 using Wildlife Acoustics Song Meter devices. Four recorders (SM01-SM04) were strategically deployed at sites TW05-TW08 from June 30 to July 3, 2025. 

Devices were programmed for scheduled recordings with different sampling strategies across sites. Recorders were mounted on trees or posts at appropriate heights and orientations to minimize wind noise and maximize acoustic detection. Upon retrieval, audio files were organized by deployment session and basic metadata were recorded. No audio processing, filtering, or annotation was applied to preserve the raw acoustic data.

#### Who are the source data producers?

The dataset was collected and curated by researchers and students from the Imageomics Institute and Ohio State University in collaboration with conservation staff at The Wilds safari park in Ohio.

### Annotations

#### Annotation process

No species identification or acoustic annotations are currently provided with this initial dataset release. Manual and AI-assisted labeling efforts for species detection, vocalization classification, and acoustic event annotation are planned for future versions.

#### Who are the annotators?

N/A - annotations will be added in future releases

### Personal and Sensitive Information

The dataset includes GPS coordinates within The Wilds, a public conservation ")

Training Details

Training Set Metrics

Training set Min Median Max
Word count 2 400.3986 4498
Label Training Sample Count
biology 149
chemistry 89
climate 135
code 200
cybersecurity 200
finance 200
legal 200
math 185
medical 200

Training Hyperparameters

  • batch_size: (32, 32)
  • num_epochs: (1, 1)
  • max_steps: -1
  • sampling_strategy: oversampling
  • num_iterations: 5
  • body_learning_rate: (2e-05, 1e-05)
  • head_learning_rate: 0.01
  • loss: CosineSimilarityLoss
  • distance_metric: cosine_distance
  • margin: 0.25
  • end_to_end: False
  • use_amp: False
  • warmup_proportion: 0.1
  • l2_weight: 0.01
  • seed: 42
  • eval_max_steps: -1
  • load_best_model_at_end: False

Training Results

Epoch Step Training Loss Validation Loss
0.0021 1 0.2723 -
0.1027 50 0.2194 -
0.2053 100 0.1241 -
0.3080 150 0.0837 -
0.4107 200 0.0693 -
0.5133 250 0.0579 -
0.6160 300 0.0501 -
0.7187 350 0.0443 -
0.8214 400 0.0415 -
0.9240 450 0.0394 -

Framework Versions

  • Python: 3.12.12
  • SetFit: 1.1.3
  • Sentence Transformers: 5.3.0
  • Transformers: 4.50.3
  • PyTorch: 2.11.0+cu130
  • Datasets: 4.8.4
  • Tokenizers: 0.21.4

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}