BigScience Biomedical Datasets

non-profit

Activity Feed Request to join this org

AI & ML interests

We aim to unify the schema across many different biomedical NLP resources.

Recent Activity

minwoosun authored a paper 5 days ago

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Abhijnan authored a paper 13 days ago

CRAFT: Grounded Multi-Agent Coordination Under Partial Information

MaziyarPanahi authored a paper about 2 months ago

Arcee Trinity Large Technical Report

View all activity

minwoosun

authored a paper 5 days ago

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Paper • 2603.27460 • Published 15 days ago • 67

MaziyarPanahi

posted an update 12 days ago

Post

993

Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

https://huggingface.co/blog/OpenMed/training-mrna-models-25-species

Abhijnan

authored a paper 13 days ago

CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Paper • 2603.25268 • Published 17 days ago

MaziyarPanahi

posted an update 19 days ago

Post

2198

We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match.

Everything is open-sourced: datasets, adapters, and code.

https://huggingface.co/blog/OpenMed/synthvision

2 replies

Nymbo

posted an update 28 days ago

Post

6521

We should really have a release date range slider on the /models page. Tired of "trending/most downloaded" being the best way to sort and still seeing models from 2023 on the first page just because they're embedded in enterprise pipelines and get downloaded repeatedly. "Recently Created/Recently Updated" don't solve the discovery problem considering the amount of noise to sift through.

Slight caveat: Trending actually does have some recency bias, but it's not strong/precise enough.

3 replies

MaziyarPanahi

posted an update about 1 month ago

Post

4806

DNA, mRNA, proteins, AI. I spent the last year going deep into computational biology as an ML engineer. This is Part I of what I found. 🧬

In 2024, AlphaFold won the Nobel Prize in Chemistry.

By 2026, the open-source community had built alternatives that outperform it.

That's the story I find most interesting about protein AI right now. Not just the science (which is incredible), but the speed at which open-source caught up. Multiple teams, independently, reproduced and then exceeded AlphaFold 3's accuracy with permissive licenses. The field went from prediction to generation: we're not just modeling known proteins anymore, we're designing new ones.

I spent months mapping this landscape for ML engineers. What the architectures actually are (spoiler: transformers and diffusion models), which tools to use for what, and which ones you can actually ship commercially.

New post on the Hugging Face blog: https://huggingface.co/blog/MaziyarPanahi/protein-ai-landscape

Hope you all enjoy! 🤗

2 replies

albertvillanova

posted an update about 2 months ago

Post

2315

🚀 TRL v0.29.0 introduces trl-training: an agent-native training skill.

This makes the TRL CLI a structured, agent-readable capability, allowing AI agents to reliably execute training workflows such as:
- Supervised Fine-Tuning (SFT)
- Direct Preference Optimization (DPO)
- Group Relative Policy Optimization (GRPO)

We’re excited to see what the community builds on top of this.

If you’re working on AI agents, alignment research, or scalable RL training infrastructure: give TRL v0.29.0 a try! 🤗

The future of ML tooling is agent-native.
🔗 https://github.com/huggingface/trl/releases/tag/v0.29.0

Tonic

posted an update about 2 months ago

Post

3537

🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named

kurakurai ?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .

4 replies

MaziyarPanahi

authored a paper about 2 months ago

Arcee Trinity Large Technical Report

Paper • 2602.17004 • Published Feb 19 • 19

mariagrandury

authored 2 papers about 2 months ago

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Paper • 2510.10159 • Published Oct 11, 2025 • 3

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Paper • 2511.04703 • Published Nov 3, 2025 • 8

Tonic

posted an update about 2 months ago

Post

3353

🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest

just a few quite minor details i'll take care of but i wanted to share here first

2 replies

albertvillanova

posted an update 2 months ago

Post

1842

5 years already working in democratizing AI 🤗
Grateful to be part of such an awesome team making it happen every day.

MaziyarPanahi

posted an update 2 months ago

Post

2365

Announcing: OpenMed Multilingual PII Detection Models

Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.

All Apache 2.0 licensed. Free for commercial use. No restrictions.

Performance:

- French: 97.97% F1 (top model)
- German: 97.61% F1 (top model)
- Italian: 97.28% F1 (top model)

All top-10 models per language exceed 96% F1

Coverage:

55+ PII entity types per language
Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian)
Language-specific address, phone, and name patterns

Training Data:

French: 49,580 samples
German: 42,250 samples
Italian: 40,944 samples

Why Multilingual?

European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.

Effective de-identification requires:

- Native language understanding — not translation
- Local ID format recognition — each country has unique patterns
- Cultural context awareness — names, addresses, and formats vary
- These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.

HIPAA & GDPR Compliance
Built for US and European privacy regulations:

- On-premise deployment: Process data locally with zero external dependencies
- Data sovereignty: No API calls, no cloud services, no cross-border transfers
- Air-gapped capable: Deploy in fully isolated environments if required
- Regulatory-grade accuracy: Supporting Expert Determination standards
- HIPAA and GDPR compliance across languages, without compliance gaps.

Use Cases
- Hospital EHR systems: Automated patient record de-identification
- Clinical research: Multilingual dataset preparation for studies
- Insurance companies: Claims processing across

https://huggingface.co/collections/OpenMed/multilingual-pii-and-de-identification

1 reply

MaziyarPanahi

posted an update 2 months ago

Post

1309

From Golden Gate Bridge to Broken JSON: Why Anthropic's SAE Steering Fails for Structured Output

I ran 6 experiments trying to use Anthropic's SAE steering for JSON generation.

- Base model: 86.8% valid JSON
- Steering only: 24.4%
- Fine-tuned: 96.6%
- FSM constrained: 100%

Steering is for semantics, not syntax.

https://huggingface.co/blog/MaziyarPanahi/sae-steering-json

MaziyarPanahi

posted an update 2 months ago

Post

4057

🚨 Day 8/8: OpenMed Medical Reasoning Dataset Release - THE GRAND FINALE

Today I complete my 8-day release series with Medical-Reasoning-SFT-Mega.
The largest open medical reasoning dataset, combining 7 state-of-the-art AI models with fair distribution deduplication.

THE 7 SOURCE MODELS (Original Sample Counts):

1. Trinity-Mini: 810,284 samples
2. Qwen3-Next-80B: 604,249 samples
3. GPT-OSS-120B: 506,150 samples
4. Nemotron-Nano-30B: 444,544 samples
5. GLM-4.5-Air: 225,179 samples
6. MiniMax-M2.1: 204,773 samples
7. Baichuan-M3-235B: 124,520 samples

TOTAL BEFORE DEDUPLICATION: 2,919,699 samples

TOKEN COUNTS:
- Content tokens: 2.22 Billion
- Reasoning tokens: 1.56 Billion
- Total tokens: 3.78 Billion
- Samples with chain-of-thought: 100%

Quick Start:

from datasets import load_dataset
ds = load_dataset("OpenMed/Medical-Reasoning-SFT-Mega")

6 replies

jmhb

submitted a paper to Daily Papers 2 months ago

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Paper • 2601.18207 • Published Jan 26 • 19

aryopg

authored 2 papers 2 months ago

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Paper • 2512.05648 • Published Dec 5, 2025

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Paper • 2601.23045 • Published Jan 30

jmhb

authored a paper 2 months ago

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

Paper • 2601.18207 • Published Jan 26 • 19

AI & ML interests

Recent Activity

Team members 87

bigbio's activity