We Wanted to Train a Model on Antibody Sequences. We Ended Up Evaluating HuggingFace's New Storage Revolution.
We recognized a gap: OAS-unpaired, one of the most important public antibody sequence resources, is invaluable for research but not packaged for production-scale ML workflows.
We're releasing an open Hugging Face version of OAS-unpaired to make it easier for the broader research community to stream, query, and work with this data at scale and develop better drugs. The full corpus is now available for streaming, and per-study subsets let researchers load exactly the data they need without downloading billions of rows.
The dataset is available at: ConvergeBio/OAS-unpaired
In the process, we also got to evaluate Hugging Face's newer storage stack, Xet plus Parquet CDC, which makes updating very large Parquet datasets much more practical. That is what this post is about.
The Data: OAS Unpaired
Antibodies are Y-shaped proteins produced by B cells, each shaped to recognize and bind a specific molecular target. This specificity has made them the backbone of modern therapeutics, with five of the top ten selling drugs worldwide being antibody-based. Over the past decade, next-generation sequencing (NGS) has let researchers read the genetic code of millions of antibodies from a single blood sample. These sequences have become the training data for a new generation of machine learning models. IgBert, IgT5, AntiBERTy, and AbLang all learn patterns from massive antibody datasets to enable computational drug design.
The Observed Antibody Space (OAS) is the largest public database of cleaned, annotated antibody sequences from next-generation sequencing studies, maintained by the Oxford Protein Informatics Group (OPIG) at the University of Oxford [1][2]. It holds billions of sequences across 88 studies covering multiple species (human, mouse, rhesus macaque, and others), isotypes, and disease contexts.
OPIG provides a web-based search interface that lets you filter sequences by species, disease, B cell type, tissue source, isotype, vaccine, and more. You can also submit a sequence and retrieve up to 1,000 matches with the same V and J germline genes. For targeted exploration of individual studies or specific subsets, this works well.
The underlying data, however, is distributed as thousands of individual .csv.gz files on their web server, one file per study, chain type (heavy or light), and isotype combination. There's no programmatic API, no column selection, no way to run a query across the full dataset at once. If you want to work with all 2.43 billion sequences, you need to download and parse 15,568 separate files.
The Setup
For readers less familiar with data formats: OAS uses CSV (Comma-Separated Values), a plain-text format where each row is a line and columns are separated by commas. It's the paper spreadsheet of the data world: simple, universal, and deeply inefficient at scale.
We converted everything to Parquet, a columnar binary format that organizes data by column rather than by row. CSV reads data row by row, like scrolling through a million-row spreadsheet to sum a single column. Parquet stores each column independently, allowing you to use only specific columns without downloading the full dataset.
Parquet also compresses dramatically better. A column of V-gene names (mostly values like IGHV3-30 and IGHV1-69) compresses far more efficiently than a CSV row mixing gene names, amino acid sequences, quality scores, and metadata together.
The result: 4,030 Parquet files split into two configs: heavy with 2.07 billion sequences across 3,420 shards, and light with 357 million sequences across 610 shards. Total size on disk: 2.45 TB, 114 columns per shard (99 AIRR fields plus 13 metadata and 2 hash).
Now, anyone can stream the entire OAS unpaired database without downloading anything:
from datasets import load_dataset
ds = load_dataset("ConvergeBio/oas-unpaired", "heavy", split="train", streaming=True)
The dataset is organized so that each of the 88 studies is its own subset within the heavy and light configs. This means you can also load a single study directly, useful if you're working on a specific organism, disease, or experimental context:
# Load just one study
ds = load_dataset("ConvergeBio/oas-unpaired", "Briney et al., 2019", split="heavy",
streaming=True)
Or query across the entire database with SQL using DuckDB's HuggingFace integration [3]. No download needed, just HTTP range requests against remote Parquet:
-- Most common V-genes across all heavy chain sequences
SELECT v_call, COUNT(*)
FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/**/*.parquet'
GROUP BY v_call ORDER BY COUNT(*) DESC LIMIT 10;
-- Or drill into a single study
SELECT COUNT(*)
FROM 'hf://datasets/ConvergeBio/oas-unpaired/data/unpaired_heavy/Bolland et al., 2016/*.parquet';
So far so good. The interesting part came next.
Adding a Column
With the dataset uploaded, we wanted to assess its redundancy and perform basic EDA before training. Are the same sequences showing up across studies? How many unique sequences are there really?
You can't just GROUP BY a variable-length amino acid string across 2.43 billion rows. That's expensive and slow. So we wanted to add a precomputed hash: take the sequence_alignment_aa column, run it through xxHash [4] to get a 128-bit digest, and store that as two new columns (aa_hash_hi and aa_hash_lo, the high and low 64-bit halves).
For readers unfamiliar with hashing: imagine you have a library of a billion books and need to find duplicates. Comparing every book's full text to every other book would take an astronomical number of operations. But if you stamp each book with a unique short fingerprint, a code derived deterministically from its content, then finding duplicates becomes trivial: just look for matching stamps. Two identical books always produce the same fingerprint. Two different books (with a good hash function) always produce different ones. That's hashing.
With the hash materialized in the dataset, any duplicate query becomes a GROUP BY on two integers. Fast, cheap, and you can do it remotely via DuckDB without downloading the full dataset.
The Problem: Git LFS
Under Git LFS, which was HuggingFace Hub's storage backend until the Xet migration began in early 2025, uploading a modified file means uploading the entire file. LFS works at file granularity. It doesn't know or care that 112 out of 114 columns are identical to what's already stored. Every byte of every shard would need to be re-transferred.
That's 2.19 TB of upload to add 16 bytes per row. At 100 Mbps, roughly two days of continuous transfer for two columns that represent about 2.8% of each file.
Two Layers of Storage Innovation by Hugging Face
The solution involves two technologies that must work together: one at the storage layer and one at the file format layer.
Layer 1: XET (Storage)
In August 2024, HuggingFace acquired XETHub [5], a Seattle-based startup founded by ex-Apple ML infrastructure engineers. By early 2025, they deployed XET as the Hub's new storage backend, and by May 2025, it became the default for all new repositories [6][7].
The key idea: instead of storing files as opaque blobs (like Git LFS does), XET splits every file into variable-sized chunks of about 64 KB using a GearHash rolling hash. Each chunk gets hashed (Blake3) and stored in a global content-addressable store. When you upload a modified file, only the chunks that actually changed get transferred [8][9].
To give a sense of scale, as of October 2025, HuggingFace has migrated over 6 million repositories totaling 77+ petabytes from LFS to XET [10].
Why XET Alone Isn't Enough
You'd think byte-level chunking would solve the Parquet problem, but it doesn't. Here's why.
Parquet is a columnar format: data is organized into row groups, and within each row group, each column is written as a series of compressed data pages. The critical detail is that standard Parquet writers determine page boundaries by counting rows, every N values start a new page.
When you add a column to a Parquet file, the row group metadata changes, byte offsets shift, and even though the actual column data is identical, the compression now operates on pages that start at slightly different positions. Different input positions produce different compressed bytes, and XET's chunker sees different bytes and treats them as entirely new data.
We benchmarked this on a single 210 MB shard with 247K rows. Without any special handling, adding the hash columns resulted in only 3.9% chunk overlap between the original and modified files. XET couldn't deduplicate almost anything.
Layer 2: Parquet CDC (File Format)
The fix is Parquet Content-Defined Chunking, a feature contributed to Apache Arrow by Krisztian Szucs, merged in May 2025, released in Apache Arrow 21.0.0 in July 2025 [11][12][13].
Instead of splitting data pages at fixed row counts, Parquet CDC uses a rolling hash (same GearHash family) over the actual column values to decide where to cut pages. Identical data sequences produce identical compressed pages regardless of their position in the file. When you add a column, the existing columns' page boundaries don't shift, because they're determined by content, not by position.
With datasets>=4.0.0 (which pins pyarrow>=21.0.0 for Parquet CDC support) this is enabled by default:
from datasets import Dataset, load_dataset
# 1. Load your CSV and push to Hub
ds = Dataset.from_csv("sequences.csv")
ds.push_to_hub("my-org/my-dataset")
# 2. Later: load from Hub, add a column, push again
ds = load_dataset("my-org/my-dataset", split="train")
ds = ds.map(lambda x: {"seq_length": len(x["sequence"])})
ds.push_to_hub("my-org/my-dataset")
# Only the new column's chunks are transferred
The Result
With both layers working together, Parquet CDC at the format level and XET at the storage level, our benchmark showed 98.8% chunk overlap. Only the new column data and some metadata were actually uploaded.
What would have been a multi-day upload became a lunch break.
Here we used this to add hash columns, but if you want to add any new feature, embedding, or tokenization to a large dataset, this will save you a lot of time.
Finding Duplicates
With hashes computed, we could finally quantify how redundant OAS actually is, redundant for our purpose, since we care about amino acid sequences.
Why duplicates exist
Every protein in nature is encoded by a gene, a stretch of DNA. The cell reads this DNA and translates it into a chain of amino acids. It's the amino acid sequence that determines what the protein does, and it's what we train our models on.
Translation works in triplets. Every three nucleotides in the DNA (called a codon) map to one amino acid. There are 64 possible codons, but only 20 amino acids, so multiple codons map to the same one. For example, four different codons - GCT, GCC, GCA, GCG - all produce the amino acid Alanine.
This means two DNA sequences can look completely different at the nucleotide level but encode the exact same protein. OAS stores nucleotide sequences, but we train on amino acid sequences. At that level, these are duplicates.
Antibodies specifically have additional sources of duplication. When the immune system finds an antibody that works, it amplifies it: the B cell that produces it divides rapidly, creating many copies of the same sequence. This is called clonal expansion. On top of that, sequencing the same blood sample across different experimental setups (targeting different antibody classes, such as IgG, IgM, or IgA) will capture the same underlying sequence multiple times. And across the hundreds of studies that make up OAS, overlapping patient populations and shared cell lines contribute further.
The Numbers
We ran duplicate detection across all 2.43 billion sequences and per study using the precomputed hash columns, a GROUP BY on two uint64 values, executed remotely via DuckDB against the HuggingFace-hosted Parquet files.
The full table is available at the dataset page on Hugging Face.
What We Learned
The toolchain is ready but barely documented. Parquet CDC shipped in Arrow 21.0.0 (July 2025), Xet became the default on HuggingFace in May 2025. The combination works, but figuring out the right approach required reading Apache Arrow PRs and HuggingFace engineering blog posts. There isn't a "how to efficiently update a large Parquet dataset" guide yet. Consider this a start.
AI coding assistants don't know this stack yet. Even Claude kept defaulting to the old patterns: download the full dataset, process locally, re-upload the whole thing. Getting them to use hf:// URIs or leverage DuckDB's remote Parquet support required explicit correction every time. You need to know what to ask for.
DuckDB + HuggingFace is underappreciated. SQL-querying 2.43 billion rows directly on HuggingFace with no download and no local database changes how you think about data exploration. Our duplicate detection query runs against remote Parquet shards. It's not fast, but it works, and you don't need to provision anything [3].
Content-defined chunking is a principle, not a single technology. The same idea - using content to determine boundaries - appears at three levels in this project: PyArrow uses it to decide Parquet page boundaries from column values, Xet uses it to decide storage chunk boundaries from file bytes, and our hash columns use it for application-level deduplication. Each layer solves a different problem with the same underlying insight.
Open data infrastructure matters as much as the science. The sequences in OAS were always open, but accessing them at scale required days of engineering before any science could begin. Converting to Parquet, hosting on HuggingFace, and validating every row is what turns "technically available" into "actually usable."
The dataset is open access. We're releasing it for the community, for ML researchers training antibody language models, for biologists querying specific studies or species, and for drug discovery teams building reproducible pipelines. The whole point of open data is what the community builds with it.
The side quest was worth it. We set out to prepare training data for our antibody language models. We ended up with the largest publicly accessible copy of OAS unpaired on HuggingFace, validated against every original source file, and along the way got to benchmark HuggingFace's new Xet storage and Parquet CDC, infrastructure that makes maintaining multi-terabyte biological datasets practical without blocking out a weekend every time you want to add a column.
References
[1] Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. "Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires." J Immunol. 2018; 201(8):2502-2509. PubMed
[2] Olsen TH, Boyles F, Deane CM. "Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences." Protein Science. 2022; 31(1):141-146. DOI
[3] "Access 150k+ Datasets from Hugging Face with DuckDB." DuckDB Blog, May 29, 2024. duckdb.org
[4] Yann Collet. xxHash, Extremely fast non-cryptographic hash algorithm. xxhash.com / GitHub
[5] "XetHub is joining Hugging Face!" HuggingFace Blog, August 8, 2024. huggingface.co/blog/xethub-joins-hf
[6] "From Files to Chunks: Improving HF Storage Efficiency." HuggingFace Blog, November 20, 2024. huggingface.co/blog/from-files-to-chunks
[7] "Xet is on the Hub." HuggingFace Blog, March 18, 2025. huggingface.co/blog/xet-on-the-hub
[8] "From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub." HuggingFace Blog, February 12, 2025. huggingface.co/blog/from-chunks-to-blocks
[9] Xet Protocol Specification v1.0.0, Content-Defined Chunking. huggingface.co/docs/xet/en/chunking
[10] "huggingface_hub v1.0." HuggingFace Blog, October 27, 2025. huggingface.co/blog/huggingface-hub-v1
[11] "Parquet Content-Defined Chunking." HuggingFace Blog, July 25, 2025. huggingface.co/blog/parquet-cdc
[12] Apache Arrow PR #45360, Content-Defined Chunking for Parquet. Created January 27, 2025; merged May 13, 2025. github.com/apache/arrow/pull/45360
[13] Apache Arrow 21.0.0 Release. July 17, 2025. arrow.apache.org/blog/2025/07/17/21.0.0-release/
[14] "Improving Parquet Dedupe on Hugging Face Hub." HuggingFace Blog, October 5, 2024. huggingface.co/blog/improve_parquet_dedupe
[15] "Migrating the Hub from Git LFS to Xet." HuggingFace Blog, July 15, 2025. huggingface.co/blog/migrating-the-hub-to-xet
[16] OAS, Observed Antibody Space. OPIG, University of Oxford. opig.stats.ox.ac.uk/webapps/oas/
[17] HuggingFace Datasets, Streaming. huggingface.co/docs/datasets/en/stream








