Instructions to use btrkeks/transcoda-59M-zeroshot-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use btrkeks/transcoda-59M-zeroshot-v1 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="btrkeks/transcoda-59M-zeroshot-v1", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("btrkeks/transcoda-59M-zeroshot-v1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Transcoda 59M Zero-Shot v1
End-to-end zero-shot Optical Music Recognition. A compact 59M-parameter
vision-encoder-decoder that turns raw score images directly into
**kern (Humdrum) symbolic
transcriptions. Trained from scratch on synthetic data only.
Paired benchmarks: btrkeks/verovio-synth-omr
(synthetic) and btrkeks/polish-scores
(real historical scans).
Quick load (transformers)
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
model = AutoModelForCausalLM.from_pretrained(
"btrkeks/transcoda-59M-zeroshot-v1",
trust_remote_code=True,
)
tokenizer = PreTrainedTokenizerFast.from_pretrained("btrkeks/transcoda-59M-zeroshot-v1")
trust_remote_code=True is required because the model uses a custom
vision-encoder-decoder architecture not yet in the transformers registry.
On first load the ConvNeXt-V2-Tiny base weights are re-downloaded from
facebook/convnextv2-tiny-22k-224
and then overwritten by the Transcoda checkpoint; this is expected.
Preprocessing
The model expects a single page image normalized to 1485 × 1050 pixels
(height × width) with float32 pixel values in [-1, 1]:
- Convert to RGB.
- Resize to width 1050 preserving aspect ratio (bilinear).
- If the resized height exceeds 1485, top-crop; otherwise white-pad at the bottom to reach exactly 1485.
- Normalize to
[-1, 1](subtract 0.5, divide by 0.5 per channel). - Pass as
pixel_valuesof shape(1, 3, 1485, 1050)plus animage_sizestensor[[1485, 1050]]tomodel.generate(...).
The canonical implementation is preprocess_pil_image in
src/data/preprocessing.py
in the project repository.
Full inference snippet
import torch
from PIL import Image
import numpy as np
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast
REPO = "btrkeks/transcoda-59M-zeroshot-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(REPO, trust_remote_code=True).to(device).eval()
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO)
def preprocess(path, target_w=1050, target_h=1485):
img = Image.open(path).convert("RGB")
new_h = max(1, int(img.height * (target_w / img.width)))
img = img.resize((target_w, new_h), Image.BILINEAR)
arr = np.array(img)
if arr.shape[0] > target_h:
arr = arr[:target_h]
elif arr.shape[0] < target_h:
pad = np.full((target_h - arr.shape[0], target_w, 3), 255, dtype=arr.dtype)
arr = np.concatenate([arr, pad], axis=0)
t = torch.from_numpy(arr).permute(2, 0, 1).float() / 255.0
t = (t - 0.5) / 0.5
return t.unsqueeze(0)
pixel_values = preprocess("page.png").to(device)
image_sizes = torch.tensor([[1485, 1050]], device=device)
with torch.no_grad():
out = model.generate(
pixel_values=pixel_values,
image_sizes=image_sizes,
max_length=2048,
do_sample=False,
num_beams=1,
repetition_penalty=1.1,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
For grammar-constrained decoding (guarantees formally valid **kern) and
beam search, use the project repository's scripts/inference.py.
Loading the raw .ckpt (paper reproducibility)
The original Lightning checkpoint is also distributed in this repo as
transcoda-59M-zeroshot-v1.ckpt. Load it from a clone of the
Transcoda repository:
import torch
from src.model.checkpoint_loader import load_model_from_checkpoint
loaded = load_model_from_checkpoint("transcoda-59M-zeroshot-v1.ckpt", torch.device("cpu"))
model = loaded.model
This matches the inference path used for the paper's reported metrics.
Architecture
| Component | Spec |
|---|---|
| Encoder | ConvNeXt-V2-Tiny, pretrained on ImageNet-22k |
| Patch grid | 47 × 33 (from a 1485 × 1050 input) |
| Projector | 2-layer MLP, encoder dim 768 → decoder dim 512 |
| Positional encoding | 2D sinusoidal on the encoder grid, RoPE on the decoder |
| Decoder | 8-layer Transformer, d_model=512, dim_ff=1024, 8 heads |
| Vocab | 3,000-token BPE (**kern source) |
| Max seq len | 2,048 |
| Parameters | 58.8 M total |
Training data
310,554 synthetic page images rendered with Verovio,
sourced from PDMX, Musetrainer, Grandstaff, OpenScore Lieder, and OpenScore
String Quartets. Targets are canonicalized via a deterministic
**kern normalization pipeline so each score maps to a single canonical
sequence. Augmentation adds expressive marks (dynamics, pedals, tempo, etc.)
to the rendered image without changing the target — this asymmetry teaches
the decoder to ignore notation that is not part of **kern.
No real-scan or Polish data is used for training or model selection.
Benchmarks
OMR-NED on the released benchmarks (lower is better):
| Model | Params | Verovio synth ↓ | Polish (real) ↓ |
|---|---|---|---|
| SMT++ | 11M | 92.23% | 80.16% |
| Legato | 943M | 43.91% | 86.73% |
| Transcoda 59M (beam) | 59M | 18.46% | 63.97% |
Headline numbers use beam search with width 3, repetition_penalty=1.1,
max_length=2048. See
scripts/benchmark/README.md
in the project repository for reproduction commands and ablations.
Decoding
Three configurations are documented in the paper:
- Greedy (default in this card's
generation_config.json):do_sample=false, num_beams=1, repetition_penalty=1.1, max_length=2048. - Beam search (best CER, headline metric):
num_beams=3, length_penalty=1.0, otherwise as above. - Grammar-constrained: applies a
**kerngrammar (grammars/kern.gbnf) via custom logits processors. Guarantees structural validity; requires the project repository.
The repetition penalty is preferred over no_repeat_ngram_size —
music is intrinsically repetitive at every scale, and banning n-grams
breaks legitimate output. See the project CLAUDE.md for details.
Limitations
- Trained exclusively on synthetic Verovio renders. Domain gap on real scans is significant: 18.46% OMR-NED on clean Verovio pages vs 63.97% on Polish historical scans.
- No fine-tuning on Polish or any other real-world corpus.
- Single layout engine in training data; other engravers may degrade.
- Fixed input geometry (1485 × 1050). Multi-page or unusual aspect ratios must be split or letterboxed by the caller.
License
Citation
@misc{dratschuk2026transcodaendtoendzeroshotoptical,
title={Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training},
author={Daniel Dratschuk and Paul Swoboda},
year={2026},
eprint={2605.10835},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.10835},
}
- Downloads last month
- 99
Model tree for btrkeks/transcoda-59M-zeroshot-v1
Base model
facebook/convnextv2-tiny-22k-224Datasets used to train btrkeks/transcoda-59M-zeroshot-v1
btrkeks/verovio-synth-omr
Spaces using btrkeks/transcoda-59M-zeroshot-v1 2
Paper for btrkeks/transcoda-59M-zeroshot-v1
Evaluation results
- OMR-NED (lower is better) on Verovio Synthetic OMR Benchmarkself-reported18.460
- OMR-NED (lower is better) on Polish Historical-Scan OMR Benchmarkself-reported63.970