Transcoda 59M Zero-Shot v1

End-to-end zero-shot Optical Music Recognition. A compact 59M-parameter vision-encoder-decoder that turns raw score images directly into **kern (Humdrum) symbolic transcriptions. Trained from scratch on synthetic data only.

Paired benchmarks: btrkeks/verovio-synth-omr (synthetic) and btrkeks/polish-scores (real historical scans).

Quick load (transformers)

from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

model = AutoModelForCausalLM.from_pretrained(
    "btrkeks/transcoda-59M-zeroshot-v1",
    trust_remote_code=True,
)
tokenizer = PreTrainedTokenizerFast.from_pretrained("btrkeks/transcoda-59M-zeroshot-v1")

trust_remote_code=True is required because the model uses a custom vision-encoder-decoder architecture not yet in the transformers registry. On first load the ConvNeXt-V2-Tiny base weights are re-downloaded from facebook/convnextv2-tiny-22k-224 and then overwritten by the Transcoda checkpoint; this is expected.

Preprocessing

The model expects a single page image normalized to 1485 × 1050 pixels (height × width) with float32 pixel values in [-1, 1]:

  1. Convert to RGB.
  2. Resize to width 1050 preserving aspect ratio (bilinear).
  3. If the resized height exceeds 1485, top-crop; otherwise white-pad at the bottom to reach exactly 1485.
  4. Normalize to [-1, 1] (subtract 0.5, divide by 0.5 per channel).
  5. Pass as pixel_values of shape (1, 3, 1485, 1050) plus an image_sizes tensor [[1485, 1050]] to model.generate(...).

The canonical implementation is preprocess_pil_image in src/data/preprocessing.py in the project repository.

Full inference snippet

import torch
from PIL import Image
import numpy as np
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast

REPO = "btrkeks/transcoda-59M-zeroshot-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(REPO, trust_remote_code=True).to(device).eval()
tokenizer = PreTrainedTokenizerFast.from_pretrained(REPO)

def preprocess(path, target_w=1050, target_h=1485):
    img = Image.open(path).convert("RGB")
    new_h = max(1, int(img.height * (target_w / img.width)))
    img = img.resize((target_w, new_h), Image.BILINEAR)
    arr = np.array(img)
    if arr.shape[0] > target_h:
        arr = arr[:target_h]
    elif arr.shape[0] < target_h:
        pad = np.full((target_h - arr.shape[0], target_w, 3), 255, dtype=arr.dtype)
        arr = np.concatenate([arr, pad], axis=0)
    t = torch.from_numpy(arr).permute(2, 0, 1).float() / 255.0
    t = (t - 0.5) / 0.5
    return t.unsqueeze(0)

pixel_values = preprocess("page.png").to(device)
image_sizes = torch.tensor([[1485, 1050]], device=device)

with torch.no_grad():
    out = model.generate(
        pixel_values=pixel_values,
        image_sizes=image_sizes,
        max_length=2048,
        do_sample=False,
        num_beams=1,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(out[0], skip_special_tokens=True))

For grammar-constrained decoding (guarantees formally valid **kern) and beam search, use the project repository's scripts/inference.py.

Loading the raw .ckpt (paper reproducibility)

The original Lightning checkpoint is also distributed in this repo as transcoda-59M-zeroshot-v1.ckpt. Load it from a clone of the Transcoda repository:

import torch
from src.model.checkpoint_loader import load_model_from_checkpoint

loaded = load_model_from_checkpoint("transcoda-59M-zeroshot-v1.ckpt", torch.device("cpu"))
model = loaded.model

This matches the inference path used for the paper's reported metrics.

Architecture

Component Spec
Encoder ConvNeXt-V2-Tiny, pretrained on ImageNet-22k
Patch grid 47 × 33 (from a 1485 × 1050 input)
Projector 2-layer MLP, encoder dim 768 → decoder dim 512
Positional encoding 2D sinusoidal on the encoder grid, RoPE on the decoder
Decoder 8-layer Transformer, d_model=512, dim_ff=1024, 8 heads
Vocab 3,000-token BPE (**kern source)
Max seq len 2,048
Parameters 58.8 M total

Training data

310,554 synthetic page images rendered with Verovio, sourced from PDMX, Musetrainer, Grandstaff, OpenScore Lieder, and OpenScore String Quartets. Targets are canonicalized via a deterministic **kern normalization pipeline so each score maps to a single canonical sequence. Augmentation adds expressive marks (dynamics, pedals, tempo, etc.) to the rendered image without changing the target — this asymmetry teaches the decoder to ignore notation that is not part of **kern.

No real-scan or Polish data is used for training or model selection.

Benchmarks

OMR-NED on the released benchmarks (lower is better):

Model Params Verovio synth ↓ Polish (real) ↓
SMT++ 11M 92.23% 80.16%
Legato 943M 43.91% 86.73%
Transcoda 59M (beam) 59M 18.46% 63.97%

Headline numbers use beam search with width 3, repetition_penalty=1.1, max_length=2048. See scripts/benchmark/README.md in the project repository for reproduction commands and ablations.

Decoding

Three configurations are documented in the paper:

  • Greedy (default in this card's generation_config.json): do_sample=false, num_beams=1, repetition_penalty=1.1, max_length=2048.
  • Beam search (best CER, headline metric): num_beams=3, length_penalty=1.0, otherwise as above.
  • Grammar-constrained: applies a **kern grammar (grammars/kern.gbnf) via custom logits processors. Guarantees structural validity; requires the project repository.

The repetition penalty is preferred over no_repeat_ngram_size — music is intrinsically repetitive at every scale, and banning n-grams breaks legitimate output. See the project CLAUDE.md for details.

Limitations

  • Trained exclusively on synthetic Verovio renders. Domain gap on real scans is significant: 18.46% OMR-NED on clean Verovio pages vs 63.97% on Polish historical scans.
  • No fine-tuning on Polish or any other real-world corpus.
  • Single layout engine in training data; other engravers may degrade.
  • Fixed input geometry (1485 × 1050). Multi-page or unusual aspect ratios must be split or letterboxed by the caller.

License

CC BY 4.0.

Citation

@misc{dratschuk2026transcodaendtoendzeroshotoptical,
      title={Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training},
      author={Daniel Dratschuk and Paul Swoboda},
      year={2026},
      eprint={2605.10835},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.10835},
}
Downloads last month
99
Safetensors
Model size
58.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btrkeks/transcoda-59M-zeroshot-v1

Finetuned
(7)
this model

Datasets used to train btrkeks/transcoda-59M-zeroshot-v1

Spaces using btrkeks/transcoda-59M-zeroshot-v1 2

Paper for btrkeks/transcoda-59M-zeroshot-v1

Evaluation results

  • OMR-NED (lower is better) on Verovio Synthetic OMR Benchmark
    self-reported
    18.460
  • OMR-NED (lower is better) on Polish Historical-Scan OMR Benchmark
    self-reported
    63.970