txtcaptcha CRNN — unified Brazilian court captcha model

A Convolutional Recurrent Neural Network (ResNet-style CNN → BiLSTM → CTC) trained jointly on ten labeled captcha datasets from Brazilian courts and tax authorities, published originally by the R captcha package. It replaces ten per-site CNN classifiers with a single model that reads variable-length alphanumeric strings from arbitrarily-sized input images.

Usage

from txtcaptcha import decrypt, read_captcha

cap = read_captcha("captcha.png")
# First call downloads the weights from this repo into ~/.cache/huggingface;
# subsequent calls are free.
print(decrypt(cap))                  # greedy CTC decoding
print(decrypt(cap, length=5))        # force exactly 5 output chars
print(decrypt(cap, mask="[0-9]"))    # restrict to digits

Explicit download (useful for notebooks or CI warmup):

from txtcaptcha import from_pretrained
model = from_pretrained("jtrecenti/txtcaptcha-crnn")

Install txtcaptcha:

pip install git+https://github.com/jtrecenti/txtcaptcha

Training data

Merged union of these datasets (published at https://github.com/decryptr/captcha/releases):

cadesp, esaj, jucesp, rfb, sei, tjmg, tjpe, tjrs, trf5, trt

Labels follow the filename convention <id>_<label>.<ext>. Vocabulary is the full alphanumeric range 0-9a-zA-Z (62 classes + CTC blank).

Architecture

Stage	Details
Backbone	ResNet-style CNN, channels `64 → 128 → 256 → 256`, H/8 × W/4 downsample
Sequence head	2-layer bidirectional LSTM, hidden 256
Classifier	Linear `512 → 63` (62 chars + blank)
Loss	CTC, handles variable output length
Input	Any dimensions; height resized to 32 at inference, width preserved

The full config is in config.json.

Validation accuracy

~89% captcha-level exact match on a held-out 20% split of the training corpus. Per-dataset breakdown available in notebooks/eval_per_dataset.ipynb.

Limitations

Trained exclusively on Brazilian court captcha fonts/distortions. Generalization to US/European OCR captchas is untested.
The sei dataset is a math captcha; the model learned to transcribe the (already-solved) label as if it were a plain 4-char sequence.
Label length distribution is concentrated on 4–5 chars; longer sequences may degrade. Use decrypt(..., length=N) to enforce an exact length.

Versioning

Pin to a specific release with revision=:

from_pretrained("jtrecenti/txtcaptcha-crnn", revision="v0.1.0")

License

MIT. The original training data from the R captcha package is redistributed under its own terms.

Downloads last month: 30