txtcaptcha CRNN β unified Brazilian court captcha model
A Convolutional Recurrent Neural Network (ResNet-style CNN β BiLSTM β CTC)
trained jointly on ten labeled captcha datasets from Brazilian courts and
tax authorities, published originally by the R
captcha package. It replaces ten
per-site CNN classifiers with a single model that reads variable-length
alphanumeric strings from arbitrarily-sized input images.
Usage
from txtcaptcha import decrypt, read_captcha
cap = read_captcha("captcha.png")
# First call downloads the weights from this repo into ~/.cache/huggingface;
# subsequent calls are free.
print(decrypt(cap)) # greedy CTC decoding
print(decrypt(cap, length=5)) # force exactly 5 output chars
print(decrypt(cap, mask="[0-9]")) # restrict to digits
Explicit download (useful for notebooks or CI warmup):
from txtcaptcha import from_pretrained
model = from_pretrained("jtrecenti/txtcaptcha-crnn")
Install txtcaptcha:
pip install git+https://github.com/jtrecenti/txtcaptcha
Training data
Merged union of these datasets (published at https://github.com/decryptr/captcha/releases):
cadesp, esaj, jucesp, rfb, sei, tjmg, tjpe, tjrs, trf5, trt
Labels follow the filename convention <id>_<label>.<ext>. Vocabulary is
the full alphanumeric range 0-9a-zA-Z (62 classes + CTC blank).
Architecture
| Stage | Details |
|---|---|
| Backbone | ResNet-style CNN, channels 64 β 128 β 256 β 256, H/8 Γ W/4 downsample |
| Sequence head | 2-layer bidirectional LSTM, hidden 256 |
| Classifier | Linear 512 β 63 (62 chars + blank) |
| Loss | CTC, handles variable output length |
| Input | Any dimensions; height resized to 32 at inference, width preserved |
The full config is in config.json.
Validation accuracy
~89% captcha-level exact match on a held-out 20% split of the training
corpus. Per-dataset breakdown available in
notebooks/eval_per_dataset.ipynb.
Limitations
- Trained exclusively on Brazilian court captcha fonts/distortions. Generalization to US/European OCR captchas is untested.
- The
seidataset is a math captcha; the model learned to transcribe the (already-solved) label as if it were a plain 4-char sequence. - Label length distribution is concentrated on 4β5 chars; longer sequences
may degrade. Use
decrypt(..., length=N)to enforce an exact length.
Versioning
Pin to a specific release with revision=:
from_pretrained("jtrecenti/txtcaptcha-crnn", revision="v0.1.0")
License
MIT. The original training data from the R captcha package is redistributed
under its own terms.
- Downloads last month
- 30