--- language: - en license: mit tags: - latex - ocr - causal-lm - custom_code library_name: transformers --- # LaTeX OCR Decoder A lightweight causal language model pretrained on LaTeX expressions for OCR post-processing. ## Architecture - **Type**: Decoder-only Transformer (GPT-style) - **Layers**: 6 - **d_model**: 512 - **Heads**: 8 - **FFN**: SwiGLU, d_ff=1408 - **Position encoding**: RoPE (θ=10000) - **Vocab size**: 8192 (custom BPE tokenizer) - **Max sequence length**: 200 - **Parameters**: ~14M ## Training - **Steps**: 100,000 - **Final loss**: 1.163 - **Optimizer**: AdamW (lr=3e-4, weight_decay=0.1) - **Scheduler**: Cosine with warmup (1000 steps) - **Precision**: bfloat16 - **Data**: LaTeX expressions from OCR dataset ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("harryrobert/latexOCR", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("harryrobert/latexOCR", trust_remote_code=True) model.eval() prompt = r"\frac{1}{2}" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): output_ids = model.generate( inputs["input_ids"], max_new_tokens=100, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## License MIT