--- title: LREC 2026 LLM-as-Annotator emoji: ✒️ colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 pinned: false license: mit short_description: Annotate historical and low-resource languages with LLMs --- # LREC 2026 — LLM-as-Annotator Workbench A **corpus-centered** annotation app built around the LLM-as-annotator pipeline described in the LREC 2026 tutorial and the companion LoResLM 2026 paper. The text is the focal point; everything else (task schema, models, prompt, ICL pool, exports) lives in popups behind toolbar pills. ## What it does - Loads a corpus (paste, file, or sandbox example from the four historical languages of the paper). - Annotates **token by token** with one or more LLMs (single inference or Mixture-of-Experts). - Highlights MoE **disagreements** so the reviewer focuses on contested tokens first. - Lets you correct any token in a focused popup with per-model votes, keyboard navigation, bulk operations, and a "re-ask one model" action. - Bootstrap loop: corrected sentences feed back into the few-shot pool (filtered by `(language, schema_hash)` to avoid task contamination). - Exports as TSV (PIE-baseline round-trip), JSON (schema-conformant), CoNLL-U (UD standard), or JSONL (fine-tune format). ## Companion paper Vidal-Gorène, C., Kindt, B., & Cafiero, F. (2026). *Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac.* **LoResLM 2026**. [https://aclanthology.org/2026.loreslm-1.28/](https://aclanthology.org/2026.loreslm-1.28/) Tutorial repo: [floriancafiero/lrec2026-llm-as-annotator-tutorial](https://github.com/floriancafiero/lrec2026-llm-as-annotator-tutorial) ## Stack - **Backend**: FastAPI + httpx (async OpenRouter client). - **Frontend**: single static HTML page + Alpine.js (15 KB, CDN) + Tailwind CSS (CDN). No build step. ## Run locally ```bash cd app pip install -r requirements.txt python app.py # or: uvicorn app:app --reload --port 7860 # open http://127.0.0.1:7860 ``` The app expects the two sibling repos at: ``` LREC-tutorial/ ├── code/ │ ├── EACL2026-historical-languages/ # sandbox corpora + tagsets │ └── lrec2026-llm-as-annotator-tutorial/ # JSON schema + system prompts └── app/ # this directory ``` ## Workflow 1. **Sidebar → quick start** — click an example corpus (Ancient Greek, Old Armenian, Syriac). The toolbar updates the task, language, and models. 2. **Top bar → 🔑 OpenRouter** — paste your API key (kept in this browser session only). 3. **Top bar → ▶ Annotate all** — runs every model in parallel (Mixture-of-Experts if 2+ models). Tokens are colored by status: indigo = consensus, amber ⚠ = disagreement. 4. **Click any token** → popup with editable fields, per-model votes, keyboard navigation, "adopt from " and "re-ask one model" shortcuts. 5. **📥 to ICL** on a sentence — pushes the corrected annotation into the few-shot pool. The next run re-injects it. 6. **Top bar → export** — TSV / JSON / CoNLL-U / JSONL. ### Keyboard shortcuts | Key | Action | |---|---| | `j` / `k` | next / previous token | | `e` or `↵` | edit focused token | | `1`–`9` | (in editor) assign the i-th visible tag | | `x` | toggle selection of focused token | | `r` | re-annotate the focused sentence | | `↵` | save edit & advance to next disagreement | | `Esc` | close popup / clear selection | | `shift+click` | multi-select tokens (then "Apply tag…") | | `right-click` | per-token context menu | ## Deploy on HuggingFace Spaces This `app/` directory is **self-contained**: the tagsets, schemas, system prompts, cheatsheet and a slice of the four sandbox corpora are vendored under [data/](data/) (≈ 900 KB). You do not need to push the parent repo or use git submodules. ### One-shot deploy ```bash cd app # Create a new Space (Docker SDK) at https://huggingface.co/new-space # Then push this directory as the Space's root: git init && git add . && git commit -m "init" git remote add space https://huggingface.co/spaces// git push --force space main ``` The Space builds from `Dockerfile`, boots `uvicorn` on port 7860, and serves the SPA at `/`. ### ⚠ Single-user demo `SESSION` is module-global. **The Space serves one user at a time** — if two people open it simultaneously, they share the same corpus, the same selected models, and (briefly) the same API key. For the LREC tutorial we recommend: > 🦆 **Each attendee clicks the "⋮ → Duplicate this Space" button** in the > top-right of the Space page. They get a free private clone, isolated state, > their own API key in their own browser. This is the simplest way to fan out the tutorial. Document this prominently on the Space's README. ### Optional: ship a default OpenRouter key If you want attendees to start without entering a key (e.g., a shared demo key with a rate limit), set a Space **Secret** named `OPENROUTER_API_KEY`. The backend reads it at startup; users can still override it from the UI. API keys entered through the UI are **never persisted** — they live only in the in-memory `SESSION` dict and are forgotten on restart. ## File map | File | Role | |---|---| | [app.py](app.py) | FastAPI app: state + REST endpoints | | [static/index.html](static/index.html) | SPA layout: toolbar, sidebar, corpus panel, modals | | [static/app.js](static/app.js) | Alpine.js state + handlers + keyboard shortcuts | | [static/styles.css](static/styles.css) | Token chips, modals, polish | | [provider.py](provider.py) | OpenRouter async client (JSON-Schema response_format + retry) | | [moe.py](moe.py) | Pure `aggregate()` — vote / LCS / min / priority | | [schemas.py](schemas.py) | `AnnotationSchema` + 8 presets | | [prompts.py](prompts.py) | Templates from tutorial repo + `ICLPool` | | [io_utils.py](io_utils.py) | Tokenizer + TSV / JSON / CoNLL-U / JSONL I/O | | [tutorial.py](tutorial.py) | 3 guided examples prefilling the corpus | | [paths.py](paths.py) | Resolves sibling repos (read-only) | ## License MIT for this app code. Sandbox data and prompt templates remain under their upstream licenses (see the two `code/` repositories).