Spaces:

ENC-PSL
/

lrec2026-llm-annotator

Running

App Files Files Community

lrec2026-llm-annotator / README.md

dhuser

Update README.md

c216483 verified 6 days ago

preview code

raw

history blame contribute delete

6.28 kB

	---
	title: LREC 2026 LLM-as-Annotator
	emoji: ✒️
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	short_description: Annotate historical and low-resource languages with LLMs
	---

	# LREC 2026 — LLM-as-Annotator Workbench

	A corpus-centered annotation app built around the LLM-as-annotator pipeline
	described in the LREC 2026 tutorial and the companion LoResLM 2026 paper. The
	text is the focal point; everything else (task schema, models, prompt, ICL
	pool, exports) lives in popups behind toolbar pills.

	## What it does

	- Loads a corpus (paste, file, or sandbox example from the four historical languages of the paper).
	- Annotates token by token with one or more LLMs (single inference or Mixture-of-Experts).
	- Highlights MoE disagreements so the reviewer focuses on contested tokens first.
	- Lets you correct any token in a focused popup with per-model votes, keyboard navigation, bulk operations, and a "re-ask one model" action.
	- Bootstrap loop: corrected sentences feed back into the few-shot pool (filtered by `(language, schema_hash)` to avoid task contamination).
	- Exports as TSV (PIE-baseline round-trip), JSON (schema-conformant), CoNLL-U (UD standard), or JSONL (fine-tune format).

	## Companion paper

	Vidal-Gorène, C., Kindt, B., & Cafiero, F. (2026). *Under-resourced studies of
	under-resourced languages: lemmatization and POS-tagging with LLM annotators
	for historical Armenian, Georgian, Greek and Syriac.* LoResLM 2026.
	[https://aclanthology.org/2026.loreslm-1.28/](https://aclanthology.org/2026.loreslm-1.28/)

	Tutorial repo: [floriancafiero/lrec2026-llm-as-annotator-tutorial](https://github.com/floriancafiero/lrec2026-llm-as-annotator-tutorial)

	## Stack

	- Backend: FastAPI + httpx (async OpenRouter client).
	- Frontend: single static HTML page + Alpine.js (15 KB, CDN) + Tailwind CSS (CDN). No build step.

	## Run locally

	```bash
	cd app
	pip install -r requirements.txt
	python app.py # or: uvicorn app:app --reload --port 7860
	# open http://127.0.0.1:7860
	```

	The app expects the two sibling repos at:

	```
	LREC-tutorial/
	├── code/
	│ ├── EACL2026-historical-languages/ # sandbox corpora + tagsets
	│ └── lrec2026-llm-as-annotator-tutorial/ # JSON schema + system prompts
	└── app/ # this directory
	```

	## Workflow

	1. Sidebar → quick start — click an example corpus (Ancient Greek, Old Armenian, Syriac). The toolbar updates the task, language, and models.
	2. Top bar → 🔑 OpenRouter — paste your API key (kept in this browser session only).
	3. Top bar → ▶ Annotate all — runs every model in parallel (Mixture-of-Experts if 2+ models). Tokens are colored by status: indigo = consensus, amber ⚠ = disagreement.
	4. Click any token → popup with editable fields, per-model votes, keyboard navigation, "adopt from <model>" and "re-ask one model" shortcuts.
	5. 📥 to ICL on a sentence — pushes the corrected annotation into the few-shot pool. The next run re-injects it.
	6. Top bar → export — TSV / JSON / CoNLL-U / JSONL.

	### Keyboard shortcuts

	\| Key \| Action \|
	\|---\|---\|
	\| `j` / `k` \| next / previous token \|
	\| `e` or `↵` \| edit focused token \|
	\| `1`–`9` \| (in editor) assign the i-th visible tag \|
	\| `x` \| toggle selection of focused token \|
	\| `r` \| re-annotate the focused sentence \|
	\| `↵` \| save edit & advance to next disagreement \|
	\| `Esc` \| close popup / clear selection \|
	\| `shift+click` \| multi-select tokens (then "Apply tag…") \|
	\| `right-click` \| per-token context menu \|

	## Deploy on HuggingFace Spaces

	This `app/` directory is self-contained: the tagsets, schemas, system
	prompts, cheatsheet and a slice of the four sandbox corpora are vendored under
	[data/](data/) (≈ 900 KB). You do not need to push the parent repo or use git
	submodules.

	### One-shot deploy

	```bash
	cd app
	# Create a new Space (Docker SDK) at https://huggingface.co/new-space
	# Then push this directory as the Space's root:
	git init && git add . && git commit -m "init"
	git remote add space https://huggingface.co/spaces/<your-user>/<space-name>
	git push --force space main
	```

	The Space builds from `Dockerfile`, boots `uvicorn` on port 7860, and serves
	the SPA at `/`.

	### ⚠ Single-user demo

	`SESSION` is module-global. The Space serves one user at a time — if two
	people open it simultaneously, they share the same corpus, the same selected
	models, and (briefly) the same API key. For the LREC tutorial we recommend:

	> 🦆 Each attendee clicks the "⋮ → Duplicate this Space" button in the
	> top-right of the Space page. They get a free private clone, isolated state,
	> their own API key in their own browser.

	This is the simplest way to fan out the tutorial. Document this prominently on
	the Space's README.

	### Optional: ship a default OpenRouter key

	If you want attendees to start without entering a key (e.g., a shared demo
	key with a rate limit), set a Space Secret named `OPENROUTER_API_KEY`.
	The backend reads it at startup; users can still override it from the UI.

	API keys entered through the UI are never persisted — they live only in
	the in-memory `SESSION` dict and are forgotten on restart.

	## File map

	\| File \| Role \|
	\|---\|---\|
	\| [app.py](app.py) \| FastAPI app: state + REST endpoints \|
	\| [static/index.html](static/index.html) \| SPA layout: toolbar, sidebar, corpus panel, modals \|
	\| [static/app.js](static/app.js) \| Alpine.js state + handlers + keyboard shortcuts \|
	\| [static/styles.css](static/styles.css) \| Token chips, modals, polish \|
	\| [provider.py](provider.py) \| OpenRouter async client (JSON-Schema response_format + retry) \|
	\| [moe.py](moe.py) \| Pure `aggregate()` — vote / LCS / min / priority \|
	\| [schemas.py](schemas.py) \| `AnnotationSchema` + 8 presets \|
	\| [prompts.py](prompts.py) \| Templates from tutorial repo + `ICLPool` \|
	\| [io_utils.py](io_utils.py) \| Tokenizer + TSV / JSON / CoNLL-U / JSONL I/O \|
	\| [tutorial.py](tutorial.py) \| 3 guided examples prefilling the corpus \|
	\| [paths.py](paths.py) \| Resolves sibling repos (read-only) \|

	## License

	MIT for this app code. Sandbox data and prompt templates remain under their
	upstream licenses (see the two `code/` repositories).