seriffic's picture
Backend evolution: Phases 1-10 specialists + agentic FSM + Mellea + LiteLLM router
6a82282
# Phase 2 β€” GLiNER (gliner_medium-v2.1) structured extraction
## Status
**Working end-to-end on a real corpus PDF, both backends.**
## Model
- **Model:** `urchade/gliner_medium-v2.1` (151 M params)
- **License:** Apache-2.0 (verified β€” model card frontmatter; **NOT
the `gliner_base` variant which is CC-BY-NC-4.0**).
- **Loader:** `gliner.GLiNER.from_pretrained(...)` β€” pure HF, no
third-party fine-tune framework.
## Pipeline
1. **`extract.py`** β€” loads GLiNER, runs `predict_entities()` on a
paragraph with the 5 typed labels:
`nyc_location`, `dollar_amount`, `date_range`, `agency`,
`infrastructure_project`. Threshold 0.45 (tuned by inspection).
2. **`extract_from_pdf.py`** β€” pulls paragraph text from a corpus PDF
via `pypdf`, runs GLiNER on the longest paragraphs.
3. **`emit_doc.py`** β€” packages the typed list into a
`role: "document gliner_<source>"` chat message. doc_id format:
`gliner_comptroller`, `gliner_dep`, etc.
4. **`run_double_gate.py`** β€” end-to-end on a corpus PDF + paired
Ollama/vLLM probe.
## Validation
### Hand-crafted paragraph (sanity)
> "The NYC Department of Environmental Protection allocated $5.6 million
> for the Bluebelt expansion in Hollis, Queens for fiscal year
> 2025-2027. The Newtown Creek wastewater treatment plant in
> Brooklyn will receive an additional $12 million from NYCHA's
> resilience fund."
GLiNER extracted 9/9 expected entities at score β‰₯ 0.59:
`[agency] NYC Department of Environmental Protection`,
`[dollar_amount] $5.6 million`,
`[infrastructure_project] Bluebelt expansion`,
`[nyc_location] Hollis, Queens`,
`[date_range] fiscal year 2025-2027`,
`[infrastructure_project] Newtown Creek wastewater treatment plant`,
`[nyc_location] Brooklyn`,
`[dollar_amount] $12 million`,
`[agency] NYCHA`.
### Real corpus PDF β€” `comptroller_rain_2024.pdf`
Running on the longest paragraph (~3 KB of text, methodology section):
- 15 entities extracted
- 13Γ— `agency` (mostly `DEP` repeated, `New York City Comptroller`,
`Comptroller's Office`)
- 1Γ— `date_range`
- 1Γ— `nyc_location`
- Two `dollar_amount` hits (`$15,000`, `$22.5 million`, `$ 875 million`)
on a different paragraph in the same PDF (`top --top 2`)
Citation discipline is preserved: cited `[gliner_comptroller]` resolves
to a real input doc_id, agency tags align to actual surface text, no
hallucinated dollar amounts in either backend's output.
## Double-gating
`run_double_gate.py --pdf comptroller_rain_2024.pdf --source-id comptroller`:
| Backend | Latency | Cited content |
|---------|--------:|---------------|
| Ollama (M-series MPS) | 11.94 s | "The NYC Department of Environmental Protection (DEP) has committed to implementing flood mitigation measures as part of the city's preparedness for flash flooding, as detailed in the report by the Office of the NYC Comptroller Brad Lander **[gliner_comptroller]**." |
| vLLM (AMD MI300X) | 0.58 s | "The NYC Department of Environmental Protection (DEP) has committed to implementing flood mitigation measures as part of the City's preparedness outlined in New Normal, Rainfall Ready, and Ida, as documented by the NYC Comptroller's Office in the source **[gliner_comptroller]**." |
Both citations resolve correctly. vLLM is again ~20Γ— faster than
Ollama on this prompt size.
### Findings worth remembering
1. **Threshold tuning matters.** Default GLiNER `threshold=0.5` misses
`agency = "Comptroller's Office"` (it scored 0.45). 0.45 catches it
without producing many false positives in the policy corpus. Worth
re-tuning per source PDF if integrated.
2. **GLiNER is fast even on CPU.** Per-paragraph extract is ~0.3 s on
M3 Pro. The model load itself is the dominant cost (~6 s); in
production it stays loaded, so per-call latency is sub-second.
3. **No comparative reasoning over the extractions.** GLiNER returns
typed spans, not relations. The reconciler infers the relation
("DEP allocated $X for Y in Z") from co-occurrence in the
paragraph. That's fine for our briefings since they are paragraph-
scoped, but stronger relational extraction (REBEL, etc.) would
need a different model.
4. **The current ranker is a placeholder.** `extract_from_pdf.py` ranks
paragraphs by length, not query relevance. In production this
specialist consumes the existing Granite Embedding 278M retriever's
top-K rather than picking longest paragraphs.
## Files
```
02_gliner_extraction/
extract.py GLiNER load + predict_entities wrapper
extract_from_pdf.py pypdf paragraph splitter + GLiNER pass
emit_doc.py build gliner_<source> doc message
run_double_gate.py end-to-end + Ollama/vLLM probe
RESULTS.md (this file)
.cache/ GLiNER weights, double_gate_*.json
```
## Conclusion
Specialist works on both backends. **Recommended path forward:**
integrate as a wrapper over the existing `app/rag.py` retriever output
β€” GLiNER runs on the top-3 retrieved paragraphs and emits one
`gliner_<source_pdf>` doc per paragraph, with the source_id derived
from the PDF filename slug. The wrapper does not replace `rag.py`; it
adds typed structure to its output for the reconciler.