Phase 2 β GLiNER (gliner_medium-v2.1) structured extraction
Status
Working end-to-end on a real corpus PDF, both backends.
Model
- Model:
urchade/gliner_medium-v2.1(151 M params) - License: Apache-2.0 (verified β model card frontmatter; NOT
the
gliner_basevariant which is CC-BY-NC-4.0). - Loader:
gliner.GLiNER.from_pretrained(...)β pure HF, no third-party fine-tune framework.
Pipeline
extract.pyβ loads GLiNER, runspredict_entities()on a paragraph with the 5 typed labels:nyc_location,dollar_amount,date_range,agency,infrastructure_project. Threshold 0.45 (tuned by inspection).extract_from_pdf.pyβ pulls paragraph text from a corpus PDF viapypdf, runs GLiNER on the longest paragraphs.emit_doc.pyβ packages the typed list into arole: "document gliner_<source>"chat message. doc_id format:gliner_comptroller,gliner_dep, etc.run_double_gate.pyβ end-to-end on a corpus PDF + paired Ollama/vLLM probe.
Validation
Hand-crafted paragraph (sanity)
"The NYC Department of Environmental Protection allocated $5.6 million for the Bluebelt expansion in Hollis, Queens for fiscal year 2025-2027. The Newtown Creek wastewater treatment plant in Brooklyn will receive an additional $12 million from NYCHA's resilience fund."
GLiNER extracted 9/9 expected entities at score β₯ 0.59:
[agency] NYC Department of Environmental Protection,
[dollar_amount] $5.6 million,
[infrastructure_project] Bluebelt expansion,
[nyc_location] Hollis, Queens,
[date_range] fiscal year 2025-2027,
[infrastructure_project] Newtown Creek wastewater treatment plant,
[nyc_location] Brooklyn,
[dollar_amount] $12 million,
[agency] NYCHA.
Real corpus PDF β comptroller_rain_2024.pdf
Running on the longest paragraph (~3 KB of text, methodology section):
- 15 entities extracted
- 13Γ
agency(mostlyDEPrepeated,New York City Comptroller,Comptroller's Office) - 1Γ
date_range - 1Γ
nyc_location - Two
dollar_amounthits ($15,000,$22.5 million,$ 875 million) on a different paragraph in the same PDF (top --top 2)
Citation discipline is preserved: cited [gliner_comptroller] resolves
to a real input doc_id, agency tags align to actual surface text, no
hallucinated dollar amounts in either backend's output.
Double-gating
run_double_gate.py --pdf comptroller_rain_2024.pdf --source-id comptroller:
| Backend | Latency | Cited content |
|---|---|---|
| Ollama (M-series MPS) | 11.94 s | "The NYC Department of Environmental Protection (DEP) has committed to implementing flood mitigation measures as part of the city's preparedness for flash flooding, as detailed in the report by the Office of the NYC Comptroller Brad Lander [gliner_comptroller]." |
| vLLM (AMD MI300X) | 0.58 s | "The NYC Department of Environmental Protection (DEP) has committed to implementing flood mitigation measures as part of the City's preparedness outlined in New Normal, Rainfall Ready, and Ida, as documented by the NYC Comptroller's Office in the source [gliner_comptroller]." |
Both citations resolve correctly. vLLM is again ~20Γ faster than Ollama on this prompt size.
Findings worth remembering
Threshold tuning matters. Default GLiNER
threshold=0.5missesagency = "Comptroller's Office"(it scored 0.45). 0.45 catches it without producing many false positives in the policy corpus. Worth re-tuning per source PDF if integrated.GLiNER is fast even on CPU. Per-paragraph extract is
0.3 s on M3 Pro. The model load itself is the dominant cost (6 s); in production it stays loaded, so per-call latency is sub-second.No comparative reasoning over the extractions. GLiNER returns typed spans, not relations. The reconciler infers the relation ("DEP allocated $X for Y in Z") from co-occurrence in the paragraph. That's fine for our briefings since they are paragraph- scoped, but stronger relational extraction (REBEL, etc.) would need a different model.
The current ranker is a placeholder.
extract_from_pdf.pyranks paragraphs by length, not query relevance. In production this specialist consumes the existing Granite Embedding 278M retriever's top-K rather than picking longest paragraphs.
Files
02_gliner_extraction/
extract.py GLiNER load + predict_entities wrapper
extract_from_pdf.py pypdf paragraph splitter + GLiNER pass
emit_doc.py build gliner_<source> doc message
run_double_gate.py end-to-end + Ollama/vLLM probe
RESULTS.md (this file)
.cache/ GLiNER weights, double_gate_*.json
Conclusion
Specialist works on both backends. Recommended path forward:
integrate as a wrapper over the existing app/rag.py retriever output
β GLiNER runs on the top-3 retrieved paragraphs and emits one
gliner_<source_pdf> doc per paragraph, with the source_id derived
from the PDF filename slug. The wrapper does not replace rag.py; it
adds typed structure to its output for the reconciler.