Add model card
Browse files
README.md
CHANGED
|
@@ -7,80 +7,99 @@ tags:
|
|
| 7 |
- two-pass-hybrid
|
| 8 |
base_model: fastino/gliner2-large-v1
|
| 9 |
library_name: gliner2
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# GLiNER2 Data Mention Extractor
|
| 13 |
|
| 14 |
Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
|
| 15 |
development economics and humanitarian research documents.
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
mode collapse that limits native `extract_json` to 1 mention per chunk:
|
| 21 |
|
| 22 |
- **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
|
| 23 |
(`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
|
| 24 |
-
- **Pass 2** (`extract_json`): Classifies each span individually
|
| 25 |
-
count=1 is always correct since each call contains exactly 1 mention.
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- `vague_mention`: Generic data references with minimal identifying detail
|
| 37 |
-
- **Classification fields** (Pass 2 — fixed choices):
|
| 38 |
-
- `typology_tag`: survey / census / database / administrative / indicator / geospatial / microdata / report / other
|
| 39 |
-
- `is_used`: True / False
|
| 40 |
-
- `usage_context`: primary / supporting / background
|
| 41 |
|
| 42 |
-
##
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
- **Training examples**: 8087
|
| 48 |
-
- **Val examples**: 563
|
| 49 |
-
- **Best val loss**: None
|
| 50 |
|
| 51 |
## Usage
|
| 52 |
-
|
| 53 |
```python
|
| 54 |
from gliner2 import GLiNER2
|
| 55 |
-
|
| 56 |
-
# Install the patched library first
|
| 57 |
-
# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
|
| 58 |
|
| 59 |
extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
|
| 60 |
-
extractor.load_adapter("
|
| 61 |
|
| 62 |
-
|
| 63 |
-
entity_schema = {
|
| 64 |
"entities": ["named_mention", "descriptive_mention", "vague_mention"],
|
| 65 |
"entity_descriptions": {
|
| 66 |
-
"named_mention": "A proper name or well-known acronym for a data source.
|
| 67 |
-
"descriptive_mention": "A described data reference with
|
| 68 |
-
"vague_mention": "A generic or loosely specified reference to data.
|
| 69 |
-
},
|
| 70 |
-
}
|
| 71 |
-
spans = extractor.extract(text, entity_schema, threshold=0.3)
|
| 72 |
-
|
| 73 |
-
# Pass 2: Classify each span
|
| 74 |
-
json_schema = {
|
| 75 |
-
"data_mention": {
|
| 76 |
-
"mention_name": "",
|
| 77 |
-
"typology_tag": {"choices": ["survey", "census", "administrative", "database",
|
| 78 |
-
"indicator", "geospatial", "microdata", "report", "other"]},
|
| 79 |
-
"is_used": {"choices": ["True", "False"]},
|
| 80 |
-
"usage_context": {"choices": ["primary", "supporting", "background"]},
|
| 81 |
},
|
| 82 |
}
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
```
|
|
|
|
| 7 |
- two-pass-hybrid
|
| 8 |
base_model: fastino/gliner2-large-v1
|
| 9 |
library_name: gliner2
|
| 10 |
+
license: apache-2.0
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# GLiNER2 Data Mention Extractor — datause-extraction
|
| 14 |
|
| 15 |
Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
|
| 16 |
development economics and humanitarian research documents.
|
| 17 |
|
| 18 |
+
Mirrored from [`rafmacalaba/gliner2-datause-large-v1-hybrid-entities`](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-hybrid-entities).
|
| 19 |
|
| 20 |
+
## Architecture: Two-Pass Hybrid
|
|
|
|
| 21 |
|
| 22 |
- **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
|
| 23 |
(`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
|
| 24 |
+
- **Pass 2** (`extract_json`): Classifies each span individually (count=1).
|
|
|
|
| 25 |
|
| 26 |
+
## Entity Types
|
| 27 |
|
| 28 |
+
- `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
|
| 29 |
+
- `descriptive_mention`: Described data with identifying detail but no formal name
|
| 30 |
+
- `vague_mention`: Generic data references with minimal identifying detail
|
| 31 |
|
| 32 |
+
## Classification Fields
|
| 33 |
|
| 34 |
+
- `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
|
| 35 |
+
- `is_used`: True / False
|
| 36 |
+
- `usage_context`: primary / supporting / background
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
## Installation
|
| 39 |
|
| 40 |
+
```bash
|
| 41 |
+
pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
|
| 42 |
+
```
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Usage
|
|
|
|
| 45 |
```python
|
| 46 |
from gliner2 import GLiNER2
|
| 47 |
+
import re
|
|
|
|
|
|
|
| 48 |
|
| 49 |
extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
|
| 50 |
+
extractor.load_adapter("ai4data/datause-extraction")
|
| 51 |
|
| 52 |
+
ENTITY_SCHEMA = {
|
|
|
|
| 53 |
"entities": ["named_mention", "descriptive_mention", "vague_mention"],
|
| 54 |
"entity_descriptions": {
|
| 55 |
+
"named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
|
| 56 |
+
"descriptive_mention": "A described data reference with identifying detail but no formal name.",
|
| 57 |
+
"vague_mention": "A generic or loosely specified reference to data.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
},
|
| 59 |
}
|
| 60 |
+
|
| 61 |
+
def extract_sentence_context(text, char_start, char_end, margin=1):
|
| 62 |
+
boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
|
| 63 |
+
for i in range(len(boundaries) - 1):
|
| 64 |
+
if boundaries[i] <= char_start < boundaries[i + 1]:
|
| 65 |
+
s = max(0, i - margin)
|
| 66 |
+
e = min(len(boundaries) - 1, i + margin + 1)
|
| 67 |
+
return text[boundaries[s]:boundaries[e]].strip()
|
| 68 |
+
return text
|
| 69 |
+
|
| 70 |
+
json_schema = (
|
| 71 |
+
extractor.create_schema()
|
| 72 |
+
.structure("data_mention")
|
| 73 |
+
.field("mention_name", dtype="str")
|
| 74 |
+
.field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
|
| 75 |
+
.field("is_used", dtype="str", choices=["True", "False"])
|
| 76 |
+
.field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."
|
| 80 |
+
|
| 81 |
+
# Pass 1 — span detection
|
| 82 |
+
pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
|
| 83 |
+
entities = pass1.get("entities", {})
|
| 84 |
+
|
| 85 |
+
# Pass 2 — classification per span
|
| 86 |
+
results = []
|
| 87 |
+
for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
|
| 88 |
+
for span in entities.get(etype, []):
|
| 89 |
+
mention_text = span.get("text", span) if isinstance(span, dict) else span
|
| 90 |
+
char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
|
| 91 |
+
char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
|
| 92 |
+
context = extract_sentence_context(text, char_start, char_end)
|
| 93 |
+
tags = extractor.extract(context, json_schema)
|
| 94 |
+
tag = (tags.get("data_mention") or [{}])[0]
|
| 95 |
+
results.append({
|
| 96 |
+
"mention_name": mention_text,
|
| 97 |
+
"specificity": etype.replace("_mention", ""),
|
| 98 |
+
"typology": tag.get("typology_tag"),
|
| 99 |
+
"is_used": tag.get("is_used"),
|
| 100 |
+
"usage_context": tag.get("usage_context"),
|
| 101 |
+
})
|
| 102 |
+
|
| 103 |
+
for r in results:
|
| 104 |
+
print(r)
|
| 105 |
```
|