atom-classifier / README.md
Telmo's picture
Update README.md
d631eac verified
---
license: apache-2.0
language:
- multilingual
- en
- de
- fr
- es
- pt
- nl
base_model: distilbert-base-multilingual-cased
tags:
- token-classification
- semantic-parsing
- hypergraph
- nlp
pipeline_tag: token-classification
library_name: transformers
---
# Atom Classifier
A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab).
## Model Details
- **Architecture:** DistilBertForTokenClassification
- **Base model:** distilbert-base-multilingual-cased
- **Labels:** 39 semantic atom types
- **Max sequence length:** 512
## Label Taxonomy
Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/). The 7 main types and their subtypes:
### Concepts (C)
| Label | Description |
|-------|-------------|
| `C` | Generic concept |
| `Cc` | Common noun |
| `Cp` | Proper noun |
| `Ca` | Adjective (as concept) |
| `Ci` | Pronoun |
| `Cd` | Determiner (as concept) |
| `Cm` | Nominal modifier |
| `Cw` | Interrogative word |
| `C#` | Number |
### Predicates (P)
| Label | Description |
|-------|-------------|
| `P` | Generic predicate |
| `Pd` | Declarative predicate |
| `P!` | Imperative predicate |
### Modifiers (M)
| Label | Description |
|-------|-------------|
| `M` | Generic modifier |
| `Ma` | Adjective modifier |
| `Mc` | Conceptual modifier |
| `Md` | Determiner modifier |
| `Me` | Adverbial modifier |
| `Mi` | Infinitive particle |
| `Mj` | Conjunctional modifier |
| `Ml` | Particle |
| `Mm` | Modal (auxiliary verb) |
| `Mn` | Negation |
| `Mp` | Possessive modifier |
| `Ms` | Superlative modifier |
| `Mt` | Prepositional modifier |
| `Mv` | Verbal modifier |
| `Mw` | Specifier |
| `M#` | Number modifier |
| `M=` | Comparative modifier |
| `M^` | Degree modifier |
### Builders (B)
| Label | Description |
|-------|-------------|
| `B` | Generic builder |
| `Bp` | Possessive builder |
| `Br` | Relational builder (preposition) |
### Triggers (T)
| Label | Description |
|-------|-------------|
| `T` | Generic trigger |
| `Tt` | Temporal trigger |
| `Tv` | Verbal trigger |
### Conjunctions (J)
| Label | Description |
|-------|-------------|
| `J` | Generic conjunction |
| `Jr` | Relational conjunction |
### Special
| Label | Description |
|-------|-------------|
| `X` | Excluded token (punctuation, etc.) |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")
sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")
with torch.no_grad():
outputs = model(**encoded)
predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)
for idx, word_id in enumerate(word_ids):
if word_id is not None:
start, end = offset_mapping[0][idx].tolist()
label = model.config.id2label[predictions[idx]]
print(f"{sentence[start:end]:15s} -> {label}")
```
## Intended Use
This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.
## Part of
- [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit
- [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser