File size: 3,859 Bytes

---
license: apache-2.0
language:
  - multilingual
  - en
  - de
  - fr
  - es
  - pt
  - nl
base_model: distilbert-base-multilingual-cased
tags:
  - token-classification
  - semantic-parsing
  - hypergraph
  - nlp
pipeline_tag: token-classification
library_name: transformers
---

# Atom Classifier

A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab).

## Model Details

- **Architecture:** DistilBertForTokenClassification
- **Base model:** distilbert-base-multilingual-cased
- **Labels:** 39 semantic atom types
- **Max sequence length:** 512

## Label Taxonomy

Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/). The 7 main types and their subtypes:

### Concepts (C) 
| Label | Description |
|-------|-------------|
| `C` | Generic concept |
| `Cc` | Common noun |
| `Cp` | Proper noun |
| `Ca` | Adjective (as concept) |
| `Ci` | Pronoun |
| `Cd` | Determiner (as concept) |
| `Cm` | Nominal modifier |
| `Cw` | Interrogative word |
| `C#` | Number |

### Predicates (P)
| Label | Description |
|-------|-------------|
| `P` | Generic predicate |
| `Pd` | Declarative predicate |
| `P!` | Imperative predicate |

### Modifiers (M)
| Label | Description |
|-------|-------------|
| `M` | Generic modifier |
| `Ma` | Adjective modifier |
| `Mc` | Conceptual modifier |
| `Md` | Determiner modifier |
| `Me` | Adverbial modifier |
| `Mi` | Infinitive particle |
| `Mj` | Conjunctional modifier |
| `Ml` | Particle |
| `Mm` | Modal (auxiliary verb) |
| `Mn` | Negation |
| `Mp` | Possessive modifier |
| `Ms` | Superlative modifier |
| `Mt` | Prepositional modifier |
| `Mv` | Verbal modifier |
| `Mw` | Specifier |
| `M#` | Number modifier |
| `M=` | Comparative modifier |
| `M^` | Degree modifier |

### Builders (B)
| Label | Description |
|-------|-------------|
| `B` | Generic builder |
| `Bp` | Possessive builder |
| `Br` | Relational builder (preposition) |

### Triggers (T)
| Label | Description |
|-------|-------------|
| `T` | Generic trigger |
| `Tt` | Temporal trigger |
| `Tv` | Verbal trigger |

### Conjunctions (J)
| Label | Description |
|-------|-------------|
| `J` | Generic conjunction |
| `Jr` | Relational conjunction |

### Special
| Label | Description |
|-------|-------------|
| `X` | Excluded token (punctuation, etc.) |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")

sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")

with torch.no_grad():
    outputs = model(**encoded)

predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)

for idx, word_id in enumerate(word_ids):
    if word_id is not None:
        start, end = offset_mapping[0][idx].tolist()
        label = model.config.id2label[predictions[idx]]
        print(f"{sentence[start:end]:15s} -> {label}")
```

## Intended Use

This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.

## Part of

- [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit
- [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser