File size: 3,859 Bytes
b7e2589
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d631eac
b7e2589
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
  - multilingual
  - en
  - de
  - fr
  - es
  - pt
  - nl
base_model: distilbert-base-multilingual-cased
tags:
  - token-classification
  - semantic-parsing
  - hypergraph
  - nlp
pipeline_tag: token-classification
library_name: transformers
---

# Atom Classifier

A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab).

## Model Details

- **Architecture:** DistilBertForTokenClassification
- **Base model:** distilbert-base-multilingual-cased
- **Labels:** 39 semantic atom types
- **Max sequence length:** 512

## Label Taxonomy

Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/). The 7 main types and their subtypes:

### Concepts (C) 
| Label | Description |
|-------|-------------|
| `C` | Generic concept |
| `Cc` | Common noun |
| `Cp` | Proper noun |
| `Ca` | Adjective (as concept) |
| `Ci` | Pronoun |
| `Cd` | Determiner (as concept) |
| `Cm` | Nominal modifier |
| `Cw` | Interrogative word |
| `C#` | Number |

### Predicates (P)
| Label | Description |
|-------|-------------|
| `P` | Generic predicate |
| `Pd` | Declarative predicate |
| `P!` | Imperative predicate |

### Modifiers (M)
| Label | Description |
|-------|-------------|
| `M` | Generic modifier |
| `Ma` | Adjective modifier |
| `Mc` | Conceptual modifier |
| `Md` | Determiner modifier |
| `Me` | Adverbial modifier |
| `Mi` | Infinitive particle |
| `Mj` | Conjunctional modifier |
| `Ml` | Particle |
| `Mm` | Modal (auxiliary verb) |
| `Mn` | Negation |
| `Mp` | Possessive modifier |
| `Ms` | Superlative modifier |
| `Mt` | Prepositional modifier |
| `Mv` | Verbal modifier |
| `Mw` | Specifier |
| `M#` | Number modifier |
| `M=` | Comparative modifier |
| `M^` | Degree modifier |

### Builders (B)
| Label | Description |
|-------|-------------|
| `B` | Generic builder |
| `Bp` | Possessive builder |
| `Br` | Relational builder (preposition) |

### Triggers (T)
| Label | Description |
|-------|-------------|
| `T` | Generic trigger |
| `Tt` | Temporal trigger |
| `Tv` | Verbal trigger |

### Conjunctions (J)
| Label | Description |
|-------|-------------|
| `J` | Generic conjunction |
| `Jr` | Relational conjunction |

### Special
| Label | Description |
|-------|-------------|
| `X` | Excluded token (punctuation, etc.) |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")

sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")

with torch.no_grad():
    outputs = model(**encoded)

predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)

for idx, word_id in enumerate(word_ids):
    if word_id is not None:
        start, end = offset_mapping[0][idx].tolist()
        label = model.config.id2label[predictions[idx]]
        print(f"{sentence[start:end]:15s} -> {label}")
```

## Intended Use

This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.

## Part of

- [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit
- [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser