permutans commited on
Commit
4d0233b
·
verified ·
1 Parent(s): 6d23e34

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +106 -99
  2. config.json +66 -18
  3. model.safetensors +2 -2
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +7 -5
README.md CHANGED
@@ -2,7 +2,7 @@
2
  license: mit
3
  tags:
4
  - text-classification
5
- - bert
6
  - orality
7
  - linguistics
8
  - rhetorical-analysis
@@ -12,7 +12,7 @@ metrics:
12
  - f1
13
  - accuracy
14
  base_model:
15
- - google-bert/bert-base-uncased
16
  pipeline_tag: text-classification
17
  library_name: transformers
18
  datasets:
@@ -25,16 +25,16 @@ model-index:
25
  name: Marker Subtype Classification
26
  metrics:
27
  - type: f1
28
- value: 0.500
29
  name: F1 (macro)
30
  - type: accuracy
31
- value: 0.498
32
  name: Accuracy
33
  ---
34
 
35
  # Havelock Marker Subtype Classifier
36
 
37
- BERT-based classifier for **71 fine-grained rhetorical marker subtypes** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
38
 
39
  This is the finest level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 71 specific rhetorical devices (e.g., `anaphora`, `epistemic_hedge`, `vocative`, `nested_clauses`).
40
 
@@ -42,14 +42,14 @@ This is the finest level of the Havelock span classification hierarchy. Given a
42
 
43
  | Property | Value |
44
  |----------|-------|
45
- | Base model | `bert-base-uncased` |
46
- | Architecture | `BertForSequenceClassification` |
47
  | Task | Multi-class classification (71 classes) |
48
  | Max sequence length | 128 tokens |
49
- | Test F1 (macro) | **0.500** |
50
- | Test Accuracy | **0.498** |
51
- | Missing labels (test) | 1/71 (`rhyme`) |
52
- | Parameters | ~109M |
53
 
54
  ## Usage
55
  ```python
@@ -100,7 +100,7 @@ print(f"Marker subtype: {model.config.id2label[pred]}")
100
 
101
  ### Data
102
 
103
- Span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Each span carries a `marker_subtype` field. Only subtypes with ≥50 examples are included. A stratified 80/10/10 train/val/test split was used with swap-based optimization to balance label distributions across splits. The test set contains 2,357 spans.
104
 
105
  ### Hyperparameters
106
 
@@ -115,7 +115,11 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
115
  | Loss | Focal loss (γ=2.0) + class weights |
116
  | Mixout | 0.1 |
117
  | Mixed precision | FP16 |
118
- | Min examples per class | 50 |
 
 
 
 
119
 
120
  ### Test Set Classification Report
121
 
@@ -123,105 +127,106 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
123
  ```
124
  precision recall f1-score support
125
 
126
- abstract_noun 0.376 0.364 0.370 88
127
- additive_formal 0.455 0.417 0.435 12
128
- agent_demoted 0.533 0.800 0.640 10
129
- agentless_passive 0.542 0.456 0.495 57
130
- alliteration 0.714 0.500 0.588 10
131
- anaphora 0.490 0.585 0.533 41
132
  antithesis 0.947 0.818 0.878 22
133
- aside 0.225 0.243 0.234 37
134
- assonance 0.926 1.000 0.962 25
135
- asyndeton 0.583 0.500 0.538 14
136
- audience_response 0.778 0.700 0.737 10
137
- categorical_statement 0.209 0.450 0.286 20
138
- causal_chain 0.425 0.405 0.415 42
139
- causal_explicit 0.537 0.468 0.500 47
140
- citation 0.794 0.587 0.675 46
141
- conceptual_metaphor 0.176 0.077 0.107 39
142
- concessive 0.617 0.644 0.630 45
143
- concessive_connector 0.833 0.833 0.833 18
144
- conditional 0.582 0.655 0.616 87
145
- conflict_frame 0.588 0.667 0.625 15
146
- contrastive 0.442 0.557 0.493 61
147
  cross_reference 0.733 0.458 0.564 24
148
- definitional_move 0.333 0.200 0.250 10
149
- discourse_formula 0.485 0.424 0.452 118
150
- dramatic_pause 0.875 0.700 0.778 10
151
- embodied_action 0.271 0.310 0.289 42
152
- enumeration 0.556 0.581 0.568 43
153
- epistemic_hedge 0.206 0.500 0.292 14
154
- epistrophe 0.778 0.875 0.824 16
155
- epithet 0.385 0.417 0.400 12
156
- everyday_example 0.278 0.179 0.217 28
157
- evidential 0.606 0.541 0.571 37
158
- footnote_reference 0.444 0.400 0.421 10
159
- imperative 0.628 0.590 0.608 100
160
- inclusive_we 0.561 0.627 0.592 59
161
- institutional_subject 0.947 0.857 0.900 21
162
- intensifier_doubling 0.905 0.864 0.884 22
163
- lexical_repetition 0.447 0.467 0.457 45
164
- list_structure 0.190 0.174 0.182 23
165
- metadiscourse 0.073 0.182 0.104 22
166
- methodological_framing 0.500 0.238 0.323 21
167
- named_individual 0.455 0.333 0.385 30
168
- nested_clauses 0.294 0.326 0.309 46
169
- nominalization 0.353 0.429 0.387 56
170
- objectifying_stance 0.167 0.300 0.214 10
171
- parallelism 0.188 0.222 0.203 27
172
- phatic_check 0.444 0.364 0.400 11
173
- phatic_filler 0.300 0.600 0.400 10
174
- polysyndeton 1.000 0.833 0.909 24
175
- probability 0.500 0.682 0.577 22
176
- proverb 0.059 0.100 0.074 10
177
- qualified_assertion 0.280 0.241 0.259 29
178
- refrain 0.850 0.708 0.773 24
179
- relative_chain 0.431 0.455 0.442 55
180
- religious_formula 1.000 0.688 0.815 16
181
- rhetorical_question 0.646 0.738 0.689 84
182
- rhyme 0.000 0.000 0.000 10
183
- rhythm 1.000 0.625 0.769 16
184
- second_person 0.573 0.474 0.519 116
185
- self_correction 0.952 0.500 0.656 40
186
- sensory_detail 0.538 0.350 0.424 20
187
- simple_conjunction 0.133 0.200 0.160 10
188
- specific_place 0.625 0.278 0.385 18
189
- technical_abbreviation 0.818 0.321 0.462 28
190
- technical_term 0.438 0.432 0.435 74
191
- temporal_anchor 0.472 0.500 0.486 34
192
- temporal_embedding 0.475 0.604 0.532 48
193
- third_person_reference 0.692 0.900 0.783 10
194
- tricolon 0.667 0.667 0.667 18
195
- us_them 0.750 0.500 0.600 18
196
- vocative 0.414 0.600 0.490 20
197
-
198
- accuracy 0.498 2357
199
- macro avg 0.528 0.497 0.500 2357
200
- weighted avg 0.525 0.498 0.502 2357
201
  ```
202
 
203
  </details>
204
 
205
- **Top performing subtypes (F1 ≥ 0.75):** `assonance` (0.962), `polysyndeton` (0.909), `institutional_subject` (0.900), `intensifier_doubling` (0.884), `antithesis` (0.878), `concessive_connector` (0.833), `epistrophe` (0.824), `religious_formula` (0.815), `third_person_reference` (0.783), `dramatic_pause` (0.778), `refrain` (0.773), `rhythm` (0.769).
206
 
207
- **Weakest subtypes (F1 < 0.20):** `rhyme` (0.000), `proverb` (0.074), `metadiscourse` (0.104), `simple_conjunction` (0.160), `list_structure` (0.182). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes or have very low test support.
208
 
209
  ## Class Distribution
210
 
211
- The test set exhibits significant imbalance across 71 classes:
212
 
213
- | Support Range | Classes | % of Total |
214
- |---------------|---------|------------|
215
- | >100 | 3 (`discourse_formula`, `second_person`, `imperative`) | 4% |
216
- | 50100 | 11 | 15% |
217
- | 2550 | 26 | 37% |
218
- | 1025 | 31 | 44% |
 
219
 
220
  ## Limitations
221
 
222
- - **71-way classification on ~22k spans**: The data budget per class is thin, particularly for classes near the 50-example minimum. More data or class consolidation would help.
223
  - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
224
- - **Recall-precision tradeoff on rare classes**: Many rare classes show high precision but lower recall (e.g., `self_correction`: P=0.952, R=0.500; `religious_formula`: P=1.000, R=0.688), suggesting the model learns narrow prototypes but misses variation.
225
  - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
226
  - **128-token context window**: Longer spans are truncated.
227
 
@@ -235,7 +240,7 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
235
  |-------|------|---------|-----|
236
  | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
237
  | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.583 |
238
- | **This model** | Fine-grained subtype | 71 | 0.500 |
239
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
240
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
241
 
@@ -252,7 +257,9 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
252
  ## References
253
 
254
  - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
 
 
255
 
256
  ---
257
 
258
- *Model version: b31f147d · Trained: February 2026*
 
2
  license: mit
3
  tags:
4
  - text-classification
5
+ - modernbert
6
  - orality
7
  - linguistics
8
  - rhetorical-analysis
 
12
  - f1
13
  - accuracy
14
  base_model:
15
+ - answerdotai/ModernBERT-base
16
  pipeline_tag: text-classification
17
  library_name: transformers
18
  datasets:
 
25
  name: Marker Subtype Classification
26
  metrics:
27
  - type: f1
28
+ value: 0.493
29
  name: F1 (macro)
30
  - type: accuracy
31
+ value: 0.500
32
  name: Accuracy
33
  ---
34
 
35
  # Havelock Marker Subtype Classifier
36
 
37
+ ModernBERT-based classifier for **71 fine-grained rhetorical marker subtypes** on the oral–literate spectrum, grounded in Walter Ong's *Orality and Literacy* (1982).
38
 
39
  This is the finest level of the Havelock span classification hierarchy. Given a text span identified as a rhetorical marker, the model classifies it into one of 71 specific rhetorical devices (e.g., `anaphora`, `epistemic_hedge`, `vocative`, `nested_clauses`).
40
 
 
42
 
43
  | Property | Value |
44
  |----------|-------|
45
+ | Base model | `answerdotai/ModernBERT-base` |
46
+ | Architecture | `ModernBertForSequenceClassification` |
47
  | Task | Multi-class classification (71 classes) |
48
  | Max sequence length | 128 tokens |
49
+ | Test F1 (macro) | **0.493** |
50
+ | Test Accuracy | **0.500** |
51
+ | Missing labels (test) | 1/71 (`proverb`) |
52
+ | Parameters | ~149M |
53
 
54
  ## Usage
55
  ```python
 
100
 
101
  ### Data
102
 
103
+ 22,367 span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Each span carries a `marker_subtype` field. Only subtypes with ≥10 examples are included. A stratified 80/10/10 train/val/test split was used with swap-based optimization to balance label distributions across splits. The test set contains 2,357 spans.
104
 
105
  ### Hyperparameters
106
 
 
115
  | Loss | Focal loss (γ=2.0) + class weights |
116
  | Mixout | 0.1 |
117
  | Mixed precision | FP16 |
118
+ | Min examples per class | 10 |
119
+
120
+ ### Training Metrics
121
+
122
+ Best checkpoint selected at epoch 15 by missing-label-primary, F1-tiebreaker (0 missing, F1 0.486).
123
 
124
  ### Test Set Classification Report
125
 
 
127
  ```
128
  precision recall f1-score support
129
 
130
+ abstract_noun 0.408 0.330 0.365 88
131
+ additive_formal 0.286 0.167 0.211 12
132
+ agent_demoted 0.667 1.000 0.800 10
133
+ agentless_passive 0.583 0.491 0.533 57
134
+ alliteration 0.500 0.200 0.286 10
135
+ anaphora 0.500 0.537 0.518 41
136
  antithesis 0.947 0.818 0.878 22
137
+ aside 0.615 0.216 0.320 37
138
+ assonance 1.000 0.960 0.980 25
139
+ asyndeton 0.636 0.500 0.560 14
140
+ audience_response 1.000 0.800 0.889 10
141
+ categorical_statement 0.103 0.200 0.136 20
142
+ causal_chain 0.442 0.452 0.447 42
143
+ causal_explicit 0.400 0.468 0.431 47
144
+ citation 0.743 0.565 0.642 46
145
+ conceptual_metaphor 0.065 0.051 0.057 39
146
+ concessive 0.595 0.556 0.575 45
147
+ concessive_connector 0.882 0.833 0.857 18
148
+ conditional 0.596 0.609 0.602 87
149
+ conflict_frame 0.733 0.733 0.733 15
150
+ contrastive 0.533 0.525 0.529 61
151
  cross_reference 0.733 0.458 0.564 24
152
+ definitional_move 0.286 0.200 0.235 10
153
+ discourse_formula 0.405 0.508 0.451 118
154
+ dramatic_pause 0.833 0.500 0.625 10
155
+ embodied_action 0.375 0.214 0.273 42
156
+ enumeration 0.510 0.605 0.553 43
157
+ epistemic_hedge 0.102 0.357 0.159 14
158
+ epistrophe 0.824 0.875 0.848 16
159
+ epithet 0.333 0.250 0.286 12
160
+ everyday_example 0.312 0.179 0.227 28
161
+ evidential 0.667 0.432 0.525 37
162
+ footnote_reference 0.417 0.500 0.455 10
163
+ imperative 0.645 0.600 0.622 100
164
+ inclusive_we 0.630 0.576 0.602 59
165
+ institutional_subject 0.938 0.714 0.811 21
166
+ intensifier_doubling 0.944 0.773 0.850 22
167
+ lexical_repetition 0.417 0.556 0.476 45
168
+ list_structure 0.267 0.174 0.211 23
169
+ metadiscourse 0.085 0.182 0.116 22
170
+ methodological_framing 0.500 0.190 0.276 21
171
+ named_individual 0.500 0.300 0.375 30
172
+ nested_clauses 0.500 0.348 0.410 46
173
+ nominalization 0.288 0.304 0.296 56
174
+ objectifying_stance 0.267 0.400 0.320 10
175
+ parallelism 0.350 0.259 0.298 27
176
+ phatic_check 0.500 0.364 0.421 11
177
+ phatic_filler 0.333 0.800 0.471 10
178
+ polysyndeton 1.000 0.792 0.884 24
179
+ probability 0.500 0.455 0.476 22
180
+ proverb 0.000 0.000 0.000 10
181
+ qualified_assertion 0.250 0.241 0.246 29
182
+ refrain 0.944 0.708 0.810 24
183
+ relative_chain 0.350 0.509 0.415 55
184
+ religious_formula 0.857 0.750 0.800 16
185
+ rhetorical_question 0.688 0.762 0.723 84
186
+ rhyme 0.231 0.300 0.261 10
187
+ rhythm 0.909 0.625 0.741 16
188
+ second_person 0.571 0.586 0.579 116
189
+ self_correction 0.821 0.575 0.676 40
190
+ sensory_detail 0.364 0.200 0.258 20
191
+ simple_conjunction 0.167 0.300 0.214 10
192
+ specific_place 0.400 0.222 0.286 18
193
+ technical_abbreviation 0.900 0.321 0.474 28
194
+ technical_term 0.426 0.703 0.531 74
195
+ temporal_anchor 0.396 0.618 0.483 34
196
+ temporal_embedding 0.500 0.562 0.529 48
197
+ third_person_reference 0.700 0.700 0.700 10
198
+ tricolon 0.611 0.611 0.611 18
199
+ us_them 0.733 0.611 0.667 18
200
+ vocative 0.462 0.600 0.522 20
201
+
202
+ accuracy 0.500 2357
203
+ macro avg 0.535 0.484 0.493 2357
204
+ weighted avg 0.532 0.500 0.503 2357
205
  ```
206
 
207
  </details>
208
 
209
+ **Top performing subtypes (F1 ≥ 0.75):** `assonance` (0.980), `polysyndeton` (0.884), `antithesis` (0.878), `concessive_connector` (0.857), `intensifier_doubling` (0.850), `epistrophe` (0.848), `audience_response` (0.889), `institutional_subject` (0.811), `refrain` (0.810), `agent_demoted` (0.800), `religious_formula` (0.800), `conflict_frame` (0.733), `rhythm` (0.741), `rhetorical_question` (0.723).
210
 
211
+ **Weakest subtypes (F1 < 0.20):** `proverb` (0.000), `conceptual_metaphor` (0.057), `metadiscourse` (0.116), `categorical_statement` (0.136), `epistemic_hedge` (0.159). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes or have very low test support.
212
 
213
  ## Class Distribution
214
 
215
+ The training set exhibits significant imbalance across 71 classes:
216
 
217
+ | Support Range | Example Classes | Count |
218
+ |---------------|-----------------|-------|
219
+ | >1000 | `discourse_formula`, `second_person` | 2 |
220
+ | 5001000 | `conditional`, `rhetorical_question`, `technical_term`, `imperative` | 8 |
221
+ | 200500 | `abstract_noun`, `contrastive`, `inclusive_we`, `nominalization` | 27 |
222
+ | 100200 | `alliteration`, `antithesis`, `asyndeton`, `epistrophe`, `refrain` | 30 |
223
+ | <100 | `footnote_reference`, `phatic_check`, `technical_abbreviation` | 4 |
224
 
225
  ## Limitations
226
 
227
+ - **71-way classification on ~22k spans**: The data budget per class is thin, particularly for classes near the minimum. More data or class consolidation would help.
228
  - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
229
+ - **Recall-precision tradeoff on rare classes**: Many rare classes show high precision but lower recall (e.g., `self_correction`: P=0.821, R=0.575; `technical_abbreviation`: P=0.900, R=0.321), suggesting the model learns narrow prototypes but misses variation.
230
  - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
231
  - **128-token context window**: Longer spans are truncated.
232
 
 
240
  |-------|------|---------|-----|
241
  | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
242
  | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.583 |
243
+ | **This model** | Fine-grained subtype | 71 | 0.493 |
244
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
245
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
246
 
 
257
  ## References
258
 
259
  - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
260
+ - Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
261
+ - Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
262
 
263
  ---
264
 
265
+ *Trained: February 2026*
config.json CHANGED
@@ -1,16 +1,23 @@
1
  {
2
- "add_cross_attention": false,
3
  "architectures": [
4
- "BertForSequenceClassification"
5
  ],
6
- "attention_probs_dropout_prob": 0.1,
7
- "bos_token_id": null,
8
- "classifier_dropout": null,
 
 
 
 
 
 
 
9
  "dtype": "float32",
10
- "eos_token_id": null,
 
 
11
  "gradient_checkpointing": false,
12
- "hidden_act": "gelu",
13
- "hidden_dropout_prob": 0.1,
14
  "hidden_size": 768,
15
  "id2label": {
16
  "0": "LABEL_0",
@@ -85,9 +92,9 @@
85
  "69": "LABEL_69",
86
  "70": "LABEL_70"
87
  },
 
88
  "initializer_range": 0.02,
89
- "intermediate_size": 3072,
90
- "is_decoder": false,
91
  "label2id": {
92
  "LABEL_0": 0,
93
  "LABEL_1": 1,
@@ -161,16 +168,57 @@
161
  "LABEL_8": 8,
162
  "LABEL_9": 9
163
  },
164
- "layer_norm_eps": 1e-12,
165
- "max_position_embeddings": 512,
166
- "model_type": "bert",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  "num_attention_heads": 12,
168
- "num_hidden_layers": 12,
169
- "pad_token_id": 0,
170
  "position_embedding_type": "absolute",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  "tie_word_embeddings": true,
172
  "transformers_version": "5.0.0",
173
- "type_vocab_size": 2,
174
- "use_cache": true,
175
- "vocab_size": 30522
176
  }
 
1
  {
 
2
  "architectures": [
3
+ "ModernBertForSequenceClassification"
4
  ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 50281,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 50281,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
  "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 50282,
18
+ "global_attn_every_n_layers": 3,
19
  "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
 
21
  "hidden_size": 768,
22
  "id2label": {
23
  "0": "LABEL_0",
 
92
  "69": "LABEL_69",
93
  "70": "LABEL_70"
94
  },
95
+ "initializer_cutoff_factor": 2.0,
96
  "initializer_range": 0.02,
97
+ "intermediate_size": 1152,
 
98
  "label2id": {
99
  "LABEL_0": 0,
100
  "LABEL_1": 1,
 
168
  "LABEL_8": 8,
169
  "LABEL_9": 9
170
  },
171
+ "layer_norm_eps": 1e-05,
172
+ "layer_types": [
173
+ "full_attention",
174
+ "sliding_attention",
175
+ "sliding_attention",
176
+ "full_attention",
177
+ "sliding_attention",
178
+ "sliding_attention",
179
+ "full_attention",
180
+ "sliding_attention",
181
+ "sliding_attention",
182
+ "full_attention",
183
+ "sliding_attention",
184
+ "sliding_attention",
185
+ "full_attention",
186
+ "sliding_attention",
187
+ "sliding_attention",
188
+ "full_attention",
189
+ "sliding_attention",
190
+ "sliding_attention",
191
+ "full_attention",
192
+ "sliding_attention",
193
+ "sliding_attention",
194
+ "full_attention"
195
+ ],
196
+ "local_attention": 128,
197
+ "max_position_embeddings": 8192,
198
+ "mlp_bias": false,
199
+ "mlp_dropout": 0.0,
200
+ "model_type": "modernbert",
201
+ "norm_bias": false,
202
+ "norm_eps": 1e-05,
203
  "num_attention_heads": 12,
204
+ "num_hidden_layers": 22,
205
+ "pad_token_id": 50283,
206
  "position_embedding_type": "absolute",
207
+ "repad_logits_with_grad": false,
208
+ "rope_parameters": {
209
+ "full_attention": {
210
+ "rope_theta": 160000.0,
211
+ "rope_type": "default"
212
+ },
213
+ "sliding_attention": {
214
+ "rope_theta": 10000.0,
215
+ "rope_type": "default"
216
+ }
217
+ },
218
+ "sep_token_id": 50282,
219
+ "sparse_pred_ignore_index": -100,
220
+ "sparse_prediction": false,
221
  "tie_word_embeddings": true,
222
  "transformers_version": "5.0.0",
223
+ "vocab_size": 50368
 
 
224
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1ff78a23e1f73a3c2b1b41f7b253d652d236d03395d41483f87deba0000c9124
3
- size 780277732
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bd7dd251812991ee67b82e0da1ecbef1e9597bab1478f3f23ee327741272840
3
+ size 1039849764
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,14 +1,16 @@
1
  {
2
  "backend": "tokenizers",
 
3
  "cls_token": "[CLS]",
4
- "do_lower_case": true,
5
  "is_local": false,
6
  "mask_token": "[MASK]",
7
- "model_max_length": 512,
 
 
 
 
8
  "pad_token": "[PAD]",
9
  "sep_token": "[SEP]",
10
- "strip_accents": null,
11
- "tokenize_chinese_chars": true,
12
- "tokenizer_class": "BertTokenizer",
13
  "unk_token": "[UNK]"
14
  }
 
1
  {
2
  "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
  "cls_token": "[CLS]",
 
5
  "is_local": false,
6
  "mask_token": "[MASK]",
7
+ "model_input_names": [
8
+ "input_ids",
9
+ "attention_mask"
10
+ ],
11
+ "model_max_length": 8192,
12
  "pad_token": "[PAD]",
13
  "sep_token": "[SEP]",
14
+ "tokenizer_class": "TokenizersBackend",
 
 
15
  "unk_token": "[UNK]"
16
  }