init
Browse files- README.md +250 -0
- config.json +39 -0
- model.safetensors +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +55 -0
- vocab.txt +0 -0
README.md
ADDED
|
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: distilbert/distilbert-base-multilingual-cased
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
- es
|
| 7 |
+
- hi
|
| 8 |
+
- ar
|
| 9 |
+
- bn
|
| 10 |
+
- pt
|
| 11 |
+
- ru
|
| 12 |
+
- ja
|
| 13 |
+
- de
|
| 14 |
+
- ms
|
| 15 |
+
- te
|
| 16 |
+
- vi
|
| 17 |
+
- ko
|
| 18 |
+
- fr
|
| 19 |
+
- tr
|
| 20 |
+
- it
|
| 21 |
+
- pl
|
| 22 |
+
- uk
|
| 23 |
+
- tl
|
| 24 |
+
- nl
|
| 25 |
+
- gsw
|
| 26 |
+
- sw
|
| 27 |
+
library_name: transformers
|
| 28 |
+
license: cc-by-nc-4.0
|
| 29 |
+
pipeline_tag: text-classification
|
| 30 |
+
tags:
|
| 31 |
+
- text-classification
|
| 32 |
+
- sentiment-analysis
|
| 33 |
+
- sentiment
|
| 34 |
+
- synthetic data
|
| 35 |
+
- multi-class
|
| 36 |
+
- social-media-analysis
|
| 37 |
+
- customer-feedback
|
| 38 |
+
- product-reviews
|
| 39 |
+
- brand-monitoring
|
| 40 |
+
- multilingual
|
| 41 |
+
- ๐ช๐บ
|
| 42 |
+
- region:eu
|
| 43 |
+
- synthetic
|
| 44 |
+
datasets:
|
| 45 |
+
- tabularisai/swahili_sentiment_dataset
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ๐ Multilingual Sentiment Classification Model (23 Languages)
|
| 50 |
+
|
| 51 |
+
<!-- TRY IT HERE: `coming soon`
|
| 52 |
+
-->
|
| 53 |
+
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/sznxwdqBXj)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
# NEWS!
|
| 57 |
+
- 2025/8: Major model update +1 new language: **Swahili**! Also, general improvements accross all languages.
|
| 58 |
+
|
| 59 |
+
- 2025/8: Free DEMO API for our model! Please see below!
|
| 60 |
+
|
| 61 |
+
- 2025/7: Weโve just released ModernFinBERT, a model weโve been working on for a while. Itโs built on the ModernBERT architecture and trained on a mix of real and synthetic data, with LLM-based label correction applied to public datasets to fix human annotation errors.
|
| 62 |
+
Itโs performing well across a range of benchmarks โ in some cases improving accuracy by up to 48% over existing models like FinBERT.
|
| 63 |
+
You can check it out here on Hugging Face:
|
| 64 |
+
๐ https://huggingface.co/tabularisai/ModernFinBERT
|
| 65 |
+
|
| 66 |
+
- 2024/12: We are excited to introduce a multilingual sentiment model! Now you can analyze sentiment across multiple languages, enhancing your global reach.
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
## ๐ Hosted DEMO API
|
| 70 |
+
|
| 71 |
+
We provide a hosted inference API:
|
| 72 |
+
|
| 73 |
+
**Example request body:**
|
| 74 |
+
|
| 75 |
+
```json
|
| 76 |
+
curl -X POST https://api.tabularis.ai/ \
|
| 77 |
+
-H "Content-Type: application/json" \
|
| 78 |
+
-d '{"text":"I love the design","return_all_scores":false}'
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Model Details
|
| 83 |
+
- `Model Name:` tabularisai/multilingual-sentiment-analysis
|
| 84 |
+
- `Base Model:` distilbert/distilbert-base-multilingual-cased
|
| 85 |
+
- `Task:` Text Classification (Sentiment Analysis)
|
| 86 |
+
- `Languages:` Supports English plus Chinese (ไธญๆ), Spanish (Espaรฑol), Hindi (เคนเคฟเคจเฅเคฆเฅ), Arabic (ุงูุนุฑุจูุฉ), Bengali (เฆฌเฆพเฆเฆฒเฆพ), Portuguese (Portuguรชs), Russian (ะ ัััะบะธะน), Japanese (ๆฅๆฌ่ช), German (Deutsch), Malay (Bahasa Melayu), Telugu (เฐคเฑเฐฒเฑเฐเฑ), Vietnamese (Tiแบฟng Viแปt), Korean (ํ๊ตญ์ด), French (Franรงais), Turkish (Tรผrkรงe), Italian (Italiano), Polish (Polski), Ukrainian (ะฃะบัะฐัะฝััะบะฐ), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.
|
| 87 |
+
- `Number of Classes:` 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
|
| 88 |
+
- `Usage:`
|
| 89 |
+
- Social media analysis
|
| 90 |
+
- Customer feedback analysis
|
| 91 |
+
- Product reviews classification
|
| 92 |
+
- Brand monitoring
|
| 93 |
+
- Market research
|
| 94 |
+
- Customer service optimization
|
| 95 |
+
- Competitive intelligence
|
| 96 |
+
|
| 97 |
+
> If you wish to use this model for commercial purposes, please obtain a license by contacting: info@tabularis.ai
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
## Model Description
|
| 101 |
+
|
| 102 |
+
This model is a fine-tuned version of `distilbert/distilbert-base-multilingual-cased` for multilingual sentiment analysis. It leverages synthetic data from multiple sources to achieve robust performance across different languages and cultural contexts.
|
| 103 |
+
|
| 104 |
+
### Training Data
|
| 105 |
+
|
| 106 |
+
Trained exclusively on synthetic multilingual data generated by advanced LLMs, ensuring wide coverage of sentiment expressions from various languages.
|
| 107 |
+
|
| 108 |
+
### Training Procedure
|
| 109 |
+
|
| 110 |
+
- Fine-tuned for 3.5 epochs.
|
| 111 |
+
- Achieved a train_acc_off_by_one of approximately 0.93 on the validation dataset.
|
| 112 |
+
|
| 113 |
+
## Intended Use
|
| 114 |
+
|
| 115 |
+
Ideal for:
|
| 116 |
+
- Multilingual social media monitoring
|
| 117 |
+
- International customer feedback analysis
|
| 118 |
+
- Global product review sentiment classification
|
| 119 |
+
- Worldwide brand sentiment tracking
|
| 120 |
+
|
| 121 |
+
## How to Use
|
| 122 |
+
|
| 123 |
+
Using pipelines, it takes only 4 lines:
|
| 124 |
+
|
| 125 |
+
```python
|
| 126 |
+
from transformers import pipeline
|
| 127 |
+
|
| 128 |
+
# Load the classification pipeline with the specified model
|
| 129 |
+
pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
|
| 130 |
+
|
| 131 |
+
# Classify a new sentence
|
| 132 |
+
sentence = "I love this product! It's amazing and works perfectly."
|
| 133 |
+
result = pipe(sentence)
|
| 134 |
+
|
| 135 |
+
# Print the result
|
| 136 |
+
print(result)
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
Below is a Python example on how to use the multilingual sentiment model without pipelines:
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
```python
|
| 143 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 144 |
+
import torch
|
| 145 |
+
|
| 146 |
+
model_name = "tabularisai/multilingual-sentiment-analysis"
|
| 147 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 148 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 149 |
+
|
| 150 |
+
def predict_sentiment(texts):
|
| 151 |
+
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
|
| 152 |
+
with torch.no_grad():
|
| 153 |
+
outputs = model(**inputs)
|
| 154 |
+
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 155 |
+
sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
|
| 156 |
+
return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]
|
| 157 |
+
|
| 158 |
+
texts = [
|
| 159 |
+
# English
|
| 160 |
+
"I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
|
| 161 |
+
# Chinese
|
| 162 |
+
"่ฟๅฎถ้คๅ
็่ๅณ้้ๅธธๆฃ๏ผ", "ๆๅฏนไป็ๅ็ญๅพๅคฑๆใ", "ๅคฉๆฐไปๅคฉไธ่ฌใ",
|
| 163 |
+
# Spanish
|
| 164 |
+
"ยกMe encanta cรณmo quedรณ la decoraciรณn!", "El servicio fue terrible y muy lento.", "El libro estuvo mรกs o menos.",
|
| 165 |
+
# Arabic
|
| 166 |
+
"ุงูุฎุฏู
ุฉ ูู ูุฐุง ุงูููุฏู ุฑุงุฆุนุฉ ุฌุฏูุง!", "ูู
ูุนุฌุจูู ุงูุทุนุงู
ูู ูุฐุง ุงูู
ุทุนู
.", "ูุงูุช ุงูุฑุญูุฉ ุนุงุฏูุฉใ",
|
| 167 |
+
# Ukrainian
|
| 168 |
+
"ะะตะฝั ะดัะถะต ัะฟะพะดะพะฑะฐะปะฐัั ัั ะฒะธััะฐะฒะฐ!", "ะะฑัะปัะณะพะฒัะฒะฐะฝะฝั ะฑัะปะพ ะถะฐั
ะปะธะฒะธะผ.", "ะะฝะธะณะฐ ะฑัะปะฐ ะฟะพัะตัะตะดะฝัะพัใ",
|
| 169 |
+
# Hindi
|
| 170 |
+
"เคฏเคน เคเคเคน เคธเค เคฎเฅเค เค
เคฆเฅเคญเฅเคค เคนเฅ!", "เคฏเคน เค
เคจเฅเคญเคต เคฌเคนเฅเคค เคเคฐเคพเคฌ เคฅเคพเฅค", "เคซเคฟเคฒเฅเคฎ เค เฅเค-เค เคพเค เคฅเฅเฅค",
|
| 171 |
+
# Bengali
|
| 172 |
+
"เฆเฆเฆพเฆจเฆเฆพเฆฐ เฆชเฆฐเฆฟเฆฌเงเฆถ เฆ
เฆธเฆพเฆงเฆพเฆฐเฆฃ!", "เฆธเงเฆฌเฆพเฆฐ เฆฎเฆพเฆจ เฆเฆเงเฆฌเฆพเฆฐเงเฆ เฆเฆพเฆฐเฆพเฆชเฅค", "เฆเฆพเฆฌเฆพเฆฐเฆเฆพ เฆฎเงเฆเฆพเฆฎเงเฆเฆฟ เฆเฆฟเฆฒเฅค",
|
| 173 |
+
# Portuguese
|
| 174 |
+
"Este livro รฉ fantรกstico! Eu aprendi muitas coisas novas e inspiradoras.",
|
| 175 |
+
"Nรฃo gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
|
| 176 |
+
# Japanese
|
| 177 |
+
"ใใฎใฌในใใฉใณใฎๆ็ใฏๆฌๅฝใซ็พๅณใใใงใ๏ผ", "ใใฎใใใซใฎใตใผใในใฏใใฃใใใใพใใใ", "ๅคฉๆฐใฏใพใใพใใงใใ",
|
| 178 |
+
# Russian
|
| 179 |
+
"ะฏ ะฒ ะฒะพััะพัะณะต ะพั ััะพะณะพ ะฝะพะฒะพะณะพ ะณะฐะดะถะตัะฐ!", "ะญัะพั ัะตัะฒะธั ะพััะฐะฒะธะป ั ะผะตะฝั ัะพะปัะบะพ ัะฐะทะพัะฐัะพะฒะฐะฝะธะต.", "ะัััะตัะฐ ะฑัะปะฐ ะพะฑััะฝะพะน, ะฝะธัะตะณะพ ะพัะพะฑะตะฝะฝะพะณะพ.",
|
| 180 |
+
# French
|
| 181 |
+
"J'adore ce restaurant, c'est excellent !", "L'attente รฉtait trop longue et frustrante.", "Le film รฉtait moyen, sans plus.",
|
| 182 |
+
# Turkish
|
| 183 |
+
"Bu otelin manzarasฤฑna bayฤฑldฤฑm!", "รrรผn tam bir hayal kฤฑrฤฑklฤฑฤฤฑydฤฑ.", "Konser fena deฤildi, ortalamaydฤฑ.",
|
| 184 |
+
# Italian
|
| 185 |
+
"Adoro questo posto, รจ fantastico!", "Il servizio clienti รจ stato pessimo.", "La cena era nella media.",
|
| 186 |
+
# Polish
|
| 187 |
+
"Uwielbiam tฤ restauracjฤ, jedzenie jest ลwietne!", "Obsลuga klienta byลa rozczarowujฤ
ca.", "Pogoda jest w porzฤ
dku, nic szczegรณlnego.",
|
| 188 |
+
# Tagalog
|
| 189 |
+
"Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
|
| 190 |
+
# Dutch
|
| 191 |
+
"Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon okรฉ, niet bijzonder.",
|
| 192 |
+
# Malay
|
| 193 |
+
"Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
|
| 194 |
+
# Korean
|
| 195 |
+
"์ด ๊ฐ๊ฒ์ ์ผ์ดํฌ๋ ์ ๋ง ๋ง์์ด์!", "์๋น์ค๊ฐ ๋๋ฌด ๋ณ๋ก์์ด์.", "๋ ์จ๊ฐ ๊ทธ์ ๊ทธ๋ ๋ค์.",
|
| 196 |
+
# Swiss German
|
| 197 |
+
"Ich find dรค Service i de Beiz mega guet!", "Dรคs Esรค het mir nรถd gfalle.", "D Wรคtter hรผt isch so naja."
|
| 198 |
+
]
|
| 199 |
+
|
| 200 |
+
for text, sentiment in zip(texts, predict_sentiment(texts)):
|
| 201 |
+
print(f"Text: {text}\nSentiment: {sentiment}\n")
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
## Ethical Considerations
|
| 205 |
+
|
| 206 |
+
Synthetic data reduces bias, but validation in real-world scenarios is advised.
|
| 207 |
+
|
| 208 |
+
## Citation
|
| 209 |
+
```bib
|
| 210 |
+
@misc{tabularisai_2025,
|
| 211 |
+
author = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
|
| 212 |
+
title = { multilingual-sentiment-analysis (Revision 69afb83) },
|
| 213 |
+
year = 2025,
|
| 214 |
+
url = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
|
| 215 |
+
doi = { 10.57967/hf/5968 },
|
| 216 |
+
publisher = { Hugging Face }
|
| 217 |
+
}
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
## Contact
|
| 221 |
+
|
| 222 |
+
For inquiries, data, private APIs, better models, contact info@tabularis.ai
|
| 223 |
+
|
| 224 |
+
tabularis.ai
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
<table align="center">
|
| 228 |
+
<tr>
|
| 229 |
+
<td align="center">
|
| 230 |
+
<a href="https://www.linkedin.com/company/tabularis-ai/">
|
| 231 |
+
<img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
|
| 232 |
+
</a>
|
| 233 |
+
</td>
|
| 234 |
+
<td align="center">
|
| 235 |
+
<a href="https://x.com/tabularis_ai">
|
| 236 |
+
<img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
|
| 237 |
+
</a>
|
| 238 |
+
</td>
|
| 239 |
+
<td align="center">
|
| 240 |
+
<a href="https://github.com/tabularis-ai">
|
| 241 |
+
<img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
|
| 242 |
+
</a>
|
| 243 |
+
</td>
|
| 244 |
+
<td align="center">
|
| 245 |
+
<a href="https://tabularis.ai">
|
| 246 |
+
<img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
|
| 247 |
+
</a>
|
| 248 |
+
</td>
|
| 249 |
+
</tr>
|
| 250 |
+
</table>
|
config.json
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"activation": "gelu",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"DistilBertForSequenceClassification"
|
| 5 |
+
],
|
| 6 |
+
"attention_dropout": 0.1,
|
| 7 |
+
"dim": 768,
|
| 8 |
+
"dropout": 0.1,
|
| 9 |
+
"hidden_dim": 3072,
|
| 10 |
+
"id2label": {
|
| 11 |
+
"0": "Very Negative",
|
| 12 |
+
"1": "Negative",
|
| 13 |
+
"2": "Neutral",
|
| 14 |
+
"3": "Positive",
|
| 15 |
+
"4": "Very Positive"
|
| 16 |
+
},
|
| 17 |
+
"initializer_range": 0.02,
|
| 18 |
+
"label2id": {
|
| 19 |
+
"Negative": 1,
|
| 20 |
+
"Neutral": 2,
|
| 21 |
+
"Positive": 3,
|
| 22 |
+
"Very Negative": 0,
|
| 23 |
+
"Very Positive": 4
|
| 24 |
+
},
|
| 25 |
+
"max_position_embeddings": 512,
|
| 26 |
+
"model_type": "distilbert",
|
| 27 |
+
"n_heads": 12,
|
| 28 |
+
"n_layers": 6,
|
| 29 |
+
"output_past": true,
|
| 30 |
+
"pad_token_id": 0,
|
| 31 |
+
"problem_type": "single_label_classification",
|
| 32 |
+
"qa_dropout": 0.1,
|
| 33 |
+
"seq_classif_dropout": 0.2,
|
| 34 |
+
"sinusoidal_pos_embds": false,
|
| 35 |
+
"tie_weights_": true,
|
| 36 |
+
"torch_dtype": "float32",
|
| 37 |
+
"transformers_version": "4.55.0",
|
| 38 |
+
"vocab_size": 119547
|
| 39 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3ab3cecb8605da0a240e5b4e18d969704d44e27c6ea48533ef6693d31dbb926a
|
| 3 |
+
size 541326604
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": false,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": false,
|
| 47 |
+
"mask_token": "[MASK]",
|
| 48 |
+
"model_max_length": 512,
|
| 49 |
+
"pad_token": "[PAD]",
|
| 50 |
+
"sep_token": "[SEP]",
|
| 51 |
+
"strip_accents": null,
|
| 52 |
+
"tokenize_chinese_chars": true,
|
| 53 |
+
"tokenizer_class": "DistilBertTokenizer",
|
| 54 |
+
"unk_token": "[UNK]"
|
| 55 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|