oralunal commited on
Commit
80110ac
ยท
1 Parent(s): b8ea2c1
README.md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: distilbert/distilbert-base-multilingual-cased
3
+ language:
4
+ - en
5
+ - zh
6
+ - es
7
+ - hi
8
+ - ar
9
+ - bn
10
+ - pt
11
+ - ru
12
+ - ja
13
+ - de
14
+ - ms
15
+ - te
16
+ - vi
17
+ - ko
18
+ - fr
19
+ - tr
20
+ - it
21
+ - pl
22
+ - uk
23
+ - tl
24
+ - nl
25
+ - gsw
26
+ - sw
27
+ library_name: transformers
28
+ license: cc-by-nc-4.0
29
+ pipeline_tag: text-classification
30
+ tags:
31
+ - text-classification
32
+ - sentiment-analysis
33
+ - sentiment
34
+ - synthetic data
35
+ - multi-class
36
+ - social-media-analysis
37
+ - customer-feedback
38
+ - product-reviews
39
+ - brand-monitoring
40
+ - multilingual
41
+ - ๐Ÿ‡ช๐Ÿ‡บ
42
+ - region:eu
43
+ - synthetic
44
+ datasets:
45
+ - tabularisai/swahili_sentiment_dataset
46
+ ---
47
+
48
+
49
+ # ๐Ÿš€ Multilingual Sentiment Classification Model (23 Languages)
50
+
51
+ <!-- TRY IT HERE: `coming soon`
52
+ -->
53
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/sznxwdqBXj)
54
+
55
+
56
+ # NEWS!
57
+ - 2025/8: Major model update +1 new language: **Swahili**! Also, general improvements accross all languages.
58
+
59
+ - 2025/8: Free DEMO API for our model! Please see below!
60
+
61
+ - 2025/7: Weโ€™ve just released ModernFinBERT, a model weโ€™ve been working on for a while. Itโ€™s built on the ModernBERT architecture and trained on a mix of real and synthetic data, with LLM-based label correction applied to public datasets to fix human annotation errors.
62
+ Itโ€™s performing well across a range of benchmarks โ€” in some cases improving accuracy by up to 48% over existing models like FinBERT.
63
+ You can check it out here on Hugging Face:
64
+ ๐Ÿ‘‰ https://huggingface.co/tabularisai/ModernFinBERT
65
+
66
+ - 2024/12: We are excited to introduce a multilingual sentiment model! Now you can analyze sentiment across multiple languages, enhancing your global reach.
67
+
68
+
69
+ ## ๐Ÿ”Œ Hosted DEMO API
70
+
71
+ We provide a hosted inference API:
72
+
73
+ **Example request body:**
74
+
75
+ ```json
76
+ curl -X POST https://api.tabularis.ai/ \
77
+ -H "Content-Type: application/json" \
78
+ -d '{"text":"I love the design","return_all_scores":false}'
79
+
80
+ ```
81
+
82
+ ## Model Details
83
+ - `Model Name:` tabularisai/multilingual-sentiment-analysis
84
+ - `Base Model:` distilbert/distilbert-base-multilingual-cased
85
+ - `Task:` Text Classification (Sentiment Analysis)
86
+ - `Languages:` Supports English plus Chinese (ไธญๆ–‡), Spanish (Espaรฑol), Hindi (เคนเคฟเคจเฅเคฆเฅ€), Arabic (ุงู„ุนุฑุจูŠุฉ), Bengali (เฆฌเฆพเฆ‚เฆฒเฆพ), Portuguese (Portuguรชs), Russian (ะ ัƒััะบะธะน), Japanese (ๆ—ฅๆœฌ่ชž), German (Deutsch), Malay (Bahasa Melayu), Telugu (เฐคเฑ†เฐฒเฑเฐ—เฑ), Vietnamese (Tiแบฟng Viแป‡t), Korean (ํ•œ๊ตญ์–ด), French (Franรงais), Turkish (Tรผrkรงe), Italian (Italiano), Polish (Polski), Ukrainian (ะฃะบั€ะฐั—ะฝััŒะบะฐ), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.
87
+ - `Number of Classes:` 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
88
+ - `Usage:`
89
+ - Social media analysis
90
+ - Customer feedback analysis
91
+ - Product reviews classification
92
+ - Brand monitoring
93
+ - Market research
94
+ - Customer service optimization
95
+ - Competitive intelligence
96
+
97
+ > If you wish to use this model for commercial purposes, please obtain a license by contacting: info@tabularis.ai
98
+
99
+
100
+ ## Model Description
101
+
102
+ This model is a fine-tuned version of `distilbert/distilbert-base-multilingual-cased` for multilingual sentiment analysis. It leverages synthetic data from multiple sources to achieve robust performance across different languages and cultural contexts.
103
+
104
+ ### Training Data
105
+
106
+ Trained exclusively on synthetic multilingual data generated by advanced LLMs, ensuring wide coverage of sentiment expressions from various languages.
107
+
108
+ ### Training Procedure
109
+
110
+ - Fine-tuned for 3.5 epochs.
111
+ - Achieved a train_acc_off_by_one of approximately 0.93 on the validation dataset.
112
+
113
+ ## Intended Use
114
+
115
+ Ideal for:
116
+ - Multilingual social media monitoring
117
+ - International customer feedback analysis
118
+ - Global product review sentiment classification
119
+ - Worldwide brand sentiment tracking
120
+
121
+ ## How to Use
122
+
123
+ Using pipelines, it takes only 4 lines:
124
+
125
+ ```python
126
+ from transformers import pipeline
127
+
128
+ # Load the classification pipeline with the specified model
129
+ pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
130
+
131
+ # Classify a new sentence
132
+ sentence = "I love this product! It's amazing and works perfectly."
133
+ result = pipe(sentence)
134
+
135
+ # Print the result
136
+ print(result)
137
+ ```
138
+
139
+ Below is a Python example on how to use the multilingual sentiment model without pipelines:
140
+
141
+
142
+ ```python
143
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
144
+ import torch
145
+
146
+ model_name = "tabularisai/multilingual-sentiment-analysis"
147
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
148
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
149
+
150
+ def predict_sentiment(texts):
151
+ inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
152
+ with torch.no_grad():
153
+ outputs = model(**inputs)
154
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
155
+ sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
156
+ return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]
157
+
158
+ texts = [
159
+ # English
160
+ "I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
161
+ # Chinese
162
+ "่ฟ™ๅฎถ้คๅŽ…็š„่œๅ‘ณ้“้žๅธธๆฃ’๏ผ", "ๆˆ‘ๅฏนไป–็š„ๅ›ž็ญ”ๅพˆๅคฑๆœ›ใ€‚", "ๅคฉๆฐ”ไปŠๅคฉไธ€่ˆฌใ€‚",
163
+ # Spanish
164
+ "ยกMe encanta cรณmo quedรณ la decoraciรณn!", "El servicio fue terrible y muy lento.", "El libro estuvo mรกs o menos.",
165
+ # Arabic
166
+ "ุงู„ุฎุฏู…ุฉ ููŠ ู‡ุฐุง ุงู„ูู†ุฏู‚ ุฑุงุฆุนุฉ ุฌุฏู‹ุง!", "ู„ู… ูŠุนุฌุจู†ูŠ ุงู„ุทุนุงู… ููŠ ู‡ุฐุง ุงู„ู…ุทุนู….", "ูƒุงู†ุช ุงู„ุฑุญู„ุฉ ุนุงุฏูŠุฉใ€‚",
167
+ # Ukrainian
168
+ "ะœะตะฝั– ะดัƒะถะต ัะฟะพะดะพะฑะฐะปะฐัั ั†ั ะฒะธัั‚ะฐะฒะฐ!", "ะžะฑัะปัƒะณะพะฒัƒะฒะฐะฝะฝั ะฑัƒะปะพ ะถะฐั…ะปะธะฒะธะผ.", "ะšะฝะธะณะฐ ะฑัƒะปะฐ ะฟะพัะตั€ะตะดะฝัŒะพัŽใ€‚",
169
+ # Hindi
170
+ "เคฏเคน เคœเค—เคน เคธเคš เคฎเฅ‡เค‚ เค…เคฆเฅเคญเฅเคค เคนเฅˆ!", "เคฏเคน เค…เคจเฅเคญเคต เคฌเคนเฅเคค เค–เคฐเคพเคฌ เคฅเคพเฅค", "เคซเคฟเคฒเฅเคฎ เค เฅ€เค•-เค เคพเค• เคฅเฅ€เฅค",
171
+ # Bengali
172
+ "เฆเฆ–เฆพเฆจเฆ•เฆพเฆฐ เฆชเฆฐเฆฟเฆฌเง‡เฆถ เฆ…เฆธเฆพเฆงเฆพเฆฐเฆฃ!", "เฆธเง‡เฆฌเฆพเฆฐ เฆฎเฆพเฆจ เฆเฆ•เง‡เฆฌเฆพเฆฐเง‡เฆ‡ เฆ–เฆพเฆฐเฆพเฆชเฅค", "เฆ–เฆพเฆฌเฆพเฆฐเฆŸเฆพ เฆฎเง‹เฆŸเฆพเฆฎเงเฆŸเฆฟ เฆ›เฆฟเฆฒเฅค",
173
+ # Portuguese
174
+ "Este livro รฉ fantรกstico! Eu aprendi muitas coisas novas e inspiradoras.",
175
+ "Nรฃo gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
176
+ # Japanese
177
+ "ใ“ใฎใƒฌใ‚นใƒˆใƒฉใƒณใฎๆ–™็†ใฏๆœฌๅฝ“ใซ็พŽๅ‘ณใ—ใ„ใงใ™๏ผ", "ใ“ใฎใƒ›ใƒ†ใƒซใฎใ‚ตใƒผใƒ“ใ‚นใฏใŒใฃใ‹ใ‚Šใ—ใพใ—ใŸใ€‚", "ๅคฉๆฐ—ใฏใพใ‚ใพใ‚ใงใ™ใ€‚",
178
+ # Russian
179
+ "ะฏ ะฒ ะฒะพัั‚ะพั€ะณะต ะพั‚ ัั‚ะพะณะพ ะฝะพะฒะพะณะพ ะณะฐะดะถะตั‚ะฐ!", "ะญั‚ะพั‚ ัะตั€ะฒะธั ะพัั‚ะฐะฒะธะป ัƒ ะผะตะฝั ั‚ะพะปัŒะบะพ ั€ะฐะทะพั‡ะฐั€ะพะฒะฐะฝะธะต.", "ะ’ัั‚ั€ะตั‡ะฐ ะฑั‹ะปะฐ ะพะฑั‹ั‡ะฝะพะน, ะฝะธั‡ะตะณะพ ะพัะพะฑะตะฝะฝะพะณะพ.",
180
+ # French
181
+ "J'adore ce restaurant, c'est excellent !", "L'attente รฉtait trop longue et frustrante.", "Le film รฉtait moyen, sans plus.",
182
+ # Turkish
183
+ "Bu otelin manzarasฤฑna bayฤฑldฤฑm!", "รœrรผn tam bir hayal kฤฑrฤฑklฤฑฤŸฤฑydฤฑ.", "Konser fena deฤŸildi, ortalamaydฤฑ.",
184
+ # Italian
185
+ "Adoro questo posto, รจ fantastico!", "Il servizio clienti รจ stato pessimo.", "La cena era nella media.",
186
+ # Polish
187
+ "Uwielbiam tฤ™ restauracjฤ™, jedzenie jest ล›wietne!", "Obsล‚uga klienta byล‚a rozczarowujฤ…ca.", "Pogoda jest w porzฤ…dku, nic szczegรณlnego.",
188
+ # Tagalog
189
+ "Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
190
+ # Dutch
191
+ "Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon okรฉ, niet bijzonder.",
192
+ # Malay
193
+ "Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
194
+ # Korean
195
+ "์ด ๊ฐ€๊ฒŒ์˜ ์ผ€์ดํฌ๋Š” ์ •๋ง ๋ง›์žˆ์–ด์š”!", "์„œ๋น„์Šค๊ฐ€ ๋„ˆ๋ฌด ๋ณ„๋กœ์˜€์–ด์š”.", "๋‚ ์”จ๊ฐ€ ๊ทธ์ € ๊ทธ๋ ‡๋„ค์š”.",
196
+ # Swiss German
197
+ "Ich find dรค Service i de Beiz mega guet!", "Dรคs Esรค het mir nรถd gfalle.", "D Wรคtter hรผt isch so naja."
198
+ ]
199
+
200
+ for text, sentiment in zip(texts, predict_sentiment(texts)):
201
+ print(f"Text: {text}\nSentiment: {sentiment}\n")
202
+ ```
203
+
204
+ ## Ethical Considerations
205
+
206
+ Synthetic data reduces bias, but validation in real-world scenarios is advised.
207
+
208
+ ## Citation
209
+ ```bib
210
+ @misc{tabularisai_2025,
211
+ author = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
212
+ title = { multilingual-sentiment-analysis (Revision 69afb83) },
213
+ year = 2025,
214
+ url = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
215
+ doi = { 10.57967/hf/5968 },
216
+ publisher = { Hugging Face }
217
+ }
218
+ ```
219
+
220
+ ## Contact
221
+
222
+ For inquiries, data, private APIs, better models, contact info@tabularis.ai
223
+
224
+ tabularis.ai
225
+
226
+
227
+ <table align="center">
228
+ <tr>
229
+ <td align="center">
230
+ <a href="https://www.linkedin.com/company/tabularis-ai/">
231
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
232
+ </a>
233
+ </td>
234
+ <td align="center">
235
+ <a href="https://x.com/tabularis_ai">
236
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
237
+ </a>
238
+ </td>
239
+ <td align="center">
240
+ <a href="https://github.com/tabularis-ai">
241
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
242
+ </a>
243
+ </td>
244
+ <td align="center">
245
+ <a href="https://tabularis.ai">
246
+ <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
247
+ </a>
248
+ </td>
249
+ </tr>
250
+ </table>
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation": "gelu",
3
+ "architectures": [
4
+ "DistilBertForSequenceClassification"
5
+ ],
6
+ "attention_dropout": 0.1,
7
+ "dim": 768,
8
+ "dropout": 0.1,
9
+ "hidden_dim": 3072,
10
+ "id2label": {
11
+ "0": "Very Negative",
12
+ "1": "Negative",
13
+ "2": "Neutral",
14
+ "3": "Positive",
15
+ "4": "Very Positive"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "label2id": {
19
+ "Negative": 1,
20
+ "Neutral": 2,
21
+ "Positive": 3,
22
+ "Very Negative": 0,
23
+ "Very Positive": 4
24
+ },
25
+ "max_position_embeddings": 512,
26
+ "model_type": "distilbert",
27
+ "n_heads": 12,
28
+ "n_layers": 6,
29
+ "output_past": true,
30
+ "pad_token_id": 0,
31
+ "problem_type": "single_label_classification",
32
+ "qa_dropout": 0.1,
33
+ "seq_classif_dropout": 0.2,
34
+ "sinusoidal_pos_embds": false,
35
+ "tie_weights_": true,
36
+ "torch_dtype": "float32",
37
+ "transformers_version": "4.55.0",
38
+ "vocab_size": 119547
39
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ab3cecb8605da0a240e5b4e18d969704d44e27c6ea48533ef6693d31dbb926a
3
+ size 541326604
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff