Model Details
llama3.2-3B-Thai-Toxic-Det was developed by fine tune meta-llama/Llama-3.2-3B-Instruct on binary toxicity classifycation task with nakcnx/bad-topics and pythainlp/thai-wiki-dataset-v3 datasets.
This model is specialize in classify toxicity in Thai sentence including porn, bet, gambling, uncomprehensible language and code.
As of the evaluation, we test model on the fine-grained label dataset and the model has achieve average of 0.86 accuracy across all of the labels.
Model Description
- Base Model:
meta-llama/Llama-3.2-3B-Instruct - Trained Dataset:
nakcnx/bad-topics, pythainlp/thai-wiki-dataset-v3, and some from my own web scraping - Output: label 0 or 1
- Developed by: OpenThaiGPT
Evaluation
Here is the Test Evaluation from the custom fine-grained labled dataset including 12000 rows with 3000 of each label including, wiki, bet, porn, code, Incomprehensible respectively.
| Topic | Accuracy |
|---|---|
| Wiki | 0.82 |
| Porn | 0.90 |
| Bet | 0.88 |
| Code | 0.90 |
| Incomprehensible | 0.94 |
Uses
You can run the following code to inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"Anawil/llama3.2-3B-Thai-Toxic-Det",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Anawil/llama3.2-3B-Thai-Toxic-Det")
prompt = f"""Follow these instruction step by step
1. Classify the sentence given wheter it is involve porn, gambling or bet, or sensitive, policy or normal
2. You must follow these rules.
2.1 If the sentence include porn output '1'
2.2 If the sentence include bet or gambling output '1'
2.3 If the sentence include coding output '1'
2.4 If the setence include non comprehensible word or language output '1'
3. Output should be only 0 or 1 without prefixes or suffixes
Example:
Sentence Input: พนันกับคาสิโนออนไลน์, becuase it is included bet or gambling Your output should be 1
Sentence Input: ฉันกินข้าว, becuase it is normal sentence your output should be 0
Here is the Input: {text}
Classify wheter the given sentence based on the instruction
"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(
[text],
return_tensors="pt",
max_length=1024, # Set max sequence length, adjust the value as needed
truncation=True # Truncate sequences that exceed the max length
).to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=1,
pad_token_id=tokenizer.eos_token_id,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print('output:',response)
##output: 1
- Downloads last month
- 12