LLaMA 3.2 3B Fine-Tuned for TLINK Classification
π§ Overview
This model is a fully fine-tuned version of meta-llama/Llama-3.2-3B-Instruct for temporal relation classification (TLINK task).
It predicts the temporal relationship between events in text.
Labels:
- BEFORE
- AFTER
- OTHER
- NONE
π Task
Temporal Relation Classification Given a sentence, the model predicts the temporal relationship between events.
π Dataset
- Name:
fahmidiqbal/tlink-classification - Format: JSONL (text + label)
- Labels: BEFORE, AFTER, OTHER, NONE
- Distribution: Balanced (50 samples per class in test set)
βοΈ Training Details
- Model: meta-llama/Llama-3.2-3B-Instruct
- Fine-tuning: Full fine-tuning (all parameters updated)
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 1 (with gradient accumulation)
- Optimizer: AdamW
- Precision: bfloat16
- Tracking: Weights & Biases (wandb)
π Evaluation
Validation Performance
| Metric | Score |
|---|---|
| Accuracy | 0.8480 |
| Macro-F1 | 0.8285 |
| Precision | 0.8327 |
| Recall | 0.8246 |
Test Performance
| Metric | Score |
|---|---|
| Accuracy | 0.7950 |
| Macro-F1 | 0.7973 |
| Precision | 0.8100 |
| Recall | 0.7950 |
Per-Class Performance (Test Set)
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| BEFORE | 0.7826 | 0.7200 | 0.7500 |
| AFTER | 0.6613 | 0.8200 | 0.7321 |
| OTHER | 0.8462 | 0.8800 | 0.8627 |
| NONE | 0.9500 | 0.7600 | 0.8444 |
π¬ Analysis
- The model achieves strong overall performance with a Macro-F1 of ~0.80 on the test set.
- OTHER and NONE classes show the highest performance.
- The AFTER class has lower precision but strong recall, indicating some over-prediction.
- The drop from validation (0.83 F1) to test (0.79 F1) suggests a mild generalization gap, but overall stable performance.
- Errors are mainly observed in:
- BEFORE vs AFTER confusion
- NONE vs OTHER boundary ambiguity
π Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "AnirbanSaha/llama32-3b-tlink-full-finetune"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "The patient developed fever before taking the medication."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
pred = outputs.logits.argmax(dim=-1).item()
labels = ["BEFORE", "AFTER", "OTHER", "NONE"]
print(labels[pred])
β οΈ Important Notice (LLaMA License)
This model is based on:
π meta-llama/Llama-3.2-3B-Instruct
You must request access and accept the license:
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Otherwise, loading this model will fail.
π Training Logs
Weights & Biases run: π https://wandb.ai/anirbansaha002-univeristy-of-north-texas/llama32-3b-full-finetune/runs/mrdwp04g
π§© Limitations
- Performance depends on dataset distribution
- May not generalize well to unseen domains
- Sensitive to long input truncation
- Some confusion between temporally similar classes (BEFORE vs AFTER)
π¬ Research Context
This model was developed as part of research on:
- Temporal reasoning in NLP
- Relation classification using large language models
π¬ Contact
Author: Anirban Saha Anik
Affiliation: University of North Texas
β Citation
If you use this model in your research, please cite appropriately.
- Downloads last month
- 53
Model tree for AnirbanSaha/llama32-3b-tlink-full-finetune
Base model
meta-llama/Llama-3.2-3B-Instruct