MSME Legal Dispute Classifier (Longformer, 6-Class)
Model Overview
This model is a multi-class legal document classifier designed to categorize MSME-related dispute cases into six statutory dispute categories. It is fine-tuned from allenai/longformer-base-4096 and optimized for long-form legal documents up to 1200 tokens. The system is intended for automated dispute categorization, legal triage, and decision-support applications in MSME dispute resolution workflows.
Problem Statement
MSME dispute cases often involve lengthy legal narratives including:
- Statement of claim
- Buyer response
- Case summary
- Contractual and payment details
Manual classification is time-consuming and error-prone. This model automates dispute categorization into predefined legal classes.
Classification Labels
The model predicts one of the following six categories:
| Label ID | Category |
|---|---|
| 0 | Delayed payment (no dispute) |
| 1 | Quality dispute |
| 2 | No formal contract |
| 3 | Partial payment dispute |
| 4 | Government procurement delay |
| 5 | Service-related dispute |
Label mapping is included in label_mapping.json.
Model Architecture
- Base Model: Longformer
- Checkpoint:
allenai/longformer-base-4096 - Max Sequence Length: 1200 tokens
- Hidden Size: 768
- Number of Layers: 12
- Attention Type: Local attention (CLS token classification)
- Classification Head: Linear layer (6 outputs)
Longformer was selected due to the long-document nature of legal dispute texts.
Dataset Information
- Final Dataset Size (after cleaning): 2152 samples
- Duplicates removed
- Label conflicts resolved
- Stratified 80โ20 train/test split
- 5-fold stratified cross-validation
Class imbalance handled using weighted cross-entropy loss.
Training Configuration
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 2
- Gradient Accumulation Steps: 4
- Effective Batch Size: 8
- Epochs: 3
- Warmup Steps: 200
- Mixed Precision (FP16): Enabled
- Loss Function: Weighted Cross Entropy
Evaluation Results (Held-Out Test Set)
Test Set Size: 431 samples
| Metric | Score |
|---|---|
| Accuracy | 0.77 |
| Macro Precision | 0.76 |
| Macro Recall | 0.74 |
| Macro F1 Score | 0.75 |
| Macro AUC-ROC (OvR) | 0.948 |
These results indicate strong class separability and balanced performance across all categories.
Intended Use
This model is suitable for:
- Automated legal dispute classification
- MSME case triage systems
- Online Dispute Resolution (ODR) platforms
- Legal analytics systems
- Case routing and prioritization tools
Limitations
- Performance may degrade for documents significantly exceeding 1200 tokens.
- Domain-specific to MSME dispute scenarios.
- Not designed for general legal classification tasks.
- Should not be used as a substitute for legal judgment.
Ethical Considerations
This model is intended as a decision-support tool. Human oversight is recommended for legal decision-making applications. It does not provide legal advice.
Usage Example
from transformers import LongformerForSequenceClassification, AutoTokenizer
import torch
model = LongformerForSequenceClassification.from_pretrained("YOUR_USERNAME/msme-legal-dispute-classifier-longformer")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/msme-legal-dispute-classifier-longformer")
text = "The buyer failed to release payment within the agreed 45-day period."
inputs = tokenizer(text, truncation=True, max_length=1200, return_tensors="pt")
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1)
print("Predicted Label:", predicted_class.item())
- Downloads last month
- 18