LegalBERT Β· Contract Clause Classifier (LoRA)

A LoRA adapter fine-tuned on top of nlpaueb/legal-bert-base-uncased for multi-class contract clause classification across all 41 CUAD clause types.

The model significantly outperforms the untrained baseline (accuracy 3.28% β†’ 71.46%, macro F1 0.005 β†’ 0.502, weighted F1 0.008 β†’ 0.677) after 5 epochs of LoRA fine-tuning.


Model Details

Property Value
Base model nlpaueb/legal-bert-base-uncased
Adapter type LoRA (PEFT)
Task Multi-class sequence classification
Classes 41 CUAD clause types
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
Target modules query, value
Max sequence length 256 tokens
Epochs 5
Learning rate 2e-4
Batch size 16
Weight decay 0.01
Warmup ratio 0.1
Optimizer AdamW (default HF Trainer)
Hardware Kaggle GPU (T4)
PEFT version 0.18.1

Training

The adapter was trained for 5 epochs on the CUAD dataset, which contains expert-labelled contract clauses across 41 legal categories. The dataset was split 80/20 (train/test) with stratification across all 41 labels.

  • Train size: ~7,930 examples
  • Test size: 1,983 examples
  • Split strategy: Stratified random split (random_state=42)

Training Curve

Epoch Train Loss Val Loss Accuracy Weighted F1 Macro F1
1 5.992 4.285 43.22% 0.316 0.158
2 2.881 2.485 65.81% 0.601 0.382
3 2.203 2.124 69.79% 0.651 0.448
4 1.958 2.005 71.05% 0.668 0.488
5 1.852 1.944 71.46% 0.677 0.502

Baseline Comparison

Metric Baseline (untrained) Fine-Tuned (this model)
Accuracy 3.28% 71.46%
Weighted F1 0.0082 0.6771
Macro F1 0.0053 0.5016

The baseline was evaluated by running the untrained nlpaueb/legal-bert-base-uncased model directly on the test set without any fine-tuning. The near-random performance (3.28%) confirms the base model has no prior knowledge of CUAD clause types.

General Benchmark β€” Catastrophic Forgetting Check

To verify the model did not lose general language understanding after fine-tuning, it was evaluated on a 100-sample subset of the MMLU Abstract Algebra benchmark:

Metric Base Model Fine-Tuned
MMLU Abstract Algebra Accuracy 19.00% 24.00%

No catastrophic forgetting detected β€” the fine-tuned model improved by 5% on the general reasoning benchmark compared to the base model, confirming that domain-specific fine-tuning did not degrade general language ability.


Evaluation Results (Per-Class)

Selected high-performing classes from the classification report:

Class Precision Recall F1 Support
Non-Disparagement (13) 1.00 1.00 1.00 89
Termination for Convenience (14) 0.96 0.95 0.96 109
Expiration Date (4) 0.92 0.97 0.95 127
Irrevocable or Perpetual License (29) 0.72 0.79 0.75 89
Audit Rights (32) 0.84 0.93 0.88 82
Effective Date (3) 0.88 0.90 0.89 125
Renewal Term (5) 0.72 0.97 0.83 133
Insurance (37)* 0.00 0.00 0.00 33

* Some rare classes (e.g. Insurance label index 37, classes 0, 1, 2) have very few training examples and score near zero β€” see Limitations section below.


Example Inference Results

Real predictions from the fine-tuned model on unseen clauses:

Clause Predicted Type Confidence
"Either party may terminate this Agreement upon 30 days written notice." Termination for Convenience 79.50%
"Licensee shall not transfer or sublicense any rights granted herein." Anti-Assignment 61.04%
"This Agreement shall be governed by the laws of California." Governing Law 96.87%
"The Company shall maintain insurance coverage of at least $1,000,000." Insurance 97.44%
"Neither party shall disclose confidential information to third parties." Anti-Assignment 41.98%

Usage

This is a PEFT LoRA adapter β€” load it on top of the base model using the peft library.

Installation

pip install transformers peft scikit-learn

Inference

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

base_model_id = "nlpaueb/legal-bert-base-uncased"
adapter_id = "Mokshith31/legalbert-contract-clause-classification"

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_id,
    num_labels=41
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()

# Label mapping (ID β†’ clause type name)
id2label = {
    0: "Document Name", 1: "Parties", 2: "Agreement Date",
    3: "Effective Date", 4: "Expiration Date", 5: "Renewal Term",
    6: "Notice Period to Terminate Renewal", 7: "Governing Law",
    8: "Most Favored Nation", 9: "Non-Compete", 10: "Exclusivity",
    11: "No-Solicit of Customers", 12: "No-Solicit of Employees",
    13: "Non-Disparagement", 14: "Termination for Convenience",
    15: "ROFR / ROFO / ROFN", 16: "Change of Control",
    17: "Anti-Assignment", 18: "Revenue / Profit Sharing",
    19: "Price Restriction", 20: "Minimum Commitment",
    21: "Volume Restriction", 22: "IP Ownership Assignment",
    23: "Joint IP Ownership", 24: "License Grant",
    25: "Non-Transferable License", 26: "Affiliate License-Licensor",
    27: "Affiliate License-Licensee",
    28: "Unlimited / All-You-Can-Eat License",
    29: "Irrevocable or Perpetual License", 30: "Source Code Escrow",
    31: "Post-Termination Services", 32: "Audit Rights",
    33: "Uncapped Liability", 34: "Cap on Liability",
    35: "Liquidated Damages", 36: "Warranty Duration",
    37: "Insurance", 38: "Covenant Not to Sue",
    39: "Third Party Beneficiary", 40: "Other"
}

# Run inference
clause = "This Agreement shall be governed by the laws of California."

inputs = tokenizer(
    clause,
    return_tensors="pt",
    truncation=True,
    max_length=256
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=1)
pred_id = outputs.logits.argmax(dim=-1).item()
confidence = probs.max().item()

print(f"Predicted clause type: {id2label[pred_id]}")
print(f"Confidence: {confidence:.2%}")

With Merged Weights (pipeline API)

import torch
from peft import PeftModel
from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer, pipeline)

base = AutoModelForSequenceClassification.from_pretrained(
    "nlpaueb/legal-bert-base-uncased", num_labels=41
)
model = PeftModel.from_pretrained(
    base,
    "Mokshith31/legalbert-contract-clause-classification"
)
model = model.merge_and_unload()  # fuse LoRA weights into base

tokenizer = AutoTokenizer.from_pretrained(
    "nlpaueb/legal-bert-base-uncased"
)

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

result = classifier(
    "Either party may terminate upon 30 days written notice.",
    truncation=True,
    max_length=256
)
print(result)

CUAD Label Schema

The model predicts one of the following 41 clause categories:

ID Clause Type
0 Document Name
1 Parties
2 Agreement Date
3 Effective Date
4 Expiration Date
5 Renewal Term
6 Notice Period to Terminate Renewal
7 Governing Law
8 Most Favored Nation
9 Non-Compete
10 Exclusivity
11 No-Solicit of Customers
12 No-Solicit of Employees
13 Non-Disparagement
14 Termination for Convenience
15 ROFR / ROFO / ROFN
16 Change of Control
17 Anti-Assignment
18 Revenue / Profit Sharing
19 Price Restriction
20 Minimum Commitment
21 Volume Restriction
22 IP Ownership Assignment
23 Joint IP Ownership
24 License Grant
25 Non-Transferable License
26 Affiliate License-Licensor
27 Affiliate License-Licensee
28 Unlimited / All-You-Can-Eat License
29 Irrevocable or Perpetual License
30 Source Code Escrow
31 Post-Termination Services
32 Audit Rights
33 Uncapped Liability
34 Cap on Liability
35 Liquidated Damages
36 Warranty Duration
37 Insurance
38 Covenant Not to Sue
39 Third Party Beneficiary
40 Other

Limitations and Bias

  • Trained exclusively on English-language commercial contracts from the CUAD dataset. Performance may degrade on other legal domains (e.g. employment, real estate) or non-US contract styles.
  • Some CUAD classes have very few training examples (e.g. class 2 β€” Agreement Date β€” has only 1 support sample), which leads to near-zero per-class performance on rare clause types. Classes 0, 1, 2, 7, 9, 21, 22, 27, 37, 38 scored F1 = 0.00 due to insufficient training data.
  • Class imbalance in the CUAD dataset means the model favours more common clause types (e.g. Renewal Term, Effective Date).
  • The model is not a substitute for legal advice. Predictions should be reviewed by qualified legal professionals before use in any legal workflow.
  • Max sequence length is 256 tokens β€” longer clauses will be truncated and may lose important context.

Citation

If you use this model, please cite the original CUAD dataset:

@article{hendrycks2021cuad,
  title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
  author={Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer},
  journal={arXiv preprint arXiv:2103.06268},
  year={2021}
}

And the LegalBERT base model:

@inproceedings{chalkidis-etal-2020-legal,
  title={LEGAL-BERT: The Muppets straight out of Law School},
  author={Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis,
          Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion},
  booktitle={Findings of EMNLP},
  year={2020}
}

Experiment Tracking

Training was tracked using Weights & Biases:
πŸ”— W&B Project β€” contract-intelligence


Framework Versions

Library Version
Transformers latest
PEFT 0.18.1
PyTorch latest
Datasets latest
scikit-learn latest
Accelerate latest
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Mokshith31/legalbert-contract-clause-classification