LegalBERT · Contract Clause Classifier (LoRA)

A LoRA adapter fine-tuned on top of nlpaueb/legal-bert-base-uncased for multi-class contract clause classification across all 41 CUAD clause types.

The model significantly outperforms the untrained baseline (accuracy 3.28% → 71.46%, macro F1 0.005 → 0.502, weighted F1 0.008 → 0.677) after 5 epochs of LoRA fine-tuning.

Model Details

Property	Value
Base model	`nlpaueb/legal-bert-base-uncased`
Adapter type	LoRA (PEFT)
Task	Multi-class sequence classification
Classes	41 CUAD clause types
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.1
Target modules	`query`, `value`
Max sequence length	256 tokens
Epochs	5
Learning rate	2e-4
Batch size	16
Weight decay	0.01
Warmup ratio	0.1
Optimizer	AdamW (default HF Trainer)
Hardware	Kaggle GPU (T4)
PEFT version	0.18.1

Training

The adapter was trained for 5 epochs on the CUAD dataset, which contains expert-labelled contract clauses across 41 legal categories. The dataset was split 80/20 (train/test) with stratification across all 41 labels.

Train size: ~7,930 examples
Test size: 1,983 examples
Split strategy: Stratified random split (random_state=42)

Training Curve

Epoch	Train Loss	Val Loss	Accuracy	Weighted F1	Macro F1
1	5.992	4.285	43.22%	0.316	0.158
2	2.881	2.485	65.81%	0.601	0.382
3	2.203	2.124	69.79%	0.651	0.448
4	1.958	2.005	71.05%	0.668	0.488
5	1.852	1.944	71.46%	0.677	0.502

Baseline Comparison

Metric	Baseline (untrained)	Fine-Tuned (this model)
Accuracy	3.28%	71.46%
Weighted F1	0.0082	0.6771
Macro F1	0.0053	0.5016

The baseline was evaluated by running the untrained nlpaueb/legal-bert-base-uncased model directly on the test set without any fine-tuning. The near-random performance (3.28%) confirms the base model has no prior knowledge of CUAD clause types.

General Benchmark — Catastrophic Forgetting Check

To verify the model did not lose general language understanding after fine-tuning, it was evaluated on a 100-sample subset of the MMLU Abstract Algebra benchmark:

Metric	Base Model	Fine-Tuned
MMLU Abstract Algebra Accuracy	19.00%	24.00%

No catastrophic forgetting detected — the fine-tuned model improved by 5% on the general reasoning benchmark compared to the base model, confirming that domain-specific fine-tuning did not degrade general language ability.

Evaluation Results (Per-Class)

Selected high-performing classes from the classification report:

Class	Precision	Recall	F1	Support
Non-Disparagement (13)	1.00	1.00	1.00	89
Termination for Convenience (14)	0.96	0.95	0.96	109
Expiration Date (4)	0.92	0.97	0.95	127
Irrevocable or Perpetual License (29)	0.72	0.79	0.75	89
Audit Rights (32)	0.84	0.93	0.88	82
Effective Date (3)	0.88	0.90	0.89	125
Renewal Term (5)	0.72	0.97	0.83	133
Insurance (37)*	0.00	0.00	0.00	33

* Some rare classes (e.g. Insurance label index 37, classes 0, 1, 2) have very few training examples and score near zero — see Limitations section below.

Example Inference Results

Real predictions from the fine-tuned model on unseen clauses:

Clause	Predicted Type	Confidence
"Either party may terminate this Agreement upon 30 days written notice."	Termination for Convenience	79.50%
"Licensee shall not transfer or sublicense any rights granted herein."	Anti-Assignment	61.04%
"This Agreement shall be governed by the laws of California."	Governing Law	96.87%
"The Company shall maintain insurance coverage of at least $1,000,000."	Insurance	97.44%
"Neither party shall disclose confidential information to third parties."	Anti-Assignment	41.98%

Usage

This is a PEFT LoRA adapter — load it on top of the base model using the peft library.

Installation

pip install transformers peft scikit-learn

Inference

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

base_model_id = "nlpaueb/legal-bert-base-uncased"
adapter_id = "Mokshith31/legalbert-contract-clause-classification"

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_id,
    num_labels=41
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()

# Label mapping (ID → clause type name)
id2label = {
    0: "Document Name", 1: "Parties", 2: "Agreement Date",
    3: "Effective Date", 4: "Expiration Date", 5: "Renewal Term",
    6: "Notice Period to Terminate Renewal", 7: "Governing Law",
    8: "Most Favored Nation", 9: "Non-Compete", 10: "Exclusivity",
    11: "No-Solicit of Customers", 12: "No-Solicit of Employees",
    13: "Non-Disparagement", 14: "Termination for Convenience",
    15: "ROFR / ROFO / ROFN", 16: "Change of Control",
    17: "Anti-Assignment", 18: "Revenue / Profit Sharing",
    19: "Price Restriction", 20: "Minimum Commitment",
    21: "Volume Restriction", 22: "IP Ownership Assignment",
    23: "Joint IP Ownership", 24: "License Grant",
    25: "Non-Transferable License", 26: "Affiliate License-Licensor",
    27: "Affiliate License-Licensee",
    28: "Unlimited / All-You-Can-Eat License",
    29: "Irrevocable or Perpetual License", 30: "Source Code Escrow",
    31: "Post-Termination Services", 32: "Audit Rights",
    33: "Uncapped Liability", 34: "Cap on Liability",
    35: "Liquidated Damages", 36: "Warranty Duration",
    37: "Insurance", 38: "Covenant Not to Sue",
    39: "Third Party Beneficiary", 40: "Other"
}

# Run inference
clause = "This Agreement shall be governed by the laws of California."

inputs = tokenizer(
    clause,
    return_tensors="pt",
    truncation=True,
    max_length=256
)

with torch.no_grad():
    outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=1)
pred_id = outputs.logits.argmax(dim=-1).item()
confidence = probs.max().item()

print(f"Predicted clause type: {id2label[pred_id]}")
print(f"Confidence: {confidence:.2%}")

With Merged Weights (pipeline API)

import torch
from peft import PeftModel
from transformers import (AutoModelForSequenceClassification,
                          AutoTokenizer, pipeline)

base = AutoModelForSequenceClassification.from_pretrained(
    "nlpaueb/legal-bert-base-uncased", num_labels=41
)
model = PeftModel.from_pretrained(
    base,
    "Mokshith31/legalbert-contract-clause-classification"
)
model = model.merge_and_unload()  # fuse LoRA weights into base

tokenizer = AutoTokenizer.from_pretrained(
    "nlpaueb/legal-bert-base-uncased"
)

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

result = classifier(
    "Either party may terminate upon 30 days written notice.",
    truncation=True,
    max_length=256
)
print(result)

CUAD Label Schema

The model predicts one of the following 41 clause categories:

ID	Clause Type
0	Document Name
1	Parties
2	Agreement Date
3	Effective Date
4	Expiration Date
5	Renewal Term
6	Notice Period to Terminate Renewal
7	Governing Law
8	Most Favored Nation
9	Non-Compete
10	Exclusivity
11	No-Solicit of Customers
12	No-Solicit of Employees
13	Non-Disparagement
14	Termination for Convenience
15	ROFR / ROFO / ROFN
16	Change of Control
17	Anti-Assignment
18	Revenue / Profit Sharing
19	Price Restriction
20	Minimum Commitment
21	Volume Restriction
22	IP Ownership Assignment
23	Joint IP Ownership
24	License Grant
25	Non-Transferable License
26	Affiliate License-Licensor
27	Affiliate License-Licensee
28	Unlimited / All-You-Can-Eat License
29	Irrevocable or Perpetual License
30	Source Code Escrow
31	Post-Termination Services
32	Audit Rights
33	Uncapped Liability
34	Cap on Liability
35	Liquidated Damages
36	Warranty Duration
37	Insurance
38	Covenant Not to Sue
39	Third Party Beneficiary
40	Other

Limitations and Bias

Trained exclusively on English-language commercial contracts from the CUAD dataset. Performance may degrade on other legal domains (e.g. employment, real estate) or non-US contract styles.
Some CUAD classes have very few training examples (e.g. class 2 — Agreement Date — has only 1 support sample), which leads to near-zero per-class performance on rare clause types. Classes 0, 1, 2, 7, 9, 21, 22, 27, 37, 38 scored F1 = 0.00 due to insufficient training data.
Class imbalance in the CUAD dataset means the model favours more common clause types (e.g. Renewal Term, Effective Date).
The model is not a substitute for legal advice. Predictions should be reviewed by qualified legal professionals before use in any legal workflow.
Max sequence length is 256 tokens — longer clauses will be truncated and may lose important context.

Citation

If you use this model, please cite the original CUAD dataset:

@article{hendrycks2021cuad,
  title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
  author={Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer},
  journal={arXiv preprint arXiv:2103.06268},
  year={2021}
}

And the LegalBERT base model:

@inproceedings{chalkidis-etal-2020-legal,
  title={LEGAL-BERT: The Muppets straight out of Law School},
  author={Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis,
          Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion},
  booktitle={Findings of EMNLP},
  year={2020}
}

Experiment Tracking

Training was tracked using Weights & Biases:
🔗 W&B Project — contract-intelligence

Framework Versions

Library	Version
Transformers	latest
PEFT	0.18.1
PyTorch	latest
Datasets	latest
scikit-learn	latest
Accelerate	latest

Downloads last month: 26

Paper for Mokshith31/legalbert-contract-clause-classification

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Paper • 2103.06268 • Published Mar 10, 2021