Intelligent Legal Document Analysis Classifier (Longformer)

This model is a fine-tuned version of the Legal-Longformer-Base-4096, developed at the Pune Institute of Computer Technology (PICT) as part of the research paper: "Intelligent Legal Document Analysis using NLP".

The framework leverages NLP and Information Retrieval (IR) to classify unstructured documents into three primary domains: Criminal Law, Contract Law, and Education Law.

Model Details

Model Description

Traditional NLP often struggles with the complexity of legal discourse and domain-specific jargon. This model addresses these challenges by focusing on a tractable subset of twelve small-scale laws. It utilizes the Vector Space Model (VSM) for clause-level representation and the Longformer's sliding window attention to process documents spanning thousands of tokens.

  • Developed by: Tanishq Shinde, Nilakshi Sonawane, Sarang Joshi, Mansi Jangle, and Vaishnavi Madavi
  • Institution: Pune Institute of Computer Technology (PICT), Pune, India
  • Model type: Transformer-based Sequence Classifier
  • Finetuned from model: Saibo-creator/legal-longformer-base-4096

Model Sources

  • Research Project: Intelligent Legal Document Analysis using NLP (Paper ID 685)
  • Repository: Hugging Face Model Hub

Uses

Direct Use

The model is optimized to categorize legal clauses into the following twelve sub-domains:

  • Criminal Law: Traffic signal violations, drunk driving penalties, petty theft, and noise pollution laws.
  • Contract Law: House rent agreements, lease termination, small loan disputes, and consumer redressal rules.
  • Education Law: Teacher appointments, service statutes, wage rules, and leave regulations.

Engineering-Inspired Features

The classifier is designed to support several high-level analytical components described in the paper:

  • Knapsack-based Term Selection: Selecting the most informative terms to maximize relevance within a "scope budget".
  • Fuzzy Word Identification: Flagging ambiguous expressions (e.g., "reasonable time") to highlight legal uncertainties for human review.
  • Finite State Machines (FSM): Modeling legal procedural flows, such as the transition from a "contract active" state to a "penalty imposed" state.

Training Details

Training Data

The training utilized a balanced dataset of 30,000 legal rows (10,000 per primary domain).

  • Preprocessing: Text normalization, tokenization, and retention of critical legal abbreviations and Latin expressions.
  • Representation: Documents were segmented into individual clauses, treating each as a mathematical point in high-dimensional space.

Training Procedure

  • Hardware: Single NVIDIA P100 GPU (Kaggle).
  • Precision: FP16 Mixed Precision for accelerated computation.
  • Epochs: 1.0.
  • Effective Batch Size: 32 (Batch Size 16 with Gradient Accumulation steps: 2).
  • Final Training Loss: 0.0842.

Evaluation

Results

The model achieved a Global Training Loss of 0.084, demonstrating high accuracy in differentiating between the structured yet dense terminologies of Criminal, Contract, and Education law.

How to Get Started with the Model

from transformers import pipeline

# Load the fine-tuned legal classifier
classifier = pipeline("text-classification", model="Tanishq77/legal-classifier-v1")

# Test on a Contract Law clause
text = "The lessee shall be responsible for all utility payments during the lease term."

result = classifier(text)
print(f"Domain: {result[0]['label']} | Confidence: {result[0]['score']:.4f}")

Ethical Considerations

As outlined in the research methodology, this model is intended for educational and analytical purposes only. It is not a substitute for professional legal advice. The framework assumes human oversight at every stage, with legal experts expected to validate outputs to prevent misinterpretation and misuse.

Citation

If you use this model or refer to the intelligent legal document analysis framework in your research, please cite it as follows:

BibTeX

@article{shinde2026intelligent,
  title={Intelligent Legal Document Analysis using NLP},
  author={Shinde, Tanishq and Sonawane, Nilakshi and Joshi, Sarang and Jangle, Mansi and Madavi, Vaishnavi},
  journal={Dept. of Computer Engineering, Pune Institute of Computer Technology},
  year={2026}
}
Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tanishq77/legal-classifier-v1

Finetuned
(1)
this model