Upload folder using huggingface_hub
Browse files- README.md +92 -0
- config.json +63 -0
- label_config.json +30 -0
- model.safetensors +3 -0
- tokenizer.json +0 -0
- tokenizer_config.json +14 -0
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- legal
|
| 7 |
+
- glacier
|
| 8 |
+
- distillation
|
| 9 |
+
- sequence-classification
|
| 10 |
+
pipeline_tag: text-classification
|
| 11 |
+
datasets:
|
| 12 |
+
- glacier-legal/legal-distillation-data
|
| 13 |
+
base_model: nlpaueb/legal-bert-base-uncased
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# GLACIER glacier-document-classifier
|
| 17 |
+
|
| 18 |
+
**Distilled legal AI model** for the [GLACIER pipeline](https://github.com/OrionDevPartners/glacier-legal-mcp) — Gated Legal Analysis, Citation Intelligence, Evidence Routing.
|
| 19 |
+
|
| 20 |
+
## Model Description
|
| 21 |
+
|
| 22 |
+
This model is distilled from Claude Opus 4.6 (via AWS Bedrock) into a lightweight transformer for fast, local inference. It handles **legal document type classification (complaint, motion, brief, etc.)** as part of the GLACIER 6-stage legal document production pipeline.
|
| 23 |
+
|
| 24 |
+
- **Base model:** [nlpaueb/legal-bert-base-uncased](https://huggingface.co/nlpaueb/legal-bert-base-uncased)
|
| 25 |
+
- **Task:** sequence-classification
|
| 26 |
+
- **Labels:** 12 classes
|
| 27 |
+
- **Max length:** 512 tokens
|
| 28 |
+
|
| 29 |
+
## Labels
|
| 30 |
+
|
| 31 |
+
- `complaint`
|
| 32 |
+
- `answer`
|
| 33 |
+
- `motion`
|
| 34 |
+
- `brief`
|
| 35 |
+
- `order`
|
| 36 |
+
- `opinion`
|
| 37 |
+
- `notice`
|
| 38 |
+
- `subpoena`
|
| 39 |
+
- `affidavit`
|
| 40 |
+
- `demand_letter`
|
| 41 |
+
- `bar_complaint`
|
| 42 |
+
- `other`
|
| 43 |
+
|
| 44 |
+
## Usage
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from glacier_distill.inference import GlacierPipeline
|
| 48 |
+
|
| 49 |
+
pipeline = GlacierPipeline()
|
| 50 |
+
result = pipeline.classify_document("your legal text here")
|
| 51 |
+
print(result)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Or use directly with transformers:
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
from transformers import pipeline
|
| 58 |
+
|
| 59 |
+
classifier = pipeline("text-classification", model="glacier-legal/glacier-document-classifier")
|
| 60 |
+
result = classifier("your legal text here")
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Training
|
| 64 |
+
|
| 65 |
+
- **Teacher:** Claude Opus 4.6 (AWS Bedrock)
|
| 66 |
+
- **Method:** Knowledge distillation (Hinton et al., 2015) with temperature=4.0, alpha=0.7
|
| 67 |
+
- **Data:** CourtListener case law + synthetic labeled examples
|
| 68 |
+
- **Framework:** HuggingFace Transformers + custom DistillationLoss
|
| 69 |
+
|
| 70 |
+
## GLACIER Pipeline
|
| 71 |
+
|
| 72 |
+
This model is part of the GLACIER pipeline stages:
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
Stage 1: QUERY -> jurisdiction-router + document-classifier
|
| 76 |
+
Stage 2: RESEARCH -> legal-ner (entity extraction)
|
| 77 |
+
Stage 3: WDC #1 -> (full model review)
|
| 78 |
+
Stage 4: DRAFT -> legal-ner + citation-classifier
|
| 79 |
+
Stage 5: WDC #2 -> hallucination-detector + citation-classifier
|
| 80 |
+
Stage 6: FINAL -> (human review)
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## Limitations
|
| 84 |
+
|
| 85 |
+
- Distilled models are optimized for US legal text (federal + state)
|
| 86 |
+
- Not a substitute for full model review in GLACIER Stages 3/5
|
| 87 |
+
- Citation hallucination detection is a pre-filter, not a replacement for external verification
|
| 88 |
+
- Jurisdiction coverage: Florida, Mississippi, Federal (primary); other states (limited)
|
| 89 |
+
|
| 90 |
+
## License
|
| 91 |
+
|
| 92 |
+
MIT — Part of the GLACIER Legal AI Framework by Orion Dev Partners, LLC.
|
config.json
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_cross_attention": false,
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertForSequenceClassification"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"dtype": "float32",
|
| 10 |
+
"eos_token_id": null,
|
| 11 |
+
"eos_token_ids": 0,
|
| 12 |
+
"finetuning_task": null,
|
| 13 |
+
"hidden_act": "gelu",
|
| 14 |
+
"hidden_dropout_prob": 0.1,
|
| 15 |
+
"hidden_size": 768,
|
| 16 |
+
"id2label": {
|
| 17 |
+
"0": "LABEL_0",
|
| 18 |
+
"1": "LABEL_1",
|
| 19 |
+
"2": "LABEL_2",
|
| 20 |
+
"3": "LABEL_3",
|
| 21 |
+
"4": "LABEL_4",
|
| 22 |
+
"5": "LABEL_5",
|
| 23 |
+
"6": "LABEL_6",
|
| 24 |
+
"7": "LABEL_7",
|
| 25 |
+
"8": "LABEL_8",
|
| 26 |
+
"9": "LABEL_9",
|
| 27 |
+
"10": "LABEL_10",
|
| 28 |
+
"11": "LABEL_11"
|
| 29 |
+
},
|
| 30 |
+
"initializer_range": 0.02,
|
| 31 |
+
"intermediate_size": 3072,
|
| 32 |
+
"is_decoder": false,
|
| 33 |
+
"label2id": {
|
| 34 |
+
"LABEL_0": 0,
|
| 35 |
+
"LABEL_1": 1,
|
| 36 |
+
"LABEL_10": 10,
|
| 37 |
+
"LABEL_11": 11,
|
| 38 |
+
"LABEL_2": 2,
|
| 39 |
+
"LABEL_3": 3,
|
| 40 |
+
"LABEL_4": 4,
|
| 41 |
+
"LABEL_5": 5,
|
| 42 |
+
"LABEL_6": 6,
|
| 43 |
+
"LABEL_7": 7,
|
| 44 |
+
"LABEL_8": 8,
|
| 45 |
+
"LABEL_9": 9
|
| 46 |
+
},
|
| 47 |
+
"layer_norm_eps": 1e-12,
|
| 48 |
+
"max_position_embeddings": 512,
|
| 49 |
+
"model_type": "bert",
|
| 50 |
+
"num_attention_heads": 12,
|
| 51 |
+
"num_hidden_layers": 12,
|
| 52 |
+
"output_past": true,
|
| 53 |
+
"pad_token_id": 0,
|
| 54 |
+
"problem_type": "single_label_classification",
|
| 55 |
+
"pruned_heads": {},
|
| 56 |
+
"tie_word_embeddings": true,
|
| 57 |
+
"torchscript": false,
|
| 58 |
+
"transformers_version": "5.5.3",
|
| 59 |
+
"type_vocab_size": 2,
|
| 60 |
+
"use_bfloat16": false,
|
| 61 |
+
"use_cache": false,
|
| 62 |
+
"vocab_size": 30522
|
| 63 |
+
}
|
label_config.json
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"label2id": {
|
| 3 |
+
"complaint": 0,
|
| 4 |
+
"answer": 1,
|
| 5 |
+
"motion": 2,
|
| 6 |
+
"brief": 3,
|
| 7 |
+
"order": 4,
|
| 8 |
+
"opinion": 5,
|
| 9 |
+
"notice": 6,
|
| 10 |
+
"subpoena": 7,
|
| 11 |
+
"affidavit": 8,
|
| 12 |
+
"demand_letter": 9,
|
| 13 |
+
"bar_complaint": 10,
|
| 14 |
+
"other": 11
|
| 15 |
+
},
|
| 16 |
+
"id2label": {
|
| 17 |
+
"0": "complaint",
|
| 18 |
+
"1": "answer",
|
| 19 |
+
"2": "motion",
|
| 20 |
+
"3": "brief",
|
| 21 |
+
"4": "order",
|
| 22 |
+
"5": "opinion",
|
| 23 |
+
"6": "notice",
|
| 24 |
+
"7": "subpoena",
|
| 25 |
+
"8": "affidavit",
|
| 26 |
+
"9": "demand_letter",
|
| 27 |
+
"10": "bar_complaint",
|
| 28 |
+
"11": "other"
|
| 29 |
+
}
|
| 30 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1fc7e80124768afd7dbd78f29fac3c84cee9c3ffc849c70c5326aa5a5463f60c
|
| 3 |
+
size 437989384
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "tokenizers",
|
| 3 |
+
"cls_token": "[CLS]",
|
| 4 |
+
"do_lower_case": true,
|
| 5 |
+
"is_local": false,
|
| 6 |
+
"mask_token": "[MASK]",
|
| 7 |
+
"model_max_length": 512,
|
| 8 |
+
"pad_token": "[PAD]",
|
| 9 |
+
"sep_token": "[SEP]",
|
| 10 |
+
"strip_accents": null,
|
| 11 |
+
"tokenize_chinese_chars": true,
|
| 12 |
+
"tokenizer_class": "BertTokenizer",
|
| 13 |
+
"unk_token": "[UNK]"
|
| 14 |
+
}
|