Model Card: PatentSBERTa Fine-Tuned on Green Patent Claims (Assignment 2)
Model Summary
This model is a fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as part of a course assignment in Applied Deep Learning and AI at Aalborg University, exploring Human-in-the-Loop (HITL) data labeling pipelines for patent classification.
Model Details
- Developed by: Anders Sønderbý (as58zr@student.aau.dk)
- Model type: Sentence Transformer with classification head (binary)
- Base model: AI-Growth-Lab/PatentSBERTa
- Language: English
- License: MIT
- Task: Binary text classification — Green Technology (Y02) vs. Not Green
What This Model Does
Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system. The output is a binary label:
1— Green technology (Y02)0— Not green technology
Training Pipeline Overview
This model was produced through a 4-stage pipeline:
Stage 1 — Baseline (Frozen Embeddings)
A baseline classifier was trained using frozen PatentSBERTa embeddings with a Logistic Regression head on the train_silver split. This baseline was used to compute uncertainty scores for active learning.
Stage 2 — Uncertainty Sampling
The baseline model computed p_green (predicted probability of green) for all examples in pool_unlabeled. An uncertainty score was computed as:
u = 1 − 2 · |p − 0.5|
The top 100 highest-uncertainty claims were exported as hitl_green_100.csv for human review.
Stage 3 — LLM → Human HITL Labeling
For each of the 100 high-risk claims, an LLM first suggested a label (llm_green_suggested), confidence (llm_confidence), and rationale (llm_rationale). A human reviewer then assigned the final gold label (is_green_human), overriding the LLM where necessary. Only the claim text was used during labeling — no CPC codes or metadata.
Stage 4 — Fine-Tuning
PatentSBERTa was fine-tuned for binary classification using the combined train_silver + gold_100 dataset, where gold labels override silver labels for the 100 HITL-reviewed claims.
Training Data
- Dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
- Working file:
patents_50k_green.parquet— a balanced 50k sample (25,000 green, 25,000 not green) - Silver label source: CPC Y02* classification codes (
is_green_silver) - Gold labels: 100 human-reviewed claims from uncertainty sampling (
is_green_gold)
Dataset Splits
| Split | Size | Description |
|---|---|---|
| train_silver | ~40,000 | Silver-labeled training set (CPC-derived) |
| eval_silver | ~5,000 | Silver-labeled evaluation set |
| pool_unlabeled | ~5,000 | Unlabeled pool used for uncertainty sampling |
| gold_100 | 100 | Human-reviewed high-uncertainty claims |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | AI-Growth-Lab/PatentSBERTa |
| Max sequence length | 256 |
| Epochs | 1 |
| Learning rate | 2e-5 |
| Training set size | ~40,100 (train_silver + gold_100) |
Evaluation Results
| Evaluation Set | F1 Score | Notes |
|---|---|---|
| eval_silver (5,000) | 0.818 | Primary evaluation metric |
| gold_100 (100) | 0.667 | Human-reviewed high-uncertainty claims |
HITL Reporting
As required by the assignment, the human reviewer assessed all 100 high-uncertainty claims. The LLM suggestion and human final label were recorded for each claim. Disagreements between the LLM and human reviewer were documented in the labeling file (hitl_final.csv).
Intended Use
- Primary use: Academic research and coursework in patent classification
- Intended users: Course instructors and students at Aalborg University
- Out-of-scope: Production patent classification systems, legal patent assessment, or any commercial use
Limitations
- Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
- Silver labels are derived from CPC codes, which may contain noise
- The model classifies based on claim text only and does not use metadata, citations, or CPC codes at inference time
- Fine-tuned for 1 epoch only due to compute constraints
Repository
The full code, notebooks, and data files for this assignment are available in the course GitHub repository.
- Downloads last month
- 5
Model tree for Anders-sonderby/patentsbert-finetuned
Base model
AI-Growth-Lab/PatentSBERTa