Model Card: PatentSBERTa Fine-Tuned on Green Patent Claims (Assignment 2)

Model Summary

This model is a fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (Y02) or not. It was developed as part of a course assignment in Applied Deep Learning and AI at Aalborg University, exploring Human-in-the-Loop (HITL) data labeling pipelines for patent classification.


Model Details

  • Developed by: Anders Sønderbý (as58zr@student.aau.dk)
  • Model type: Sentence Transformer with classification head (binary)
  • Base model: AI-Growth-Lab/PatentSBERTa
  • Language: English
  • License: MIT
  • Task: Binary text classification — Green Technology (Y02) vs. Not Green

What This Model Does

Given the text of a patent claim, the model predicts whether the claim relates to green technology as defined by the CPC Y02 classification system. The output is a binary label:

  • 1 — Green technology (Y02)
  • 0 — Not green technology

Training Pipeline Overview

This model was produced through a 4-stage pipeline:

Stage 1 — Baseline (Frozen Embeddings)

A baseline classifier was trained using frozen PatentSBERTa embeddings with a Logistic Regression head on the train_silver split. This baseline was used to compute uncertainty scores for active learning.

Stage 2 — Uncertainty Sampling

The baseline model computed p_green (predicted probability of green) for all examples in pool_unlabeled. An uncertainty score was computed as:

u = 1 − 2 · |p − 0.5|

The top 100 highest-uncertainty claims were exported as hitl_green_100.csv for human review.

Stage 3 — LLM → Human HITL Labeling

For each of the 100 high-risk claims, an LLM first suggested a label (llm_green_suggested), confidence (llm_confidence), and rationale (llm_rationale). A human reviewer then assigned the final gold label (is_green_human), overriding the LLM where necessary. Only the claim text was used during labeling — no CPC codes or metadata.

Stage 4 — Fine-Tuning

PatentSBERTa was fine-tuned for binary classification using the combined train_silver + gold_100 dataset, where gold labels override silver labels for the 100 HITL-reviewed claims.


Training Data

  • Dataset: Derived from AI-Growth-Lab/patents_claims_1.5m_traim_test
  • Working file: patents_50k_green.parquet — a balanced 50k sample (25,000 green, 25,000 not green)
  • Silver label source: CPC Y02* classification codes (is_green_silver)
  • Gold labels: 100 human-reviewed claims from uncertainty sampling (is_green_gold)

Dataset Splits

Split Size Description
train_silver ~40,000 Silver-labeled training set (CPC-derived)
eval_silver ~5,000 Silver-labeled evaluation set
pool_unlabeled ~5,000 Unlabeled pool used for uncertainty sampling
gold_100 100 Human-reviewed high-uncertainty claims

Training Hyperparameters

Parameter Value
Base model AI-Growth-Lab/PatentSBERTa
Max sequence length 256
Epochs 1
Learning rate 2e-5
Training set size ~40,100 (train_silver + gold_100)

Evaluation Results

Evaluation Set F1 Score Notes
eval_silver (5,000) 0.818 Primary evaluation metric
gold_100 (100) 0.667 Human-reviewed high-uncertainty claims

HITL Reporting

As required by the assignment, the human reviewer assessed all 100 high-uncertainty claims. The LLM suggestion and human final label were recorded for each claim. Disagreements between the LLM and human reviewer were documented in the labeling file (hitl_final.csv).


Intended Use

  • Primary use: Academic research and coursework in patent classification
  • Intended users: Course instructors and students at Aalborg University
  • Out-of-scope: Production patent classification systems, legal patent assessment, or any commercial use

Limitations

  • Trained on a balanced 50k sample — performance may differ on the full unbalanced patent corpus
  • Silver labels are derived from CPC codes, which may contain noise
  • The model classifies based on claim text only and does not use metadata, citations, or CPC codes at inference time
  • Fine-tuned for 1 epoch only due to compute constraints

Repository

The full code, notebooks, and data files for this assignment are available in the course GitHub repository.

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Anders-sonderby/patentsbert-finetuned

Finetuned
(20)
this model