PatentSBERTa — Fine-Tuned for Green Patent Detection

Model Description

Fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (1) or non-green technology (0) based on CPC Y02 classification codes.

Green technology includes inventions related to:

🌱 Renewable energy (solar, wind, hydro)
🏭 Emission and pollution reduction
⚡ Energy efficiency in buildings and transport
♻️ Waste recycling and reduction
🌍 Climate change mitigation and adaptation

Training Pipeline

This model was developed through a three-iteration active learning pipeline:

Iteration	Method	F1 Score
1. Baseline	Frozen PatentSBERTa + Logistic Regression	0.7686
2. Assignment 2	Fine-tuned on Silver + Gold (Simple Generic LLM)	0.8081
3. Assignment 3	Fine-tuned on Silver + Gold (QLoRA MAS + HITL)	0.8166

Data Pipeline

Source: 1,372,910 patent claims from AI-Growth-Lab/patents_claims_1.5m_traim_test
Silver Labels: Created from 8 CPC Y02 columns (Y02A–Y02E+). Any patent with at least one Y02 code = green (1)
Balanced Subset: 25,000 green + 25,000 non-green = 50,000 total
Splits: train_silver (40,000) / eval_silver (5,000) / pool_unlabeled (5,000)
Uncertainty Sampling: Baseline model identified 100 most uncertain claims (uncertainty: 0.9710–0.9997)
QLoRA Fine-Tuning: Mistral-7B-Instruct-v0.3 fine-tuned with QLoRA (4-bit NF4) on 1,500 silver examples
Human Review: Expert reviewed all 100 claims → 42 green / 58 non-green gold labels
Final Training: PatentSBERTa fine-tuned on 40,000 silver + 100 gold (gold weighted 5x)

Training Hyperparameters

Parameter	Value
Epochs	3
Batch Size	16
Learning Rate	2e-5
Warmup Steps	100
Weight Decay	0.01
Gold Weight	5.0x
Max Sequence Length	384
Optimizer	AdamW
Scheduler	Linear with warmup
Gradient Clipping	max_norm=1.0

Training Progress

Epoch	Train Loss	Eval Loss	Accuracy	Precision	Recall	F1
1	0.4653	0.4162	80.96%	0.8116	0.8064	0.8090
2	0.3789	0.4125	81.64%	0.8258	0.8020	0.8137
3	0.3087	0.4512	81.52%	0.8105	0.8228	0.8166

Evaluation Results

eval_silver (5,000 balanced examples)

Metric	Score
Accuracy	81.52%
F1 Score (Green)	0.8166
Precision (Green)	0.8105
Recall (Green)	0.8228
F1 Macro	0.8152
F1 Weighted	0.8152

               precision    recall  f1-score   support

Non-Green (0)     0.8201    0.8076    0.8138      2500
    Green (1)     0.8105    0.8228    0.8166      2500

     accuracy                         0.8152      5000
    macro avg     0.8153    0.8152    0.8152      5000
 weighted avg     0.8153    0.8152    0.8152      5000

Confusion Matrix (eval_silver):

	Pred Non-Green	Pred Green
Actual Non-Green	2019	481
Actual Green	443	2057

gold_100 (100 human-reviewed examples)

Metric	Score
Accuracy	88.00%
F1 Score (Green)	0.8605
Precision (Green)	0.8409
Recall (Green)	0.8810

               precision    recall  f1-score   support

Non-Green (0)     0.9107    0.8793    0.8947        58
    Green (1)     0.8409    0.8810    0.8605        42

     accuracy                         0.8800       100
    macro avg     0.8758    0.8801    0.8776       100
 weighted avg     0.8814    0.8800    0.8803       100

Confusion Matrix (gold_100):

	Pred Non-Green	Pred Green
Actual Non-Green	51	7
Actual Green	5	37

QLoRA Configuration (Mistral-7B Advocate Agent)

Parameter	Value
Base Model	mistralai/Mistral-7B-Instruct-v0.3
Quantization	4-bit NF4 + double quantization
LoRA Rank	16
LoRA Alpha	32
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Total Parameters	7,289,966,592
Trainable Parameters	41,943,040 (0.58%)
Training Examples	1,500
Training Time	~104 minutes

Usage

import torch
from transformers import AutoTokenizer, AutoModel

# Load encoder and tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")
encoder = AutoModel.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")

# Load classifier head
classifier = torch.nn.Linear(768, 2)
classifier.load_state_dict(torch.load("classifier_head.pt", map_location="cpu"))

encoder.eval()
classifier.eval()

# Classify a patent claim
claim = "A method for converting solar radiation into electrical energy using photovoltaic cells arranged in a panel configuration with maximum power point tracking."

inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=384, padding=True)

with torch.no_grad():
    outputs = encoder(**inputs)
    mask = inputs["attention_mask"].unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
    pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    logits = classifier(pooled)
    probs = torch.softmax(logits, dim=1)
    prediction = torch.argmax(logits, dim=1).item()

label = "Green" if prediction == 1 else "Non-Green"
confidence = probs[0][prediction].item()
print(f"Prediction: {label} (confidence: {confidence:.4f})")

Limitations

Trained on CPC Y02 silver labels which may contain noise
Only 100 human-reviewed gold examples — more gold data would improve performance
Max sequence length of 384 tokens — longer patent claims are truncated
Binary classification only — does not distinguish Y02 subcategories (Y02A, Y02B, etc.)
Performance may vary on patents from domains not well represented in training data

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mehta7408/PatentSBERTa-green-finetuned_assign.3

Base model

AI-Growth-Lab/PatentSBERTa

Finetuned

(20)

this model

Dataset used to train mehta7408/PatentSBERTa-green-finetuned_assign.3

Evaluation results

F1 on eval_silver
self-reported

0.817
Accuracy on eval_silver
self-reported

0.815
Precision on eval_silver
self-reported

0.810
Recall on eval_silver
self-reported

0.823