PatentSBERTa β Fine-Tuned for Green Patent Detection
Model Description
Fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (1) or non-green technology (0) based on CPC Y02 classification codes.
Green technology includes inventions related to:
- π± Renewable energy (solar, wind, hydro)
- π Emission and pollution reduction
- β‘ Energy efficiency in buildings and transport
- β»οΈ Waste recycling and reduction
- π Climate change mitigation and adaptation
Training Pipeline
This model was developed through a three-iteration active learning pipeline:
| Iteration |
Method |
F1 Score |
| 1. Baseline |
Frozen PatentSBERTa + Logistic Regression |
0.7686 |
| 2. Assignment 2 |
Fine-tuned on Silver + Gold (Simple Generic LLM) |
0.8081 |
| 3. Assignment 3 |
Fine-tuned on Silver + Gold (QLoRA MAS + HITL) |
0.8166 |
Data Pipeline
- Source: 1,372,910 patent claims from AI-Growth-Lab/patents_claims_1.5m_traim_test
- Silver Labels: Created from 8 CPC Y02 columns (Y02AβY02E+). Any patent with at least one Y02 code = green (1)
- Balanced Subset: 25,000 green + 25,000 non-green = 50,000 total
- Splits: train_silver (40,000) / eval_silver (5,000) / pool_unlabeled (5,000)
- Uncertainty Sampling: Baseline model identified 100 most uncertain claims (uncertainty: 0.9710β0.9997)
- QLoRA Fine-Tuning: Mistral-7B-Instruct-v0.3 fine-tuned with QLoRA (4-bit NF4) on 1,500 silver examples
- Human Review: Expert reviewed all 100 claims β 42 green / 58 non-green gold labels
- Final Training: PatentSBERTa fine-tuned on 40,000 silver + 100 gold (gold weighted 5x)
Training Hyperparameters
| Parameter |
Value |
| Epochs |
3 |
| Batch Size |
16 |
| Learning Rate |
2e-5 |
| Warmup Steps |
100 |
| Weight Decay |
0.01 |
| Gold Weight |
5.0x |
| Max Sequence Length |
384 |
| Optimizer |
AdamW |
| Scheduler |
Linear with warmup |
| Gradient Clipping |
max_norm=1.0 |
Training Progress
| Epoch |
Train Loss |
Eval Loss |
Accuracy |
Precision |
Recall |
F1 |
| 1 |
0.4653 |
0.4162 |
80.96% |
0.8116 |
0.8064 |
0.8090 |
| 2 |
0.3789 |
0.4125 |
81.64% |
0.8258 |
0.8020 |
0.8137 |
| 3 |
0.3087 |
0.4512 |
81.52% |
0.8105 |
0.8228 |
0.8166 |
Evaluation Results
eval_silver (5,000 balanced examples)
| Metric |
Score |
| Accuracy |
81.52% |
| F1 Score (Green) |
0.8166 |
| Precision (Green) |
0.8105 |
| Recall (Green) |
0.8228 |
| F1 Macro |
0.8152 |
| F1 Weighted |
0.8152 |
precision recall f1-score support
Non-Green (0) 0.8201 0.8076 0.8138 2500
Green (1) 0.8105 0.8228 0.8166 2500
accuracy 0.8152 5000
macro avg 0.8153 0.8152 0.8152 5000
weighted avg 0.8153 0.8152 0.8152 5000
Confusion Matrix (eval_silver):
|
Pred Non-Green |
Pred Green |
| Actual Non-Green |
2019 |
481 |
| Actual Green |
443 |
2057 |
gold_100 (100 human-reviewed examples)
| Metric |
Score |
| Accuracy |
88.00% |
| F1 Score (Green) |
0.8605 |
| Precision (Green) |
0.8409 |
| Recall (Green) |
0.8810 |
precision recall f1-score support
Non-Green (0) 0.9107 0.8793 0.8947 58
Green (1) 0.8409 0.8810 0.8605 42
accuracy 0.8800 100
macro avg 0.8758 0.8801 0.8776 100
weighted avg 0.8814 0.8800 0.8803 100
Confusion Matrix (gold_100):
|
Pred Non-Green |
Pred Green |
| Actual Non-Green |
51 |
7 |
| Actual Green |
5 |
37 |
QLoRA Configuration (Mistral-7B Advocate Agent)
| Parameter |
Value |
| Base Model |
mistralai/Mistral-7B-Instruct-v0.3 |
| Quantization |
4-bit NF4 + double quantization |
| LoRA Rank |
16 |
| LoRA Alpha |
32 |
| Target Modules |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Total Parameters |
7,289,966,592 |
| Trainable Parameters |
41,943,040 (0.58%) |
| Training Examples |
1,500 |
| Training Time |
~104 minutes |
Usage
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")
encoder = AutoModel.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")
classifier = torch.nn.Linear(768, 2)
classifier.load_state_dict(torch.load("classifier_head.pt", map_location="cpu"))
encoder.eval()
classifier.eval()
claim = "A method for converting solar radiation into electrical energy using photovoltaic cells arranged in a panel configuration with maximum power point tracking."
inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=384, padding=True)
with torch.no_grad():
outputs = encoder(**inputs)
mask = inputs["attention_mask"].unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
logits = classifier(pooled)
probs = torch.softmax(logits, dim=1)
prediction = torch.argmax(logits, dim=1).item()
label = "Green" if prediction == 1 else "Non-Green"
confidence = probs[0][prediction].item()
print(f"Prediction: {label} (confidence: {confidence:.4f})")
Limitations
- Trained on CPC Y02 silver labels which may contain noise
- Only 100 human-reviewed gold examples β more gold data would improve performance
- Max sequence length of 384 tokens β longer patent claims are truncated
- Binary classification only β does not distinguish Y02 subcategories (Y02A, Y02B, etc.)
- Performance may vary on patents from domains not well represented in training data