PatentSBERTa β€” Fine-Tuned for Green Patent Detection

Model Description

Fine-tuned version of AI-Growth-Lab/PatentSBERTa for binary classification of patent claims as green technology (1) or non-green technology (0) based on CPC Y02 classification codes.

Green technology includes inventions related to:

  • 🌱 Renewable energy (solar, wind, hydro)
  • 🏭 Emission and pollution reduction
  • ⚑ Energy efficiency in buildings and transport
  • ♻️ Waste recycling and reduction
  • 🌍 Climate change mitigation and adaptation

Training Pipeline

This model was developed through a three-iteration active learning pipeline:

Iteration Method F1 Score
1. Baseline Frozen PatentSBERTa + Logistic Regression 0.7686
2. Assignment 2 Fine-tuned on Silver + Gold (Simple Generic LLM) 0.8081
3. Assignment 3 Fine-tuned on Silver + Gold (QLoRA MAS + HITL) 0.8166

Data Pipeline

  1. Source: 1,372,910 patent claims from AI-Growth-Lab/patents_claims_1.5m_traim_test
  2. Silver Labels: Created from 8 CPC Y02 columns (Y02A–Y02E+). Any patent with at least one Y02 code = green (1)
  3. Balanced Subset: 25,000 green + 25,000 non-green = 50,000 total
  4. Splits: train_silver (40,000) / eval_silver (5,000) / pool_unlabeled (5,000)
  5. Uncertainty Sampling: Baseline model identified 100 most uncertain claims (uncertainty: 0.9710–0.9997)
  6. QLoRA Fine-Tuning: Mistral-7B-Instruct-v0.3 fine-tuned with QLoRA (4-bit NF4) on 1,500 silver examples
  7. Human Review: Expert reviewed all 100 claims β†’ 42 green / 58 non-green gold labels
  8. Final Training: PatentSBERTa fine-tuned on 40,000 silver + 100 gold (gold weighted 5x)

Training Hyperparameters

Parameter Value
Epochs 3
Batch Size 16
Learning Rate 2e-5
Warmup Steps 100
Weight Decay 0.01
Gold Weight 5.0x
Max Sequence Length 384
Optimizer AdamW
Scheduler Linear with warmup
Gradient Clipping max_norm=1.0

Training Progress

Epoch Train Loss Eval Loss Accuracy Precision Recall F1
1 0.4653 0.4162 80.96% 0.8116 0.8064 0.8090
2 0.3789 0.4125 81.64% 0.8258 0.8020 0.8137
3 0.3087 0.4512 81.52% 0.8105 0.8228 0.8166

Evaluation Results

eval_silver (5,000 balanced examples)

Metric Score
Accuracy 81.52%
F1 Score (Green) 0.8166
Precision (Green) 0.8105
Recall (Green) 0.8228
F1 Macro 0.8152
F1 Weighted 0.8152
               precision    recall  f1-score   support

Non-Green (0)     0.8201    0.8076    0.8138      2500
    Green (1)     0.8105    0.8228    0.8166      2500

     accuracy                         0.8152      5000
    macro avg     0.8153    0.8152    0.8152      5000
 weighted avg     0.8153    0.8152    0.8152      5000

Confusion Matrix (eval_silver):

Pred Non-Green Pred Green
Actual Non-Green 2019 481
Actual Green 443 2057

gold_100 (100 human-reviewed examples)

Metric Score
Accuracy 88.00%
F1 Score (Green) 0.8605
Precision (Green) 0.8409
Recall (Green) 0.8810
               precision    recall  f1-score   support

Non-Green (0)     0.9107    0.8793    0.8947        58
    Green (1)     0.8409    0.8810    0.8605        42

     accuracy                         0.8800       100
    macro avg     0.8758    0.8801    0.8776       100
 weighted avg     0.8814    0.8800    0.8803       100

Confusion Matrix (gold_100):

Pred Non-Green Pred Green
Actual Non-Green 51 7
Actual Green 5 37

QLoRA Configuration (Mistral-7B Advocate Agent)

Parameter Value
Base Model mistralai/Mistral-7B-Instruct-v0.3
Quantization 4-bit NF4 + double quantization
LoRA Rank 16
LoRA Alpha 32
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Total Parameters 7,289,966,592
Trainable Parameters 41,943,040 (0.58%)
Training Examples 1,500
Training Time ~104 minutes

Usage

import torch
from transformers import AutoTokenizer, AutoModel

# Load encoder and tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")
encoder = AutoModel.from_pretrained("YOUR_USERNAME/PatentSBERTa-green-finetuned")

# Load classifier head
classifier = torch.nn.Linear(768, 2)
classifier.load_state_dict(torch.load("classifier_head.pt", map_location="cpu"))

encoder.eval()
classifier.eval()

# Classify a patent claim
claim = "A method for converting solar radiation into electrical energy using photovoltaic cells arranged in a panel configuration with maximum power point tracking."

inputs = tokenizer(claim, return_tensors="pt", truncation=True, max_length=384, padding=True)

with torch.no_grad():
    outputs = encoder(**inputs)
    mask = inputs["attention_mask"].unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
    pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    logits = classifier(pooled)
    probs = torch.softmax(logits, dim=1)
    prediction = torch.argmax(logits, dim=1).item()

label = "Green" if prediction == 1 else "Non-Green"
confidence = probs[0][prediction].item()
print(f"Prediction: {label} (confidence: {confidence:.4f})")

Limitations

  • Trained on CPC Y02 silver labels which may contain noise
  • Only 100 human-reviewed gold examples β€” more gold data would improve performance
  • Max sequence length of 384 tokens β€” longer patent claims are truncated
  • Binary classification only β€” does not distinguish Y02 subcategories (Y02A, Y02B, etc.)
  • Performance may vary on patents from domains not well represented in training data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mehta7408/PatentSBERTa-green-finetuned_assign.3

Finetuned
(20)
this model

Dataset used to train mehta7408/PatentSBERTa-green-finetuned_assign.3

Evaluation results