--- license: apache-2.0 language: - en tags: - climate - ESG - sustainable-finance - sequence-classification base_model: climatebert/distilroberta-base-climate-detector metrics: - f1 - accuracy ---
# ๐ŸŒฟ Green Shareholder Proposal Detector

License Language Task Domain

*A fine-tuned BERT-based language model to detect "greenness" within shareholder proposal.*
--- ## ๐Ÿ“‹ Model Summary Shareholder resolutions are often terse and semantically ambiguous when read in isolation. Consider a proposal requesting a report on **water risk management** โ€” this may refer to environmental water stress (a climate risk) or to the human right to water access (a social issue). Such overlaps are pervasive in ESG discourse, where the same terminology routinely spans environmental, social, and governance dimensions. This model is a fine-tuned version of [ClimateBERT](https://huggingface.co/climatebert/distilroberta-base-climate-detector), specifically engineered to classify shareholder proposals as **green** (climate/environmental) or **non-green**. It is trained to resolve precisely this kind of ambiguity: rather than surface-matching sustainability keywords, it learns to identify the **underlying environmental intent** of a proposal from its full contextual framing. As a result, the model is robust against false positives induced by generic ESG buzzwords โ€” terms such as *neutrality*, *waste*, or *water* that frequently appear across non-environmental proposals โ€” and maintains high precision in **mixed-ESG contexts** where environmental and social/governance themes co-occur. > ๐ŸŽฏ **Designed for:** Extracting environmental signal from noisy, multi-topic ESG disclosures. --- ## ๐Ÿš€ Usage ### โšก Quick Start Install dependencies first: ```bash pip install transformers torch ``` Then run the following: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline from transformers.pipelines.pt_utils import KeyDataset import datasets from tqdm.auto import tqdm # โ”€โ”€ Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ model_name = "Jidi1997/ClimateBERT_GPROP_Detector" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=512) pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0) # change to device=-1 if only CPU is available # โ”€โ”€ Data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ # Option A: Load your own dataset from a local CSV / JSON file # dataset = datasets.load_dataset("csv", data_files="your_proposals.csv", split="train") # Option B: Construct proposals inline using the recommended input format # Each entry should follow the structure below for best performance: # "A(An) {sponsor_type}-type sponsor has filed a shareholder proposal to a(an) # {sic2_des}-sector company. This proposal requests: {resolution}. # It falls under a broader agenda class that may include items not directly # relevant to this specific proposal: {AgendaCodeInformation}" dataset = datasets.Dataset.from_dict({"text": [ # Replace with your own proposals following the recommended input format above """A(An) institutional-type sponsor has filed a shareholder proposal to a(an) energy-sector company. This proposal requests: the company to issue a report on its greenhouse gas emissions reduction targets. It falls under a broader agenda class: "...""" ]}) # โ”€โ”€ Inference โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ # label='yes' โ†’ Green proposal (Label 1) # label='no' โ†’ Non-green proposal (Label 0) for out in tqdm(pipe(KeyDataset(dataset, "text"), padding=True, truncation=True)): print(out) ``` --- ### ๐Ÿ“Œ Recommended Input Format To address ambiguity in raw proposal text, we can enhance the model's input with structured proposal- and firm-level context, like the training data format: ``` "A(An) {sponsor_type}-type sponsor has filed a shareholder proposal to a(an) {sic2_des}-sector company. This proposal requests: {resolution}. It falls under a broader agenda class that may include items not directly relevant to this specific proposal: {AgendaCodeInformation}" ``` | Field | Description | Example | |:---|:---|:---| | `{sponsor_type}` | Type of proposal sponsor | `institutional`, `individual`, `SRI fund`, `pension fund` | | `{sic2_des}` | SIC-2 industry sector description | `energy`, `manufacturing` | | `{resolution}` | Full text of the proposal resolution | *"Report on Climate Change Performance Metrics Into Executive Compensation Program..."* | | `{AgendaCodeInformation}` | Description of ISS agenda code | *"This code is used for proposals seeking..."* | > ๐Ÿ’ก **Tip:** The `{AgendaCodeInformation}` field is optional but including it generally improves prediction confidence, as it provides additional categorical context into brief resolution context. ## ๐Ÿ“ฆ Training Data The model was fine-tuned on a custom **stratified dataset of 1,500 manually curated ISS shareholder proposals**. The dataset underwent rule-based correction to exclude purely social/governance and blend proposals. ๐Ÿ“‚ For full details on data sampling, text construction, and labeling rules, please refer to the **[gprop_training_dataset](https://huggingface.co/datasets/Jidi1997/gprop_training_dataset)**. --- ## โš™๏ธ Training Procedure ### ๐Ÿ”ง Hyperparameters | Hyperparameter | Value | |:---|:---:| | ๐Ÿ“ Learning Rate | `2e-05` | | ๐Ÿ“ฆ Train Batch Size | `16` | | ๐Ÿ“ฆ Eval Batch Size | `16` | | ๐ŸŽฒ Seed | `42` | | โš–๏ธ Weight Decay | `0.05` | | ๐Ÿ” Optimizer | AdamW | | ๐Ÿ”„ Epochs | `10` | ### ๐Ÿ“ˆ Training Results The model weights from **Epoch 8 (`checkpoint-600`)** were selected as the best performing checkpoint based on the validation F1 score. | Epoch | Train Loss | Val Loss | Accuracy | F1 (Binary) | |:---:|:---:|:---:|:---:|:---:| | 1 | 0.3060 | 0.0968 | 0.9667 | 0.9675 | | 2 | 0.0954 | 0.0898 | 0.9733 | 0.9740 | | 3 | 0.0956 | 0.1808 | 0.9600 | 0.9623 | | 4 | 0.0029 | 0.0783 | 0.9800 | 0.9805 | | 5 | 0.0395 | 0.1026 | 0.9800 | 0.9803 | | 6 | 0.0350 | 0.1308 | 0.9733 | 0.9744 | | 7 | 0.0094 | 0.1108 | 0.9767 | 0.9772 | | **8** โญ | **0.0003** | **0.1182** | **0.9800** | **0.9806** | | 9 | 0.0004 | 0.1154 | 0.9767 | 0.9773 | | 10 | 0.0002 | 0.1229 | 0.9767 | 0.9773 | > โญ **Best checkpoint selected at Epoch 8** โ€” highest validation F1 of **0.9806** --- ## ๐Ÿ“š Citation If you use this model in your research, please cite the associated working paper: (Forthcoming) ---
*Built on top of [ClimateBERT](https://huggingface.co/climatebert) ยท Trained with ๐Ÿค— Hugging Face Transformers*