MaximilianWeiland commited on
Commit
d1ed19e
·
1 Parent(s): 9263158

add model card

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - bert
7
+ - feature-extraction
8
+ - contrastive-learning
9
+ - embeddings
10
+ - political-science
11
+ - social-groups
12
+ - clustering
13
+ base_model: google-bert/bert-base-uncased
14
+ pipeline_tag: feature-extraction
15
+ library_name: transformers
16
+ ---
17
+
18
+ # Contrastive Learning Mention Embedding
19
+
20
+ A BERT-base model with a linear projection head fine-tuned via contrastive learning to produce embeddings that maximize separability between mentions of different social groups. Designed for clustering social group mentions into qualitative categories.
21
+
22
+ This model is part of the [`group-appeal-detector`](https://github.com/MaximilianWeiland/group_appeal_detector) package, which also provides group mention detection and stance classification.
23
+
24
+ ## Model Details
25
+
26
+ - **Base model:** `bert-base-uncased`
27
+ - **Architecture:** BERT-base + linear projection head (768 → 128 dimensions)
28
+ - **Training objective:** Triplet loss with hard negative mining
29
+ - **Training data:** Social group dictionary provided by [Will Horne, Alona O. Dolinsky and Lena Maria Huber](https://osf.io/preprints/osf/fp2h3_v3)
30
+
31
+ ## How It Works
32
+
33
+ Each mention is fed into the model using the following prompt template:
34
+
35
+ ```
36
+ Social group of {mention} is: [MASK].
37
+ ```
38
+
39
+ The hidden state at the `[MASK]` position is extracted, passed through the projection layer, and L2-normalized. Mentions of the same social group category are pulled together in embedding space; mentions of different categories are pushed apart.
40
+
41
+ The model was trained using the triplet loss. Each anchor is a term from a category in the social group dictionary, paired with a randomly sampled positive from the same category and a hard negative mined from a different category.
42
+
43
+ ## Usage
44
+
45
+ ### Via `group-appeal-detector` package (recommended)
46
+
47
+ ```bash
48
+ pip install group-appeal-detector
49
+ ```
50
+
51
+ ```python
52
+ from group_appeal_detector import GroupAppealDetector, GroupMentionClusterer
53
+
54
+ detector = GroupAppealDetector(device="cpu")
55
+
56
+ # collect mentions from a corpus
57
+ texts = [...]
58
+ all_mentions = detector.detect_mentions_batch(texts, batch_size=16, as_df=False)
59
+ mentions = [m["span"] for mentions in all_mentions for m in mentions]
60
+
61
+ # cluster into categories
62
+ clusterer = GroupMentionClusterer(mentions, device="cpu")
63
+ results_df = clusterer.cluster(n_clusters=5, as_df=True)
64
+ results_df.head()
65
+ ```
66
+
67
+ To find the optimal number of clusters automatically:
68
+
69
+ ```python
70
+ best_k, all_scores = clusterer.find_optimal_k(k_range=(2, 20), metric="silhouette", visualize=True)
71
+ results_df = clusterer.cluster(n_clusters=best_k, as_df=True)
72
+ ```
73
+
74
+ ### Direct usage
75
+
76
+ ```python
77
+ import torch
78
+ import torch.nn as nn
79
+ import torch.nn.functional as F
80
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
81
+ from huggingface_hub import hf_hub_download
82
+ from safetensors.torch import load_file
83
+
84
+ REPO_ID = "maxwlnd/cl_mention_embedding"
85
+
86
+ class ModelMask(nn.Module):
87
+ def __init__(self, tokenizer, pretrained_model_name="bert-base-uncased", proj_dim=128):
88
+ super().__init__()
89
+ config = AutoConfig.from_pretrained(pretrained_model_name)
90
+ self.encoder = AutoModel.from_config(config)
91
+ self.mask_id = tokenizer.mask_token_id
92
+ self.projector = nn.Sequential(nn.Linear(config.hidden_size, proj_dim))
93
+
94
+ def encode(self, input_ids, attention_mask):
95
+ outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
96
+ mask_positions = (input_ids == self.mask_id)
97
+ h = torch.stack([
98
+ outputs.last_hidden_state[i][mask_positions[i]].mean(dim=0)
99
+ for i in range(input_ids.size(0))
100
+ ])
101
+ z = self.projector(h)
102
+ return F.normalize(z, p=2, dim=1)
103
+
104
+ tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
105
+ model = ModelMask(tokenizer)
106
+ model.load_state_dict(load_file(hf_hub_download(REPO_ID, "model.safetensors")))
107
+ model.eval()
108
+
109
+ def embed(mention: str) -> torch.Tensor:
110
+ prompt = f"Social group of {mention} is: {tokenizer.mask_token}."
111
+ inputs = tokenizer(prompt, return_tensors="pt")
112
+ with torch.no_grad():
113
+ return model.encode(inputs["input_ids"], inputs["attention_mask"])
114
+
115
+ emb_a = embed("farmers")
116
+ emb_b = embed("agricultural workers")
117
+ print(F.cosine_similarity(emb_a, emb_b).item())
118
+ ```
119
+
120
+ ## Related Models
121
+
122
+ This model is one of three models in the group appeal detection pipeline:
123
+
124
+ | Model | Task |
125
+ |---|---|
126
+ | [`maxwlnd/roberta_group_mention_detector`](https://huggingface.co/maxwlnd/roberta_group_mention_detector) | Detect social group mentions |
127
+ | [`maxwlnd/socialgroup_stance_classification_nli`](https://huggingface.co/maxwlnd/socialgroup_stance_classification_nli) | Classify stance toward a group as positive, negative, or neutral |
128
+ | [`maxwlnd/cl_mention_embedding`](https://huggingface.co/maxwlnd/cl_mention_embedding) | Embed mentions for clustering into qualitative categories (this model) |
129
+
130
+ ## License
131
+
132
+ MIT