Text Ranking
sentence-transformers
Safetensors
Transformers
roberta
text-classification
feature-extraction
sentence-similarity
text-embeddings-inference
Instructions to use inthedarkness/klue-roberta-small-cross-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use inthedarkness/klue-roberta-small-cross-encoder with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("inthedarkness/klue-roberta-small-cross-encoder") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Transformers
How to use inthedarkness/klue-roberta-small-cross-encoder with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("inthedarkness/klue-roberta-small-cross-encoder") model = AutoModelForSequenceClassification.from_pretrained("inthedarkness/klue-roberta-small-cross-encoder") - Notebooks
- Google Colab
- Kaggle
MINJUNJU commited on
Upload folder using huggingface_hub
Browse files- 1_Pooling/config.json +10 -0
- README.md +127 -0
- config.json +29 -0
- config_sentence_transformers.json +9 -0
- eval/similarity_evaluation_results.csv +5 -0
- model.safetensors +3 -0
- modules.json +14 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +51 -0
- tokenizer.json +0 -0
- tokenizer_config.json +59 -0
- vocab.txt +0 -0
1_Pooling/config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"word_embedding_dimension": 768,
|
| 3 |
+
"pooling_mode_cls_token": false,
|
| 4 |
+
"pooling_mode_mean_tokens": true,
|
| 5 |
+
"pooling_mode_max_tokens": false,
|
| 6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
| 7 |
+
"pooling_mode_weightedmean_tokens": false,
|
| 8 |
+
"pooling_mode_lasttoken": false,
|
| 9 |
+
"include_prompt": true
|
| 10 |
+
}
|
README.md
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: sentence-transformers
|
| 3 |
+
pipeline_tag: sentence-similarity
|
| 4 |
+
tags:
|
| 5 |
+
- sentence-transformers
|
| 6 |
+
- feature-extraction
|
| 7 |
+
- sentence-similarity
|
| 8 |
+
- transformers
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# {MODEL_NAME}
|
| 13 |
+
|
| 14 |
+
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 15 |
+
|
| 16 |
+
<!--- Describe your model here -->
|
| 17 |
+
|
| 18 |
+
## Usage (Sentence-Transformers)
|
| 19 |
+
|
| 20 |
+
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
pip install -U sentence-transformers
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
Then you can use the model like this:
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
+
from sentence_transformers import SentenceTransformer
|
| 30 |
+
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 31 |
+
|
| 32 |
+
model = SentenceTransformer('{MODEL_NAME}')
|
| 33 |
+
embeddings = model.encode(sentences)
|
| 34 |
+
print(embeddings)
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## Usage (HuggingFace Transformers)
|
| 40 |
+
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
from transformers import AutoTokenizer, AutoModel
|
| 44 |
+
import torch
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
#Mean Pooling - Take attention mask into account for correct averaging
|
| 48 |
+
def mean_pooling(model_output, attention_mask):
|
| 49 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
| 50 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 51 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
# Sentences we want sentence embeddings for
|
| 55 |
+
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 56 |
+
|
| 57 |
+
# Load model from HuggingFace Hub
|
| 58 |
+
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
|
| 59 |
+
model = AutoModel.from_pretrained('{MODEL_NAME}')
|
| 60 |
+
|
| 61 |
+
# Tokenize sentences
|
| 62 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
| 63 |
+
|
| 64 |
+
# Compute token embeddings
|
| 65 |
+
with torch.no_grad():
|
| 66 |
+
model_output = model(**encoded_input)
|
| 67 |
+
|
| 68 |
+
# Perform pooling. In this case, mean pooling.
|
| 69 |
+
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
| 70 |
+
|
| 71 |
+
print("Sentence embeddings:")
|
| 72 |
+
print(sentence_embeddings)
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Evaluation Results
|
| 78 |
+
|
| 79 |
+
<!--- Describe how your model was evaluated -->
|
| 80 |
+
|
| 81 |
+
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
## Training
|
| 85 |
+
The model was trained with the parameters:
|
| 86 |
+
|
| 87 |
+
**DataLoader**:
|
| 88 |
+
|
| 89 |
+
`torch.utils.data.dataloader.DataLoader` of length 657 with parameters:
|
| 90 |
+
```
|
| 91 |
+
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
**Loss**:
|
| 95 |
+
|
| 96 |
+
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
|
| 97 |
+
|
| 98 |
+
Parameters of the fit()-Method:
|
| 99 |
+
```
|
| 100 |
+
{
|
| 101 |
+
"epochs": 4,
|
| 102 |
+
"evaluation_steps": 1000,
|
| 103 |
+
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
|
| 104 |
+
"max_grad_norm": 1,
|
| 105 |
+
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
|
| 106 |
+
"optimizer_params": {
|
| 107 |
+
"lr": 2e-05
|
| 108 |
+
},
|
| 109 |
+
"scheduler": "WarmupLinear",
|
| 110 |
+
"steps_per_epoch": null,
|
| 111 |
+
"warmup_steps": 100,
|
| 112 |
+
"weight_decay": 0.01
|
| 113 |
+
}
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
## Full Model Architecture
|
| 118 |
+
```
|
| 119 |
+
SentenceTransformer(
|
| 120 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
|
| 121 |
+
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 122 |
+
)
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Citing & Authors
|
| 126 |
+
|
| 127 |
+
<!--- Describe where people can find more information -->
|
config.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "klue/roberta-base",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"RobertaModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"eos_token_id": 2,
|
| 10 |
+
"gradient_checkpointing": false,
|
| 11 |
+
"hidden_act": "gelu",
|
| 12 |
+
"hidden_dropout_prob": 0.1,
|
| 13 |
+
"hidden_size": 768,
|
| 14 |
+
"initializer_range": 0.02,
|
| 15 |
+
"intermediate_size": 3072,
|
| 16 |
+
"layer_norm_eps": 1e-05,
|
| 17 |
+
"max_position_embeddings": 514,
|
| 18 |
+
"model_type": "roberta",
|
| 19 |
+
"num_attention_heads": 12,
|
| 20 |
+
"num_hidden_layers": 12,
|
| 21 |
+
"pad_token_id": 1,
|
| 22 |
+
"position_embedding_type": "absolute",
|
| 23 |
+
"tokenizer_class": "BertTokenizer",
|
| 24 |
+
"torch_dtype": "float32",
|
| 25 |
+
"transformers_version": "4.40.1",
|
| 26 |
+
"type_vocab_size": 1,
|
| 27 |
+
"use_cache": true,
|
| 28 |
+
"vocab_size": 32000
|
| 29 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "2.7.0",
|
| 4 |
+
"transformers": "4.40.1",
|
| 5 |
+
"pytorch": "2.2.1+cu118"
|
| 6 |
+
},
|
| 7 |
+
"prompts": {},
|
| 8 |
+
"default_prompt_name": null
|
| 9 |
+
}
|
eval/similarity_evaluation_results.csv
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
epoch,steps,cosine_pearson,cosine_spearman,euclidean_pearson,euclidean_spearman,manhattan_pearson,manhattan_spearman,dot_pearson,dot_spearman
|
| 2 |
+
0,-1,0.9545304841120885,0.9056638834772096,0.9443079460728341,0.9068591261478569,0.9439727860354374,0.9058604957884819,0.9438354160908577,0.8910780902270788
|
| 3 |
+
1,-1,0.9606693092268676,0.9175533548796789,0.9524796973784794,0.9180121191962418,0.9523029351015382,0.9176516639477474,0.9493910168818802,0.9002708738516022
|
| 4 |
+
2,-1,0.9612209955060567,0.9193717787749159,0.952420146334871,0.9198482065927485,0.9523019439942308,0.9193955532309962,0.9486788258348828,0.9015565465838897
|
| 5 |
+
3,-1,0.9619607636908197,0.9212519775662461,0.953898124124259,0.9211105199937801,0.9538096606372759,0.9207983485284669,0.9512574401766136,0.9052701762316409
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:159333a67862dd1fd65b67efd1b2c772e76c1b17664133268b143f4f0a777dd4
|
| 3 |
+
size 442494816
|
modules.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
}
|
| 14 |
+
]
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 512,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "[CLS]",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "[CLS]",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "[SEP]",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "[MASK]",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "[PAD]",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "[SEP]",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "[UNK]",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[CLS]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "[PAD]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "[SEP]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "[UNK]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"4": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "[CLS]",
|
| 45 |
+
"clean_up_tokenization_spaces": true,
|
| 46 |
+
"cls_token": "[CLS]",
|
| 47 |
+
"do_basic_tokenize": true,
|
| 48 |
+
"do_lower_case": false,
|
| 49 |
+
"eos_token": "[SEP]",
|
| 50 |
+
"mask_token": "[MASK]",
|
| 51 |
+
"model_max_length": 512,
|
| 52 |
+
"never_split": null,
|
| 53 |
+
"pad_token": "[PAD]",
|
| 54 |
+
"sep_token": "[SEP]",
|
| 55 |
+
"strip_accents": null,
|
| 56 |
+
"tokenize_chinese_chars": true,
|
| 57 |
+
"tokenizer_class": "BertTokenizer",
|
| 58 |
+
"unk_token": "[UNK]"
|
| 59 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|