Sentence Similarity
sentence-transformers
PyTorch
Transformers
Korean
bert
feature-extraction
TAACO
text-embeddings-inference
Instructions to use KDHyun08/TAACO_STS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use KDHyun08/TAACO_STS with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("KDHyun08/TAACO_STS") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use KDHyun08/TAACO_STS with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("KDHyun08/TAACO_STS") model = AutoModel.from_pretrained("KDHyun08/TAACO_STS") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - transformers | |
| - TAACO | |
| language: ko | |
| # TAACO_Similarity | |
| λ³Έ λͺ¨λΈμ [Sentence-transformers](https://www.SBERT.net)λ₯Ό κΈ°λ°μΌλ‘ νλ©° KLUEμ STS(Sentence Textual Similarity) λ°μ΄ν°μ μ ν΅ν΄ νλ ¨μ μ§νν λͺ¨λΈμ λλ€. | |
| νμκ° μ μνκ³ μλ νκ΅μ΄ λ¬Έμ₯κ° κ²°μμ± μΈ‘μ λκ΅¬μΈ K-TAACO(κ°μ )μ μ§ν μ€ νλμΈ λ¬Έμ₯ κ° μλ―Έμ κ²°μμ±μ μΈ‘μ νκΈ° μν΄ λ³Έ λͺ¨λΈμ μ μνμμ΅λλ€. | |
| λν λͺ¨λμ λ§λμΉμ λ¬Έμ₯κ° μ μ¬λ λ°μ΄ν° λ± λ€μν λ°μ΄ν°λ₯Ό κ΅¬ν΄ μΆκ° νλ ¨μ μ§νν μμ μ λλ€. | |
| ## Train Data | |
| KLUE-sts-v1.1._train.json | |
| NLI-sts-train.tsv | |
| ## Usage (Sentence-Transformers) | |
| λ³Έ λͺ¨λΈμ μ¬μ©νκΈ° μν΄μλ [Sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉνμ¬μΌ ν©λλ€. | |
| ``` | |
| pip install -U sentence-transformers | |
| ``` | |
| λͺ¨λΈμ μ¬μ©νκΈ° μν΄μλ μλ μ½λλ₯Ό μ°Έμ‘°νμκΈΈ λ°λλλ€. | |
| ```python | |
| from sentence_transformers import SentenceTransformer, models | |
| sentences = ["This is an example sentence", "Each sentence is converted"] | |
| embedding_model = models.Transformer( | |
| model_name_or_path="KDHyun08/TAACO_STS", | |
| max_seq_length=256, | |
| do_lower_case=True | |
| ) | |
| pooling_model = models.Pooling( | |
| embedding_model.get_word_embedding_dimension(), | |
| pooling_mode_mean_tokens=True, | |
| pooling_mode_cls_token=False, | |
| pooling_mode_max_tokens=False, | |
| ) | |
| model = SentenceTransformer(modules=[embedding_model, pooling_model]) | |
| embeddings = model.encode(sentences) | |
| print(embeddings) | |
| ``` | |
| ## Usage (μ€μ λ¬Έμ₯ κ° μ μ¬λ λΉκ΅) | |
| [Sentence-transformers](https://www.SBERT.net) λ₯Ό μ€μΉν ν μλ λ΄μ©κ³Ό κ°μ΄ λ¬Έμ₯ κ° μ μ¬λλ₯Ό λΉκ΅ν μ μμ΅λλ€. | |
| query λ³μλ λΉκ΅ κΈ°μ€μ΄ λλ λ¬Έμ₯(Source Sentence)μ΄κ³ λΉκ΅λ₯Ό μ§νν λ¬Έμ₯μ docsμ list νμμΌλ‘ ꡬμ±νμλ©΄ λ©λλ€. | |
| ```python | |
| from sentence_transformers import SentenceTransformer, models | |
| embedding_model = models.Transformer( | |
| model_name_or_path="KDHyun08/TAACO_STS", | |
| max_seq_length=256, | |
| do_lower_case=True | |
| ) | |
| pooling_model = models.Pooling( | |
| embedding_model.get_word_embedding_dimension(), | |
| pooling_mode_mean_tokens=True, | |
| pooling_mode_cls_token=False, | |
| pooling_mode_max_tokens=False, | |
| ) | |
| model = SentenceTransformer(modules=[embedding_model, pooling_model]) | |
| docs = ['μ΄μ λ μλ΄μ μμΌμ΄μλ€', 'μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€. μ£Όλ λ©λ΄λ μ€ν μ΄ν¬μ λμ§λ³Άμ, λ―Έμκ΅, μ‘μ±, μμΌ λ±μ΄μλ€', 'μ€ν μ΄ν¬λ μμ£Ό νλ μμμ΄μ΄μ μμ μ΄ μ€λΉνλ €κ³ νλ€', 'μλ€λ 1λΆμ© 3λ² λ€μ§κ³ λμ€ν μ μ νλ©΄ μ‘μ¦μ΄ κ°λν μ€ν μ΄ν¬κ° μ€λΉλλ€', 'μλ΄λ κ·Έλ° μ€ν μ΄ν¬λ₯Ό μ’μνλ€. κ·Έλ°λ° μμλ λͺ»ν μΌμ΄ λ²μ΄μ§κ³ λ§μλ€', 'λ³΄ν΅ μμ¦λμ΄ λμ§ μμ μμ‘μ μ¬μ μ€ν μ΄ν¬λ₯Ό νλλ°, μ΄λ²μλ μμ¦λμ΄ λ λΆμ±μ΄μ ꡬμ ν΄μ νλ€', 'κ·Έλ°λ° μΌμ΄μ€ μμ λ°©λΆμ κ° λ€μ΄μλ κ²μ μΈμ§νμ§ λͺ»νκ³ λ°©λΆμ μ λμμ νλΌμ΄ν¬μ μ¬λ €λμ κ²μ΄λ€', 'κ·Έκ²λ μΈμ§ λͺ»ν 체... μλ©΄μ μΌ λΆμ 1λΆμ κ΅½κ³ λ€μ§λ μκ° λ°©λΆμ κ° ν¨κ» ꡬμ΄μ§ κ²μ μμλ€', 'μλ΄μ μμΌμ΄λΌ λ§μκ² κ΅¬μλ³΄κ³ μΆμλλ° μ΄μ²κ΅¬λμλ μν©μ΄ λ°μν κ²μ΄λ€', 'λ°©λΆμ κ° μΌ λΆμ λ Ήμμ κ·Έλ°μ§ λ¬Όμ²λΌ νλ¬λ΄λ Έλ€', ' κ³ λ―Όμ νλ€. λ°©λΆμ κ° λ¬»μ λΆλ¬Έλ§ μ κ±°νκ³ λ€μ ꡬμΈκΉ νλλ° λ°©λΆμ μ μ λ λ¨Ήμ§ λ§λΌλ λ¬Έκ΅¬κ° μμ΄μ μκΉμ§λ§ λ²λ¦¬λ λ°©ν₯μ νλ€', 'λ무λ μνκΉμ λ€', 'μμΉ¨ μΌμ° μλ΄κ° μ’μνλ μ€ν μ΄ν¬λ₯Ό μ€λΉνκ³ κ·Έκ²μ λ§μκ² λ¨Ήλ μλ΄μ λͺ¨μ΅μ λ³΄κ³ μΆμλλ° μ ν μκ°μ§λ λͺ»ν μν©μ΄ λ°μν΄μ... νμ§λ§ μ μ μ μΆμ€λ₯΄κ³ λ°λ‘ λ€λ₯Έ λ©λ΄λ‘ λ³κ²½νλ€', 'μμΌ, μμμ§ μΌμ±λ³Άμ..', 'μλ΄κ° μ’μνλμ§ λͺ¨λ₯΄κ² μ§λ§ λμ₯κ³ μμ μλ νλν¬μμΈμ§λ₯Ό 보λ λ°λ‘ μμΌλ₯Ό ν΄μΌκ² λ€λ μκ°μ΄ λ€μλ€. μμμ μ±κ³΅μ μΌλ‘ μμ±μ΄ λμλ€', '40λ²μ§Έλ₯Ό λ§μ΄νλ μλ΄μ μμΌμ μ±κ³΅μ μΌλ‘ μ€λΉκ° λμλ€', 'λ§μκ² λ¨Ήμ΄ μ€ μλ΄μκ²λ κ°μ¬νλ€', 'λ§€λ μλ΄μ μμΌμ λ§μ΄νλ©΄ μμΉ¨λ§λ€ μμΌμ μ°¨λ €μΌκ² λ€. μ€λλ μ¦κ±°μ΄ νλ£¨κ° λμμΌλ©΄ μ’κ² λ€', 'μμΌμ΄λκΉ~'] | |
| #κ° λ¬Έμ₯μ vectorκ° encoding | |
| document_embeddings = model.encode(docs) | |
| query = 'μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€' | |
| query_embedding = model.encode(query) | |
| top_k = min(10, len(docs)) | |
| # μ½μ¬μΈ μ μ¬λ κ³μ° ν, | |
| cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0] | |
| # μ½μ¬μΈ μ μ¬λ μμΌλ‘ λ¬Έμ₯ μΆμΆ | |
| top_results = torch.topk(cos_scores, k=top_k) | |
| print(f"μ λ ₯ λ¬Έμ₯: {query}") | |
| print(f"\n<μ λ ₯ λ¬Έμ₯κ³Ό μ μ¬ν {top_k} κ°μ λ¬Έμ₯>\n") | |
| for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])): | |
| print(f"{i+1}: {docs[idx]} {'(μ μ¬λ: {:.4f})'.format(score)}\n") | |
| ``` | |
| ## Evaluation Results | |
| μ Usageλ₯Ό μ€ννκ² λλ©΄ μλμ κ°μ κ²°κ³Όκ° λμΆλ©λλ€. 1μ κ°κΉμΈμλ‘ μ μ¬ν λ¬Έμ₯μ λλ€. | |
| ``` | |
| μ λ ₯ λ¬Έμ₯: μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€ | |
| <μ λ ₯ λ¬Έμ₯κ³Ό μ μ¬ν 10 κ°μ λ¬Έμ₯> | |
| 1: μμΌμ λ§μ΄νμ¬ μμΉ¨μ μ€λΉνκ² λ€κ³ μ€μ 8μ 30λΆλΆν° μμμ μ€λΉνμλ€. μ£Όλ λ©λ΄λ μ€ν μ΄ν¬μ λμ§λ³Άμ, λ―Έμκ΅, μ‘μ±, μμΌ λ±μ΄μλ€ (μ μ¬λ: 0.6687) | |
| 2: λ§€λ μλ΄μ μμΌμ λ§μ΄νλ©΄ μμΉ¨λ§λ€ μμΌμ μ°¨λ €μΌκ² λ€. μ€λλ μ¦κ±°μ΄ νλ£¨κ° λμμΌλ©΄ μ’κ² λ€ (μ μ¬λ: 0.6468) | |
| 3: 40λ²μ§Έλ₯Ό λ§μ΄νλ μλ΄μ μμΌμ μ±κ³΅μ μΌλ‘ μ€λΉκ° λμλ€ (μ μ¬λ: 0.4647) | |
| 4: μλ΄μ μμΌμ΄λΌ λ§μκ² κ΅¬μλ³΄κ³ μΆμλλ° μ΄μ²κ΅¬λμλ μν©μ΄ λ°μν κ²μ΄λ€ (μ μ¬λ: 0.4469) | |
| 5: μμΌμ΄λκΉ~ (μ μ¬λ: 0.4218) | |
| 6: μ΄μ λ μλ΄μ μμΌμ΄μλ€ (μ μ¬λ: 0.4192) | |
| 7: μμΉ¨ μΌμ° μλ΄κ° μ’μνλ μ€ν μ΄ν¬λ₯Ό μ€λΉνκ³ κ·Έκ²μ λ§μκ² λ¨Ήλ μλ΄μ λͺ¨μ΅μ λ³΄κ³ μΆμλλ° μ ν μκ°μ§λ λͺ»ν μν©μ΄ λ°μν΄μ... νμ§λ§ μ μ μ μΆμ€λ₯΄κ³ λ°λ‘ λ€λ₯Έ λ©λ΄λ‘ λ³κ²½νλ€ (μ μ¬λ: 0.4156) | |
| 8: λ§μκ² λ¨Ήμ΄ μ€ μλ΄μκ²λ κ°μ¬νλ€ (μ μ¬λ: 0.3093) | |
| 9: μλ΄κ° μ’μνλμ§ λͺ¨λ₯΄κ² μ§λ§ λμ₯κ³ μμ μλ νλν¬μμΈμ§λ₯Ό 보λ λ°λ‘ μμΌλ₯Ό ν΄μΌκ² λ€λ μκ°μ΄ λ€μλ€. μμμ μ±κ³΅μ μΌλ‘ μμ±μ΄ λμλ€ (μ μ¬λ: 0.2259) | |
| 10: μλ΄λ κ·Έλ° μ€ν μ΄ν¬λ₯Ό μ’μνλ€. κ·Έλ°λ° μμλ λͺ»ν μΌμ΄ λ²μ΄μ§κ³ λ§μλ€ (μ μ¬λ: 0.1967) | |
| ``` | |
| **DataLoader**: | |
| `torch.utils.data.dataloader.DataLoader` of length 142 with parameters: | |
| ``` | |
| {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} | |
| ``` | |
| **Loss**: | |
| `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` | |
| Parameters of the fit()-Method: | |
| ``` | |
| { | |
| "epochs": 4, | |
| "evaluation_steps": 1000, | |
| "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator", | |
| "max_grad_norm": 1, | |
| "optimizer_class": "<class 'transformers.optimization.AdamW'>", | |
| "optimizer_params": { | |
| "lr": 2e-05 | |
| }, | |
| "scheduler": "WarmupLinear", | |
| "steps_per_epoch": null, | |
| "warmup_steps": 10000, | |
| "weight_decay": 0.01 | |
| } | |
| ``` | |
| ## Full Model Architecture | |
| ``` | |
| SentenceTransformer( | |
| (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel | |
| (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) | |
| ) | |
| ``` | |
| ## Citing & Authors | |
| <!--- Describe where people can find more information --> |