| # Quickstart |
| Once you have SentenceTransformers [installed](installation.md), the usage is simple: |
| ```python |
| from sentence_transformers import SentenceTransformer |
| model = SentenceTransformer('all-MiniLM-L6-v2') |
| |
| #Our sentences we like to encode |
| sentences = ['This framework generates embeddings for each input sentence', |
| 'Sentences are passed as a list of string.', |
| 'The quick brown fox jumps over the lazy dog.'] |
| |
| #Sentences are encoded by calling model.encode() |
| sentence_embeddings = model.encode(sentences) |
| |
| #Print the embeddings |
| for sentence, embedding in zip(sentences, sentence_embeddings): |
| print("Sentence:", sentence) |
| print("Embedding:", embedding) |
| print("") |
| ``` |
|
|
|
|
| With `SentenceTransformer('all-MiniLM-L6-v2')` we define which sentence transformer model we like to load. In this example, we load *all-MiniLM-L6-v2*, which is a MiniLM model fine tuned on a large dataset of over 1 billion training pairs. |
|
|
| BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector. |
|
|
| ## Comparing Sentence Similarities |
|
|
| The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). For two sentences, this can be done like this: |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer, util |
| model = SentenceTransformer('all-MiniLM-L6-v2') |
| |
| #Sentences are encoded by calling model.encode() |
| emb1 = model.encode("This is a red cat with a hat.") |
| emb2 = model.encode("Have you seen my red cat?") |
| |
| cos_sim = util.cos_sim(emb1, emb2) |
| print("Cosine-Similarity:", cos_sim) |
| ``` |
|
|
| If you have a list with more sentences, you can use the following code example: |
| ```python |
| from sentence_transformers import SentenceTransformer, util |
| model = SentenceTransformer('all-MiniLM-L6-v2') |
| |
| sentences = ['A man is eating food.', |
| 'A man is eating a piece of bread.', |
| 'The girl is carrying a baby.', |
| 'A man is riding a horse.', |
| 'A woman is playing violin.', |
| 'Two men pushed carts through the woods.', |
| 'A man is riding a white horse on an enclosed ground.', |
| 'A monkey is playing drums.', |
| 'Someone in a gorilla costume is playing a set of drums.' |
| ] |
| |
| #Encode all sentences |
| embeddings = model.encode(sentences) |
| |
| #Compute cosine similarity between all pairs |
| cos_sim = util.cos_sim(embeddings, embeddings) |
| |
| #Add all pairs to a list with their cosine similarity score |
| all_sentence_combinations = [] |
| for i in range(len(cos_sim)-1): |
| for j in range(i+1, len(cos_sim)): |
| all_sentence_combinations.append([cos_sim[i][j], i, j]) |
| |
| #Sort list by the highest cosine similarity score |
| all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True) |
| |
| print("Top-5 most similar pairs:") |
| for score, i, j in all_sentence_combinations[0:5]: |
| print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j])) |
| ``` |
|
|
| See on the left the *Usage* sections for more examples how to use SentenceTransformers. |
|
|
| ## Pre-Trained Models |
| Various pre-trained models exists optimized for many tasks exists. For a full list, see **[Pretrained Models](pretrained_models.md)**. |
|
|
|
|
|
|
| ## Training your own Embeddings |
|
|
| Training your own sentence embeddings models for all type of use-cases is easy and requires often only minimal coding effort. For a comprehensive tutorial, see [Training/Overview](training/overview.md). |
|
|
| You can also extend easily existent sentence embeddings models to **further languages**. For details, see [Multi-Lingual Training](../examples/training/multilingual/README). |
|
|