RafaelUI commited on
Commit
cafa2c9
Β·
verified Β·
1 Parent(s): 143ea6f

Upload folder using huggingface_hub

Browse files
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": null,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": null,
11
+ "gradient_checkpointing": false,
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0.1,
14
+ "hidden_size": 384,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 1536,
17
+ "is_decoder": false,
18
+ "layer_norm_eps": 1e-12,
19
+ "max_position_embeddings": 512,
20
+ "model_type": "bert",
21
+ "num_attention_heads": 12,
22
+ "num_hidden_layers": 12,
23
+ "pad_token_id": 0,
24
+ "position_embedding_type": "absolute",
25
+ "tie_word_embeddings": true,
26
+ "transformers_version": "5.5.4",
27
+ "type_vocab_size": 2,
28
+ "use_cache": true,
29
+ "vocab_size": 250037
30
+ }
1_Pooling/config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "pytorch": "2.11.0",
4
+ "sentence_transformers": "5.4.1",
5
+ "transformers": "5.5.4"
6
+ },
7
+ "default_prompt_name": null,
8
+ "model_type": "SentenceTransformer",
9
+ "prompts": {
10
+ "document": "",
11
+ "query": ""
12
+ },
13
+ "similarity_fn_name": "cosine"
14
+ }
1_Pooling/modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.base.modules.transformer.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.sentence_transformer.modules.pooling.Pooling"
13
+ }
14
+ ]
1_Pooling/sentence_bert_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "transformer_task": "feature-extraction",
3
+ "modality_config": {
4
+ "text": {
5
+ "method": "forward",
6
+ "method_output_name": "last_hidden_state"
7
+ }
8
+ },
9
+ "module_output_name": "token_embeddings"
10
+ }
1_Pooling/tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<s>",
4
+ "cls_token": "<s>",
5
+ "do_lower_case": true,
6
+ "eos_token": "</s>",
7
+ "is_local": false,
8
+ "mask_token": "<mask>",
9
+ "max_length": 128,
10
+ "model_max_length": 128,
11
+ "pad_to_multiple_of": null,
12
+ "pad_token": "<pad>",
13
+ "pad_token_type_id": 0,
14
+ "padding_side": "right",
15
+ "sep_token": "</s>",
16
+ "stride": 0,
17
+ "strip_accents": null,
18
+ "tokenize_chinese_chars": true,
19
+ "tokenizer_class": "TokenizersBackend",
20
+ "truncation_side": "right",
21
+ "truncation_strategy": "longest_first",
22
+ "unk_token": "<unk>"
23
+ }
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ru
5
+ - fr
6
+ - de
7
+ - es
8
+ - it
9
+ - pt
10
+ license: apache-2.0
11
+ tags:
12
+ - sentence-transformers
13
+ - sentence-similarity
14
+ - feature-extraction
15
+ - literary
16
+ - semantic-search
17
+ - multilingual
18
+ base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
19
+ datasets:
20
+ - rafaelui/literary-text-pairs
21
+ pipeline_tag: sentence-similarity
22
+ ---
23
+
24
+ # literary-minilm
25
+
26
+ A multilingual semantic search model fine-tuned for **literary text** β€” novels, short stories, and other fiction. Built on top of [`paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), this model is optimized to understand narrative language, character descriptions, plot dynamics, and thematic queries across 7 languages.
27
+
28
+ Developed for use in [Impulse](https://apps.apple.com/us/app/impulse-writers-studio/id6761264842?l=ru&mt=12) β€” a macOS writing app for authors.
29
+
30
+ ## Model Details
31
+
32
+ | Property | Value |
33
+ |---|---|
34
+ | Base model | paraphrase-multilingual-MiniLM-L12-v2 |
35
+ | Architecture | BERT (12 layers, 384 hidden size) |
36
+ | Max sequence length | 128 tokens |
37
+ | Languages | English, Russian, French, German, Spanish, Italian, Portuguese |
38
+ | Training pairs | ~134,000 |
39
+ | Output dimension | 384 |
40
+ | License | Apache 2.0 |
41
+
42
+ ## Why literary-minilm?
43
+
44
+ General-purpose multilingual embeddings are trained on a broad mix of content: Wikipedia, Reddit, StackOverflow, scientific papers, and web crawls. This works well for factual retrieval but poorly for fiction β€” where meaning is conveyed through metaphor, subtext, character voice, and narrative context.
45
+
46
+ **literary-minilm** is domain-adapted exclusively on fiction. The result is a model that understands queries like:
47
+
48
+ - *"scene where the hero doubts himself"*
49
+ - *"description of a mysterious city at night"*
50
+ - *"character who sacrifices everything for love"*
51
+
52
+ ## Training Data
53
+
54
+ The model was fine-tuned on a custom dataset of ~134,000 literary text pairs across 7 languages, generated from:
55
+
56
+ - **English**: Project Gutenberg (via `emozilla/pg19`) and `manu/project_gutenberg`
57
+ - **Russian**: RusLit corpus (classical Russian prose) and `cointegrated/taiga_stripped_proza`
58
+ - **French, German, Spanish, Italian, Portuguese**: OPUS Books (`Helsinki-NLP/opus_books`) and `manu/project_gutenberg`
59
+
60
+ Each training example consists of:
61
+ - `anchor` β€” a passage of literary text (up to 256 tokens)
62
+ - `semantic_phrase` β€” a short natural-language search query describing the passage (5–10 words)
63
+ - `paraphrase` β€” a rephrasing of the anchor in different words
64
+
65
+ Training pairs were generated using a combination of YandexGPT, GPT-4.1-nano, and Qwen3 235B, then filtered for quality.
66
+
67
+ ## Usage
68
+
69
+ ### With sentence-transformers
70
+
71
+ ```python
72
+ from sentence_transformers import SentenceTransformer
73
+
74
+ model = SentenceTransformer("rafaelui/literary-minilm")
75
+
76
+ query = "hero says goodbye to a friend before war"
77
+ passages = [
78
+ "He embraced his friend and held on for a long time, knowing he would never see him again.",
79
+ "The sun was bright, birds sang in the garden.",
80
+ "She closed the book and sat thinking about what she had read."
81
+ ]
82
+
83
+ query_emb = model.encode(query)
84
+ passage_embs = model.encode(passages)
85
+
86
+ from sentence_transformers.util import cos_sim
87
+ scores = cos_sim(query_emb, passage_embs)[0]
88
+ for passage, score in zip(passages, scores):
89
+ print(f"{score:.3f}: {passage}")
90
+ ```
91
+
92
+ Output:
93
+ ```
94
+ 0.621: He embraced his friend and held on for a long time...
95
+ -0.082: The sun was bright, birds sang in the garden.
96
+ 0.275: She closed the book and sat thinking about what she had read.
97
+ ```
98
+
99
+ ### CoreML (iOS / macOS)
100
+
101
+ A compiled `.mlpackage` is available for direct use in Apple platform apps. See the [Releases](https://huggingface.co/rafaelui/literary-minilm/tree/main) section.
102
+
103
+ ## Limitations
104
+
105
+ - Optimized for **fiction only** β€” performance on factual, technical, or conversational text may be lower than the base model
106
+ - Context window is limited to **128 tokens** β€” longer passages should be chunked
107
+ - Asian languages (Chinese, Japanese, Korean) are not included in fine-tuning; the model falls back to base multilingual capabilities for these
108
+
109
+ ## Author
110
+
111
+ **Alexei Goncharov** β€” [ImpulseLeap](https://www.impulseleap.com)
112
+
113
+ Built for [Impulse](https://apps.apple.com/us/app/impulse-writers-studio/id6761264842?l=ru&mt=12), a macOS app for writers.
114
+
115
+ ## License
116
+
117
+ Apache 2.0
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b20cb0c384f0e394db15dd6ff520ef92d5919d290d1fa7e8d75f58508b159832
3
+ size 470637392
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
3
+ size 17082987