Upload folder using huggingface_hub

cafa2c9 verified 24 days ago

4.35 kB

	---
	language:
	- en
	- ru
	- fr
	- de
	- es
	- it
	- pt
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- literary
	- semantic-search
	- multilingual
	base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	datasets:
	- rafaelui/literary-text-pairs
	pipeline_tag: sentence-similarity
	---

	# literary-minilm

	A multilingual semantic search model fine-tuned for literary text — novels, short stories, and other fiction. Built on top of [`paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), this model is optimized to understand narrative language, character descriptions, plot dynamics, and thematic queries across 7 languages.

	Developed for use in [Impulse](https://apps.apple.com/us/app/impulse-writers-studio/id6761264842?l=ru&mt=12) — a macOS writing app for authors.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| paraphrase-multilingual-MiniLM-L12-v2 \|
	\| Architecture \| BERT (12 layers, 384 hidden size) \|
	\| Max sequence length \| 128 tokens \|
	\| Languages \| English, Russian, French, German, Spanish, Italian, Portuguese \|
	\| Training pairs \| ~134,000 \|
	\| Output dimension \| 384 \|
	\| License \| Apache 2.0 \|

	## Why literary-minilm?

	General-purpose multilingual embeddings are trained on a broad mix of content: Wikipedia, Reddit, StackOverflow, scientific papers, and web crawls. This works well for factual retrieval but poorly for fiction — where meaning is conveyed through metaphor, subtext, character voice, and narrative context.

	literary-minilm is domain-adapted exclusively on fiction. The result is a model that understands queries like:

	- "scene where the hero doubts himself"
	- "description of a mysterious city at night"
	- "character who sacrifices everything for love"

	## Training Data

	The model was fine-tuned on a custom dataset of ~134,000 literary text pairs across 7 languages, generated from:

	- English: Project Gutenberg (via `emozilla/pg19`) and `manu/project_gutenberg`
	- Russian: RusLit corpus (classical Russian prose) and `cointegrated/taiga_stripped_proza`
	- French, German, Spanish, Italian, Portuguese: OPUS Books (`Helsinki-NLP/opus_books`) and `manu/project_gutenberg`

	Each training example consists of:
	- `anchor` — a passage of literary text (up to 256 tokens)
	- `semantic_phrase` — a short natural-language search query describing the passage (5–10 words)
	- `paraphrase` — a rephrasing of the anchor in different words

	Training pairs were generated using a combination of YandexGPT, GPT-4.1-nano, and Qwen3 235B, then filtered for quality.

	## Usage

	### With sentence-transformers

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("rafaelui/literary-minilm")

	query = "hero says goodbye to a friend before war"
	passages = [
	"He embraced his friend and held on for a long time, knowing he would never see him again.",
	"The sun was bright, birds sang in the garden.",
	"She closed the book and sat thinking about what she had read."
	]

	query_emb = model.encode(query)
	passage_embs = model.encode(passages)

	from sentence_transformers.util import cos_sim
	scores = cos_sim(query_emb, passage_embs)[0]
	for passage, score in zip(passages, scores):
	print(f"{score:.3f}: {passage}")
	```

	Output:
	```
	0.621: He embraced his friend and held on for a long time...
	-0.082: The sun was bright, birds sang in the garden.
	0.275: She closed the book and sat thinking about what she had read.
	```

	### CoreML (iOS / macOS)

	A compiled `.mlpackage` is available for direct use in Apple platform apps. See the [Releases](https://huggingface.co/rafaelui/literary-minilm/tree/main) section.

	## Limitations

	- Optimized for fiction only — performance on factual, technical, or conversational text may be lower than the base model
	- Context window is limited to 128 tokens — longer passages should be chunked
	- Asian languages (Chinese, Japanese, Korean) are not included in fine-tuning; the model falls back to base multilingual capabilities for these

	## Author

	Alexei Goncharov — [ImpulseLeap](https://www.impulseleap.com)

	Built for [Impulse](https://apps.apple.com/us/app/impulse-writers-studio/id6761264842?l=ru&mt=12), a macOS app for writers.

	## License

	Apache 2.0