--- title: Embedding Bench emoji: 📐 colorFrom: blue colorTo: purple sdk: streamlit sdk_version: "1.56.0" app_file: app.py pinned: false license: mit --- # embedding-bench Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI. ## Features - **40+ pre-configured models** — sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more - **4 backends** — sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding - **7 built-in datasets** — STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA - **Custom datasets** — upload your own CSV/TSV or load any HuggingFace dataset - **Custom models** — add any HuggingFace embedding model from the UI - **11 retrieval metrics** — MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable) - **LLM as a Judge** — use OpenAI or Anthropic to rate retrieval relevance - **Interactive charts** — Plotly-powered, with hover, zoom, and PNG export ## Setup ```bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` ## Web UI ```bash streamlit run app.py ``` The sidebar has three sections: 1. **Models** — select from the registry or add a custom HuggingFace model 2. **Datasets** — pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset 3. **Evaluation** — configure metrics, speed/memory benchmarks, LLM judge, and max pairs ### Custom datasets You can add datasets two ways from the sidebar: - **Upload file** — CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used. - **HuggingFace Hub** — provide the dataset ID (e.g. `mteb/stsbenchmark-sts`), config, split, and column names. The dataset is validated on add. ### LLM as a Judge Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1–5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5. ### Metrics | Dimension | Metrics | Method | |-----------|---------|--------| | Quality (scored) | Spearman | Cosine similarity vs gold scores | | Quality (pairs) | MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 | Retrieval ranking of positive passages | | LLM Judge | Avg@1, Avg@5, nDCG@5 | LLM relevance ratings on retrieved passages | | Speed | Median encode time, sent/s | Wall-clock over N runs with warmup | | Memory | Peak RSS delta (MB) | Isolated subprocess via `psutil` | ## CLI ```bash # Full benchmark (quality + speed + memory) python bench.py # Specific models python bench.py --models mpnet bge-small # Compare backends python bench.py --models bge-small bge-small-fe # Skip expensive evals python bench.py --skip-quality python bench.py --skip-memory # Multiple datasets with pair limit python bench.py --models mpnet bge-small \ --datasets sts natural-questions squad \ --max-pairs 1000 --skip-speed --skip-memory # Custom HF dataset python bench.py --dataset my-org/my-pairs \ --query-col query --passage-col passage --score-col none # Export python bench.py --csv results.csv --charts ./results ``` ### Built-in dataset presets | Preset | HF Dataset | Type | |--------|-----------|------| | `sts` | `mteb/stsbenchmark-sts` | Scored (Spearman) | | `natural-questions` | `sentence-transformers/natural-questions` | Retrieval | | `msmarco` | `sentence-transformers/msmarco-bm25` | Retrieval | | `squad` | `sentence-transformers/squad` | Retrieval | | `trivia-qa` | `sentence-transformers/trivia-qa` | Retrieval | | `gooaq` | `sentence-transformers/gooaq` | Retrieval | | `hotpotqa` | `sentence-transformers/hotpotqa` | Retrieval | ### CLI flags ``` --models Models to benchmark (default: all) --corpus-size Sentences for speed/memory tests (default: 1000) --batch-size Encoding batch size (default: 64) --num-runs Speed benchmark runs (default: 3) --skip-quality Skip quality evaluation --skip-speed Skip speed measurement --skip-memory Skip memory measurement --datasets Dataset presets (default: sts) --max-pairs Limit pairs per dataset --dataset Custom HF dataset (overrides --datasets) --config Dataset config/subset name (e.g. 'triplet') --split Dataset split (default: test) --query-col Query column name (default: sentence1) --passage-col Passage column name (default: sentence2) --score-col Score column (default: score, 'none' for pairs) --score-scale Score normalization divisor (default: 5.0) --csv Export results to CSV --charts Save charts to directory ``` ## Adding a model From the web UI, click **Add Custom Model** in the sidebar — just provide a display name and a HuggingFace model ID. Or edit `models.py` directly: ```python "e5-small": ModelConfig( name="e5-small-v2", model_id="intfloat/e5-small-v2", ), ``` ## Project structure ``` embedding-bench/ ├── app.py # Streamlit web UI ├── bench.py # CLI entry point ├── models.py # Model registry (40+ models) ├── wrapper.py # Backend wrappers (sbert, fastembed, gguf, libembedding) ├── corpus.py # Sentence corpus builder ├── dataset_config.py # Dataset presets and configuration ├── report.py # Table formatting, CSV export, charts (CLI) ├── evals/ │ ├── quality.py # Quality evaluation (Spearman + retrieval metrics) │ ├── speed.py # Latency measurement │ ├── memory.py # Memory measurement │ └── llm_judge.py # LLM-as-a-Judge evaluation └── requirements.txt ```