embedding-bench / README.md
AmrYassinIsFree
replace matplot with plotly, add more evals, UI re-org
db0da0a
metadata
title: Embedding Bench
emoji: πŸ“
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.56.0
app_file: app.py
pinned: false
license: mit

embedding-bench

Compare text embedding models on quality, speed, and memory. Includes a Streamlit web UI and a CLI.

Features

  • 40+ pre-configured models β€” sentence-transformers, BGE, E5, GTE, Nomic, Jina, Arctic, and more
  • 4 backends β€” sbert (PyTorch), fastembed (ONNX), gguf (llama-cpp), libembedding
  • 7 built-in datasets β€” STS Benchmark, Natural Questions, MS MARCO, SQuAD, TriviaQA, GooAQ, HotpotQA
  • Custom datasets β€” upload your own CSV/TSV or load any HuggingFace dataset
  • Custom models β€” add any HuggingFace embedding model from the UI
  • 11 retrieval metrics β€” MRR, MAP@k, NDCG@k, Precision@k, Recall@k (all configurable)
  • LLM as a Judge β€” use OpenAI or Anthropic to rate retrieval relevance
  • Interactive charts β€” Plotly-powered, with hover, zoom, and PNG export

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Web UI

streamlit run app.py

The sidebar has three sections:

  1. Models β€” select from the registry or add a custom HuggingFace model
  2. Datasets β€” pick built-in presets, upload a CSV/TSV, or add any HuggingFace dataset
  3. Evaluation β€” configure metrics, speed/memory benchmarks, LLM judge, and max pairs

Custom datasets

You can add datasets two ways from the sidebar:

  • Upload file β€” CSV or TSV (max 50 MB, 50k rows) with a query column and a passage column. Optionally include a numeric score column for Spearman correlation; otherwise retrieval metrics (MRR, Recall@k, etc.) are used.
  • HuggingFace Hub β€” provide the dataset ID (e.g. mteb/stsbenchmark-sts), config, split, and column names. The dataset is validated on add.

LLM as a Judge

Enable in the Evaluation section. Provide your OpenAI or Anthropic API key. For each sampled query, the top-5 retrieved passages are rated for relevance (1–5) by the LLM. Reports judge_avg@1, judge_avg@5, and judge_nDCG@5.

Metrics

Dimension Metrics Method
Quality (scored) Spearman Cosine similarity vs gold scores
Quality (pairs) MRR, MAP@5/10, NDCG@5/10, Precision@1/5/10, Recall@1/5/10 Retrieval ranking of positive passages
LLM Judge Avg@1, Avg@5, nDCG@5 LLM relevance ratings on retrieved passages
Speed Median encode time, sent/s Wall-clock over N runs with warmup
Memory Peak RSS delta (MB) Isolated subprocess via psutil

CLI

# Full benchmark (quality + speed + memory)
python bench.py

# Specific models
python bench.py --models mpnet bge-small

# Compare backends
python bench.py --models bge-small bge-small-fe

# Skip expensive evals
python bench.py --skip-quality
python bench.py --skip-memory

# Multiple datasets with pair limit
python bench.py --models mpnet bge-small \
  --datasets sts natural-questions squad \
  --max-pairs 1000 --skip-speed --skip-memory

# Custom HF dataset
python bench.py --dataset my-org/my-pairs \
  --query-col query --passage-col passage --score-col none

# Export
python bench.py --csv results.csv --charts ./results

Built-in dataset presets

Preset HF Dataset Type
sts mteb/stsbenchmark-sts Scored (Spearman)
natural-questions sentence-transformers/natural-questions Retrieval
msmarco sentence-transformers/msmarco-bm25 Retrieval
squad sentence-transformers/squad Retrieval
trivia-qa sentence-transformers/trivia-qa Retrieval
gooaq sentence-transformers/gooaq Retrieval
hotpotqa sentence-transformers/hotpotqa Retrieval

CLI flags

--models            Models to benchmark (default: all)
--corpus-size       Sentences for speed/memory tests (default: 1000)
--batch-size        Encoding batch size (default: 64)
--num-runs          Speed benchmark runs (default: 3)
--skip-quality      Skip quality evaluation
--skip-speed        Skip speed measurement
--skip-memory       Skip memory measurement
--datasets          Dataset presets (default: sts)
--max-pairs         Limit pairs per dataset
--dataset           Custom HF dataset (overrides --datasets)
--config            Dataset config/subset name (e.g. 'triplet')
--split             Dataset split (default: test)
--query-col         Query column name (default: sentence1)
--passage-col       Passage column name (default: sentence2)
--score-col         Score column (default: score, 'none' for pairs)
--score-scale       Score normalization divisor (default: 5.0)
--csv               Export results to CSV
--charts            Save charts to directory

Adding a model

From the web UI, click Add Custom Model in the sidebar β€” just provide a display name and a HuggingFace model ID.

Or edit models.py directly:

"e5-small": ModelConfig(
    name="e5-small-v2",
    model_id="intfloat/e5-small-v2",
),

Project structure

embedding-bench/
β”œβ”€β”€ app.py               # Streamlit web UI
β”œβ”€β”€ bench.py             # CLI entry point
β”œβ”€β”€ models.py            # Model registry (40+ models)
β”œβ”€β”€ wrapper.py           # Backend wrappers (sbert, fastembed, gguf, libembedding)
β”œβ”€β”€ corpus.py            # Sentence corpus builder
β”œβ”€β”€ dataset_config.py    # Dataset presets and configuration
β”œβ”€β”€ report.py            # Table formatting, CSV export, charts (CLI)
β”œβ”€β”€ evals/
β”‚   β”œβ”€β”€ quality.py       # Quality evaluation (Spearman + retrieval metrics)
β”‚   β”œβ”€β”€ speed.py         # Latency measurement
β”‚   β”œβ”€β”€ memory.py        # Memory measurement
β”‚   └── llm_judge.py     # LLM-as-a-Judge evaluation
└── requirements.txt