IconClip-ViT-L-14 β€” Text Encoder (ONNX, INT8 quantized)

INT8 quantized ONNX export of the text-encoder half of likaixin/IconClip-ViT-L-14 β€” a CLIP variant fine-tuned for icon imagery. Ships in the standard ONNX layout (onnx/model_quantized.onnx + tokenizer + config.json) that any ONNX-runtime caller can load directly; the library_name frontmatter on this card also surfaces an auto-generated transformers.js code snippet for browser consumers.


Why this exists

CLIP is a bi-encoder: an image tower that maps icons β†’ 768-d vectors, and a text tower that maps queries β†’ 768-d vectors in the same space. At query time you encode the user's text and cosine-rank against pre-computed icon image vectors:

  1. At query time you only need the text tower β€” the image vectors are already in your catalogue from the offline indexing pass. The image tower runs once per icon, offline (typically on a GPU); shipping it to every consumer wastes ~250 MB of weights they never call. The companion benchmark dataset Cortiq-Labs/iconclip-search-benchmark ships pre-computed image embeddings for 22 827 icons across 11 libraries β€” drop them into your vector store and you're done.
  2. The fp32 text encoder is overkill for inference. INT8 dynamic quantization preserves 0.992 cosine parity with fp32, cuts the size 4Γ—, and doubles steady-state throughput.

This artefact is just the text encoder, INT8-quantized:

fp32 (upstream) q8 (this)
Download 472 MB 119 MB (4Γ— smaller)
Steady-state inference (Node) 17.8 ms p50 8.6 ms p50 (2Γ— faster)
Cosine parity vs fp32 β€” 0.992 mean
MRR@50 retention vs fp32 β€” 99.6%
ONNX layout β€” onnx/model_quantized.onnx + tokenizer at root

The HF Hub CDN serves the file at no cost to consumers β€” and loader libraries like transformers.js and optimum.onnxruntime cache it client-side after the first visit.


What it's for

Use case Status
Browser-side semantic icon search (typed query β†’ ranked icon list) βœ… Primary target
Server-side icon retrieval at <50 ms per query, single-core Node βœ… Also supported via onnxruntime-node
Re-ranking lexical icon search results with semantic similarity βœ… DBSF-fuse with BM25 (see usage below)
General-purpose CLIP text encoding (Flickr30k retrieval, etc.) ⚠️ Works but the underlying IconClip model is biased toward icon-style imagery. For general CLIP, use the upstream LAION base directly.
Image encoding ❌ This repo ships only the text half. Pre-computed image embeddings for 22 827 icons across 11 libraries are bundled in the companion dataset Cortiq-Labs/iconclip-search-benchmark (data/embeddings/iconclip-vit-l-14.parquet); for novel icons, run the upstream image tower at index time.

Benchmark results

All numbers measured on the IconClip Search Benchmark β€” a sibling dataset released alongside this model, in the BEIR information-retrieval benchmark format. 120 paraphrase queries across 24 UI-intent categories, evaluated against a 22 800-icon corpus spanning 11 open-license icon libraries.

Quality β€” quantization fidelity

Metric Value Interpretation
Cosine parity (q8 ↔ fp32) 0.992 mean (per-query, L2-normalised) INT8 quantization barely shifts the output direction β€” within typical L2-normalisation float noise.
MRR@50 retained 99.6% of fp32 ranking quality The few queries that change rank under q8 cluster around ties, not meaningful drops.

Quality — text→image task (the cross-modal task CLIP was trained for)

CLIP-family bi-encoders run on the t2i task: encode the query with each model's text tower; encode each icon's 512Γ—512 raster with its image tower; cosine-rank. Lucide slice, 1 695 icons Γ— 120 queries.

# Model MRR@10 MRR@50 R@10 R@50 nDCG@10
1 IconClip-ViT-L-14 (this q8 text + cached image vecs) 0.495 0.505 0.271 0.487 0.345
2 google/siglip-base-patch16-224 0.442 0.451 0.282 0.463 0.336
3 laion/CLIP-ViT-L-14-laion2B-s32B-b82K (the LAION CLIP IconClip was fine-tuned from) 0.392 0.399 0.234 0.406 0.280
4 openai/clip-vit-large-patch14 0.367 0.378 0.187 0.341 0.239
5 openai/clip-vit-base-patch32 0.287 0.298 0.116 0.254 0.170

Fine-tuning lift over base CLIP-L/14: MRR@50 0.378 β†’ 0.505 (+33.7% relative), nDCG@10 0.239 β†’ 0.345 (+44.4%). That gap quantifies the icon-domain fine-tune the upstream IconClip authors performed.

Quality — text→text task (out of distribution for CLIP)

Same dataset, but document side is the icon's text field (names + tags + aliases + category synonyms) rather than its image. CLIP-style text encoders are bi-encoders aligned with image features, so on text-only retrieval they sit out-of-distribution against dedicated sentence transformers. Numbers are reported as a controlled comparison, not as a recommendation for using IconClip in a text-only setup.

# Model MRR@10 MRR@50 nDCG@10
1 BAAI/bge-small-en-v1.5 0.576 0.583 0.436
2 sentence-transformers/all-MiniLM-L6-v2 0.566 0.570 0.432
3 intfloat/e5-small-v2 0.527 0.533 0.420
4 IconClip-ViT-L-14 text encoder (this q8) 0.438 0.447 0.283
5 openai/clip-vit-large-patch14 (text tower only) 0.269 0.279 0.137

For text-only retrieval (no image embeddings indexed), use a sentence-transformer. For cross-modal retrieval over an icon catalogue with pre-computed image embeddings, use this artefact.

Quality β€” quantization-only comparison vs the hashed-TF-IDF baseline

The original v1 release compared q8 to hashed-TF-IDF over the same 22 800-icon corpus. Kept for back-compat:

Metric TF-IDF baseline q8 IconClip Ξ”
MRR@50 0.111 0.403 +0.292 absolute (β‰ˆ3.6Γ— relative MRR)
Recall@10 0.7% 3.5% +2.8 pp

This is in-app full-corpus retrieval β€” the IconClip MRR@50 is 0.403 because the production retrieval path uses HNSW shortlist + score blending (a small approximation cost over the pure-cosine 0.447 in the t2t leaderboard above). Both numbers are correct; they measure different retrieval stacks on the same model.

Latency β€” Node (onnxruntime-node 1.26, x86-64 AMD CPU, post-warmup)

Quantisation p50 p95 p99 Cold-load
q8 (this artefact) 8.6 ms 21.5 ms 27.9 ms 153 ms
fp32 (reference) 17.8 ms 53.1 ms 82.8 ms 350 ms

Browser numbers via transformers.js WASM-SIMD will be 1.5-2Γ— higher than these Node numbers on the same hardware β€” well inside a 200 ms keystroke-debounce budget. Enable cross-origin isolation (COOP + COEP headers) to unlock SharedArrayBuffer + WASM multi-thread for further speedup.


Benchmark corpus + methodology

The benchmark dataset is a separate release β€” Cortiq-Labs/iconclip-search-benchmark β€” in the BEIR information-retrieval format (corpus.jsonl + queries.jsonl + qrels/test.tsv). That's the same format MTEB and the BEIR benchmarks adopt, so any BEIR-compatible evaluator can score a custom retriever against this benchmark with no glue code.

Corpus: 22 800 icons spanning 11 open-license icon libraries β€” Lucide, Phosphor, Tabler, Heroicons, Bootstrap, Carbon, Font Awesome, Iconoir, Ionicons, Material Symbols, RemixIcon. Per-library icon counts are in the dataset's README.

Methodology:

  1. 120 paraphrase queries across 24 UI-intent categories, each deliberately worded so that BM25 / TF-IDF cannot win via token overlap ("throw it away" β†’ trash; "feeling blue" β†’ frown; "unfasten the latch" β†’ unlock). Difficulty banding: 38 medium / 65 hard paraphrase / 17 genuinely ambiguous-but-fair.
  2. Each query encoded by both arms (q8 IconClip and hashed TF-IDF); per-arm cosine top-50 against the 22 800-icon vector pool.
  3. 35 184 relevance judgements generated by expanding each query's "expected" patterns against every icon's name + aliases; emitted as TREC qrels (qrels/test.tsv in the dataset repo).
  4. Score MRR@50 and Recall@10 per arm; report the gap.
  5. Cosine-parity test for the q8↔fp32 drift number: dot(L2(q8_vec), L2(fp32_vec)) over the same 120 queries.

Reproducible end-to-end β€” load the BEIR dataset, plug in any retriever, run a standard beir.retrieval.evaluation.EvaluateRetrieval. The queries file also carries our category + difficulty + expects_patterns fields (BEIR ignores unknown fields) so consumers can slice results.


Files in this repo

File Purpose Size
onnx/model_quantized.onnx INT8 dynamic-quantized text encoder, 768-d output ~119 MB
tokenizer.json CLIP BPE tokenizer (49 408 vocab, BOS 49 406 / EOS 49 407) ~1.8 MB
config.json CLIPTextModel architecture spec <1 KB
tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt Tokenizer metadata for transformers.js auto-loading <2 MB combined

The fp32 export (~472 MB) is not shipped here β€” it's reserved for Node-side server use where streaming from local disk is fine. The q8 artefact is what the browser bundle needs.


Quantization design space

We considered three precisions for the text-encoder export. The shipped q8 lands in the sweet spot:

Precision Size Steady-state p50 (Node CPU) Cosine parity vs fp32 Verdict
fp32 472 MB 17.8 ms 1.000 Reference; too large for browser ship, fine for Node disk
INT8 (this artefact) 119 MB 8.6 ms 0.992 mean Shipped. 4Γ— smaller + 2Γ— faster vs fp32, no meaningful quality drop.
INT4 ~70-80 MB (variable) not measured Not shipped. ONNX Runtime Web's INT4 path unpacks weights to FP16 at runtime, eliminating the speed win; quality cliff for CLIP-class text encoders isn't well-studied. Not worth the risk for the marginal extra size savings.
FP16 236 MB ~14 ms ~0.998 Not shipped. Only 2Γ— smaller than fp32 vs q8's 4Γ—; speed gain modest. q8 dominates on every axis.

How to use this in production: DBSF fusion with BM25

The single best practice for icon search is NOT "replace BM25 with CLIP" β€” it's fuse them with DBSF (Distribution-Based Score Fusion). CLIP wins on paraphrase intent; BM25 wins on exact-name match ("home" should put home first, which CLIP doesn't always do). Fusing recovers both signals:

import { dbsfFuse } from "your-fusion-lib";

// At index time (offline, GPU):
//   image_vec = iconClip.imageEncoder(icon.png)  // 768-d
//   bm25_doc  = orama.index({ name, tags, aliases, category_synonyms })

// At query time (browser, ~10 ms total):
const textVec = await iconClipText.encode(query);            // 768-d
const semantic = cosineTopK(textVec, image_vecs, 50);        // {id, score}[]
const lexical  = bm25.search(query, 50);                     // {id, score}[]
const fused = dbsfFuse([
  { weight: 0.5, results: semantic },
  { weight: 0.5, results: lexical },
]);

DBSF (mean Β± 3Οƒ z-score normalisation per arm, then weighted sum) is robust to per-arm score-distribution differences β€” much better than Reciprocal Rank Fusion (RRF) for this case where the two arms have genuinely different score scales. Reference: Mazzeschi 2023. The companion benchmark Cortiq-Labs/iconclip-search-benchmark includes a category field per query so you can verify that DBSF helps on every UI-intent category, not just the paraphrase-heavy ones.

50/50 weighting works as a default; tune via offline grid-search if your domain differs (e.g. heavy exact-name matching β†’ upweight BM25).


Use it

Browser (transformers.js v4)

import { AutoTokenizer, AutoModel } from "@huggingface/transformers";

const repo = "Cortiq-Labs/IconClip-ViT-L-14-text-encoder-ONNX";
const tokenizer = await AutoTokenizer.from_pretrained(repo);
const model = await AutoModel.from_pretrained(repo, {
  dtype: "q8",
  device: "wasm",  // or "webgpu" where supported
});

// CLIP text encoders use a fixed 77-token context window β€” pad every input
// to that length. This matches OpenAI CLIP, LAION CLIP, and every other
// CLIP-family ONNX export on HF.
const enc = await tokenizer(["shopping cart"], {
  padding: "max_length",
  max_length: 77,
  truncation: true,
});
const out = await model(enc);
// The ONNX exposes the projected 768-d output under the `embeddings` key.
// transformers.js v4's EncoderOnly fallback uses this name when the
// CLIPTextModel config maps to the q8 quantized graph.
const vec = out.embeddings.data;  // Float32Array(768), L2-normalised

For best results, enable cross-origin isolation (COOP: same-origin + COEP: require-corp) so transformers.js can use WASM multi-threading

  • SIMD.

Node (onnxruntime-node)

import * as ort from "onnxruntime-node";

const session = await ort.InferenceSession.create(
  "models/model_quantized.onnx",
  { executionProviders: ["cpu"], graphOptimizationLevel: "all" },
);
// pair with a CLIP BPE tokenizer that produces the same input_ids shape
// (this repo's tokenizer.json works through @huggingface/tokenizers)

Fuse with lexical search (DBSF, recommended for icon search)

Cosine on the CLIP embedding alone misses exact-name matches that users expect (typing "home" should put home first). Fuse with BM25:

import { dbsfFuse } from "your-fusion-lib"; // distribution-based score fusion

const semantic = cosineTopK(queryEmbedding, iconVectors, /* k */ 50);
const lexical = bm25Search(query, iconIndex, /* k */ 50);
const fused = dbsfFuse([
  { weight: 0.5, results: semantic },
  { weight: 0.5, results: lexical },
]);

DBSF (mean Β± 3Οƒ z-score normalisation per arm, then weighted sum) is robust to per-arm score-distribution differences. Reference: Mazzeschi 2023. The icon-search production code uses 50/50 weighting; tune via offline grid-search if your domain differs.


Provenance

  • Base model: likaixin/IconClip-ViT-L-14 β€” itself fine-tuned from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K (13 B web-scale image-text pairs) on the IconStack-Captions-48M + IconStack-48M-Pre-Rendered + svg-stack datasets.
  • ONNX export: optimum.exporters.onnx, opset 17, dynamic axes for batch + sequence dimensions.
  • Quantization: ONNX Runtime quantize_dynamic, per-tensor INT8 weights, fp32 activations preserved on the input embedding lookup to maintain BPE token coverage.

Limitations

  • Icon domain bias. Trained on icon imagery, so general photographs / natural scenes will produce lower-quality embeddings than a general LAION CLIP. Use the upstream base directly for general retrieval.
  • English only. The CLIP BPE tokenizer's vocab is English; non-Latin scripts are essentially unsupported.
  • 77-token context limit. Inherited from CLIP β€” queries are truncated. Fine for icon-search queries (typically 1-5 words).
  • Training-data provenance for IconClip's icon corpus (IconStack-Captions-48M, IconStack-48M-Pre-Rendered) is not declared on the upstream model card. Verify the upstream license story matches your needs before commercial deployment.

License

MIT, inheriting from the LAION CLIP base. See upstream LAION model card for full attribution.


Citing

If this artefact helps your work, citations are appreciated for both this quantized export and the underlying base models:

@misc{peciukonis2026iconclipq8,
  author       = {Pe{\v{c}}iukonis, Matas (NullSense)},
  title        = {{IconClip-ViT-L-14 text encoder} --- INT8 ONNX export for
                  client-side icon search},
  year         = {2026},
  howpublished = {Hugging Face --- Cortiq Labs},
  url          = {https://huggingface.co/Cortiq-Labs/IconClip-ViT-L-14-text-encoder-ONNX},
  note         = {Quantized INT8 ONNX text-tower export with reproducible
                  benchmark harness against an 11-library, 22\,800-icon corpus.
                  Source + bench: github.com/Cortiq-Lab/lucide-similarity-explorer}
}

@misc{iconclip2024,
  author = {Kaixin Li (likaixin)},
  title  = {{IconClip-ViT-L-14}: a CLIP variant fine-tuned for icon imagery},
  year   = {2024},
  url    = {https://huggingface.co/likaixin/IconClip-ViT-L-14}
}

@inproceedings{gadre2023datacomp,
  title     = {{DataComp}: in search of the next generation of multimodal datasets},
  author    = {Gadre et al.},
  booktitle = {NeurIPS},
  year      = {2023},
  url       = {https://arxiv.org/abs/2304.14108}
}

And β€” if you wouldn't mind β€” a star on the source repo helps: Cortiq-Lab/lucide-similarity-explorer.

Downloads last month
111
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Cortiq-Labs/IconClip-ViT-L-14-text-encoder-ONNX

Quantized
(1)
this model

Space using Cortiq-Labs/IconClip-ViT-L-14-text-encoder-ONNX 1

Papers for Cortiq-Labs/IconClip-ViT-L-14-text-encoder-ONNX