TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
Abstract
A new generalist embedding model called TabEmbed is introduced that unifies tabular classification and retrieval tasks within a shared embedding space using large-scale contrastive learning with positive-aware hard negative mining.
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
Community
the generalist embedding model that unifies tabular classification and retrieval in a single shared space.
the cross-modal setup in tabembed, where you use natural language queries as anchors to bind a serialized tabular row to its constraints, is a clever departure from the usual row-to-row contrasts. that design seems to be the lever that preserves numeric semantics and column-level meaning in the embedding space, something text-only or vanilla tabular encoders struggle with. one tight question: how well does this hold up under schema shifts, like adding new columns or reinterpreting a column across domains, where the natural language constraint might drift? btw the arxivlens breakdown at https://arxivlens.com/PaperView/Details/tabembed-benchmarking-and-learning-generalist-embeddings-for-tabular-understanding-5676-ee07c573 helped me parse the method details, especially the triplet construction and training recipe. an ablation that tests removing the query anchor or the hard negative mining would really nail down which piece is driving the gains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks (2026)
- SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs (2026)
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition (2026)
- MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models (2026)
- Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering (2026)
- LLM2Vec-Gen: Generative Embeddings from Large Language Models (2026)
- ReaLM: Residual Quantization Bridges Knowledge Graph Embeddings and Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.04962 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper