new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 15

Towards Generalizable Robotic Data Flywheel: High-Dimensional Factorization and Composition

The lack of sufficiently diverse data, coupled with limited data efficiency, remains a major bottleneck for generalist robotic models, yet systematic strategies for collecting and curating such data are not fully explored. Task diversity arises from implicit factors that are sparsely distributed across multiple dimensions and are difficult to define explicitly. To address this challenge, we propose F-ACIL, a heuristic factor-aware compositional iterative learning framework that enables structured data factorization and promotes compositional generalization. F-ACIL decomposes the data distribution into structured factor spaces such as object, action, and environment. Based on the factorized formulation, we develop a factor-wise data collection and an iterative training paradigm that promotes compositional generalization over the high-dimensional factor space, leading to more effective utilization of real-world robotic demonstrations. With extensive real-world experiments, we show that F-ACIL can achieve more than 45% performance gains with 5-10times fewer demonstrations comparing to that of which without the strategy. The results suggest that structured factorization offers a practical pathway toward efficient compositional generalization in real-world robotic learning. We believe F-ACIL can inspire more systematic research on building generalizable robotic data flywheel strategies. More demonstrations can be found at: https://f-acil.github.io/

  • 5 authors
·
Mar 25

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to 150 fold reduction in FLOPS) and a factor of up to 5times less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.

  • 3 authors
·
Mar 26, 2024

Unsupervised Learning under Latent Label Shift

What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals p_d(y) can shift across domains but the class conditionals p(x|y) do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to p(d|x) suffices to identify p_d(y) and p_d(y|x) up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator p(d|x); (ii) discretize the data by clustering examples in p(d|x) space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered p(y|d) with the discriminator outputs p(d|x) to compute p_d(y|x) ; forall d. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.

  • 4 authors
·
Jul 26, 2022

SESA: Supervised Explicit Semantic Analysis

In recent years supervised representation learning has provided state of the art or close to the state of the art results in semantic analysis tasks including ranking and information retrieval. The core idea is to learn how to embed items into a latent space such that they optimize a supervised objective in that latent space. The dimensions of the latent space have no clear semantics, and this reduces the interpretability of the system. For example, in personalization models, it is hard to explain why a particular item is ranked high for a given user profile. We propose a novel model of representation learning called Supervised Explicit Semantic Analysis (SESA) that is trained in a supervised fashion to embed items to a set of dimensions with explicit semantics. The model learns to compare two objects by representing them in this explicit space, where each dimension corresponds to a concept from a knowledge base. This work extends Explicit Semantic Analysis (ESA) with a supervised model for ranking problems. We apply this model to the task of Job-Profile relevance in LinkedIn in which a set of skills defines our explicit dimensions of the space. Every profile and job are encoded to this set of skills their similarity is calculated in this space. We use RNNs to embed text input into this space. In addition to interpretability, our model makes use of the web-scale collaborative skills data that is provided by users for each LinkedIn profile. Our model provides state of the art result while it remains interpretable.

  • 2 authors
·
Aug 10, 2017

FLoRA: Low-Rank Core Space for N-dimension

Adapting pre-trained foundation models for various downstream tasks has been prevalent in artificial intelligence. Due to the vast number of tasks and high costs, adjusting all parameters becomes unfeasible. To mitigate this, several fine-tuning techniques have been developed to update the pre-trained model weights in a more resource-efficient manner, such as through low-rank adjustments. Yet, almost all of these methods focus on linear weights, neglecting the intricacies of parameter spaces in higher dimensions like 4D. Alternatively, some methods can be adapted for high-dimensional parameter space by compressing changes in the original space into two dimensions and then employing low-rank matrix decomposition. However, these approaches destructs the structural integrity of the involved high-dimensional spaces. To tackle the diversity of dimensional spaces across different foundation models and provide a more precise representation of the changes within these spaces, this paper introduces a generalized parameter-efficient fine-tuning framework, FLoRA, designed for various dimensional parameter space. Specifically, utilizing Tucker decomposition, FLoRA asserts that changes in each dimensional parameter space are based on a low-rank core space which maintains the consistent topological structure with the original space. It then models the changes through this core space alongside corresponding weights to reconstruct alterations in the original space. FLoRA effectively preserves the structural integrity of the change of original N-dimensional parameter space, meanwhile decomposes it via low-rank tensor decomposition. Extensive experiments on computer vision, natural language processing and multi-modal tasks validate FLoRA's effectiveness. Codes are available at https://github.com/SJTU-DeepVisionLab/FLoRA.

  • 9 authors
·
May 23, 2024

GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures? We introduce GraphShaper, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47\% accuracy improvements on citation networks and 7.63\% on social networks in zero-shot settings.

  • 9 authors
·
Oct 13, 2025

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.

  • 37 authors
·
Apr 1 5

A Framework for End-to-End Learning on Semantic Tree-Structured Data

While learning models are typically studied for inputs in the form of a fixed dimensional feature vector, real world data is rarely found in this form. In order to meet the basic requirement of traditional learning models, structural data generally have to be converted into fix-length vectors in a handcrafted manner, which is tedious and may even incur information loss. A common form of structured data is what we term "semantic tree-structures", corresponding to data where rich semantic information is encoded in a compositional manner, such as those expressed in JavaScript Object Notation (JSON) and eXtensible Markup Language (XML). For tree-structured data, several learning models have been studied to allow for working directly on raw tree-structure data, However such learning models are limited to either a specific tree-topology or a specific tree-structured data format, e.g., synthetic parse trees. In this paper, we propose a novel framework for end-to-end learning on generic semantic tree-structured data of arbitrary topology and heterogeneous data types, such as data expressed in JSON, XML and so on. Motivated by the works in recursive and recurrent neural networks, we develop exemplar neural implementations of our framework for the JSON format. We evaluate our approach on several UCI benchmark datasets, including ablation and data-efficiency studies, and on a toy reinforcement learning task. Experimental results suggest that our framework yields comparable performance to use of standard models with dedicated feature-vectors in general, and even exceeds baseline performance in cases where compositional nature of the data is particularly important. The source code for a JSON-based implementation of our framework along with experiments can be downloaded at https://github.com/EndingCredits/json2vec.

  • 2 authors
·
Feb 13, 2020

Differentiable Neural Input Search for Recommender Systems

Latent factor models are the driving forces of the state-of-the-art recommender systems, with an important insight of vectorizing raw input features into dense embeddings. The dimensions of different feature embeddings are often set to a same value empirically, which limits the predictive performance of latent factor models. Existing works have proposed heuristic or reinforcement learning-based methods to search for mixed feature embedding dimensions. For efficiency concern, these methods typically choose embedding dimensions from a restricted set of candidate dimensions. However, this restriction will hurt the flexibility of dimension selection, leading to suboptimal performance of search results. In this paper, we propose Differentiable Neural Input Search (DNIS), a method that searches for mixed feature embedding dimensions in a more flexible space through continuous relaxation and differentiable optimization. The key idea is to introduce a soft selection layer that controls the significance of each embedding dimension, and optimize this layer according to model's validation performance. DNIS is model-agnostic and thus can be seamlessly incorporated with existing latent factor models for recommendation. We conduct experiments with various architectures of latent factor models on three public real-world datasets for rating prediction, Click-Through-Rate (CTR) prediction, and top-k item recommendation. The results demonstrate that our method achieves the best predictive performance compared with existing neural input search approaches with fewer embedding parameters and less time cost.

  • 3 authors
·
Jun 8, 2020

Unsupervised Matching of Data and Text

Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve promising results for these two tasks, there is no clear solution for the more general problem of matching textual content and structured data. We introduce a framework that supports this new task in an unsupervised setting for any pair of corpora, being relational tables or text documents. Our method builds a fine-grained graph over the content of the corpora and derives word embeddings to represent the objects to match in a low dimensional space. The learned representation enables effective and efficient matching at different granularity, from relational tuples to text sentences and paragraphs. Our flexible framework can exploit pre-trained resources, but it does not depends on their existence and achieves better quality performance in matching content when the vocabulary is domain specific. We also introduce optimizations in the graph creation process with an "expand and compress" approach that first identifies new valid relationships across elements, to improve matching, and then prunes nodes and edges, to reduce the graph size. Experiments on real use cases and public datasets show that our framework produces embeddings that outperform word embeddings and fine-tuned language models both in results' quality and in execution times.

  • 3 authors
·
Dec 16, 2021

XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision

Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce XFactors, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace S and factor-specific subspaces T_1,ldots,T_K and a residual subspace S. Each target factor is encoded in its assigned T_i through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both S and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, XFactors achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at https://github.com/ICML26-anon/XFactors{github.com/ICML26-anon/XFactors}.

  • 6 authors
·
Jan 29

What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models

The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.

  • 4 authors
·
Mar 12, 2025

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

  • 3 authors
·
Feb 27 3

DeepArchitect: Automatically Designing and Training Deep Architectures

In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are tree-structured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential model-based optimization (SMBO). We present experiments comparing the different algorithms on CIFAR-10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available.

  • 2 authors
·
Apr 27, 2017

Navigating Ideation Space: Decomposed Conceptual Representations for Positioning Scientific Ideas

Scientific discovery is a cumulative process and requires new ideas to be situated within an ever-expanding landscape of existing knowledge. An emerging and critical challenge is how to identify conceptually relevant prior work from rapidly growing literature, and assess how a new idea differentiates from existing research. Current embedding approaches typically conflate distinct conceptual aspects into single representations and cannot support fine-grained literature retrieval; meanwhile, LLM-based evaluators are subject to sycophancy biases, failing to provide discriminative novelty assessment. To tackle these challenges, we introduce the Ideation Space, a structured representation that decomposes scientific knowledge into three distinct dimensions, i.e., research problem, methodology, and core findings, each learned through contrastive training. This framework enables principled measurement of conceptual distance between ideas, and modeling of ideation transitions that capture the logical connections within a proposed idea. Building upon this representation, we propose a Hierarchical Sub-Space Retrieval framework for efficient, targeted literature retrieval, and a Decomposed Novelty Assessment algorithm that identifies which aspects of an idea are novel. Extensive experiments demonstrate substantial improvements, where our approach achieves Recall@30 of 0.329 (16.7% over baselines), our ideation transition retrieval reaches Hit Rate@30 of 0.643, and novelty assessment attains 0.37 correlation with expert judgments. In summary, our work provides a promising paradigm for future research on accelerating and evaluating scientific discovery.

  • 4 authors
·
Jan 13

Towards Representation Alignment and Uniformity in Collaborative Filtering

Collaborative filtering (CF) plays a critical role in the development of recommender systems. Most CF methods utilize an encoder to embed users and items into the same representation space, and the Bayesian personalized ranking (BPR) loss is usually adopted as the objective function to learn informative encoders. Existing studies mainly focus on designing more powerful encoders (e.g., graph neural network) to learn better representations. However, few efforts have been devoted to investigating the desired properties of representations in CF, which is important to understand the rationale of existing CF methods and design new learning objectives. In this paper, we measure the representation quality in CF from the perspective of alignment and uniformity on the hypersphere. We first theoretically reveal the connection between the BPR loss and these two properties. Then, we empirically analyze the learning dynamics of typical CF methods in terms of quantified alignment and uniformity, which shows that better alignment or uniformity both contribute to higher recommendation performance. Based on the analyses results, a learning objective that directly optimizes these two properties is proposed, named DirectAU. We conduct extensive experiments on three public datasets, and the proposed learning framework with a simple matrix factorization model leads to significant performance improvements compared to state-of-the-art CF methods. Our implementations are publicly available at https://github.com/THUwangcy/DirectAU.

  • 7 authors
·
Jun 26, 2022

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called self-guided training, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.

  • 4 authors
·
Jun 24, 2024

Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

Extracting meaningful insights from large and complex datasets poses significant challenges, particularly in ensuring the accuracy and relevance of retrieved information. Traditional data retrieval methods such as sequential search and index-based retrieval often fail when handling intricate and interconnected data structures, resulting in incomplete or misleading outputs. To overcome these limitations, we introduce Structured-GraphRAG, a versatile framework designed to enhance information retrieval across structured datasets in natural language queries. Structured-GraphRAG utilizes multiple knowledge graphs, which represent data in a structured format and capture complex relationships between entities, enabling a more nuanced and comprehensive retrieval of information. This graph-based approach reduces the risk of errors in language model outputs by grounding responses in a structured format, thereby enhancing the reliability of results. We demonstrate the effectiveness of Structured-GraphRAG by comparing its performance with that of a recently published method using traditional retrieval-augmented generation. Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times. While our case study focuses on soccer data, the framework's design is broadly applicable, offering a powerful tool for data analysis and enhancing language model applications across various structured domains.

  • 5 authors
·
Sep 26, 2024 2

A Markov Categorical Framework for Language Modeling

Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective. While empirically powerful, a deep theoretical understanding of why this simple objective yields such versatile representations remains elusive. This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective. We model the single-step generation map as a composition of Markov kernels in the category Stoch. This compositional view, when enriched with statistical divergences, allows us to dissect information flow and learned geometry. Our framework makes three main contributions. First, we provide a formal, information-theoretic rationale for the success of modern speculative decoding methods like EAGLE, quantifying the information surplus in hidden states that these methods exploit. Second, we formalize how NLL minimization forces the model to learn not just the next token, but the data's intrinsic conditional stochasticity, a process we analyze using categorical entropy. Third, and most centrally, we prove that NLL training acts as an implicit form of spectral contrastive learning. By analyzing the information geometry of the model's prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a predictive similarity operator, thereby learning a geometrically structured space without explicit contrastive pairs. This compositional and information-geometric perspective reveals the deep structural principles underlying the effectiveness of modern LMs. Project Page: https://github.com/asiresearch/lm-theory

  • 1 authors
·
Jul 25, 2025

Sheaf Neural Networks for Graph-based Recommender Systems

Recent progress in Graph Neural Networks has resulted in wide adoption by many applications, including recommendation systems. The reason for Graph Neural Networks' superiority over other approaches is that many problems in recommendation systems can be naturally modeled as graphs, where nodes can be either users or items and edges represent preference relationships. In current Graph Neural Network approaches, nodes are represented with a static vector learned at training time. This static vector might only be suitable to capture some of the nuances of users or items they define. To overcome this limitation, we propose using a recently proposed model inspired by category theory: Sheaf Neural Networks. Sheaf Neural Networks, and its connected Laplacian, can address the previous problem by associating every node (and edge) with a vector space instead than a single vector. The vector space representation is richer and allows picking the proper representation at inference time. This approach can be generalized for different related tasks on graphs and achieves state-of-the-art performance in terms of F1-Score@N in collaborative filtering and Hits@20 in link prediction. For collaborative filtering, the approach is evaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a 5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link prediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the respective baselines.

  • 4 authors
·
Apr 7, 2023

Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation

Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for REfining multi-modAl contRastive learning and hoMography relations (REARM). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available https://github.com/MrShouxingMa/REARM{here}.

  • 4 authors
·
Aug 19, 2025 2

einspace: Searching for Neural Architectures from Fundamental Operations

Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

  • 8 authors
·
May 31, 2024

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

  • 5 authors
·
Sep 16, 2023 1

DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries

This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.

  • 1 authors
·
May 25, 2024

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

  • 4 authors
·
Apr 29, 2024

RecGPT: A Foundation Model for Sequential Recommendation

This work addresses a fundamental barrier in recommender systems: the inability to generalize across domains without extensive retraining. Traditional ID-based approaches fail entirely in cold-start and cross-domain scenarios where new users or items lack sufficient interaction history. Inspired by foundation models' cross-domain success, we develop a foundation model for sequential recommendation that achieves genuine zero-shot generalization capabilities. Our approach fundamentally departs from existing ID-based methods by deriving item representations exclusively from textual features. This enables immediate embedding of any new item without model retraining. We introduce unified item tokenization with Finite Scalar Quantization that transforms heterogeneous textual descriptions into standardized discrete tokens. This eliminates domain barriers that plague existing systems. Additionally, the framework features hybrid bidirectional-causal attention that captures both intra-item token coherence and inter-item sequential dependencies. An efficient catalog-aware beam search decoder enables real-time token-to-item mapping. Unlike conventional approaches confined to their training domains, RecGPT naturally bridges diverse recommendation contexts through its domain-invariant tokenization mechanism. Comprehensive evaluations across six datasets and industrial scenarios demonstrate consistent performance advantages.

  • 6 authors
·
Jun 6, 2025

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations.

  • 8 authors
·
Mar 7, 2025 2

Self-supervised Learning for Large-scale Item Recommendations

Large scale recommender models find most relevant items from huge catalogs, and they play a critical role in modern search and recommendation systems. To model the input space with large-vocab categorical features, a typical recommender model learns a joint embedding space through neural networks for both queries and items from user feedback data. However, with millions to billions of items in the corpus, users tend to provide feedback for a very small set of them, causing a power-law distribution. This makes the feedback data for long-tail items extremely sparse. Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning better latent relationship of item features. Specifically, SSL improves item representation learning as well as serving as additional regularization to improve generalization. Furthermore, we propose a novel data augmentation method that utilizes feature correlations within the proposed framework. We evaluate our framework using two real-world datasets with 500M and 1B training examples respectively. Our results demonstrate the effectiveness of SSL regularization and show its superior performance over the state-of-the-art regularization techniques. We also have already launched the proposed techniques to a web-scale commercial app-to-app recommendation system, with significant improvements top-tier business metrics demonstrated in A/B experiments on live traffic. Our online results also verify our hypothesis that our framework indeed improves model performance even more on slices that lack supervision.

  • 11 authors
·
Jul 25, 2020

PRISM: Purified Representation and Integrated Semantic Modeling for Generative Sequential Recommendation

Generative Sequential Recommendation (GSR) has emerged as a promising paradigm, reframing recommendation as an autoregressive sequence generation task over discrete Semantic IDs (SIDs), typically derived via codebook-based quantization. Despite its great potential in unifying retrieval and ranking, existing GSR frameworks still face two critical limitations: (1) impure and unstable semantic tokenization, where quantization methods struggle with interaction noise and codebook collapse, resulting in SIDs with ambiguous discrimination; and (2) lossy and weakly structured generation, where reliance solely on coarse-grained discrete tokens inevitably introduces information loss and neglects items' hierarchical logic. To address these issues, we propose a novel generative recommendation framework, PRISM, with Purified Representation and Integrated Semantic Modeling. Specifically, to ensure high-quality tokenization, we design a Purified Semantic Quantizer that constructs a robust codebook via adaptive collaborative denoising and hierarchical semantic anchoring mechanisms. To compensate for information loss during quantization, we further propose an Integrated Semantic Recommender, which incorporates a dynamic semantic integration mechanism to integrate fine-grained semantics and enforces logical validity through a semantic structure alignment objective. PRISM consistently outperforms state-of-the-art baselines across four real-world datasets, demonstrating substantial performance gains, particularly in high-sparsity scenarios.

  • 5 authors
·
Jan 23

Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations

Randomly-hashed item ids are used ubiquitously in recommendation models. However, the learned representations from random hashing prevents generalization across similar items, causing problems of learning unseen and long-tail items, especially when item corpus is large, power-law distributed, and evolving dynamically. In this paper, we propose using content-derived features as a replacement for random ids. We show that simply replacing ID features with content-based embeddings can cause a drop in quality due to reduced memorization capability. To strike a good balance of memorization and generalization, we propose to use Semantic IDs -- a compact discrete item representation learned from frozen content embeddings using RQ-VAE that captures the hierarchy of concepts in items -- as a replacement for random item ids. Similar to content embeddings, the compactness of Semantic IDs poses a problem of easy adaption in recommendation models. We propose novel methods for adapting Semantic IDs in industry-scale ranking models, through hashing sub-pieces of of the Semantic-ID sequences. In particular, we find that the SentencePiece model that is commonly used in LLM tokenization outperforms manually crafted pieces such as N-grams. To the end, we evaluate our approaches in a real-world ranking model for YouTube recommendations. Our experiments demonstrate that Semantic IDs can replace the direct use of video IDs by improving the generalization ability on new and long-tail item slices without sacrificing overall model quality.

  • 12 authors
·
Jun 13, 2023

Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders

Scientific archives now contain hundreds of petabytes of data across genomics, ecology, climate, and molecular biology that could reveal undiscovered patterns if systematically analyzed at scale. Large-scale, weakly-supervised datasets in language and vision have driven the development of foundation models whose internal representations encode structure (patterns, co-occurrences and statistical regularities) beyond their training objectives. Most existing methods extract structure only for pre-specified targets; they excel at confirmation but do not support open-ended discovery of unknown patterns. We ask whether sparse autoencoders (SAEs) can enable open-ended feature discovery from foundation model representations. We evaluate this question in controlled rediscovery studies, where the learned SAE features are tested for alignment with semantic concepts on a standard segmentation benchmark and compared against strong label-free alternatives on concept-alignment metrics. Applied to ecological imagery, the same procedure surfaces fine-grained anatomical structure without access to segmentation or part labels, providing a scientific case study with ground-truth validation. While our experiments focus on vision with an ecology case study, the method is domain-agnostic and applicable to models in other sciences (e.g., proteins, genomics, weather). Our results indicate that sparse decomposition provides a practical instrument for exploring what scientific foundation models have learned, an important prerequisite for moving from confirmation to genuine discovery.

  • 4 authors
·
Nov 21, 2025

Language Representations Can be What Recommenders Need: Findings and Potentials

Recent studies empirically indicate that language models (LMs) encode rich world knowledge beyond mere semantics, attracting significant attention across various fields. However, in the recommendation domain, it remains uncertain whether LMs implicitly encode user preference information. Contrary to prevailing understanding that LMs and traditional recommenders learn two distinct representation spaces due to the huge gap in language and behavior modeling objectives, this work re-examines such understanding and explores extracting a recommendation space directly from the language representation space. Surprisingly, our findings demonstrate that item representations, when linearly mapped from advanced LM representations, yield superior recommendation performance. This outcome suggests the possible homomorphism between the advanced language representation space and an effective item representation space for recommendation, implying that collaborative signals may be implicitly encoded within LMs. Motivated by these findings, we explore the possibility of designing advanced collaborative filtering (CF) models purely based on language representations without ID-based embeddings. To be specific, we incorporate several crucial components to build a simple yet effective model, with item titles as the input. Empirical results show that such a simple model can outperform leading ID-based CF models, which sheds light on using language representations for better recommendation. Moreover, we systematically analyze this simple model and find several key features for using advanced language representations: a good initialization for item representations, zero-shot recommendation abilities, and being aware of user intention. Our findings highlight the connection between language modeling and behavior modeling, which can inspire both natural language processing and recommender system communities.

  • 6 authors
·
Jul 7, 2024

Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.

  • 7 authors
·
May 20, 2023

GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning

Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner. Our code is publicly available at: https://github.com/Paranioar/GSSF.

  • 6 authors
·
Oct 19, 2024

Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster -- the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 Newsgroups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.

  • 2 authors
·
Dec 29, 2025

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result, i.e., the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also provide a unified prospective for understanding of the linear representation hypothesis. Taking this a step further, our finding motivates a reliable evaluation of sparse autoencoders by treating the performance of supervised concept extractors as an upper bound. Pushing this idea even further, it inspires a structural variant that enforces dependence among latent concepts in addition to promoting sparsity. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families, and demonstrate the effectiveness of our structured sparse autoencoder.

  • 9 authors
·
Mar 11, 2025

Relative representations enable zero-shot latent space communication

Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations. Ideally, the distribution of the data points in the latent space should depend only on the task, the data, the loss, and other architecture-specific constraints. However, factors such as the random weights initialization, training hyperparameters, or other sources of randomness in the training phase may induce incoherent latent spaces that hinder any form of reuse. Nevertheless, we empirically observe that, under the same data and modeling choices, the angles between the encodings within distinct latent spaces do not change. In this work, we propose the latent similarity between each sample and a fixed set of anchors as an alternative data representation, demonstrating that it can enforce the desired invariances without any additional training. We show how neural architectures can leverage these relative representations to guarantee, in practice, invariance to latent isometries and rescalings, effectively enabling latent space communication: from zero-shot model stitching to latent space comparison between diverse settings. We extensively validate the generalization capability of our approach on different datasets, spanning various modalities (images, text, graphs), tasks (e.g., classification, reconstruction) and architectures (e.g., CNNs, GCNs, transformers).

  • 6 authors
·
Sep 30, 2022

UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs

Existing foundation models, such as CLIP, aim to learn a unified embedding space for multimodal data, enabling a wide range of downstream web-based applications like search, recommendation, and content classification. However, these models often overlook the inherent graph structures in multimodal datasets, where entities and their relationships are crucial. Multimodal graphs (MMGs) represent such graphs where each node is associated with features from different modalities, while the edges capture the relationships between these entities. On the other hand, existing graph foundation models primarily focus on text-attributed graphs (TAGs) and are not designed to handle the complexities of MMGs. To address these limitations, we propose UniGraph2, a novel cross-domain graph foundation model that enables general representation learning on MMGs, providing a unified embedding space. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space that captures both the multimodal information and the underlying graph structure. We propose a new cross-domain multi-graph pre-training algorithm at scale to ensure effective transfer learning across diverse graph domains and modalities. Additionally, we adopt a Mixture of Experts (MoE) component to align features from different domains and modalities, ensuring coherent and robust embeddings that unify the information across modalities. Extensive experiments on a variety of multimodal graph tasks demonstrate that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks, offering a scalable and flexible solution for learning on MMGs.

  • 6 authors
·
Feb 2, 2025

Science Hierarchography: Hierarchical Organization of Science Literature

Scientific knowledge is growing rapidly, making it challenging to track progress and high-level conceptual links across broad disciplines. While existing tools like citation networks and search engines make it easy to access a few related papers, they fundamentally lack the flexible abstraction needed to represent the density of activity in various scientific subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that allows for the categorization of scientific work across varying levels of abstraction, from very broad fields to very specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve the goals of SCIENCE HIERARCHOGRAPHY, we develop a range of algorithms. Our primary approach combines fast embedding-based clustering with LLM-based prompting to balance the computational efficiency of embedding methods with the semantic precision offered by LLM prompting. We demonstrate that this approach offers the best trade-off between quality and speed compared to methods that heavily rely on LLM prompting, such as iterative tree construction with LLMs. To better reflect the interdisciplinary and multifaceted nature of research papers, our hierarchy captures multiple dimensions of categorization beyond simple topic labels. We evaluate the utility of our framework by assessing how effectively an LLM-based agent can locate target papers using the hierarchy. Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo: https://github.com/JHU-CLSP/science-hierarchography{https://github.com/JHU-CLSP/science-hierarchography}

  • 4 authors
·
Apr 18, 2025

Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, vehicle/person reidentification, and face recognition. Many applications in these domains require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as Filtered Approximate Nearest Neighbor Search (FANNS). In this work, we present a comprehensive survey and taxonomy of FANNS methods and analyze how they are benchmarked in the literature. By doing so, we identify a key challenge in the current FANNS landscape: the lack of diverse and realistic datasets, particularly ones derived from the latest transformer-based text embedding models. To address this, we introduce a novel dataset consisting of embedding vectors for the abstracts of over 2.7 million research articles from the arXiv repository, accompanied by 11 real-world attributes such as authors and categories. We benchmark a wide range of FANNS methods on our novel dataset and find that each method has distinct strengths and limitations; no single approach performs best across all scenarios. ACORN, for example, supports various filter types and performs reliably across dataset scales but is often outperformed by more specialized methods. SeRF shows excellent performance for range filtering on ordered attributes but cannot handle categorical attributes. Filtered-DiskANN and UNG excel on the medium-scale dataset but fail on the large-scale dataset, highlighting the challenge posed by transformer-based embeddings, which are often more than an order of magnitude larger than earlier embeddings. We conclude that no universally best method exists.

  • 5 authors
·
Jul 29, 2025

FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models

Recent research has demonstrated that Feed-Forward Networks (FFNs) in Large Language Models (LLMs) play a pivotal role in storing diverse linguistic and factual knowledge. Conventional methods frequently face challenges due to knowledge confusion stemming from their monolithic and redundant architectures, which calls for more efficient solutions with minimal computational overhead, particularly for LLMs. In this paper, we explore the FFN computation paradigm in LLMs and introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications, while maintaining the same level of performance. Furthermore, we embed a router from the Mixture-of-Experts (MoE), combined with our devised Prior-Approximate (PA) loss term that facilitates the dynamic activation of experts and knowledge adaptation, thereby accelerating computational processes and enhancing performance using minimal training data and fine-tuning steps. FactorLLM thus enables efficient knowledge factorization and activates select groups of experts specifically tailored to designated tasks, emulating the interactive functional segmentation of the human brain. Extensive experiments across various benchmarks demonstrate the effectiveness of our proposed FactorLLM which achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed. Code: https://github.com/zhenwuweihe/FactorLLM.

  • 9 authors
·
Aug 15, 2024

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations -- binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek-Dalvi410/HDFLIM.

  • 2 authors
·
Feb 26

Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on Text-Attributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.

  • 7 authors
·
Feb 3

On the Theoretical Limitations of Embedding-Based Retrieval

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

  • 4 authors
·
Aug 28, 2025 3

Autoregressive Search Engines: Generating Substrings as Document Identifiers

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems. Code and pre-trained models at https://github.com/facebookresearch/SEAL.

  • 6 authors
·
Apr 22, 2022

How to Index Item IDs for Recommendation Foundation Models

Recommendation foundation model utilizes large language models (LLM) for recommendation by converting recommendation tasks into natural language tasks. It enables generative recommendation which directly generates the item(s) to recommend rather than calculating a ranking score for each and every candidate item in traditional recommendation models, simplifying the recommendation pipeline from multi-stage filtering to single-stage filtering. To avoid generating excessively long text and hallucinated recommendation when deciding which item(s) to recommend, creating LLM-compatible item IDs to uniquely identify each item is essential for recommendation foundation models. In this study, we systematically examine the item indexing problem for recommendation foundation models, using P5 as an example of backbone model. To emphasize the importance of item indexing, we first discuss the issues of several trivial item indexing methods, such as independent indexing, title indexing, and random indexing. We then propose four simple yet effective solutions, including sequential indexing, collaborative indexing, semantic (content-based) indexing, and hybrid indexing. Our study highlights the significant influence of item indexing methods on the performance of LLM-based recommendation, and our results on real-world datasets validate the effectiveness of our proposed solutions. The research also demonstrates how recent advances on language modeling and traditional IR principles such as indexing can help each other for better learning and inference.

  • 4 authors
·
May 11, 2023

CRISP: Clustering Multi-Vector Representations for Denoising and Pruning

Multi-vector models, such as ColBERT, are a significant advancement in neural information retrieval (IR), delivering state-of-the-art performance by representing queries and documents by multiple contextualized token-level embeddings. However, this increased representation size introduces considerable storage and computational overheads which have hindered widespread adoption in practice. A common approach to mitigate this overhead is to cluster the model's frozen vectors, but this strategy's effectiveness is fundamentally limited by the intrinsic clusterability of these embeddings. In this work, we introduce CRISP (Clustered Representations with Intrinsic Structure Pruning), a novel multi-vector training method which learns inherently clusterable representations directly within the end-to-end training process. By integrating clustering into the training phase rather than imposing it post-hoc, CRISP significantly outperforms post-hoc clustering at all representation sizes, as well as other token pruning methods. On the BEIR retrieval benchmarks, CRISP achieves a significant rate of ~3x reduction in the number of vectors while outperforming the original unpruned model. This indicates that learned clustering effectively denoises the model by filtering irrelevant information, thereby generating more robust multi-vector representations. With more aggressive clustering, CRISP achieves an 11x reduction in the number of vectors with only a 3.6% quality loss.

  • 6 authors
·
May 16, 2025

Linking Datasets on Organizations Using Half A Billion Open Collaborated Records

Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers may turn to approximate string matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations presented. Worse, many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and "Federal National Mortgage Association"), a case where string matching has little hope of succeeding. This paper introduces data from a prominent employment-related networking site (LinkedIn) as a tool to address these problems. We propose interconnected approaches to leveraging the massive amount of information from LinkedIn regarding organizational name-to-name links. The first approach builds a machine learning model for predicting matches from character strings, treating the trillions of user-contributed organizational name pairs as a training corpus: this approach constructs a string matching metric that explicitly maximizes match probabilities. A second approach identifies relationships between organization names using network representations of the LinkedIn data. A third approach combines the first and second. We document substantial improvements over fuzzy matching in applications, making all methods accessible in open-source software ("LinkOrgs").

  • 2 authors
·
Feb 5, 2023 1

Hypercube-Based Retrieval-Augmented Generation for Scientific Question-Answering

Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG) has shown its high promise, empowering LLMs to generate more qualified responses with retrieved external data and knowledge. However, most RAG methods retrieve relevant documents based on either sparse or dense retrieval methods or their combinations, which overlooks the essential, multi-dimensional, and structured semantic information present in documents. This structured information plays a critical role in finding concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). In this work, we introduce a multi-dimensional (cube) structure, Hypercube, which can index and allocate documents in a pre-defined multi-dimensional space. Built on the hypercube, we further propose Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities, phrases, and topics along with pre-defined hypercube dimensions, and then retrieves relevant documents from cubes by aligning these decomposed components with corresponding dimensions. Experiments on three datasets across different domains demonstrate that our method improves response accuracy by 3.7% and retrieval accuracy by 5.3% over the strongest RAG baseline. It also boosts retrieval efficiency (speed) by one or two magnitudes faster than graph-based RAG. Notably, our Hypercube-RAG inherently offers explainability by revealing those underlying dimensions used for retrieval. The code and data are available at https://github.com/JimengShi/Hypercube-RAG.

  • 8 authors
·
May 25, 2025

Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search

Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, the first adversarial search method on TDMs that systematically explores the discrete prompt space and the high-dimensional latent space, to automatically discover undesirable behaviors and failure cases in image generation. We use image classifiers as surrogate loss functions during searching, and employ human inspections to validate the identified failures. For the first time, our method enables efficient exploration of both the discrete and intricate human language space and the challenging latent space, overcoming the gradient vanishing problem. Then, we demonstrate the effectiveness of SAGE on five widely used generative models and reveal four typical failure modes: (1) We find a variety of natural text prompts that generate images failing to capture the semantics of input texts. We further discuss the underlying causes and potential solutions based on the results. (2) We find regions in the latent space that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that result in natural-looking images unrelated to the text prompt, implying a possible misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to any input prompts, we can generate a variety of specified target objects. Project page: https://sage-diffusion.github.io/

  • 5 authors
·
Jun 1, 2023