Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning Paper • 2601.21037 • Published Jan 28 • 15
ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding Paper • 2601.22666 • Published Jan 30 • 3
Lost in Embeddings: Information Loss in Vision-Language Models Paper • 2509.11986 • Published Sep 15, 2025 • 29
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation Paper • 2506.01565 • Published Jun 2, 2025 • 3
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge Paper • 2306.03024 • Published Jun 5, 2023 • 2
Structural Similarities Between Language Models and Neural Response Measurements Paper • 2306.01930 • Published Jun 2, 2023 • 2
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding Paper • 2505.14462 • Published May 20, 2025 • 4
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Paper • 2406.11030 • Published Jun 16, 2024
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning Paper • 2406.02265 • Published Jun 4, 2024 • 7
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study Paper • 2302.06555 • Published Feb 13, 2023 • 9