Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2511.15661

VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21, 2025 • 36
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Paper • 2510.08551 • Published Oct 9, 2025 • 34
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Paper • 2510.04212 • Published Oct 5, 2025 • 26
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Paper • 2510.12693 • Published Oct 14, 2025 • 28

Papers I have written about on my blog.

MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization

Paper • 2503.16874 • Published Mar 21, 2025 • 45
System Prompt Optimization with Meta-Learning

Paper • 2505.09666 • Published May 14, 2025 • 71
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Paper • 2505.23754 • Published May 29, 2025 • 15

iFormer: Integrating ConvNet and Transformer for Mobile Application

Paper • 2501.15369 • Published Jan 26, 2025 • 13
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Paper • 2601.08303 • Published Jan 13 • 19

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Paper • 2512.04763 • Published Dec 4, 2025 • 5
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Paper • 2512.14531 • Published Dec 16, 2025 • 15
Improving Recursive Transformers with Mixture of LoRAs

Paper • 2512.12880 • Published Dec 14, 2025 • 6

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published Aug 11, 2025 • 45
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27, 2025 • 32

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17, 2025 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4, 2025 • 23
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17, 2025 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39

liuhaotian/LLaVA-Pretrain

Preview • Updated Jul 6, 2023 • 1.6k • 216
5CD-AI/Viet-LLaVA-CoT-o1-Instruct

Viewer • Updated May 22, 2025 • 53.2k • 14
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44

VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Paper • 2512.04763 • Published Dec 4, 2025 • 5
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Paper • 2512.14531 • Published Dec 16, 2025 • 15
Improving Recursive Transformers with Mixture of LoRAs

Paper • 2512.12880 • Published Dec 14, 2025 • 6

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21, 2025 • 36
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Paper • 2510.08551 • Published Oct 9, 2025 • 34
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Paper • 2510.04212 • Published Oct 5, 2025 • 26
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Paper • 2510.12693 • Published Oct 14, 2025 • 28

Vision Language Action models

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published Aug 11, 2025 • 45
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Paper • 2508.20072 • Published Aug 27, 2025 • 32

Papers I have written about on my blog.

MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization

Paper • 2503.16874 • Published Mar 21, 2025 • 45
System Prompt Optimization with Meta-Learning

Paper • 2505.09666 • Published May 14, 2025 • 71
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Paper • 2505.23754 • Published May 29, 2025 • 15

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17, 2025 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4, 2025 • 23
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17, 2025 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39

iFormer: Integrating ConvNet and Transformer for Mobile Application

Paper • 2501.15369 • Published Jan 26, 2025 • 13
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Paper • 2601.08303 • Published Jan 13 • 19

liuhaotian/LLaVA-Pretrain

Preview • Updated Jul 6, 2023 • 1.6k • 216
5CD-AI/Viet-LLaVA-CoT-o1-Instruct

Viewer • Updated May 22, 2025 • 53.2k • 14
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs