Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2601.14490

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 39
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37

Data and models for optical character recognition

PubMed-OCR: PMC Open Access OCR Annotations

Paper • 2601.11425 • Published Jan 16 • 12
GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37
rootsautomation/TABMEpp

Viewer • Updated Aug 23, 2024 • 122k • 114 • 5
rootsautomation/pubmed-ocr

Viewer • Updated Jan 22 • 1.55M • 4.24k • 70

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37
rootsautomation/GutenOCR-3B

Image-Text-to-Text • 4B • Updated Mar 10 • 384 • 26
rootsautomation/GutenOCR-7B

Image-Text-to-Text • 8B • Updated Mar 11 • 239 • 25

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 191
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 42

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 39
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37

GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37
rootsautomation/GutenOCR-3B

Image-Text-to-Text • 4B • Updated Mar 10 • 384 • 26
rootsautomation/GutenOCR-7B

Image-Text-to-Text • 8B • Updated Mar 11 • 239 • 25

Data and models for optical character recognition

PubMed-OCR: PMC Open Access OCR Annotations

Paper • 2601.11425 • Published Jan 16 • 12
GutenOCR: A Grounded Vision-Language Front-End for Documents

Paper • 2601.14490 • Published Jan 20 • 37
rootsautomation/TABMEpp

Viewer • Updated Aug 23, 2024 • 122k • 114 • 5
rootsautomation/pubmed-ocr

Viewer • Updated Jan 22 • 1.55M • 4.24k • 70

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 191
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 42

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs