-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2402.17764
-
peteromallet/dataclaw-peteromallet
Viewer • Updated • 549 • 646 • 298 -
Qwen/Qwen3.5-35B-A3B
Image-Text-to-Text • 36B • Updated • 3.89M • • 1.39k -
Nanbeige/Nanbeige4.1-3B
Text Generation • 4B • Updated • 272k • • 1.09k -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628
-
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Paper • 2309.06497 • Published • 7 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24
-
deepseek-ai/DeepSeek-V3.2
Text Generation • 685B • Updated • 10.3M • 1.41k -
Anthropic/AnthropicInterviewer
Viewer • Updated • 1.25k • 1.49k • 367 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Qwen/Qwen-Image-Layered
Image-Text-to-Image • Updated • 22k • 1.05k
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
-
peteromallet/dataclaw-peteromallet
Viewer • Updated • 549 • 646 • 298 -
Qwen/Qwen3.5-35B-A3B
Image-Text-to-Text • 36B • Updated • 3.89M • • 1.39k -
Nanbeige/Nanbeige4.1-3B
Text Generation • 4B • Updated • 272k • • 1.09k -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628
-
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Paper • 2309.06497 • Published • 7 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251
-
deepseek-ai/DeepSeek-V3.2
Text Generation • 685B • Updated • 10.3M • 1.41k -
Anthropic/AnthropicInterviewer
Viewer • Updated • 1.25k • 1.49k • 367 -
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Qwen/Qwen-Image-Layered
Image-Text-to-Image • Updated • 22k • 1.05k
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213