-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2406.12624
-
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37 -
A Survey on LLM-as-a-Judge
Paper • 2411.15594 • Published -
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Paper • 2412.05579 • Published • 2 -
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Paper • 2411.16594 • Published • 39
-
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 96 -
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Paper • 2405.21060 • Published • 68 -
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Paper • 2405.20541 • Published • 24 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 54
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
FlowMind: Automatic Workflow Generation with LLMs
Paper • 2404.13050 • Published • 34 -
How Far Can We Go with Practical Function-Level Program Repair?
Paper • 2404.12833 • Published • 7 -
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Paper • 2404.18796 • Published • 71
-
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper • 2403.04132 • Published • 40 -
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper • 2402.17753 • Published • 19 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15
-
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Paper • 2512.20605 • Published • 62 -
Enhancing Human-Like Responses in Large Language Models
Paper • 2501.05032 • Published • 61 -
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Paper • 2401.01325 • Published • 27 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37
-
Learn Your Reference Model for Real Good Alignment
Paper • 2404.09656 • Published • 90 -
Aligning Teacher with Student Preferences for Tailored Training Data Generation
Paper • 2406.19227 • Published • 25 -
Self-Play Preference Optimization for Language Model Alignment
Paper • 2405.00675 • Published • 28 -
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Paper • 2404.03820 • Published • 25
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 42 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35
-
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Paper • 2401.01275 • Published • 1 -
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Paper • 2404.12241 • Published • 13 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 124 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
-
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Paper • 2512.20605 • Published • 62 -
Enhancing Human-Like Responses in Large Language Models
Paper • 2501.05032 • Published • 61 -
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
Paper • 2401.01325 • Published • 27 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37
-
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37 -
A Survey on LLM-as-a-Judge
Paper • 2411.15594 • Published -
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Paper • 2412.05579 • Published • 2 -
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Paper • 2411.16594 • Published • 39
-
Learn Your Reference Model for Real Good Alignment
Paper • 2404.09656 • Published • 90 -
Aligning Teacher with Student Preferences for Tailored Training Data Generation
Paper • 2406.19227 • Published • 25 -
Self-Play Preference Optimization for Language Model Alignment
Paper • 2405.00675 • Published • 28 -
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Paper • 2404.03820 • Published • 25
-
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 96 -
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Paper • 2405.21060 • Published • 68 -
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Paper • 2405.20541 • Published • 24 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 54
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
FlowMind: Automatic Workflow Generation with LLMs
Paper • 2404.13050 • Published • 34 -
How Far Can We Go with Practical Function-Level Program Repair?
Paper • 2404.12833 • Published • 7 -
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Paper • 2404.18796 • Published • 71
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 42 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35
-
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper • 2403.04132 • Published • 40 -
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper • 2402.17753 • Published • 19 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 24 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 15
-
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Paper • 2401.01275 • Published • 1 -
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Paper • 2404.12241 • Published • 13 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 124 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37