Judging LLM-as-a-judge with MT-Bench and Chatbot Arena Paper β’ 2306.05685 β’ Published Jun 9, 2023 β’ 42
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent Paper β’ 2312.10003 β’ Published Dec 15, 2023 β’ 44
Leveraging Large Language Models for NLG Evaluation: A Survey Paper β’ 2401.07103 β’ Published Jan 13, 2024 β’ 4
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Paper β’ 2310.08491 β’ Published Oct 12, 2023 β’ 57
Running Agents 111 Judge Arena π» 111 View and compare openβsource AI model rankings with ELO scores