arxiv:2308.03688

AgentBench: Evaluating LLMs as Agents

Published on Aug 7, 2023

· Submitted by

AK on Aug 8, 2023

#2 Paper of the day

Upvote

Authors:

Xiao Liu ,

Hao Yu ,

Yifan Xu ,

Hanyu Lai ,

Yu Gu ,

Kaiwen Men ,

Xiang Deng ,

Aohan Zeng ,

Zhengxiao Du ,

Chenhui Zhang ,

Sheng Shen ,

Tianjun Zhang ,

Yu Su ,

Huan Sun ,

Yuxiao Dong ,

Jie Tang

Abstract

AgentBench is a multi-dimensional benchmark for evaluating LLMs as autonomous agents across various interactive environments, highlighting performance differences between commercial and open-source models.

AI-generated summary

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench