JuStRank

Running

App Files Files Community

JuStRank / app.py

arielgera

ACL anthology link

7dfa0af 6 months ago

raw

history blame contribute delete

3.29 kB

	import pandas as pd
	import streamlit as st

	st.set_page_config(
	page_title="JuStRank",
	page_icon="️🧑🏻‍⚖️",
	# layout="wide",
	initial_sidebar_state="auto",
	menu_items=None,
	)

	st.title("🧑🏻‍⚖️ JuStRank: The Best Judges for Ranking Systems 🧑🏻‍⚖️")

	url = "https://aclanthology.org/2025.acl-long.34/"
	st.subheader("Check out our [ACL paper](%s) for more details" % url)

	def prettify_judge_name(judge_name):
	pretty_judge = (judge_name[0].upper()+judge_name[1:]).replace("Gpt", "GPT")
	return pretty_judge


	def format_digits(flt, num_digits=3):
	format_str = "{:."+str(num_digits-1)+"f}"
	format_str_zeroes = "{:."+str(num_digits)+"f}"
	return format_str_zeroes.format(flt)[1:] if (0 < flt < 1) else format_str.format(flt)


	df = pd.read_csv("./best_judges_single_agg.csv")[["Judge Model", "Realization", "Ranking Agreement", "Decisiveness", "Bias"]]
	df["Judge Model"] = df["Judge Model"].apply(prettify_judge_name)

	styled_data = (
	df.style.background_gradient(subset=["Ranking Agreement"])
	.background_gradient(
	subset=["Ranking Agreement"],
	cmap="RdYlGn",
	vmin=0.5,
	vmax=0.9,
	)
	.format(subset=["Ranking Agreement", "Decisiveness", "Bias"], formatter=format_digits)
	.set_properties(**{"text-align": "center"})
	)


	st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True)

	st.text("\n\n")
	st.markdown(
	r"""
	This leaderboard measures the system-level performance and behavior of LLM judges, and was created as part of the [JuStRank paper](https://aclanthology.org/2025.acl-long.34/) from ACL 2025.

	Judges are sorted according to Ranking Agreement with humans, i.e., comparing how the judges rank different systems (generative models) relative to how humans rank those systems on [LMSys Arena](https://lmarena.ai/leaderboard/text/hard-prompts-english).

	We also compare judges in terms of the Decisiveness and Bias reflected in their judgment behaviors (refer to the paper for details).

	In our research we tested 10 LLM judges and 8 reward models, and asked them to score the [responses](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto/tree/main/data/arena-hard-v0.1/model_answer) of 63 systems to the 500 questions from Arena Hard v0.1.
	For each LLM judge we tried 4 different _realizations_, i.e., different prompt and scoring methods used with the LLM judge.

	In total, the judge ranking is derived from [1.5 million raw judgment scores](https://huggingface.co/datasets/ibm-research/justrank_judge_scores) (48 judge realizations X 63 target systems X 500 instances).

	If you find this useful, please cite our work 🤗

	```bibtex
	@inproceedings{gera2025justrank,
	title={JuStRank: Benchmarking LLM Judges for System Ranking},
	author={Gera, Ariel and Boni, Odellia and Perlitz, Yotam and Bar-Haim, Roy and Eden, Lilach and Yehudai, Asaf},
	booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	month={july},
	address={Vienna, Austria},
	year={2025}
	url={https://aclanthology.org/2025.acl-long.34/},
	}
	```
	"""
	)