| import pandas as pd |
| import streamlit as st |
|
|
| st.set_page_config( |
| page_title="JuStRank", |
| page_icon="οΈπ§π»ββοΈ", |
| |
| initial_sidebar_state="auto", |
| menu_items=None, |
| ) |
|
|
| st.title("π§π»ββοΈ JuStRank: The Best Judges for Ranking Systems π§π»ββοΈ") |
|
|
| url = "https://aclanthology.org/2025.acl-long.34/" |
| st.subheader("Check out our [ACL paper](%s) for more details" % url) |
|
|
| def prettify_judge_name(judge_name): |
| pretty_judge = (judge_name[0].upper()+judge_name[1:]).replace("Gpt", "GPT") |
| return pretty_judge |
|
|
|
|
| def format_digits(flt, num_digits=3): |
| format_str = "{:."+str(num_digits-1)+"f}" |
| format_str_zeroes = "{:."+str(num_digits)+"f}" |
| return format_str_zeroes.format(flt)[1:] if (0 < flt < 1) else format_str.format(flt) |
|
|
|
|
| df = pd.read_csv("./best_judges_single_agg.csv")[["Judge Model", "Realization", "Ranking Agreement", "Decisiveness", "Bias"]] |
| df["Judge Model"] = df["Judge Model"].apply(prettify_judge_name) |
|
|
| styled_data = ( |
| df.style.background_gradient(subset=["Ranking Agreement"]) |
| .background_gradient( |
| subset=["Ranking Agreement"], |
| cmap="RdYlGn", |
| vmin=0.5, |
| vmax=0.9, |
| ) |
| .format(subset=["Ranking Agreement", "Decisiveness", "Bias"], formatter=format_digits) |
| .set_properties(**{"text-align": "center"}) |
| ) |
|
|
|
|
| st.dataframe(styled_data, use_container_width=True, height=800, hide_index=True) |
|
|
| st.text("\n\n") |
| st.markdown( |
| r""" |
| This leaderboard measures the **system-level performance and behavior of LLM judges**, and was created as part of the **[JuStRank paper](https://aclanthology.org/2025.acl-long.34/)** from ACL 2025. |
| |
| Judges are sorted according to *Ranking Agreement* with humans, i.e., comparing how the judges rank different systems (generative models) relative to how humans rank those systems on [LMSys Arena](https://lmarena.ai/leaderboard/text/hard-prompts-english). |
| |
| We also compare judges in terms of the *Decisiveness* and *Bias* reflected in their judgment behaviors (refer to the paper for details). |
| |
| In our research we tested 10 **LLM judges** and 8 **reward models**, and asked them to score the [responses](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto/tree/main/data/arena-hard-v0.1/model_answer) of 63 systems to the 500 questions from Arena Hard v0.1. |
| For each LLM judge we tried 4 different _realizations_, i.e., different prompt and scoring methods used with the LLM judge. |
| |
| In total, the judge ranking is derived from **[1.5 million raw judgment scores](https://huggingface.co/datasets/ibm-research/justrank_judge_scores)** (48 judge realizations X 63 target systems X 500 instances). |
| |
| If you find this useful, please cite our work π€ |
| |
| ```bibtex |
| @inproceedings{gera2025justrank, |
| title={JuStRank: Benchmarking LLM Judges for System Ranking}, |
| author={Gera, Ariel and Boni, Odellia and Perlitz, Yotam and Bar-Haim, Roy and Eden, Lilach and Yehudai, Asaf}, |
| booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, |
| month={july}, |
| address={Vienna, Austria}, |
| year={2025} |
| url={https://aclanthology.org/2025.acl-long.34/}, |
| } |
| ``` |
| """ |
| ) |
|
|
|
|