Add daily ChemGraph eval pipeline and redesign leaderboard

#4
by tdphamm - opened
Autonomous Scientific Agents org

Summary

  • Add daily evaluation pipeline (scripts/daily_eval.sh + scripts/chemgraph_to_leaderboard.py) that runs ChemGraph benchmarks, transforms results into leaderboard format, and pushes to HF Hub datasets
  • Redesign leaderboard tasks: Replace 15 old experiment-based tasks with 12 category-based tasks (SMILES Lookup, Optimization, Vibrations, Thermochemistry, Dipole, Energy, Reaction Energy)
  • Add Trends tab with Plotly chart showing model performance over time and 1-day/3-day/7-day rolling average summary table
  • Add local development mode (--local flag) to skip HF Hub downloads and scheduler for local testing
  • Fix Python 3.13+ compatibility by replacing make_dataclass with a plain class for AutoEvalColumn
  • Handle API-only models (OpenAI, Anthropic, etc.) gracefully by skipping HF Hub model checks
  • Update citation to published Communications Chemistry paper (2026)
  • Add plotly to requirements.txt and dataset/model_map.json for model name mapping

Files changed (10)

File Changes
app.py Trends tab, local mode, graceful error handling
dataset/model_map.json New model name mapping
requirements.txt Add plotly
scripts/chemgraph_to_leaderboard.py New ETL script
scripts/daily_eval.sh New daily eval orchestration
src/about.py 12 new task categories, updated docs/citation
src/display/utils.py Plain class AutoEvalColumn, trend columns
src/leaderboard/aggregate.py New trend aggregation module
src/leaderboard/read_evals.py Category-based scoring, eval_date support
src/populate.py Trend data loading, multi-date result handling
Autonomous Scientific Agents org

Closing - created accidentally while setting up the PR.

tdphamm changed discussion status to closed

Sign up or log in to comment