Add daily ChemGraph eval pipeline and redesign leaderboard tasks

#1
by tdphamm - opened
Autonomous Scientific Agents org

Summary

  • Redesign leaderboard tasks: Replace 15 paper experiments (exp1-exp15) with 8 task categories derived from ChemGraph's 14 ground-truth evaluation queries
  • Add automated eval pipeline: New scripts/chemgraph_to_leaderboard.py transforms ChemGraph benchmark results into leaderboard-compatible JSON files + pushes to HF Hub
  • Add cron wrapper: scripts/daily_eval.sh runs daily evals and updates the leaderboard automatically
  • Fix API model support: API-only models (OpenAI, Anthropic) no longer fail the HF Hub existence check
  • Add --local flag: python app.py --local skips HF Hub downloads for local testing

Task Categories (8 groups from 14 queries)

Task Queries Description
SMILES Lookup q1-q2 Name to SMILES string
Coordinate Gen q3-q4 SMILES to 3D coordinates
Geometry Opt q5 Geometry optimization
Vib Frequency q6 Vibrational frequency analysis
Thermochem q7 Thermochemical properties
Dipole q8 Dipole moment calculation
Energy q9-q11 Single-point energy + JSON extraction
Reaction Gibbs q12-q14 Reaction Gibbs free energy

Files Changed

File Change
scripts/chemgraph_to_leaderboard.py New - Core transform script
scripts/daily_eval.sh New - Cron wrapper for daily eval
dataset/model_map.json Updated model name mappings
src/about.py Redesigned task definitions + updated UI text
src/leaderboard/read_evals.py Fixed hub check, request matching, status handling
src/populate.py Graceful empty DataFrame handling
app.py Added --local flag, empty DF guard

Testing

Tested locally with python app.py --local against ChemGraph eval results. 5 models load correctly with all 8 task categories populated. Gradio UI serves HTTP 200.

Next Steps (after merge)

  1. Push new results/request data to HF Hub datasets with --push-to-hub
  2. Set up crontab for daily evaluation runs
tdphamm changed discussion status to closed

Sign up or log in to comment