Add daily ChemGraph eval pipeline and redesign leaderboard
#4
by tdphamm - opened
Summary
- Add daily evaluation pipeline (scripts/daily_eval.sh + scripts/chemgraph_to_leaderboard.py) that runs ChemGraph benchmarks, transforms results into leaderboard format, and pushes to HF Hub datasets
- Redesign leaderboard tasks: Replace 15 old experiment-based tasks with 12 category-based tasks (SMILES Lookup, Optimization, Vibrations, Thermochemistry, Dipole, Energy, Reaction Energy)
- Add Trends tab with Plotly chart showing model performance over time and 1-day/3-day/7-day rolling average summary table
- Add local development mode (--local flag) to skip HF Hub downloads and scheduler for local testing
- Fix Python 3.13+ compatibility by replacing make_dataclass with a plain class for AutoEvalColumn
- Handle API-only models (OpenAI, Anthropic, etc.) gracefully by skipping HF Hub model checks
- Update citation to published Communications Chemistry paper (2026)
- Add plotly to requirements.txt and dataset/model_map.json for model name mapping
Files changed (10)
| File | Changes |
|---|---|
| app.py | Trends tab, local mode, graceful error handling |
| dataset/model_map.json | New model name mapping |
| requirements.txt | Add plotly |
| scripts/chemgraph_to_leaderboard.py | New ETL script |
| scripts/daily_eval.sh | New daily eval orchestration |
| src/about.py | 12 new task categories, updated docs/citation |
| src/display/utils.py | Plain class AutoEvalColumn, trend columns |
| src/leaderboard/aggregate.py | New trend aggregation module |
| src/leaderboard/read_evals.py | Category-based scoring, eval_date support |
| src/populate.py | Trend data loading, multi-date result handling |
Closing - created accidentally while setting up the PR.
tdphamm changed discussion status to closed