Add daily ChemGraph eval pipeline and redesign leaderboard tasks
#1
by tdphamm - opened
Summary
- Redesign leaderboard tasks: Replace 15 paper experiments (exp1-exp15) with 8 task categories derived from ChemGraph's 14 ground-truth evaluation queries
- Add automated eval pipeline: New
scripts/chemgraph_to_leaderboard.pytransforms ChemGraph benchmark results into leaderboard-compatible JSON files + pushes to HF Hub - Add cron wrapper:
scripts/daily_eval.shruns daily evals and updates the leaderboard automatically - Fix API model support: API-only models (OpenAI, Anthropic) no longer fail the HF Hub existence check
- Add --local flag:
python app.py --localskips HF Hub downloads for local testing
Task Categories (8 groups from 14 queries)
| Task | Queries | Description |
|---|---|---|
| SMILES Lookup | q1-q2 | Name to SMILES string |
| Coordinate Gen | q3-q4 | SMILES to 3D coordinates |
| Geometry Opt | q5 | Geometry optimization |
| Vib Frequency | q6 | Vibrational frequency analysis |
| Thermochem | q7 | Thermochemical properties |
| Dipole | q8 | Dipole moment calculation |
| Energy | q9-q11 | Single-point energy + JSON extraction |
| Reaction Gibbs | q12-q14 | Reaction Gibbs free energy |
Files Changed
| File | Change |
|---|---|
| scripts/chemgraph_to_leaderboard.py | New - Core transform script |
| scripts/daily_eval.sh | New - Cron wrapper for daily eval |
| dataset/model_map.json | Updated model name mappings |
| src/about.py | Redesigned task definitions + updated UI text |
| src/leaderboard/read_evals.py | Fixed hub check, request matching, status handling |
| src/populate.py | Graceful empty DataFrame handling |
| app.py | Added --local flag, empty DF guard |
Testing
Tested locally with python app.py --local against ChemGraph eval results. 5 models load correctly with all 8 task categories populated. Gradio UI serves HTTP 200.
Next Steps (after merge)
- Push new results/request data to HF Hub datasets with --push-to-hub
- Set up crontab for daily evaluation runs
tdphamm changed discussion status to closed