deceit1 / evaluate.py

Commit History

improve: abstention penalty, better prompt, mixed curriculum, more steps
253d1ff

Jayant-Kernel commited on

evaluate: switch to 0.5B model comparison, 200 episodes
6b64fd2

Jayant-Kernel commited on

fix: set N_EPISODES=200 constant (was still 30)
e662a77

Jayant-Kernel commited on

update: increase evaluation to 200 episodes per model
a5be204

Jayant-Kernel commited on

fix: parse_action confidence bug, numeric answers bug, missing reasoning field bug
66bdd16

Jayant-Kernel commited on

fix: debug model output parsing in evaluation
3d9195a

Jayant-Kernel commited on

add: evaluate 1.5B base vs trained, upload chart to HF Hub
77e0352

Jayant-Kernel commited on

add: evaluate 1.5B base vs trained, upload charts
68e5af2

Jayant-Kernel commited on

update: compare 0.5B vs 1.5B trained models
0bdaeb6

Jayant-Kernel commited on

update: evaluate retrained model, upload charts to HF Hub
b84ec51
unverified

Jayant-Kernel commited on

update: evaluate on 200 episodes each for more reliable results
a178a66
unverified

Jayant-Kernel commited on

fix: remove erroneous del model outside evaluate_model scope
c6e06ad
unverified

Jayant-Kernel commited on

fix: free GPU memory between model evaluations
4c67564
unverified

Jayant-Kernel commited on

add: evaluation script - base vs trained model comparison
8fb443c
unverified

Jayant-Kernel commited on