Spaces:

Ajsaxena
/

deceit1

Paused

App Files Files Community

deceit1 / evaluate.py

Commit History

improve: abstention penalty, better prompt, mixed curriculum, more steps

253d1ff

Jayant-Kernel commited on 12 days ago

evaluate: switch to 0.5B model comparison, 200 episodes

6b64fd2

Jayant-Kernel commited on 13 days ago

fix: set N_EPISODES=200 constant (was still 30)

e662a77

Jayant-Kernel commited on 13 days ago

update: increase evaluation to 200 episodes per model

a5be204

Jayant-Kernel commited on 13 days ago

fix: parse_action confidence bug, numeric answers bug, missing reasoning field bug

66bdd16

Jayant-Kernel commited on 13 days ago

fix: debug model output parsing in evaluation

3d9195a

Jayant-Kernel commited on 13 days ago

add: evaluate 1.5B base vs trained, upload chart to HF Hub

77e0352

Jayant-Kernel commited on 13 days ago

add: evaluate 1.5B base vs trained, upload charts

68e5af2

Jayant-Kernel commited on 13 days ago

update: compare 0.5B vs 1.5B trained models

0bdaeb6

Jayant-Kernel commited on 13 days ago

update: evaluate retrained model, upload charts to HF Hub

b84ec51
unverified

Jayant-Kernel commited on 13 days ago

update: evaluate on 200 episodes each for more reliable results

a178a66
unverified

Jayant-Kernel commited on 13 days ago

fix: remove erroneous del model outside evaluate_model scope

c6e06ad
unverified

Jayant-Kernel commited on 13 days ago

fix: free GPU memory between model evaluations

4c67564
unverified

Jayant-Kernel commited on 13 days ago

add: evaluation script - base vs trained model comparison

8fb443c
unverified

Jayant-Kernel commited on 13 days ago

Commit History

improve: abstention penalty, better prompt, mixed curriculum, more steps 253d1ff

evaluate: switch to 0.5B model comparison, 200 episodes 6b64fd2

fix: set N_EPISODES=200 constant (was still 30) e662a77

update: increase evaluation to 200 episodes per model a5be204

fix: parse_action confidence bug, numeric answers bug, missing reasoning field bug 66bdd16

fix: debug model output parsing in evaluation 3d9195a

add: evaluate 1.5B base vs trained, upload chart to HF Hub 77e0352

add: evaluate 1.5B base vs trained, upload charts 68e5af2

update: compare 0.5B vs 1.5B trained models 0bdaeb6

update: evaluate retrained model, upload charts to HF Hub b84ec51 unverified

update: evaluate on 200 episodes each for more reliable results a178a66 unverified

fix: remove erroneous del model outside evaluate_model scope c6e06ad unverified

fix: free GPU memory between model evaluations 4c67564 unverified

add: evaluation script - base vs trained model comparison 8fb443c unverified

improve: abstention penalty, better prompt, mixed curriculum, more steps

253d1ff

evaluate: switch to 0.5B model comparison, 200 episodes

6b64fd2

fix: set N_EPISODES=200 constant (was still 30)

e662a77

update: increase evaluation to 200 episodes per model

a5be204

fix: parse_action confidence bug, numeric answers bug, missing reasoning field bug

66bdd16

fix: debug model output parsing in evaluation

3d9195a

add: evaluate 1.5B base vs trained, upload chart to HF Hub

77e0352

add: evaluate 1.5B base vs trained, upload charts

68e5af2

update: compare 0.5B vs 1.5B trained models

0bdaeb6

update: evaluate retrained model, upload charts to HF Hub

b84ec51
unverified

update: evaluate on 200 episodes each for more reliable results

a178a66
unverified

fix: remove erroneous del model outside evaluate_model scope

c6e06ad
unverified

fix: free GPU memory between model evaluations

4c67564
unverified

add: evaluation script - base vs trained model comparison

8fb443c
unverified