decodeshare / artifacts /rebuttal /reasoning /quick_reasoning_summary.md

Zishan Shao

Add lighthouse rebuttal artifacts

1c8e365 9 days ago

1.73 kB

	# Quick Reasoning Rebuttal Check

	This is a quick-turn held-out-task check for reasoning-heavy tasks.

	- Model: `meta-llama/Llama-2-7b-chat-hf` dtype=fp16 device=cuda
	- Tasks used for basis/eval: `gsm8k,commonsenseqa,strategyqa,arc_challenge,openbookqa,qasc,logiqa`
	- Held-out tasks run: `gsm8k,logiqa`
	- Per-task eval size: n_eval=32, n_subspace=64, layer=10
	- Protocol: LOTO heldout, forced_choice=True, do_sample=False

	## Per-task results

	\| Held-out \| Type \| n \| Baseline \| Decode-shared \| Prefill-shared \| Random \| D-P delta \| p \|
	\|----------\|-----------------------------------\|----\|--------------------\|---------------\|----------------\|--------\|---------------------\|-------\|
	\| gsm8k \| Open-ended numeric reasoning \| 32 \| 0.0 \| 0.0 \| 0.0 \| 0.0 \| +0.0 [+0.0, +0.0] \| 1 \|
	\| logiqa \| Logical reasoning multiple choice \| 32 \| 31.2 (chance 25.0) \| 15.6 \| 34.4 \| 34.4 \| -18.8 [-31.2, -6.2] \| 0.036 \|

	## Aggregate

	- Mean accuracy: baseline=15.6, decode_shared=7.8, prefill_shared=17.2, random=17.2
	- Mean deltas vs baseline: decode=-7.8, prefill=+1.6, random=+1.6
	- Mean decode-minus-prefill delta: -9.4
	- Informative held-out tasks: `logiqa`
	- Inconclusive due to baseline floor/chance: `gsm8k`

	## Interpretation

	- `gsm8k` is currently inconclusive: baseline is at or near floor/chance, so this fold does not say much about decode-vs-prefill selectivity.
	- `logiqa` is informative: decode-shared changes accuracy by -15.6 vs baseline and -18.8 vs prefill-shared.
	- Use informative folds as rebuttal evidence that the decode-shared phenomenon is not confined to short classification tasks.