File size: 2,357 Bytes
c641413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: mit
library_name: pytorch
tags:
  - test-time-training
  - conformal-prediction
  - reasoning
  - early-stopping
  - llm
datasets:
  - wzekai99/ORCA
---

# ORCA TTT-Probes

Trained Test-Time Training probes for *Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning* ([arXiv:2604.01170](https://arxiv.org/abs/2604.01170)).

## Layout (17 probes)

```
qwen2.5-32b/supervised/{no_kq, qk_dh128,
                        qk_dh32, qk_dh64, qk_dh256, qk_dh512,
                        qk_dh128_ln, qk_dh128_ln_res, qk_dh128_share_kq,
                        qk_dh128_eta_learn, qk_dh128_mlp}/
qwen2.5-32b/consistent/{no_kq, qk_dh128}/
qwq-32b/supervised/{no_kq, qk_dh128}/
llama-3.3-70b/supervised/{no_kq, qk_dh128}/
```

Per probe directory:

| File              | Contents                                                       |
|-------------------|----------------------------------------------------------------|
| `probe.pt`        | State dict: W0, b0, log_eta; QK variants also include theta_K, theta_Q |
| `config.json`     | Training hyperparameters (d_hidden, base_lr, epochs, ...)      |
| `lambdas.json`    | LTT thresholds, keyed by delta                                 |
| `metrics.json`    | Step-level savings and error rate per delta                    |
| `ood_*.json`      | Per-OOD-benchmark metrics (Qwen2.5-32B probes only)            |

## Use

Probes are loaded by the `TTTProbe` class in https://github.com/wzekai99/ORCA. Quick example:

```bash
hf download wzekai99/ORCA --local-dir probes
hf download wzekai99/ORCA --repo-type dataset --local-dir data
python code/test.py \
    --method ttt --no_kq \
    --dataset_path data/qwen2.5-32b/s1k.pkl \
                   data/qwen2.5-32b/openr1_2k.pkl \
                   data/qwen2.5-32b/deepmath_2k.pkl \
    --probe_path probes/qwen2.5-32b/supervised/no_kq/probe.pt \
    --label_mode supervised --delta 0.1 --epsilon 0.05
```

## License

MIT.

## Citation

```bibtex
@article{zhou2026online,
  title={Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning},
  author={Zhou, Cai and Wang, Zekai and Wu, Menghua and Zhu, Qianyu Julie and Shi, Flora C and Wang, Chenyu and Wilson, Ashia and Jaakkola, Tommi and Bates, Stephen},
  journal={arXiv preprint arXiv:2604.01170},
  year={2026}
}
```