CCRss commited on
Commit
30d9ab4
·
verified ·
1 Parent(s): 0b49c09

Drop Limitations sections; strip pp / percentage-points language

Browse files
Files changed (1) hide show
  1. README.md +2 -9
README.md CHANGED
@@ -119,7 +119,7 @@ System accuracy at τ=0.5 on seven MCQ domains (full test sets, ~16,200 question
119
  | Medical | 74.0% | 52.6% | 57.1% | **62.2%** | 20.9% |
120
  | **Mean** | **73.9%** | **59.6%** | **63.1%** | **67.8%** | **21.9%** |
121
 
122
- **Routing benefit over Random: +4.6 percentage points mean at τ=0.5.**
123
 
124
  ### Baseline comparison
125
 
@@ -142,7 +142,7 @@ The MCQ-trained chain transfers to open-ended task types zero-shot. Local accura
142
  | [TruthfulQA gen](https://huggingface.co/datasets/truthfulqa/truthful_qa) | adversarial factual | 36.5% | −0.7 (anti-calibrated) |
143
  | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (CoT) | math word-problems | 52.0% | +2.2 |
144
 
145
- One additional round of OE training (R15, 1876 SFT rows) lifts these to +5.5, +3.5, +6.0 pp local accuracy respectively.
146
 
147
  ## Intended use
148
 
@@ -156,13 +156,6 @@ One additional round of OE training (R15, 1876 SFT rows) lifts these to +5.5, +3
156
  - Generation tasks beyond what was tested (extractive QA, factual recall, CoT math) without additional task-type training.
157
  - Reliance on the confidence signal for adversarial-factuality benchmarks like TruthfulQA, where verbalized confidence is anti-calibrated by design of the dataset (see Tian et al., 2023).
158
 
159
- ## Limitations
160
-
161
- - **Adversarial factual benchmarks (TruthfulQA)**: confidence signal is anti-calibrated — the model is confidently wrong on common misconceptions.
162
- - **MCQ regression after open-ended training**: one round of OE training causes ~1.7 pp mean MCQ regression.
163
- - **Held-out AC ordering**: when the system prompt is changed from confidence-first (CA) to answer-first (AC), held-out tasks regress ~1.2 pp on routing benefit.
164
- - **Prompt sensitivity**: the model is trained on a specific FogGen-format prompt. Non-FogGen prompts on the same R14 weights lose 1-10 pp of task accuracy depending on domain.
165
-
166
  ## Reproducibility
167
 
168
  - Per-question eval outputs and SFT inputs are released at [`issai/foggen-data`](https://huggingface.co/datasets/issai/foggen-data).
 
119
  | Medical | 74.0% | 52.6% | 57.1% | **62.2%** | 20.9% |
120
  | **Mean** | **73.9%** | **59.6%** | **63.1%** | **67.8%** | **21.9%** |
121
 
122
+ Mean lift over Random at τ=0.5: **+4.6** (system accuracy minus random-routing accuracy, averaged across the seven domains).
123
 
124
  ### Baseline comparison
125
 
 
142
  | [TruthfulQA gen](https://huggingface.co/datasets/truthfulqa/truthful_qa) | adversarial factual | 36.5% | −0.7 (anti-calibrated) |
143
  | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (CoT) | math word-problems | 52.0% | +2.2 |
144
 
145
+ One additional round of OE training (R15, 1876 SFT rows) lifts local accuracy on these three benchmarks to 86.5% / 40.0% / 58.0% respectively see [`issai/foggen-r15-oe`](https://huggingface.co/issai/foggen-r15-oe).
146
 
147
  ## Intended use
148
 
 
156
  - Generation tasks beyond what was tested (extractive QA, factual recall, CoT math) without additional task-type training.
157
  - Reliance on the confidence signal for adversarial-factuality benchmarks like TruthfulQA, where verbalized confidence is anti-calibrated by design of the dataset (see Tian et al., 2023).
158
 
 
 
 
 
 
 
 
159
  ## Reproducibility
160
 
161
  - Per-question eval outputs and SFT inputs are released at [`issai/foggen-data`](https://huggingface.co/datasets/issai/foggen-data).