Literacy thresholds summary (outliers removed via IQR with ordering guard) Source: data/factual_testing/full_details_evaluation_0_80_qwen3-30B_v2.json Metrics: - source_coverage (proxy for % information from source text) - completeness (proxy for % information from gold summary) - factual_attribution (factual support of claims) Five-number summaries by label (min, q1, median, q3, max) Low literacy factual_attribution: 0.0000, 0.3656, 0.5263, 0.6667, 1.0000 completeness: 0.8500, 0.9600, 1.0000, 1.0000, 1.0000 source_coverage: 0.0000, 0.1765, 0.2308, 0.3226, 0.5000 Intermediate literacy factual_attribution: 0.2500, 0.5000, 0.6111, 0.7692, 0.9412 completeness: 0.8500, 0.9393, 1.0000, 1.0000, 1.0000 source_coverage: 0.0000, 0.1818, 0.3036, 0.4091, 0.7419 Proficient literacy factual_attribution: 0.0000, 0.4901, 0.7207, 0.8213, 0.9600 completeness: 0.6923, 0.9231, 1.0000, 1.0000, 1.0000 source_coverage: 0.4286, 0.7725, 0.8758, 0.9347, 1.0000 Outliers removed (count) by label and metric Low: factual_attribution=0, completeness=7, source_coverage=3 Intermediate: factual_attribution=7, completeness=9, source_coverage=3 Proficient: factual_attribution=0, completeness=10, source_coverage=10 Suggested thresholds (based on cleaned quartiles/medians) factual_attribution: low_to_intermediate: 0.5687 intermediate_to_proficient: 0.6659 completeness: low_to_intermediate: 1.0000 intermediate_to_proficient: 1.0000 source_coverage: low_to_intermediate: 0.2672 intermediate_to_proficient: 0.5908 Interpretation for “% information needed” (use medians, with IQR as uncertainty band) Low literacy: source text ≈ 23% (IQR 18%–32%) gold summary ≈ 100% (IQR 96%–100%) Intermediate literacy: source text ≈ 30% (IQR 18%–41%) gold summary ≈ 100% (IQR 94%–100%) Proficient literacy: source text ≈ 88% (IQR 77%–93%) gold summary ≈ 100% (IQR 92%–100%) Explanation of how the ranges were selected (for reporting) 1) For each literacy label, we collected the metric scores across all examples. 2) We removed outliers using the IQR rule (values below Q1-1.5*IQR or above Q3+1.5*IQR). 3) For source_coverage and completeness, we keep the ordering proficient > intermediate > low by skipping outlier removal if the cleaned medians would violate that order. 4) We summarized each cleaned distribution with the five-number summary. 5) We report the median as the “typical” needed information and the IQR (Q1–Q3) as a robust uncertainty band around that typical value. 6) For thresholds between labels, we used the midpoint between adjacent distributions: - If there is a gap (Q3 of lower label < Q1 of upper label), use the midpoint between Q3 and Q1 to avoid overlap. - If distributions overlap, use the midpoint between the two medians as a conservative separating threshold.