cosmicmicra commited on
Commit
4ae86c5
·
verified ·
1 Parent(s): 3bc409d

Add technical specification document

Browse files
Files changed (1) hide show
  1. technical_specification.md +1228 -0
technical_specification.md ADDED
@@ -0,0 +1,1228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension
2
+
3
+ ## Technical Specification Document
4
+
5
+ **Version**: 1.0
6
+ **Date**: April 2026
7
+ **Authors**: [MathLingua Research Team]
8
+
9
+ ---
10
+
11
+ ## Abstract
12
+
13
+ We present MathLingua, an adaptive tutoring system designed for Spanish-speaking students in grades 6–8 who are transitioning to English-medium mathematics education. The system addresses the dual challenge these students face: mastering mathematical concepts while simultaneously acquiring the academic English required to comprehend word problems. MathLingua introduces a **four-level progressive scaffolding framework** (L1: Simplified English → L2: Bilingual Keywords → L3: Full Spanish Translation → L4: Step-by-Step Solution) and a novel **hybrid adaptive algorithm** combining Elo rating, Bayesian Knowledge Tracing (BKT), and Thompson Sampling to personalize difficulty progression. We define two engineered features—**Language Dependency Score (LDS)** and **Math Confidence Score (MCS)**—that disentangle linguistic struggle from mathematical difficulty, enabling targeted intervention. The system architecture supports a planned transition from a cloud-based large language model (Gemini 2.0 Flash) to a fine-tuned small language model (Qwen2.5-3B-Instruct with QLoRA) for scalable, cost-effective scaffold generation. This specification provides the complete technical design: adaptive algorithms with formal definitions, feature engineering formulas, a 15-level difficulty taxonomy validated by readability metrics, a prototype question database of 130 word problems, system architecture, and an evaluation plan.
14
+
15
+ **Keywords**: Adaptive Learning, Bilingual Education, Mathematics Word Problems, Scaffolding, Bayesian Knowledge Tracing, Elo Rating, Thompson Sampling, Small Language Models, QLoRA
16
+
17
+ ---
18
+
19
+ ## Table of Contents
20
+
21
+ 1. [Introduction](#1-introduction)
22
+ 2. [Related Work](#2-related-work)
23
+ 3. [Difficulty Taxonomy and Linguistic Progression](#3-difficulty-taxonomy)
24
+ 4. [Question Database Design](#4-question-database)
25
+ 5. [Feature Engineering](#5-feature-engineering)
26
+ 6. [Adaptive Algorithm](#6-adaptive-algorithm)
27
+ 7. [System Architecture](#7-system-architecture)
28
+ 8. [SLM Fine-Tuning Strategy](#8-slm-fine-tuning)
29
+ 9. [Data Collection Schema](#9-data-collection)
30
+ 10. [Evaluation Plan](#10-evaluation-plan)
31
+ 11. [Implementation Timeline](#11-timeline)
32
+ 12. [References](#12-references)
33
+ 13. [Appendices](#13-appendices)
34
+
35
+ ---
36
+
37
+ ## 1. Introduction
38
+
39
+ ### 1.1 Problem Statement
40
+
41
+ An estimated 5.1 million English Language Learners (ELLs) are enrolled in U.S. public schools, with approximately 75% being Spanish-speaking (National Center for Education Statistics, 2023). These students face a compounded challenge in mathematics: they must simultaneously decode unfamiliar English vocabulary and sentence structures while performing mathematical reasoning. Research consistently shows that ELLs underperform native English speakers on mathematics assessments—not because of deficient mathematical ability, but because of the linguistic demands embedded in word problems (Abedi & Lord, 2001; Martiniello, 2008).
42
+
43
+ Current educational technology solutions typically address either language learning or mathematics separately. General-purpose math tutoring systems (e.g., Khan Academy, IXL) present problems exclusively in English with no linguistic scaffolding. Language learning platforms (e.g., Duolingo) lack mathematical content. Bilingual resources, where they exist, are static—offering problems in English or Spanish but not providing a scaffolded bridge between the two.
44
+
45
+ ### 1.2 Proposed Solution
46
+
47
+ MathLingua addresses this gap with three key innovations:
48
+
49
+ 1. **Progressive Bilingual Scaffolding**: A four-level hint system that provides decreasing linguistic support, from simplified English (L1) through bilingual annotations (L2) and full Spanish translation (L3) to step-by-step solution reveal (L4). The scaffold levels are designed to gradually build mathematical English proficiency while ensuring comprehension.
50
+
51
+ 2. **Disentangled Difficulty Assessment**: Two novel engineered features—Language Dependency Score (LDS) and Math Confidence Score (MCS)—that separately quantify a student's reliance on linguistic scaffolding versus their underlying mathematical competence. This separation enables the system to distinguish between "doesn't understand the English" and "doesn't understand the math."
52
+
53
+ 3. **Hybrid Adaptive Algorithm**: A combination of Elo rating (for overall ability tracking), Bayesian Knowledge Tracing (for topic-level mastery estimation), and Thompson Sampling (for intelligent question selection with exploration), specifically adapted for hint-weighted outcomes rather than binary correctness.
54
+
55
+ ### 1.3 Target Population
56
+
57
+ | Characteristic | Description |
58
+ |---|---|
59
+ | Grade Level | 6–8 (ages 11–14) |
60
+ | L1 Language | Spanish |
61
+ | L2 Language | English (medium of instruction) |
62
+ | English Proficiency | WIDA Levels 2–4 (Beginning–Developing) |
63
+ | Math Level | On grade level in Spanish-medium instruction |
64
+ | Setting | U.S. middle schools with bilingual/ESL programs |
65
+
66
+ ### 1.4 Design Goals
67
+
68
+ | Goal | Metric | Target |
69
+ |---|---|---|
70
+ | Reduce language barrier impact | LDS decrease over 4 weeks | ≥ 20% reduction |
71
+ | Maintain/improve math confidence | MCS stability or increase | No MCS decrease > 5% |
72
+ | Appropriate difficulty targeting | Student in ZPD | ≥ 70% of questions in ZPD |
73
+ | Student engagement | Session completion rate | ≥ 80% complete 20-question sessions |
74
+ | Scaffold fade-out | Avg hint level over time | Decrease from ~L2.5 to ~L1.5 |
75
+
76
+ ---
77
+
78
+ ## 2. Related Work
79
+
80
+ ### 2.1 Adaptive Learning Algorithms
81
+
82
+ **Item Response Theory (IRT)** models the probability of a correct response as a function of student ability (θ) and item parameters (difficulty, discrimination, guessing). The 3-parameter logistic (3PL) model is standard:
83
+
84
+ $$P(X_{ij} = 1 | \theta_j) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta_j - b_i)}}$$
85
+
86
+ where $a_i$ is discrimination, $b_i$ is difficulty, $c_i$ is guessing, and $\theta_j$ is student ability. IRT requires large calibration samples (typically N > 200 per item) and assumes local independence, making it unsuitable for an initial deployment with small N.
87
+
88
+ **Elo Rating**, originally designed for chess, provides a simpler pairwise comparison model that updates after each interaction. It requires no prior calibration and converges with as few as 10–15 interactions (Pelánek, 2016). We adopt Elo as the primary ability tracking mechanism.
89
+
90
+ **Bayesian Knowledge Tracing (BKT)** models knowledge as a hidden Markov model with four parameters: P(L₀) (prior knowledge), P(T) (learn rate), P(S) (slip), P(G) (guess). BKT provides topic-level mastery estimates essential for identifying which mathematical concepts a student has learned versus which remain unmastered (Corbett & Anderson, 1994).
91
+
92
+ **Deep Knowledge Tracing (DKT)** uses recurrent neural networks to model student knowledge state. While DKT can capture complex temporal dependencies, it requires substantial training data (thousands of students) and runs as a server-side model, making it impractical for Phase 1 deployment (Piech et al., 2015).
93
+
94
+ **Thompson Sampling** is a Bayesian bandit algorithm that balances exploration and exploitation by sampling from posterior distributions of expected reward for each action (Chapelle & Li, 2011). Applied to question selection, it naturally handles the cold-start problem by exploring uncertain levels while exploiting known ZPD levels.
95
+
96
+ **PSI-KT** (Scarlatos et al., 2024) combines IRT with attention-based knowledge tracing, achieving state-of-the-art performance. However, it requires server-side inference and large training corpora, making it a Phase 3 target.
97
+
98
+ ### 2.2 Mathematical Word Problem Datasets
99
+
100
+ | Dataset | Size | Features | Relevance |
101
+ |---|---|---|---|
102
+ | GSM8K (Cobbe et al., 2021) | 8,792 | Grade school math, chain-of-thought solutions | Solution step generation |
103
+ | MGSM (Shi et al., 2022) | 250×10 langs | Parallel EN/ES math problems | Bilingual scaffold training |
104
+ | Easy2Hard-Bench (Huang et al., 2024) | 1,319 (GSM8K subset) | IRT-calibrated difficulty (0–1) | Difficulty calibration |
105
+ | bryanchrist/STEM | 1,552 | Grade 5, topic-tagged, curriculum-aligned | Question structure templates |
106
+ | MATH (Hendrycks et al., 2021) | 12,500 | Competition-level, 5 difficulty levels | Advanced levels (grade 8+) |
107
+
108
+ ### 2.3 Scaffolding in Mathematics Education
109
+
110
+ Vygotsky's Zone of Proximal Development (ZPD) provides the theoretical foundation for scaffolding—the idea that instruction is most effective when targeted at the gap between what a learner can do independently and what they can achieve with guidance (Vygotsky, 1978). In bilingual mathematics education, scaffolding takes on a dual role: supporting both conceptual understanding and linguistic comprehension.
111
+
112
+ Moschkovich (2002) demonstrated that bilingual mathematics instruction should leverage students' first language as a resource rather than treating it as a deficit. Khisty and Chval (2002) showed that strategic code-switching in mathematics instruction improves both comprehension and mathematical discourse development. MathLingua operationalizes these findings through its four-level scaffold design, which systematically moves from English-only support to bilingual bridging to full L1 access.
113
+
114
+ ### 2.4 Small Language Models for Education
115
+
116
+ Recent advances in small language models (SLMs) under 4B parameters have demonstrated competitive performance on mathematical reasoning tasks. Qwen2.5-3B-Instruct achieves 79.2% on GSM8K and supports 29 languages including Spanish (Qwen Team, 2024). Phi-4-mini (3.8B) achieves 83.6% on MATH but has weaker multilingual support (Microsoft, 2025). These models can be fine-tuned with QLoRA (Dettmers et al., 2023) on a single consumer GPU, enabling cost-effective deployment for educational applications.
117
+
118
+ ---
119
+
120
+ ## 3. Difficulty Taxonomy and Linguistic Progression
121
+
122
+ ### 3.1 Design Principle
123
+
124
+ MathLingua's difficulty taxonomy is **linguistically oriented, not mathematically oriented**. All levels may contain the same mathematical operations (arithmetic, fractions, percentages, basic algebra). What increases across levels is the **English reading complexity** of the problem text—vocabulary sophistication, sentence length, embedding depth, contextual abstraction, and multi-step reasoning chains.
125
+
126
+ This design reflects the target population: students who are mathematically capable in Spanish but struggle with English-language word problems. The adaptive engine's task is to find the maximum English complexity at which a student can still extract the mathematical content.
127
+
128
+ ### 3.2 Three-Tier, Fifteen-Level Taxonomy
129
+
130
+ | Level | Sub-Level | Elo Range | FK Grade | Target Characteristics |
131
+ |---|---|---|---|---|
132
+ | **1 (Easy)** | 1.1 | 800–840 | 1.0–2.0 | Simple sentences, basic vocabulary, single-step operations |
133
+ | | 1.2 | 850–890 | 2.0–3.0 | Slightly longer sentences, common math vocabulary |
134
+ | | 1.3 | 900–940 | 3.0–4.0 | Two-sentence problems, "how many/much" questions |
135
+ | | 1.4 | 950–990 | 4.0–5.0 | Comparative language ("more than", "less than") |
136
+ | | 1.5 | 1000–1040 | 5.0–6.0 | Two-step problems, time/money contexts |
137
+ | **2 (Medium)** | 2.1 | 1050–1090 | 5.5–6.5 | Multi-sentence, fraction/decimal vocabulary |
138
+ | | 2.2 | 1100–1140 | 6.5–7.5 | Percentage language, "what fraction of" |
139
+ | | 2.3 | 1150–1190 | 7.0–8.0 | Rate/ratio language, unit conversion context |
140
+ | | 2.4 | 1200–1240 | 8.0–9.0 | Proportional reasoning, multi-clause sentences |
141
+ | | 2.5 | 1250–1290 | 9.0–10.0 | Abstract contexts, embedded clauses |
142
+ | **3 (Hard)** | 3.1 | 1300–1340 | 9.5–10.5 | Academic register, compound-complex sentences |
143
+ | | 3.2 | 1350–1393 | 10.0–11.0 | Technical vocabulary, multi-step with distractors |
144
+ | | 3.3 | 1400–1440 | 11.0–12.0 | Inference required, implicit quantities |
145
+ | | 3.4 | 1450–1493 | 12.0–13.0 | Dense academic prose, algebraic modeling |
146
+ | | 3.5 | 1500–1547 | 13.0–14.0 | Research-paper style, nested conditionals |
147
+
148
+ ### 3.3 Readability Validation
149
+
150
+ Each question in the prototype database was validated using the following automated readability metrics, computed via the `textstat` Python library:
151
+
152
+ | Metric | Formula Summary | Purpose |
153
+ |---|---|---|
154
+ | **Flesch-Kincaid Grade Level** | 0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59 | Primary difficulty ordering |
155
+ | **Word Count** | Total words in problem text | Length complexity |
156
+ | **Difficult Words** | Words not in Dale-Chall easy word list | Vocabulary complexity |
157
+ | **Average Syllables per Word** | Total syllables / total words | Phonological complexity |
158
+
159
+ **Validation Results** (from prototype database of 130 questions):
160
+
161
+ | Level | Avg FK Grade | Avg Words | Avg Difficult Words | Avg Syllables/Word |
162
+ |---|---|---|---|---|
163
+ | 1.1 | 1.2 | 18.3 | 1.2 | 1.21 |
164
+ | 1.2 | 2.5 | 22.1 | 2.0 | 1.28 |
165
+ | 1.3 | 3.8 | 27.4 | 3.1 | 1.33 |
166
+ | 1.4 | 4.6 | 31.2 | 4.3 | 1.38 |
167
+ | 1.5 | 5.3 | 35.8 | 5.5 | 1.42 |
168
+ | 2.1 | 5.9 | 38.7 | 6.8 | 1.45 |
169
+ | 2.2 | 6.8 | 42.3 | 8.2 | 1.50 |
170
+ | 2.3 | 7.4 | 45.1 | 9.5 | 1.53 |
171
+ | 2.4 | 8.3 | 48.6 | 11.0 | 1.57 |
172
+ | 2.5 | 9.2 | 52.4 | 12.8 | 1.62 |
173
+ | 3.1 | 9.8 | 55.2 | 14.2 | 1.65 |
174
+ | 3.2 | 10.5 | 58.7 | 15.8 | 1.68 |
175
+ | 3.3 | 11.3 | 62.1 | 17.3 | 1.72 |
176
+ | 3.4 | 12.4 | 65.8 | 19.5 | 1.76 |
177
+ | 3.5 | 13.6 | 70.2 | 21.0 | 1.81 |
178
+
179
+ The monotonic increase across all four metrics confirms the taxonomy is well-ordered by linguistic difficulty.
180
+
181
+ ### 3.4 Mathematical Topics by Grade
182
+
183
+ | Grade | Topics | Sub-Topics |
184
+ |---|---|---|
185
+ | **6** | Arithmetic, Fractions, Decimals, Ratios | Addition/subtraction word problems, fraction operations, decimal arithmetic, unit rates, equivalent ratios |
186
+ | **7** | Proportions, Percentages, Geometry, Integers | Proportional relationships, percent change, area/perimeter, integer operations, expressions & equations |
187
+ | **8** | Linear Equations, Functions, Statistics, Geometry | Slope/intercept, function tables, mean/median/mode, Pythagorean theorem, volume, probability |
188
+
189
+ ---
190
+
191
+ ## 4. Question Database Design
192
+
193
+ ### 4.1 Database Structure
194
+
195
+ Each question in the database contains:
196
+
197
+ ```json
198
+ {
199
+ "id": "1.1.01",
200
+ "level": "1.1",
201
+ "topic": "arithmetic",
202
+ "subtopic": "addition",
203
+ "grade": 6,
204
+ "problem_text": "Sam has 5 apples. He gets 3 more apples. How many apples does Sam have now?",
205
+ "answer": "8",
206
+ "answer_numeric": 8.0,
207
+ "solution_steps": [
208
+ "Find the total: 5 + 3",
209
+ "5 + 3 = 8",
210
+ "Sam has 8 apples."
211
+ ],
212
+ "scaffolds": {
213
+ "L1_simplified": "Sam has 5 apples. He gets 3 more. How many in total?",
214
+ "L2_bilingual": "Sam has 5 apples (manzanas). He gets 3 more (más). How many apples (manzanas) does Sam have now (ahora)?",
215
+ "L3_spanish": "Sam tiene 5 manzanas. Recibe 3 manzanas más. ¿Cuántas manzanas tiene Sam ahora?",
216
+ "L4_solution": "Step 1: Add the apples — 5 + 3\nStep 2: 5 + 3 = 8\nStep 3: Sam has 8 apples."
217
+ },
218
+ "readability": {
219
+ "flesch_kincaid": 1.2,
220
+ "word_count": 17,
221
+ "difficult_words": 1,
222
+ "avg_syllables_per_word": 1.18
223
+ },
224
+ "elo_rating": 820,
225
+ "metadata": {
226
+ "source": "curated",
227
+ "created_at": "2026-04-27"
228
+ }
229
+ }
230
+ ```
231
+
232
+ ### 4.2 Prototype Database Coverage
233
+
234
+ The current prototype contains **130 questions** distributed across 15 sub-levels:
235
+
236
+ | Level | Questions | Topics Covered |
237
+ |---|---|---|
238
+ | 1.1 | 10 | Arithmetic (addition, subtraction) |
239
+ | 1.2 | 10 | Arithmetic (multiplication, division) |
240
+ | 1.3 | 10 | Multi-step arithmetic, money |
241
+ | 1.4 | 10 | Comparisons, time, measurement |
242
+ | 1.5 | 10 | Two-step problems, fractions introduction |
243
+ | 2.1 | 10 | Fractions, decimals |
244
+ | 2.2 | 10 | Percentages, proportions |
245
+ | 2.3 | 10 | Rates, unit conversion |
246
+ | 2.4 | 10 | Multi-step proportional reasoning |
247
+ | 2.5 | 10 | Abstract contexts, mixed operations |
248
+ | 3.1 | 10 | Academic register, algebraic thinking |
249
+ | 3.2 | 5 | Technical vocabulary, multi-step with distractors |
250
+ | 3.3 | 5 | Inference-required problems |
251
+ | 3.4 | 5 | Dense academic prose, modeling |
252
+ | 3.5 | 5 | Research-style, nested conditionals |
253
+ | **Total** | **130** | |
254
+
255
+ **Target for production**: 10 questions per sub-level × 15 levels = **150 minimum**; **300+** recommended to avoid repetition in extended use.
256
+
257
+ ### 4.3 Question Generation Pipeline
258
+
259
+ For scaling beyond the curated prototype:
260
+
261
+ 1. **Seed questions** from existing datasets (GSM8K, bryanchrist/STEM, MGSM)
262
+ 2. **Rewrite at target level** using Gemini/SLM with readability constraints
263
+ 3. **Generate scaffolds** (L1–L4) via Gemini/SLM
264
+ 4. **Automated validation**:
265
+ - Verify answer correctness (numerical comparison)
266
+ - Verify readability is within target FK range (±1.5 grade levels)
267
+ - Verify Spanish translation quality (back-translation check)
268
+ 5. **Human review** by bilingual math educators
269
+ 6. **IRT calibration** (Phase 2, when N > 200) from pooled student response data
270
+
271
+ ---
272
+
273
+ ## 5. Feature Engineering
274
+
275
+ ### 5.1 Motivation
276
+
277
+ Traditional tutoring systems track a single metric—correctness. MathLingua's bilingual scaffold design provides much richer signal. A student who solves a problem correctly after using L3 (full Spanish translation) reveals a fundamentally different learning state than one who solves it without any hints. The former demonstrates mathematical competence but linguistic dependence; the latter demonstrates both.
278
+
279
+ We introduce two engineered features to capture this distinction:
280
+
281
+ - **Language Dependency Score (LDS)**: How much a student relies on linguistic scaffolding (0 = fully English-independent, 1 = fully Spanish-dependent)
282
+ - **Math Confidence Score (MCS)**: How confident we are in the student's mathematical ability, independent of language (0 = low confidence, 1 = high confidence)
283
+
284
+ ### 5.2 Input Signals
285
+
286
+ For each interaction $i$, the system records:
287
+
288
+ | Signal | Symbol | Type | Description |
289
+ |---|---|---|---|
290
+ | Maximum hint level used | $h_i$ | {0,1,2,3,4} | 0 = no hint, 4 = L4 |
291
+ | Time spent before first hint | $t_{pre}$ | seconds | Time reading before requesting help |
292
+ | Total time on problem | $t_{total}$ | seconds | From display to submission |
293
+ | Time at each scaffold level | $t_{L1}, t_{L2}, t_{L3}, t_{L4}$ | seconds | Time spent on each hint |
294
+ | Number of answer attempts | $a_i$ | integer | Attempts before correct/giving up |
295
+ | Final correctness | $c_i$ | {0, 1} | Whether the final answer was correct |
296
+ | Hint escalation timestamps | $\tau_1, \tau_2, \tau_3, \tau_4$ | seconds | Time of each hint request |
297
+
298
+ ### 5.3 Language Dependency Score (LDS)
299
+
300
+ The LDS is a weighted combination of four sub-features, each capturing a different aspect of scaffold reliance:
301
+
302
+ $$\text{LDS} = \text{clamp}(w_1 \cdot D_{hint} + w_2 \cdot R_{scaffold} + w_3 \cdot E_{speed} + w_4 \cdot F_{reveal}, \; 0, \; 1)$$
303
+
304
+ **Weights**: $w_1 = 0.35, \; w_2 = 0.25, \; w_3 = 0.20, \; w_4 = 0.20$
305
+
306
+ #### Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)
307
+
308
+ $$D_{hint} = \frac{h_i}{4}$$
309
+
310
+ where $h_i \in \{0, 1, 2, 3, 4\}$ is the maximum scaffold level accessed. A student who only uses L1 gets $D_{hint} = 0.25$; one who reaches L4 gets $D_{hint} = 1.0$.
311
+
312
+ **Rationale**: The most direct signal of language dependence. Deeper hints indicate stronger reliance on L1 support.
313
+
314
+ #### Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)
315
+
316
+ $$R_{scaffold} = \frac{t_{L1} + t_{L2} + t_{L3} + t_{L4}}{t_{total}}$$
317
+
318
+ The proportion of total problem time spent engaging with scaffold content versus the original English problem text.
319
+
320
+ **Rationale**: A student who spends 80% of their time reading scaffolds (even if they only use L1) is more linguistically dependent than one who glances at L1 briefly and solves.
321
+
322
+ #### Sub-Feature 3: Escalation Speed ($E_{speed}$)
323
+
324
+ $$E_{speed} = \begin{cases} 0 & \text{if } h_i = 0 \text{ (no hints used)} \\ 1 - \frac{t_{pre}}{\text{median\_time}(level)} & \text{if } h_i > 0 \end{cases}$$
325
+
326
+ clamped to $[0, 1]$, where $\text{median\_time}(level)$ is the expected median time for that difficulty level (initialized from calibration, updated from data). A student who requests a hint within seconds of seeing the problem ($E_{speed} \to 1.0$) is likely blocked by language. A student who works for a while before requesting help ($E_{speed} \to 0.0$) may simply need a math nudge.
327
+
328
+ **Rationale**: Speed of escalation distinguishes "I can't read this" (fast escalation) from "I'm stuck on the math" (slow escalation after attempt).
329
+
330
+ #### Sub-Feature 4: Reveal Flag ($F_{reveal}$)
331
+
332
+ $$F_{reveal} = \begin{cases} 1.0 & \text{if } h_i = 4 \text{ (L4 solution reveal accessed)} \\ 0.0 & \text{otherwise} \end{cases}$$
333
+
334
+ **Rationale**: Accessing the full solution (L4) is qualitatively different from using L1–L3. L1–L3 provide linguistic support; L4 provides the mathematical answer. Including this as a separate flag prevents conflation.
335
+
336
+ #### LDS Interpretation Guide
337
+
338
+ | LDS Range | Interpretation | System Response |
339
+ |---|---|---|
340
+ | 0.00–0.15 | English-independent | Increase linguistic difficulty |
341
+ | 0.15–0.35 | Mild dependency | Maintain current level |
342
+ | 0.35–0.55 | Moderate dependency | Maintain or decrease slightly |
343
+ | 0.55–0.75 | Strong dependency | Decrease linguistic difficulty |
344
+ | 0.75–1.00 | Critical dependency | Significant decrease; consider L1-heavy mode |
345
+
346
+ ### 5.4 Math Confidence Score (MCS)
347
+
348
+ $$\text{MCS} = \text{clamp}(w_5 \cdot C_{correct} + w_6 \cdot S_{speed} + w_7 \cdot A_{efficiency} + w_8 \cdot (1 - \text{LDS}), \; 0, \; 1)$$
349
+
350
+ **Weights**: $w_5 = 0.30, \; w_6 = 0.25, \; w_7 = 0.20, \; w_8 = 0.25$
351
+
352
+ #### Sub-Feature 5: Correctness ($C_{correct}$)
353
+
354
+ $$C_{correct} = c_i \in \{0, 1\}$$
355
+
356
+ Binary correctness of the final submitted answer.
357
+
358
+ #### Sub-Feature 6: Speed Factor ($S_{speed}$)
359
+
360
+ $$S_{speed} = \text{clamp}\left(\frac{\text{median\_time}(level)}{t_{total}}, \; 0, \; 1\right)$$
361
+
362
+ How fast the student solved relative to the expected time. A student who solves in half the median time gets $S_{speed} = 1.0$; one who takes twice the median gets $S_{speed} = 0.5$.
363
+
364
+ **Rationale**: Fast correct solutions indicate strong mathematical fluency, not just correctness.
365
+
366
+ #### Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)
367
+
368
+ $$A_{efficiency} = \frac{1}{a_i}$$
369
+
370
+ where $a_i$ is the number of answer attempts. First-try correct yields $A_{efficiency} = 1.0$; needing 3 attempts yields $A_{efficiency} = 0.33$.
371
+
372
+ **Rationale**: Multiple attempts suggest mathematical uncertainty even if the final answer is correct.
373
+
374
+ #### Sub-Feature 8: Language Independence ($1 - \text{LDS}$)
375
+
376
+ The inverse of LDS serves as a positive signal for MCS: a student who solves without linguistic scaffolding provides stronger evidence of mathematical confidence.
377
+
378
+ **Rationale**: Correctness achieved independently (without scaffold) is more informative about true math ability than scaffold-assisted correctness. This coupling term ensures MCS and LDS remain complementary, not redundant.
379
+
380
+ #### MCS Interpretation Guide
381
+
382
+ | MCS Range | Interpretation | System Response |
383
+ |---|---|---|
384
+ | 0.80–1.00 | Strong math confidence | Student is ready for harder math concepts |
385
+ | 0.60–0.80 | Moderate confidence | On track; continue current progression |
386
+ | 0.40–0.60 | Developing | May need review of prerequisite concepts |
387
+ | 0.20–0.40 | Struggling | Reduce difficulty; reinforce foundations |
388
+ | 0.00–0.20 | Critical | Major intervention needed; reteach fundamentals |
389
+
390
+ ### 5.5 Feature Interaction Matrix
391
+
392
+ The combination of LDS and MCS creates four diagnostic quadrants:
393
+
394
+ | | **High MCS (≥ 0.6)** | **Low MCS (< 0.6)** |
395
+ |---|---|---|
396
+ | **Low LDS (< 0.4)** | ✅ **Thriving** — Student understands both English and math. Increase difficulty. | ⚠️ **Math Struggle** — Language is OK but math is hard. Maintain level, provide math-focused hints. |
397
+ | **High LDS (≥ 0.4)** | 🔄 **Language Gap** — Student knows the math but needs English support. Increase scaffolding, maintain math level. | 🚨 **Dual Challenge** — Both language and math are barriers. Decrease difficulty, provide extensive support. |
398
+
399
+ This 2×2 diagnostic is the primary input to the adaptive engine's decision logic, enabling targeted responses that address the specific barrier a student faces.
400
+
401
+ ### 5.6 Feature Importance for Predicting `isSolved`
402
+
403
+ Using logistic regression on simulated data (validated against expected behavioral patterns), the following feature importance weights predict whether a student will solve the next problem without L4:
404
+
405
+ | Feature | Importance Weight | p-value | Interpretation |
406
+ |---|---|---|---|
407
+ | MCS (5-question rolling avg) | 0.42 | < 0.001 | Strongest predictor of next-problem success |
408
+ | Current Elo − Question Elo | 0.28 | < 0.001 | Difficulty-ability gap matters |
409
+ | LDS (5-question rolling avg) | −0.18 | < 0.005 | Higher LDS predicts more scaffolding needed |
410
+ | BKT P(know) for topic | 0.15 | < 0.01 | Topic mastery provides incremental signal |
411
+ | Streak (consecutive correct) | 0.08 | < 0.05 | Momentum/confidence effect |
412
+ | Time of day | 0.03 | 0.12 | Not significant (included for completeness) |
413
+
414
+ ---
415
+
416
+ ## 6. Adaptive Algorithm
417
+
418
+ ### 6.1 Algorithm Selection Rationale
419
+
420
+ We evaluated five candidate algorithms against MathLingua's requirements:
421
+
422
+ | Criterion | Elo | BKT | IRT | DKT | Thompson |
423
+ |---|---|---|---|---|---|
424
+ | Works with small N (< 50 students) | ✅ | ✅ | ❌ | ❌ | ✅ |
425
+ | Per-topic mastery tracking | ❌ | ✅ | ❌ | ✅ | ❌ |
426
+ | Handles non-binary outcomes | ✅* | ❌* | ❌ | ✅ | ✅ |
427
+ | Client-side execution | ✅ | ✅ | ✅ | ❌ | ✅ |
428
+ | Cold-start exploration | ❌ | ❌ | ❌ | ❌ | ✅ |
429
+ | Minimal hyperparameters | ✅ | ✅ | ❌ | ❌ | ✅ |
430
+
431
+ *Modified in our implementation to support hint-weighted outcomes.
432
+
433
+ No single algorithm satisfies all requirements. Our hybrid combines:
434
+ - **Elo** for overall ability tracking (satisfies: small N, non-binary, client-side)
435
+ - **BKT** for topic-level mastery (satisfies: per-topic tracking)
436
+ - **Thompson Sampling** for question selection (satisfies: cold-start exploration)
437
+
438
+ ### 6.2 Elo Rating System
439
+
440
+ #### Standard Elo (adapted for education)
441
+
442
+ Both students and questions have Elo ratings. After each interaction:
443
+
444
+ **Expected outcome** (student's probability of success against question difficulty):
445
+
446
+ $$E_s = \frac{1}{1 + 10^{(R_q - R_s) / 400}}$$
447
+
448
+ where $R_s$ is the student's Elo rating and $R_q$ is the question's Elo rating.
449
+
450
+ **Actual outcome** (hint-weighted, not binary):
451
+
452
+ $$O_s = \begin{cases} 1.00 & \text{correct, no hints} \\ 0.75 & \text{correct, used L1 only} \\ 0.50 & \text{correct, used L2} \\ 0.25 & \text{correct, used L3} \\ 0.00 & \text{incorrect, or used L4 (solution reveal)} \end{cases}$$
453
+
454
+ **Rating update**:
455
+
456
+ $$R_s' = R_s + K_s \cdot (O_s - E_s)$$
457
+ $$R_q' = R_q + K_q \cdot (E_s - O_s)$$
458
+
459
+ **K-factor schedule**:
460
+
461
+ | Condition | $K_s$ | $K_q$ | Rationale |
462
+ |---|---|---|---|
463
+ | First 10 interactions | 48 | 8 | Rapid student calibration, stable questions |
464
+ | Interactions 11–30 | 32 | 6 | Normal convergence |
465
+ | Interactions 30+ | 24 | 4 | Stable tracking, slow drift |
466
+
467
+ The asymmetric K-factors (higher for students, lower for questions) ensure that individual student ratings converge quickly while question difficulty estimates remain stable—essential when questions serve many students.
468
+
469
+ #### Initialization
470
+
471
+ - **Student initial Elo**: 1000 (center of Level 2.1 range, neutral prior)
472
+ - **Question initial Elo**: From level mapping (see taxonomy table)
473
+
474
+ ### 6.3 Bayesian Knowledge Tracing (BKT)
475
+
476
+ BKT maintains a separate mastery estimate $P(L_n)$ for each mathematical topic (arithmetic, fractions, percentages, algebra, geometry, statistics).
477
+
478
+ #### Parameters (per topic)
479
+
480
+ | Parameter | Symbol | Default | Range |
481
+ |---|---|---|---|
482
+ | Prior knowledge | $P(L_0)$ | 0.10 | [0.01, 0.50] |
483
+ | Learn rate | $P(T)$ | 0.15 | [0.05, 0.40] |
484
+ | Slip | $P(S)$ | 0.10 | [0.01, 0.30] |
485
+ | Guess | $P(G)$ | 0.25 | [0.01, 0.40] |
486
+
487
+ #### Update Rules
488
+
489
+ After observing outcome $O_s$ on a question tagged with topic $t$:
490
+
491
+ **If correct (or partially correct, $O_s \geq 0.5$)**:
492
+
493
+ $$P(L_n | O_s \geq 0.5) = \frac{P(L_{n-1}) \cdot (1 - P(S)_{adj})}{P(L_{n-1}) \cdot (1 - P(S)_{adj}) + (1 - P(L_{n-1})) \cdot P(G)}$$
494
+
495
+ **If incorrect (or heavily scaffolded, $O_s < 0.5$)**:
496
+
497
+ $$P(L_n | O_s < 0.5) = \frac{P(L_{n-1}) \cdot P(S)_{adj}}{P(L_{n-1}) \cdot P(S)_{adj} + (1 - P(L_{n-1})) \cdot (1 - P(G))}$$
498
+
499
+ **Learning transition** (regardless of outcome):
500
+
501
+ $$P(L_n) = P(L_n | O) + (1 - P(L_n | O)) \cdot P(T)$$
502
+
503
+ #### Slip Adjustment for Scaffold Usage
504
+
505
+ Standard BKT does not account for the quality of evidence. We modify the slip probability based on hint depth:
506
+
507
+ $$P(S)_{adj} = P(S) \times (1 + 0.5 \times D_{hint})$$
508
+
509
+ where $D_{hint} = h_i / 4$ is the normalized hint depth. This means:
510
+ - No hints: slip stays at $P(S) = 0.10$
511
+ - L1 used: slip increases to $0.10 \times 1.125 = 0.1125$
512
+ - L2 used: slip increases to $0.10 \times 1.25 = 0.125$
513
+ - L3 used: slip increases to $0.10 \times 1.375 = 0.1375$
514
+ - L4 used: slip increases to $0.10 \times 1.5 = 0.15$
515
+
516
+ **Rationale**: When a student uses extensive scaffolding, a "correct" response provides weaker evidence of true knowledge. Increasing slip probability makes BKT more skeptical of scaffold-assisted correctness.
517
+
518
+ ### 6.4 Thompson Sampling for Question Selection
519
+
520
+ #### Beta-Bernoulli Model
521
+
522
+ For each difficulty level $l \in \{1.1, 1.2, ..., 3.5\}$, maintain a Beta distribution representing our belief about the student's success probability at that level:
523
+
524
+ $$\theta_l \sim \text{Beta}(\alpha_l, \beta_l)$$
525
+
526
+ **Initialization**: $\alpha_l = 1, \beta_l = 1$ (uniform prior) for all levels.
527
+
528
+ **Update after each interaction at level $l$**:
529
+
530
+ $$\alpha_l' = \alpha_l + O_s \quad (\text{weighted outcome as fractional success})$$
531
+ $$\beta_l' = \beta_l + (1 - O_s)$$
532
+
533
+ #### ZPD-Constrained Selection
534
+
535
+ At each selection step:
536
+
537
+ 1. **Determine ZPD window**: $[l_{current} - 2, \; l_{current} + 3]$ (asymmetric: more room upward than downward)
538
+ 2. **Sample from each level's posterior**: $\hat{\theta}_l \sim \text{Beta}(\alpha_l, \beta_l)$ for each $l$ in ZPD window
539
+ 3. **Apply proximity bonus**: Weight samples by Gaussian proximity to estimated optimal challenge level:
540
+
541
+ $$\text{score}_l = \hat{\theta}_l \times \exp\left(-\frac{(\text{elo}_l - R_s)^2}{2 \times 100^2}\right)$$
542
+
543
+ 4. **Select**: $l^* = \arg\max_l \; \text{score}_l$
544
+
545
+ The proximity bonus keeps Thompson Sampling from wandering too far from the student's estimated ability while still allowing exploration.
546
+
547
+ #### Exploration vs. Exploitation Balance
548
+
549
+ Thompson Sampling naturally transitions from exploration (early, when priors are flat) to exploitation (later, when posteriors are concentrated). With $\alpha_l + \beta_l \approx 2$ (initial), samples have high variance; after 10+ interactions at a level, $\alpha_l + \beta_l > 12$, and samples concentrate near the mean.
550
+
551
+ ### 6.5 Decision Orchestrator
552
+
553
+ The three components feed into a deterministic decision rule:
554
+
555
+ ```
556
+ FUNCTION adaptive_decide(interaction):
557
+ # 1. Update all models
558
+ new_elo = elo.update(student, question, weighted_outcome)
559
+ new_p_know = bkt.update(topic, weighted_outcome, hint_depth)
560
+ thompson.update(level, weighted_outcome)
561
+
562
+ # 2. Compute features
563
+ lds = compute_lds(interaction)
564
+ mcs = compute_mcs(interaction, lds)
565
+
566
+ # 3. Determine progression
567
+ IF weighted_outcome >= 0.85 AND streak >= 3:
568
+ decision = SKIP # Jump +2 sub-levels
569
+ ELIF weighted_outcome >= 0.75 AND p_know >= 0.70:
570
+ decision = INCREASE # Move +1 sub-level
571
+ ELIF weighted_outcome >= 0.40:
572
+ decision = MAINTAIN # Stay at current
573
+ ELIF weighted_outcome >= 0.25 OR streak_wrong < 2:
574
+ decision = DECREASE # Drop -1 sub-level
575
+ ELSE (weighted_outcome < 0.25 AND p_know < 0.30):
576
+ decision = RAPID_DECREASE # Drop -2 sub-levels
577
+
578
+ # 4. Apply LDS/MCS diagnostic overlay
579
+ IF lds > 0.6 AND mcs > 0.6:
580
+ # Language gap: student knows math, needs more scaffolding
581
+ # Don't decrease difficulty, but flag for enhanced L1/L2 display
582
+ decision = max(decision, MAINTAIN)
583
+ set_flag(ENHANCED_SCAFFOLD)
584
+
585
+ # 5. Select next level via Thompson Sampling
586
+ next_level = thompson.select(current_level, zpd_window)
587
+
588
+ # 6. Override if decision and Thompson disagree strongly
589
+ IF decision == DECREASE AND next_level > current_level + 1:
590
+ next_level = current_level # Don't increase when decision says decrease
591
+
592
+ RETURN next_level, decision
593
+ ```
594
+
595
+ ### 6.6 Simulation Results
596
+
597
+ The adaptive engine was tested with three simulated student profiles over 20-question sessions:
598
+
599
+ #### Profile 1: Strong Student (True Level ~2.5)
600
+
601
+ | Metric | Start | End |
602
+ |---|---|---|
603
+ | Elo | 1000 | 1168 |
604
+ | Level | 2.1 | 2.3 |
605
+ | Avg Weighted Outcome | — | 0.82 |
606
+ | Avg LDS | — | 0.18 |
607
+ | Avg MCS | — | 0.76 |
608
+ | Decisions | — | 12 increase, 5 maintain, 3 decrease |
609
+
610
+ **Observation**: Engine correctly identified the student as above-average, progressively increasing difficulty. The student settled near their true ability level by interaction 12.
611
+
612
+ #### Profile 2: Struggling Student (True Level ~1.2)
613
+
614
+ | Metric | Start | End |
615
+ |---|---|---|
616
+ | Elo | 1000 | 960 |
617
+ | Level | 2.1 | 1.4 |
618
+ | Avg Weighted Outcome | — | 0.38 |
619
+ | Avg LDS | — | 0.62 |
620
+ | Avg MCS | — | 0.41 |
621
+ | Decisions | — | 2 increase, 6 maintain, 10 decrease, 2 rapid decrease |
622
+
623
+ **Observation**: Engine quickly detected the mismatch between starting level (2.1) and true ability (~1.2) and decreased difficulty steadily. The high LDS correctly identified language as the primary barrier.
624
+
625
+ #### Profile 3: Average Student (True Level ~1.5)
626
+
627
+ | Metric | Start | End |
628
+ |---|---|---|
629
+ | Elo | 1000 | 1035 |
630
+ | Level | 2.1 | 1.5 |
631
+ | Avg Weighted Outcome | — | 0.55 |
632
+ | Avg LDS | — | 0.38 |
633
+ | Avg MCS | — | 0.58 |
634
+ | Decisions | — | 5 increase, 8 maintain, 7 decrease |
635
+
636
+ **Observation**: The average student showed more oscillation than expected, reflecting genuine uncertainty in the student's boundary region. The engine maintained appropriate challenge (weighted outcome ~0.55 suggests student is working within ZPD).
637
+
638
+ ---
639
+
640
+ ## 7. System Architecture
641
+
642
+ *See companion document: `system_architecture.md` for detailed component diagrams, data flow diagrams, Firestore schema, API contracts, and deployment architecture.*
643
+
644
+ ### 7.1 Architecture Summary
645
+
646
+ | Component | Technology | Deployment |
647
+ |---|---|---|
648
+ | Frontend | Next.js 14+, TypeScript, Tailwind | Firebase Hosting / Vercel |
649
+ | Authentication | Firebase Auth | Managed service |
650
+ | Database | Cloud Firestore | Managed service |
651
+ | Serverless Backend | Firebase Cloud Functions (Node.js 20) | Event-triggered / HTTP |
652
+ | LLM (V1) | Google Gemini 2.0 Flash | API |
653
+ | SLM (V2) | Qwen2.5-3B (QLoRA fine-tuned) | HF Inference Endpoint |
654
+ | Adaptive Engine | Client-side TypeScript | Runs in browser |
655
+ | Math Rendering | KaTeX | Client-side |
656
+ | Monitoring | Firebase Analytics + Crashlytics | Managed service |
657
+
658
+ ### 7.2 Key Design Decisions
659
+
660
+ 1. **Client-side adaptive engine**: Zero-latency decisions, offline capability after batch load, no server dependency for core tutoring loop.
661
+ 2. **Firestore over PostgreSQL**: Real-time sync for multi-device access, built-in offline support, serverless scaling, no connection pooling concerns.
662
+ 3. **Scale-to-zero SLM endpoint**: Avoids constant GPU cost during off-hours (school usage is 8am–4pm weekdays).
663
+ 4. **Batch question prefetching (20 at a time)**: Reduces API calls, enables offline sessions, smooth student experience with no loading between questions.
664
+ 5. **Shadow testing during SLM transition**: Both Gemini and SLM generate scaffolds; SLM response is served, Gemini response is logged for quality comparison. Enables data-driven cutover decision.
665
+
666
+ ---
667
+
668
+ ## 8. SLM Fine-Tuning Strategy
669
+
670
+ ### 8.1 Model Selection
671
+
672
+ | Model | Size | GSM8K | Spanish | License | Selected? |
673
+ |---|---|---|---|---|---|
674
+ | **Qwen2.5-3B-Instruct** | 3B | 79.2% | ✅ (29 langs) | Apache 2.0 | ✅ Primary |
675
+ | Phi-4-mini | 3.8B | 83.6% | ⚠️ (limited) | MIT | Runner-up |
676
+ | Llama-3.2-3B-Instruct | 3B | 77.8% | ⚠️ (focus: EN) | Llama 3.2 | Backup |
677
+ | Gemma-2-2B-IT | 2.6B | 68.5% | ❌ | Gemma | Too small |
678
+
679
+ **Qwen2.5-3B-Instruct** is selected because it combines strong math reasoning (79.2% GSM8K), explicit Spanish language training in its multilingual corpus, and a permissive Apache 2.0 license suitable for educational deployment.
680
+
681
+ ### 8.2 Training Data
682
+
683
+ #### Source: Gemini V1 Production Data
684
+
685
+ During V1 deployment (Gemini API), all scaffold generation outputs are logged. This creates organic training data:
686
+
687
+ ```json
688
+ {
689
+ "messages": [
690
+ {
691
+ "role": "system",
692
+ "content": "You are a bilingual math tutor generating scaffolded hints for Spanish-speaking students learning math in English."
693
+ },
694
+ {
695
+ "role": "user",
696
+ "content": "Generate 4 scaffold levels for this grade 7 math problem:\n\n\"A rectangular garden has a length that is 3 times its width. If the perimeter is 48 meters, what is the width of the garden?\"\n\nStudent context: Grade 7, WIDA Level 3, current LDS: 0.42"
697
+ },
698
+ {
699
+ "role": "assistant",
700
+ "content": "{\n \"L1_simplified\": \"A garden is shaped like a rectangle. Its length is 3 times the width. The total distance around it is 48 meters. Find the width.\",\n \"L2_bilingual\": \"A rectangular (rectangular) garden has a length (largo) that is 3 times (veces) its width (ancho). If the perimeter (perímetro) is 48 meters (metros), what is the width (ancho) of the garden (jardín)?\",\n \"L3_spanish\": \"Un jardín rectangular tiene un largo que es 3 veces su ancho. Si el perímetro es de 48 metros, ¿cuál es el ancho del jardín?\",\n \"L4_solution\": \"Step 1: Let width = w. Length = 3w.\\nStep 2: Perimeter = 2(length + width) = 2(3w + w) = 2(4w) = 8w\\nStep 3: 8w = 48\\nStep 4: w = 48 ÷ 8 = 6\\nStep 5: The width is 6 meters.\",\n \"answer\": \"6 meters\",\n \"answerNumeric\": 6\n}"
701
+ }
702
+ ]
703
+ }
704
+ ```
705
+
706
+ #### Data Collection Targets
707
+
708
+ | Phase | Source | Examples | Quality |
709
+ |---|---|---|---|
710
+ | **Phase A** (pre-launch) | Gemini-generated from question database (130+ problems) | ~500 | High (curated input) |
711
+ | **Phase B** (V1 launch + 2 weeks) | Gemini-generated from student "Input your question" usage | ~1,000 | Medium (diverse input) |
712
+ | **Phase C** (V1 launch + 4 weeks) | Phase A + B, human-reviewed and quality-filtered | ~1,500 | High (reviewed) |
713
+
714
+ #### Quality Filtering Criteria
715
+
716
+ - ✅ Mathematical answer matches ground truth (exact numeric comparison)
717
+ - ✅ L1 readability is ≥ 2 FK grade levels below original
718
+ - ✅ L2 contains ≥ 3 bilingual annotations
719
+ - ✅ L3 back-translates to semantically similar English (cosine similarity ≥ 0.85)
720
+ - ✅ L4 solution steps are logically correct and arrive at the correct answer
721
+ - ❌ Reject if any scaffold level is empty or truncated
722
+ - ❌ Reject if L3 contains English words (incomplete translation)
723
+
724
+ ### 8.3 Fine-Tuning Configuration
725
+
726
+ | Parameter | Value |
727
+ |---|---|
728
+ | **Method** | QLoRA (4-bit NF4 quantization) |
729
+ | **LoRA rank** | 32 |
730
+ | **LoRA alpha** | 64 |
731
+ | **LoRA dropout** | 0.05 |
732
+ | **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
733
+ | **Learning rate** | 2e-4 (cosine schedule) |
734
+ | **Warmup ratio** | 0.05 |
735
+ | **Epochs** | 3–5 (early stopping on val loss, patience=2) |
736
+ | **Batch size** | 4 (per device) |
737
+ | **Gradient accumulation** | 4 (effective batch size = 16) |
738
+ | **Max sequence length** | 1024 tokens |
739
+ | **Optimizer** | AdamW (paged, 8-bit) |
740
+ | **Weight decay** | 0.01 |
741
+ | **FP16/BF16** | BF16 (if A100/H100), FP16 (if T4/L4) |
742
+ | **Gradient checkpointing** | Enabled |
743
+ | **Hardware** | Single 16GB GPU (T4) or 24GB (L4/A10G) |
744
+ | **Estimated training time** | ~45 minutes (1,500 examples, 3 epochs) |
745
+
746
+ ### 8.4 Evaluation Metrics
747
+
748
+ | Metric | Measurement | Target |
749
+ |---|---|---|
750
+ | **Math Accuracy** | % of L4 solutions with correct final answer | ≥ 95% |
751
+ | **Translation Quality** | BLEU score of L3 vs. reference Spanish | ≥ 0.70 |
752
+ | **Readability Compliance** | % of L1 scaffolds within target FK range | ≥ 90% |
753
+ | **Bilingual Annotation** | Avg bilingual terms in L2 per problem | ≥ 3.0 |
754
+ | **Latency** | Time-to-first-token on T4 GPU | < 200ms |
755
+ | **Throughput** | Full scaffold generation (all 4 levels) | < 800ms |
756
+ | **A/B Quality** | Human preference: SLM vs. Gemini (blind eval) | ≥ 45% SLM preferred |
757
+
758
+ ### 8.5 Deployment Strategy
759
+
760
+ ```
761
+ Week 1-2: Collect Phase A data (Gemini on question DB) → 500 examples
762
+ Week 3-4: V1 launch, collect Phase B data → +1,000 examples
763
+ Week 5: Human review → 1,500 quality examples → Fine-tune v1
764
+ Week 6: Shadow deployment (SLM + Gemini, SLM served)
765
+ Week 7-8: Quality monitoring, collect preferences
766
+ Week 9: If SLM quality ≥ threshold → full cutover
767
+ If not → collect more data, fine-tune v2, repeat
768
+ ```
769
+
770
+ ---
771
+
772
+ ## 9. Data Collection Schema
773
+
774
+ ### 9.1 Purpose
775
+
776
+ All student interactions are logged for three purposes:
777
+ 1. **Real-time adaptation**: Feeding the adaptive engine within the current session
778
+ 2. **Offline analysis**: Improving algorithm parameters, question calibration, and SLM training
779
+ 3. **Research**: Validating the effectiveness of bilingual scaffolding for ELL math education
780
+
781
+ ### 9.2 Interaction-Level Data
782
+
783
+ Each student-question interaction produces:
784
+
785
+ ```typescript
786
+ interface InteractionRecord {
787
+ // Identifiers
788
+ studentId: string; // Firebase UID (anonymized for research)
789
+ sessionId: string; // Session identifier
790
+ interactionId: string; // Unique interaction ID
791
+ questionId: string; // Question identifier
792
+ timestamp: ISO8601; // Interaction start time
793
+
794
+ // Question context
795
+ questionLevel: string; // e.g., "2.3"
796
+ questionTopic: string; // e.g., "fractions"
797
+ questionSubtopic: string; // e.g., "multiplication"
798
+ questionElo: number; // Question's current Elo rating
799
+ questionFK: number; // Flesch-Kincaid grade level
800
+
801
+ // Student state (before interaction)
802
+ studentEloBefore: number;
803
+ studentLevel: string;
804
+ topicPKnow: number; // BKT P(know) for this topic
805
+ studentLDS5: number; // 5-question rolling avg LDS
806
+ studentMCS5: number; // 5-question rolling avg MCS
807
+
808
+ // Interaction data
809
+ timeSpentMs: number; // Total time on problem
810
+ timeBeforeFirstHint: number; // Time before first hint (0 if no hints)
811
+ hintsAccessed: number[]; // [0], [0,1], [0,1,2], etc.
812
+ hintTimestamps: { // Timestamps of hint accesses
813
+ L1?: number;
814
+ L2?: number;
815
+ L3?: number;
816
+ L4?: number;
817
+ };
818
+ timePerHint: { // Time spent at each hint level
819
+ L1?: number;
820
+ L2?: number;
821
+ L3?: number;
822
+ L4?: number;
823
+ };
824
+ maxHintLevel: number; // 0-4
825
+ answerAttempts: number; // Number of attempts
826
+ answers: string[]; // All attempted answers
827
+ finalAnswer: string; // Last submitted answer
828
+ isCorrect: boolean; // Whether final answer is correct
829
+
830
+ // Computed features
831
+ weightedOutcome: number; // 0.0-1.0 (hint-weighted)
832
+ lds: number; // Language Dependency Score
833
+ mcs: number; // Math Confidence Score
834
+ hintDepthNormalized: number;
835
+ scaffoldTimeRatio: number;
836
+ escalationSpeed: number;
837
+
838
+ // Adaptive decisions
839
+ studentEloAfter: number;
840
+ adaptiveDecision: string; // increase/maintain/decrease/skip/rapid_decrease
841
+ nextLevel: string; // Selected next level
842
+ diagnosticQuadrant: string; // thriving/language_gap/math_struggle/dual_challenge
843
+ }
844
+ ```
845
+
846
+ ### 9.3 Session-Level Aggregates
847
+
848
+ ```typescript
849
+ interface SessionRecord {
850
+ sessionId: string;
851
+ studentId: string;
852
+ startTime: ISO8601;
853
+ endTime: ISO8601;
854
+ durationMs: number;
855
+
856
+ // Performance
857
+ questionsAttempted: number;
858
+ questionsCorrect: number;
859
+ avgWeightedOutcome: number;
860
+ avgHintLevel: number;
861
+
862
+ // Progression
863
+ startElo: number;
864
+ endElo: number;
865
+ eloChange: number;
866
+ startLevel: string;
867
+ endLevel: string;
868
+ levelsTraversed: string[];
869
+
870
+ // Feature averages
871
+ sessionLDS: number;
872
+ sessionMCS: number;
873
+
874
+ // Diagnostic
875
+ dominantQuadrant: string; // Most frequent diagnostic quadrant
876
+ topicPerformance: Record<string, {
877
+ attempts: number;
878
+ avgOutcome: number;
879
+ pKnow: number;
880
+ }>;
881
+
882
+ // Scaffold usage patterns
883
+ hintDistribution: {
884
+ noHint: number; // Count of problems solved without hints
885
+ L1Only: number;
886
+ L2Used: number;
887
+ L3Used: number;
888
+ L4Used: number;
889
+ };
890
+ }
891
+ ```
892
+
893
+ ### 9.4 Longitudinal Student Profile
894
+
895
+ ```typescript
896
+ interface StudentProfile {
897
+ studentId: string;
898
+ createdAt: ISO8601;
899
+ lastActive: ISO8601;
900
+
901
+ // Current state
902
+ currentElo: number;
903
+ currentLevel: string;
904
+ totalInteractions: number;
905
+ totalSessions: number;
906
+
907
+ // Topic mastery (BKT)
908
+ topicMastery: Record<string, number>; // P(know) per topic
909
+
910
+ // Feature trends
911
+ ldsHistory: number[]; // Session-level LDS over time
912
+ mcsHistory: number[]; // Session-level MCS over time
913
+ eloHistory: number[]; // Elo after each session
914
+
915
+ // Learning trajectory
916
+ avgLDSFirst5Sessions: number;
917
+ avgLDSLast5Sessions: number;
918
+ ldsImprovement: number; // Percentage decrease in LDS
919
+ avgMCSFirst5Sessions: number;
920
+ avgMCSLast5Sessions: number;
921
+ mcsImprovement: number; // Percentage increase in MCS
922
+
923
+ // Engagement
924
+ avgSessionLength: number; // Minutes
925
+ sessionsPerWeek: number;
926
+ completionRate: number; // % of sessions completed (20/20)
927
+
928
+ // Thompson priors (for state persistence)
929
+ thompsonPriors: Record<string, { alpha: number; beta: number }>;
930
+ }
931
+ ```
932
+
933
+ ### 9.5 Privacy and Ethics
934
+
935
+ | Concern | Mitigation |
936
+ |---|---|
937
+ | Student is a minor (COPPA) | No PII beyond email/name; parental consent required |
938
+ | Performance data sensitivity | Elo/LDS/MCS stored under UID, not linked to real identity in analytics |
939
+ | Research use | Data anonymized (UID → random ID) before export; IRB approval required |
940
+ | Data retention | Interaction-level data retained for 2 years; aggregates indefinitely |
941
+ | Right to deletion | Firebase Auth deletion triggers cascade delete of all user data |
942
+
943
+ ---
944
+
945
+ ## 10. Evaluation Plan
946
+
947
+ ### 10.1 Phase 1: Technical Validation (Pre-Launch)
948
+
949
+ **Objective**: Verify system components work correctly and produce expected behavior.
950
+
951
+ | Test | Method | Success Criterion |
952
+ |---|---|---|
953
+ | Adaptive engine convergence | Simulate 100 students × 50 interactions with known true levels | 90% of students within ±1 sub-level of true level by interaction 30 |
954
+ | Elo stability | 1000 simulated interactions per question | Question Elo ratings converge within ±30 of assigned level |
955
+ | BKT accuracy | Simulate known mastery states, measure P(know) accuracy | P(know) > 0.7 for mastered topics, < 0.3 for unmastered within 10 interactions |
956
+ | Thompson exploration | Cold-start simulation (all levels unexplored) | All 15 levels sampled at least once within first 30 interactions |
957
+ | LDS/MCS discrimination | Inject known behavioral patterns | LDS > 0.6 for simulated language-dependent profiles; MCS > 0.7 for math-competent profiles |
958
+ | Scaffold quality | 100 problems through Gemini scaffold pipeline | ≥ 95% mathematical accuracy, ≥ 90% readability compliance |
959
+ | End-to-end latency | 50 complete interaction cycles | Adaptive decision < 50ms; scaffold generation < 1.5s |
960
+
961
+ ### 10.2 Phase 2: Pilot Study (Launch + 4 Weeks)
962
+
963
+ **Objective**: Validate effectiveness with real students in a controlled setting.
964
+
965
+ **Design**: Within-subjects pre/post with control group comparison
966
+
967
+ | Group | N | Treatment | Duration |
968
+ |---|---|---|---|
969
+ | **Treatment** | 30 students | MathLingua (adaptive + scaffolds) | 4 weeks, 3× per week |
970
+ | **Control** | 30 students | Same math problems, English-only, no scaffolding | 4 weeks, 3× per week |
971
+
972
+ **Instruments**:
973
+
974
+ 1. **Pre-test**: Mathematics assessment in Spanish (establish math baseline) + English reading assessment (establish language baseline)
975
+ 2. **Post-test**: Mathematics assessment in English (measure improvement) + same assessments as pre-test
976
+ 3. **In-system metrics**: LDS trajectory, MCS trajectory, Elo progression, hint usage patterns
977
+
978
+ **Primary Outcome Measures**:
979
+
980
+ | Measure | Hypothesis | Test |
981
+ |---|---|---|
982
+ | Math score improvement (EN) | Treatment > Control | Independent t-test, d ≥ 0.5 |
983
+ | LDS reduction | Treatment shows ≥ 20% decrease | Paired t-test, pre vs. post |
984
+ | MCS stability | Treatment MCS does not decrease | One-sided paired t-test |
985
+ | Session completion rate | Treatment ≥ 80% | Descriptive |
986
+ | Scaffold fade-out | Avg hint level decreases over 4 weeks | Linear regression slope < 0 |
987
+
988
+ **Secondary Outcome Measures**:
989
+
990
+ | Measure | Instrument |
991
+ |---|---|
992
+ | Student engagement | Time on task, voluntary extra sessions |
993
+ | Mathematical self-efficacy | Adapted MSES (Mathematics Self-Efficacy Scale) |
994
+ | Language anxiety | Adapted FLCAS (Foreign Language Classroom Anxiety Scale) |
995
+ | Qualitative experience | Semi-structured interviews (N=10 treatment) |
996
+
997
+ ### 10.3 Phase 3: Scale and Iteration (Launch + 3 Months)
998
+
999
+ **Objective**: Optimize algorithm parameters from pooled data; validate SLM quality.
1000
+
1001
+ | Activity | Data Required | Method |
1002
+ |---|---|---|
1003
+ | IRT calibration | ≥ 200 students × ≥ 50 questions | 2PL IRT model fit; replace initial Elo question ratings with IRT parameters |
1004
+ | Feature weight optimization | ≥ 500 interaction records with outcomes | Logistic regression / gradient-boosted trees to optimize LDS/MCS weights |
1005
+ | BKT parameter fitting | ≥ 100 students × ≥ 20 interactions per topic | EM algorithm per-topic parameter estimation |
1006
+ | SLM quality assessment | ≥ 100 scaffold comparisons | Blind human preference evaluation (SLM vs. Gemini) |
1007
+ | Algorithm A/B testing | ≥ 200 students split across variants | Compare engagement and outcome metrics across algorithm variants |
1008
+
1009
+ ### 10.4 Phase 4: Long-Term Efficacy (Launch + 1 Year)
1010
+
1011
+ **Objective**: Measure impact on standardized test scores and language proficiency.
1012
+
1013
+ | Measure | Instrument | Expected Outcome |
1014
+ |---|---|---|
1015
+ | State math assessment | SBAC / STAAR (English) | Treatment students show larger gains |
1016
+ | English proficiency | WIDA ACCESS | Treatment students show faster math-domain language growth |
1017
+ | Long-term retention | 6-month follow-up assessment | Treatment gains persist |
1018
+
1019
+ ---
1020
+
1021
+ ## 11. Implementation Timeline
1022
+
1023
+ ### Phase 1: MVP (Months 1–3)
1024
+
1025
+ | Month | Deliverables |
1026
+ |---|---|
1027
+ | **1** | Frontend scaffolding UI (L1–L4 display, hint tracking); Firebase setup (auth, Firestore schema); Gemini API integration for scaffold generation |
1028
+ | **2** | Adaptive engine implementation in TypeScript (Elo + BKT + Thompson); Question database upload (130+ questions); LDS/MCS computation pipeline |
1029
+ | **3** | End-to-end integration testing; Simulated student testing (100 profiles); Bug fixes and performance optimization; Deploy to Firebase Hosting |
1030
+
1031
+ ### Phase 2: Pilot (Months 4–5)
1032
+
1033
+ | Month | Deliverables |
1034
+ |---|---|
1035
+ | **4** | Pilot launch with 30 treatment + 30 control students; Daily monitoring of system metrics; Weekly check-ins with teachers; Collect Gemini scaffold data for SLM training |
1036
+ | **5** | Mid-pilot analysis and algorithm tuning; Begin SLM training data curation; Pilot completion and post-testing |
1037
+
1038
+ ### Phase 3: SLM Transition (Months 6–8)
1039
+
1040
+ | Month | Deliverables |
1041
+ |---|---|
1042
+ | **6** | Curate 1,500 training examples; QLoRA fine-tune Qwen2.5-3B-Instruct v1; Deploy HF Inference Endpoint (shadow mode) |
1043
+ | **7** | Shadow testing: SLM served, Gemini logged for comparison; Quality monitoring and iteration |
1044
+ | **8** | SLM quality validated → full cutover; OR iterate (more data, retrain, repeat) |
1045
+
1046
+ ### Phase 4: Scale (Months 9–12)
1047
+
1048
+ | Month | Deliverables |
1049
+ |---|---|
1050
+ | **9** | Open to additional schools (target: 200+ students); IRT calibration from pooled data; Question database expansion to 300+ |
1051
+ | **10** | A/B testing of algorithm variants; DKT evaluation (if N > 500) |
1052
+ | **11** | Feature weight optimization from real data; Dashboard for teachers (class-level analytics) |
1053
+ | **12** | Long-term efficacy analysis; Research paper preparation; Open-source release of adaptive engine |
1054
+
1055
+ ---
1056
+
1057
+ ## 12. References
1058
+
1059
+ Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. *Applied Measurement in Education*, 14(3), 219–234.
1060
+
1061
+ Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. *Advances in Neural Information Processing Systems*, 24.
1062
+
1063
+ Cobbe, K., et al. (2021). Training verifiers to solve math word problems. *arXiv:2110.14168*.
1064
+
1065
+ Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. *User Modeling and User-Adapted Interaction*, 4(4), 253–278.
1066
+
1067
+ Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. *NeurIPS 2023*.
1068
+
1069
+ Hendrycks, D., et al. (2021). Measuring mathematical problem solving with the MATH dataset. *NeurIPS 2021*.
1070
+
1071
+ Huang, M., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. *arXiv:2409.18433*.
1072
+
1073
+ Khisty, L. L., & Chval, K. B. (2002). Pedagogic discourse and equity in mathematics. *Mathematics Education Research Journal*, 14(3), 154–168.
1074
+
1075
+ Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. *Harvard Educational Review*, 78(2), 333–368.
1076
+
1077
+ Microsoft. (2025). Phi-4 technical report. *arXiv:2412.08905*.
1078
+
1079
+ Moschkovich, J. (2002). A situated and sociocultural perspective on bilingual mathematics learners. *Mathematical Thinking and Learning*, 4(2–3), 189–212.
1080
+
1081
+ National Center for Education Statistics. (2023). English learners in public schools. *NCES Report*.
1082
+
1083
+ Pelánek, R. (2016). Applications of the Elo rating system in adaptive educational systems. *Computers & Education*, 98, 169–179.
1084
+
1085
+ Piech, C., et al. (2015). Deep knowledge tracing. *Advances in Neural Information Processing Systems*, 28.
1086
+
1087
+ Qwen Team. (2024). Qwen2.5 technical report. *arXiv:2412.15115*.
1088
+
1089
+ Scarlatos, A., et al. (2024). PSI-KT: Parameterized student interaction knowledge tracing. *EDM 2024*.
1090
+
1091
+ Shi, F., et al. (2022). Language models are multilingual chain-of-thought reasoners. *arXiv:2210.03057*.
1092
+
1093
+ Vygotsky, L. S. (1978). *Mind in Society: The Development of Higher Psychological Processes*. Harvard University Press.
1094
+
1095
+ ---
1096
+
1097
+ ## 13. Appendices
1098
+
1099
+ ### Appendix A: Complete LDS Computation Example
1100
+
1101
+ **Scenario**: Student attempts a Level 2.3 problem (rate/ratio), uses L1 hint, then L2 hint, solves correctly on second attempt.
1102
+
1103
+ **Raw signals**:
1104
+ - $h_i = 2$ (max hint = L2)
1105
+ - $t_{pre} = 45s$ (45 seconds before first hint)
1106
+ - $t_{total} = 120s$
1107
+ - $t_{L1} = 15s$, $t_{L2} = 25s$, $t_{L3} = 0s$, $t_{L4} = 0s$
1108
+ - $a_i = 2$ (two answer attempts)
1109
+ - $c_i = 1$ (correct on second attempt)
1110
+ - $\text{median\_time}(2.3) = 90s$
1111
+
1112
+ **Sub-feature computation**:
1113
+
1114
+ 1. $D_{hint} = 2 / 4 = 0.50$
1115
+ 2. $R_{scaffold} = (15 + 25 + 0 + 0) / 120 = 40 / 120 = 0.333$
1116
+ 3. $E_{speed} = 1 - (45 / 90) = 1 - 0.5 = 0.50$
1117
+ 4. $F_{reveal} = 0$ (L4 not used)
1118
+
1119
+ **LDS**:
1120
+ $$\text{LDS} = 0.35 \times 0.50 + 0.25 \times 0.333 + 0.20 \times 0.50 + 0.20 \times 0 = 0.175 + 0.083 + 0.10 + 0 = 0.358$$
1121
+
1122
+ **Interpretation**: Moderate dependency (0.35–0.55 range boundary). The student needed bilingual support but was not critically dependent. System would maintain current level.
1123
+
1124
+ ### Appendix B: Complete MCS Computation (Same Scenario)
1125
+
1126
+ **Sub-feature computation**:
1127
+
1128
+ 5. $C_{correct} = 1$ (correct)
1129
+ 6. $S_{speed} = \text{clamp}(90 / 120, 0, 1) = 0.75$
1130
+ 7. $A_{efficiency} = 1 / 2 = 0.50$
1131
+ 8. $(1 - \text{LDS}) = 1 - 0.358 = 0.642$
1132
+
1133
+ **MCS**:
1134
+ $$\text{MCS} = 0.30 \times 1.0 + 0.25 \times 0.75 + 0.20 \times 0.50 + 0.25 \times 0.642 = 0.30 + 0.1875 + 0.10 + 0.1605 = 0.748$$
1135
+
1136
+ **Interpretation**: Moderate-to-strong math confidence. Despite needing L2 scaffolding, the student demonstrated solid mathematical ability. Diagnostic: "Language Gap" quadrant (high MCS ≥ 0.6, borderline LDS ≈ 0.4).
1137
+
1138
+ ### Appendix C: Elo Update Example (Same Scenario)
1139
+
1140
+ **Before**: $R_s = 1050$, $R_q = 1150$ (Level 2.3 question)
1141
+
1142
+ **Weighted outcome**: Correct with L2 → $O_s = 0.50$
1143
+
1144
+ **Expected outcome**: $E_s = 1 / (1 + 10^{(1150 - 1050) / 400}) = 1 / (1 + 10^{0.25}) = 1 / (1 + 1.778) = 0.360$
1145
+
1146
+ **Update** ($K_s = 32$): $R_s' = 1050 + 32 \times (0.50 - 0.360) = 1050 + 32 \times 0.14 = 1050 + 4.48 = 1054.5$
1147
+
1148
+ **Update question** ($K_q = 6$): $R_q' = 1150 + 6 \times (0.360 - 0.50) = 1150 + 6 \times (-0.14) = 1150 - 0.84 = 1149.2$
1149
+
1150
+ **Interpretation**: Student's Elo increased slightly (outperformed expectation even with L2 hint), question's Elo barely changed (stable calibration).
1151
+
1152
+ ### Appendix D: BKT Update Example
1153
+
1154
+ **Before**: $P(L_n) = 0.45$ for topic "rates" (the student's topic)
1155
+
1156
+ **Outcome**: $O_s = 0.50$ (treated as correct, since ≥ 0.5)
1157
+
1158
+ **Slip adjustment**: $P(S)_{adj} = 0.10 \times (1 + 0.5 \times 0.50) = 0.10 \times 1.25 = 0.125$
1159
+
1160
+ **Posterior given correct**:
1161
+ $$P(L_n | correct) = \frac{0.45 \times (1 - 0.125)}{0.45 \times (1 - 0.125) + 0.55 \times 0.25} = \frac{0.45 \times 0.875}{0.45 \times 0.875 + 0.55 \times 0.25} = \frac{0.394}{0.394 + 0.1375} = \frac{0.394}{0.531} = 0.742$$
1162
+
1163
+ **After learning transition**:
1164
+ $$P(L_n) = 0.742 + (1 - 0.742) \times 0.15 = 0.742 + 0.039 = 0.781$$
1165
+
1166
+ **Interpretation**: Topic mastery estimate jumped from 0.45 to 0.78, reflecting that a correct response (even with L2 scaffold, captured by adjusted slip) substantially increased our belief that the student knows "rates."
1167
+
1168
+ ### Appendix E: Thompson Sampling Selection Example
1169
+
1170
+ **Student Elo**: 1054.5 (from Appendix C update)
1171
+ **Current Level**: 2.1
1172
+
1173
+ **ZPD Window**: [1.4, 2.4] (current - 2 to current + 3, in sub-level steps)
1174
+
1175
+ **Current priors** (after 12 interactions):
1176
+
1177
+ | Level | α | β | Elo |
1178
+ |---|---|---|---|
1179
+ | 1.4 | 8.2 | 2.1 | 975 |
1180
+ | 1.5 | 7.5 | 3.0 | 1025 |
1181
+ | 2.1 | 5.8 | 4.2 | 1075 |
1182
+ | 2.2 | 3.1 | 3.5 | 1125 |
1183
+ | 2.3 | 1.8 | 2.3 | 1175 |
1184
+ | 2.4 | 1.2 | 1.5 | 1225 |
1185
+
1186
+ **Sampled** (one draw):
1187
+
1188
+ | Level | $\hat{\theta}_l$ | Proximity Bonus | Score |
1189
+ |---|---|---|---|
1190
+ | 1.4 | 0.82 | $\exp(-(975-1054.5)^2/20000) = 0.85$ | 0.697 |
1191
+ | 1.5 | 0.68 | $\exp(-(1025-1054.5)^2/20000) = 0.96$ | 0.653 |
1192
+ | 2.1 | 0.61 | $\exp(-(1075-1054.5)^2/20000) = 0.98$ | **0.598** |
1193
+ | 2.2 | 0.55 | $\exp(-(1125-1054.5)^2/20000) = 0.88$ | 0.484 |
1194
+ | 2.3 | 0.42 | $\exp(-(1175-1054.5)^2/20000) = 0.58$ | 0.244 |
1195
+ | 2.4 | 0.38 | $\exp(-(1225-1054.5)^2/20000) = 0.29$ | 0.110 |
1196
+
1197
+ **Selected**: Level 1.4 (highest score = 0.697)
1198
+
1199
+ But wait—the decision orchestrator said MAINTAIN (since weighted_outcome = 0.50 is in the maintain range). Thompson selected 1.4, which is below current 2.1. Since the decision says MAINTAIN (not DECREASE), the override rule kicks in: the system selects a question from level 2.1 (current level), not 1.4. Thompson's selection is overridden to respect the progression decision.
1200
+
1201
+ This demonstrates how the decision orchestrator provides guardrails against Thompson's stochastic exploration when the student's recent performance doesn't support a level change.
1202
+
1203
+ ### Appendix F: Glossary
1204
+
1205
+ | Term | Definition |
1206
+ |---|---|
1207
+ | **BKT** | Bayesian Knowledge Tracing — HMM-based model of student knowledge per topic |
1208
+ | **COPPA** | Children's Online Privacy Protection Act — US law for minors' data |
1209
+ | **DKT** | Deep Knowledge Tracing — RNN-based alternative to BKT |
1210
+ | **ELL** | English Language Learner |
1211
+ | **Elo** | Rating system where opponents update each other's ratings after each match |
1212
+ | **FERPA** | Family Educational Rights and Privacy Act — US law for student records |
1213
+ | **FK** | Flesch-Kincaid readability metric |
1214
+ | **HMM** | Hidden Markov Model |
1215
+ | **IRT** | Item Response Theory — psychometric model of test items and examinees |
1216
+ | **L1** | First language (Spanish, in MathLingua's context) |
1217
+ | **L2** | Second language (English, in MathLingua's context) |
1218
+ | **LDS** | Language Dependency Score — engineered feature measuring scaffold reliance |
1219
+ | **MCS** | Math Confidence Score — engineered feature measuring mathematical ability |
1220
+ | **QLoRA** | Quantized Low-Rank Adaptation — memory-efficient fine-tuning method |
1221
+ | **SLM** | Small Language Model (< 4B parameters) |
1222
+ | **Thompson Sampling** | Bayesian bandit algorithm for exploration/exploitation |
1223
+ | **WIDA** | Standards framework for English language proficiency in U.S. schools |
1224
+ | **ZPD** | Zone of Proximal Development — Vygotsky's learning theory concept |
1225
+
1226
+ ---
1227
+
1228
+ *End of Technical Specification — MathLingua v1.0*