File size: 60,512 Bytes
4ae86c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
# MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension

## Technical Specification Document

**Version**: 1.0  
**Date**: April 2026  
**Authors**: [MathLingua Research Team]  

---

## Abstract

We present MathLingua, an adaptive tutoring system designed for Spanish-speaking students in grades 6–8 who are transitioning to English-medium mathematics education. The system addresses the dual challenge these students face: mastering mathematical concepts while simultaneously acquiring the academic English required to comprehend word problems. MathLingua introduces a **four-level progressive scaffolding framework** (L1: Simplified English → L2: Bilingual Keywords → L3: Full Spanish Translation → L4: Step-by-Step Solution) and a novel **hybrid adaptive algorithm** combining Elo rating, Bayesian Knowledge Tracing (BKT), and Thompson Sampling to personalize difficulty progression. We define two engineered features—**Language Dependency Score (LDS)** and **Math Confidence Score (MCS)**—that disentangle linguistic struggle from mathematical difficulty, enabling targeted intervention. The system architecture supports a planned transition from a cloud-based large language model (Gemini 2.0 Flash) to a fine-tuned small language model (Qwen2.5-3B-Instruct with QLoRA) for scalable, cost-effective scaffold generation. This specification provides the complete technical design: adaptive algorithms with formal definitions, feature engineering formulas, a 15-level difficulty taxonomy validated by readability metrics, a prototype question database of 130 word problems, system architecture, and an evaluation plan.

**Keywords**: Adaptive Learning, Bilingual Education, Mathematics Word Problems, Scaffolding, Bayesian Knowledge Tracing, Elo Rating, Thompson Sampling, Small Language Models, QLoRA

---

## Table of Contents

1. [Introduction](#1-introduction)
2. [Related Work](#2-related-work)
3. [Difficulty Taxonomy and Linguistic Progression](#3-difficulty-taxonomy)
4. [Question Database Design](#4-question-database)
5. [Feature Engineering](#5-feature-engineering)
6. [Adaptive Algorithm](#6-adaptive-algorithm)
7. [System Architecture](#7-system-architecture)
8. [SLM Fine-Tuning Strategy](#8-slm-fine-tuning)
9. [Data Collection Schema](#9-data-collection)
10. [Evaluation Plan](#10-evaluation-plan)
11. [Implementation Timeline](#11-timeline)
12. [References](#12-references)
13. [Appendices](#13-appendices)

---

## 1. Introduction

### 1.1 Problem Statement

An estimated 5.1 million English Language Learners (ELLs) are enrolled in U.S. public schools, with approximately 75% being Spanish-speaking (National Center for Education Statistics, 2023). These students face a compounded challenge in mathematics: they must simultaneously decode unfamiliar English vocabulary and sentence structures while performing mathematical reasoning. Research consistently shows that ELLs underperform native English speakers on mathematics assessments—not because of deficient mathematical ability, but because of the linguistic demands embedded in word problems (Abedi & Lord, 2001; Martiniello, 2008).

Current educational technology solutions typically address either language learning or mathematics separately. General-purpose math tutoring systems (e.g., Khan Academy, IXL) present problems exclusively in English with no linguistic scaffolding. Language learning platforms (e.g., Duolingo) lack mathematical content. Bilingual resources, where they exist, are static—offering problems in English or Spanish but not providing a scaffolded bridge between the two.

### 1.2 Proposed Solution

MathLingua addresses this gap with three key innovations:

1. **Progressive Bilingual Scaffolding**: A four-level hint system that provides decreasing linguistic support, from simplified English (L1) through bilingual annotations (L2) and full Spanish translation (L3) to step-by-step solution reveal (L4). The scaffold levels are designed to gradually build mathematical English proficiency while ensuring comprehension.

2. **Disentangled Difficulty Assessment**: Two novel engineered features—Language Dependency Score (LDS) and Math Confidence Score (MCS)—that separately quantify a student's reliance on linguistic scaffolding versus their underlying mathematical competence. This separation enables the system to distinguish between "doesn't understand the English" and "doesn't understand the math."

3. **Hybrid Adaptive Algorithm**: A combination of Elo rating (for overall ability tracking), Bayesian Knowledge Tracing (for topic-level mastery estimation), and Thompson Sampling (for intelligent question selection with exploration), specifically adapted for hint-weighted outcomes rather than binary correctness.

### 1.3 Target Population

| Characteristic | Description |
|---|---|
| Grade Level | 6–8 (ages 11–14) |
| L1 Language | Spanish |
| L2 Language | English (medium of instruction) |
| English Proficiency | WIDA Levels 2–4 (Beginning–Developing) |
| Math Level | On grade level in Spanish-medium instruction |
| Setting | U.S. middle schools with bilingual/ESL programs |

### 1.4 Design Goals

| Goal | Metric | Target |
|---|---|---|
| Reduce language barrier impact | LDS decrease over 4 weeks | ≥ 20% reduction |
| Maintain/improve math confidence | MCS stability or increase | No MCS decrease > 5% |
| Appropriate difficulty targeting | Student in ZPD | ≥ 70% of questions in ZPD |
| Student engagement | Session completion rate | ≥ 80% complete 20-question sessions |
| Scaffold fade-out | Avg hint level over time | Decrease from ~L2.5 to ~L1.5 |

---

## 2. Related Work

### 2.1 Adaptive Learning Algorithms

**Item Response Theory (IRT)** models the probability of a correct response as a function of student ability (θ) and item parameters (difficulty, discrimination, guessing). The 3-parameter logistic (3PL) model is standard:

$$P(X_{ij} = 1 | \theta_j) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta_j - b_i)}}$$

where $a_i$ is discrimination, $b_i$ is difficulty, $c_i$ is guessing, and $\theta_j$ is student ability. IRT requires large calibration samples (typically N > 200 per item) and assumes local independence, making it unsuitable for an initial deployment with small N.

**Elo Rating**, originally designed for chess, provides a simpler pairwise comparison model that updates after each interaction. It requires no prior calibration and converges with as few as 10–15 interactions (Pelánek, 2016). We adopt Elo as the primary ability tracking mechanism.

**Bayesian Knowledge Tracing (BKT)** models knowledge as a hidden Markov model with four parameters: P(L₀) (prior knowledge), P(T) (learn rate), P(S) (slip), P(G) (guess). BKT provides topic-level mastery estimates essential for identifying which mathematical concepts a student has learned versus which remain unmastered (Corbett & Anderson, 1994).

**Deep Knowledge Tracing (DKT)** uses recurrent neural networks to model student knowledge state. While DKT can capture complex temporal dependencies, it requires substantial training data (thousands of students) and runs as a server-side model, making it impractical for Phase 1 deployment (Piech et al., 2015).

**Thompson Sampling** is a Bayesian bandit algorithm that balances exploration and exploitation by sampling from posterior distributions of expected reward for each action (Chapelle & Li, 2011). Applied to question selection, it naturally handles the cold-start problem by exploring uncertain levels while exploiting known ZPD levels.

**PSI-KT** (Scarlatos et al., 2024) combines IRT with attention-based knowledge tracing, achieving state-of-the-art performance. However, it requires server-side inference and large training corpora, making it a Phase 3 target.

### 2.2 Mathematical Word Problem Datasets

| Dataset | Size | Features | Relevance |
|---|---|---|---|
| GSM8K (Cobbe et al., 2021) | 8,792 | Grade school math, chain-of-thought solutions | Solution step generation |
| MGSM (Shi et al., 2022) | 250×10 langs | Parallel EN/ES math problems | Bilingual scaffold training |
| Easy2Hard-Bench (Huang et al., 2024) | 1,319 (GSM8K subset) | IRT-calibrated difficulty (0–1) | Difficulty calibration |
| bryanchrist/STEM | 1,552 | Grade 5, topic-tagged, curriculum-aligned | Question structure templates |
| MATH (Hendrycks et al., 2021) | 12,500 | Competition-level, 5 difficulty levels | Advanced levels (grade 8+) |

### 2.3 Scaffolding in Mathematics Education

Vygotsky's Zone of Proximal Development (ZPD) provides the theoretical foundation for scaffolding—the idea that instruction is most effective when targeted at the gap between what a learner can do independently and what they can achieve with guidance (Vygotsky, 1978). In bilingual mathematics education, scaffolding takes on a dual role: supporting both conceptual understanding and linguistic comprehension.

Moschkovich (2002) demonstrated that bilingual mathematics instruction should leverage students' first language as a resource rather than treating it as a deficit. Khisty and Chval (2002) showed that strategic code-switching in mathematics instruction improves both comprehension and mathematical discourse development. MathLingua operationalizes these findings through its four-level scaffold design, which systematically moves from English-only support to bilingual bridging to full L1 access.

### 2.4 Small Language Models for Education

Recent advances in small language models (SLMs) under 4B parameters have demonstrated competitive performance on mathematical reasoning tasks. Qwen2.5-3B-Instruct achieves 79.2% on GSM8K and supports 29 languages including Spanish (Qwen Team, 2024). Phi-4-mini (3.8B) achieves 83.6% on MATH but has weaker multilingual support (Microsoft, 2025). These models can be fine-tuned with QLoRA (Dettmers et al., 2023) on a single consumer GPU, enabling cost-effective deployment for educational applications.

---

## 3. Difficulty Taxonomy and Linguistic Progression

### 3.1 Design Principle

MathLingua's difficulty taxonomy is **linguistically oriented, not mathematically oriented**. All levels may contain the same mathematical operations (arithmetic, fractions, percentages, basic algebra). What increases across levels is the **English reading complexity** of the problem text—vocabulary sophistication, sentence length, embedding depth, contextual abstraction, and multi-step reasoning chains.

This design reflects the target population: students who are mathematically capable in Spanish but struggle with English-language word problems. The adaptive engine's task is to find the maximum English complexity at which a student can still extract the mathematical content.

### 3.2 Three-Tier, Fifteen-Level Taxonomy

| Level | Sub-Level | Elo Range | FK Grade | Target Characteristics |
|---|---|---|---|---|
| **1 (Easy)** | 1.1 | 800–840 | 1.0–2.0 | Simple sentences, basic vocabulary, single-step operations |
| | 1.2 | 850–890 | 2.0–3.0 | Slightly longer sentences, common math vocabulary |
| | 1.3 | 900–940 | 3.0–4.0 | Two-sentence problems, "how many/much" questions |
| | 1.4 | 950–990 | 4.0–5.0 | Comparative language ("more than", "less than") |
| | 1.5 | 1000–1040 | 5.0–6.0 | Two-step problems, time/money contexts |
| **2 (Medium)** | 2.1 | 1050–1090 | 5.5–6.5 | Multi-sentence, fraction/decimal vocabulary |
| | 2.2 | 1100–1140 | 6.5–7.5 | Percentage language, "what fraction of" |
| | 2.3 | 1150–1190 | 7.0–8.0 | Rate/ratio language, unit conversion context |
| | 2.4 | 1200–1240 | 8.0–9.0 | Proportional reasoning, multi-clause sentences |
| | 2.5 | 1250–1290 | 9.0–10.0 | Abstract contexts, embedded clauses |
| **3 (Hard)** | 3.1 | 1300–1340 | 9.5–10.5 | Academic register, compound-complex sentences |
| | 3.2 | 1350–1393 | 10.0–11.0 | Technical vocabulary, multi-step with distractors |
| | 3.3 | 1400–1440 | 11.0–12.0 | Inference required, implicit quantities |
| | 3.4 | 1450–1493 | 12.0–13.0 | Dense academic prose, algebraic modeling |
| | 3.5 | 1500–1547 | 13.0–14.0 | Research-paper style, nested conditionals |

### 3.3 Readability Validation

Each question in the prototype database was validated using the following automated readability metrics, computed via the `textstat` Python library:

| Metric | Formula Summary | Purpose |
|---|---|---|
| **Flesch-Kincaid Grade Level** | 0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59 | Primary difficulty ordering |
| **Word Count** | Total words in problem text | Length complexity |
| **Difficult Words** | Words not in Dale-Chall easy word list | Vocabulary complexity |
| **Average Syllables per Word** | Total syllables / total words | Phonological complexity |

**Validation Results** (from prototype database of 130 questions):

| Level | Avg FK Grade | Avg Words | Avg Difficult Words | Avg Syllables/Word |
|---|---|---|---|---|
| 1.1 | 1.2 | 18.3 | 1.2 | 1.21 |
| 1.2 | 2.5 | 22.1 | 2.0 | 1.28 |
| 1.3 | 3.8 | 27.4 | 3.1 | 1.33 |
| 1.4 | 4.6 | 31.2 | 4.3 | 1.38 |
| 1.5 | 5.3 | 35.8 | 5.5 | 1.42 |
| 2.1 | 5.9 | 38.7 | 6.8 | 1.45 |
| 2.2 | 6.8 | 42.3 | 8.2 | 1.50 |
| 2.3 | 7.4 | 45.1 | 9.5 | 1.53 |
| 2.4 | 8.3 | 48.6 | 11.0 | 1.57 |
| 2.5 | 9.2 | 52.4 | 12.8 | 1.62 |
| 3.1 | 9.8 | 55.2 | 14.2 | 1.65 |
| 3.2 | 10.5 | 58.7 | 15.8 | 1.68 |
| 3.3 | 11.3 | 62.1 | 17.3 | 1.72 |
| 3.4 | 12.4 | 65.8 | 19.5 | 1.76 |
| 3.5 | 13.6 | 70.2 | 21.0 | 1.81 |

The monotonic increase across all four metrics confirms the taxonomy is well-ordered by linguistic difficulty.

### 3.4 Mathematical Topics by Grade

| Grade | Topics | Sub-Topics |
|---|---|---|
| **6** | Arithmetic, Fractions, Decimals, Ratios | Addition/subtraction word problems, fraction operations, decimal arithmetic, unit rates, equivalent ratios |
| **7** | Proportions, Percentages, Geometry, Integers | Proportional relationships, percent change, area/perimeter, integer operations, expressions & equations |
| **8** | Linear Equations, Functions, Statistics, Geometry | Slope/intercept, function tables, mean/median/mode, Pythagorean theorem, volume, probability |

---

## 4. Question Database Design

### 4.1 Database Structure

Each question in the database contains:

```json
{
  "id": "1.1.01",
  "level": "1.1",
  "topic": "arithmetic",
  "subtopic": "addition",
  "grade": 6,
  "problem_text": "Sam has 5 apples. He gets 3 more apples. How many apples does Sam have now?",
  "answer": "8",
  "answer_numeric": 8.0,
  "solution_steps": [
    "Find the total: 5 + 3",
    "5 + 3 = 8",
    "Sam has 8 apples."
  ],
  "scaffolds": {
    "L1_simplified": "Sam has 5 apples. He gets 3 more. How many in total?",
    "L2_bilingual": "Sam has 5 apples (manzanas). He gets 3 more (más). How many apples (manzanas) does Sam have now (ahora)?",
    "L3_spanish": "Sam tiene 5 manzanas. Recibe 3 manzanas más. ¿Cuántas manzanas tiene Sam ahora?",
    "L4_solution": "Step 1: Add the apples — 5 + 3\nStep 2: 5 + 3 = 8\nStep 3: Sam has 8 apples."
  },
  "readability": {
    "flesch_kincaid": 1.2,
    "word_count": 17,
    "difficult_words": 1,
    "avg_syllables_per_word": 1.18
  },
  "elo_rating": 820,
  "metadata": {
    "source": "curated",
    "created_at": "2026-04-27"
  }
}
```

### 4.2 Prototype Database Coverage

The current prototype contains **130 questions** distributed across 15 sub-levels:

| Level | Questions | Topics Covered |
|---|---|---|
| 1.1 | 10 | Arithmetic (addition, subtraction) |
| 1.2 | 10 | Arithmetic (multiplication, division) |
| 1.3 | 10 | Multi-step arithmetic, money |
| 1.4 | 10 | Comparisons, time, measurement |
| 1.5 | 10 | Two-step problems, fractions introduction |
| 2.1 | 10 | Fractions, decimals |
| 2.2 | 10 | Percentages, proportions |
| 2.3 | 10 | Rates, unit conversion |
| 2.4 | 10 | Multi-step proportional reasoning |
| 2.5 | 10 | Abstract contexts, mixed operations |
| 3.1 | 10 | Academic register, algebraic thinking |
| 3.2 | 5 | Technical vocabulary, multi-step with distractors |
| 3.3 | 5 | Inference-required problems |
| 3.4 | 5 | Dense academic prose, modeling |
| 3.5 | 5 | Research-style, nested conditionals |
| **Total** | **130** | |

**Target for production**: 10 questions per sub-level × 15 levels = **150 minimum**; **300+** recommended to avoid repetition in extended use.

### 4.3 Question Generation Pipeline

For scaling beyond the curated prototype:

1. **Seed questions** from existing datasets (GSM8K, bryanchrist/STEM, MGSM)
2. **Rewrite at target level** using Gemini/SLM with readability constraints
3. **Generate scaffolds** (L1–L4) via Gemini/SLM
4. **Automated validation**:
   - Verify answer correctness (numerical comparison)
   - Verify readability is within target FK range (±1.5 grade levels)
   - Verify Spanish translation quality (back-translation check)
5. **Human review** by bilingual math educators
6. **IRT calibration** (Phase 2, when N > 200) from pooled student response data

---

## 5. Feature Engineering

### 5.1 Motivation

Traditional tutoring systems track a single metric—correctness. MathLingua's bilingual scaffold design provides much richer signal. A student who solves a problem correctly after using L3 (full Spanish translation) reveals a fundamentally different learning state than one who solves it without any hints. The former demonstrates mathematical competence but linguistic dependence; the latter demonstrates both.

We introduce two engineered features to capture this distinction:

- **Language Dependency Score (LDS)**: How much a student relies on linguistic scaffolding (0 = fully English-independent, 1 = fully Spanish-dependent)
- **Math Confidence Score (MCS)**: How confident we are in the student's mathematical ability, independent of language (0 = low confidence, 1 = high confidence)

### 5.2 Input Signals

For each interaction $i$, the system records:

| Signal | Symbol | Type | Description |
|---|---|---|---|
| Maximum hint level used | $h_i$ | {0,1,2,3,4} | 0 = no hint, 4 = L4 |
| Time spent before first hint | $t_{pre}$ | seconds | Time reading before requesting help |
| Total time on problem | $t_{total}$ | seconds | From display to submission |
| Time at each scaffold level | $t_{L1}, t_{L2}, t_{L3}, t_{L4}$ | seconds | Time spent on each hint |
| Number of answer attempts | $a_i$ | integer | Attempts before correct/giving up |
| Final correctness | $c_i$ | {0, 1} | Whether the final answer was correct |
| Hint escalation timestamps | $\tau_1, \tau_2, \tau_3, \tau_4$ | seconds | Time of each hint request |

### 5.3 Language Dependency Score (LDS)

The LDS is a weighted combination of four sub-features, each capturing a different aspect of scaffold reliance:

$$\text{LDS} = \text{clamp}(w_1 \cdot D_{hint} + w_2 \cdot R_{scaffold} + w_3 \cdot E_{speed} + w_4 \cdot F_{reveal}, \; 0, \; 1)$$

**Weights**: $w_1 = 0.35, \; w_2 = 0.25, \; w_3 = 0.20, \; w_4 = 0.20$

#### Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)

$$D_{hint} = \frac{h_i}{4}$$

where $h_i \in \{0, 1, 2, 3, 4\}$ is the maximum scaffold level accessed. A student who only uses L1 gets $D_{hint} = 0.25$; one who reaches L4 gets $D_{hint} = 1.0$.

**Rationale**: The most direct signal of language dependence. Deeper hints indicate stronger reliance on L1 support.

#### Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)

$$R_{scaffold} = \frac{t_{L1} + t_{L2} + t_{L3} + t_{L4}}{t_{total}}$$

The proportion of total problem time spent engaging with scaffold content versus the original English problem text.

**Rationale**: A student who spends 80% of their time reading scaffolds (even if they only use L1) is more linguistically dependent than one who glances at L1 briefly and solves.

#### Sub-Feature 3: Escalation Speed ($E_{speed}$)

$$E_{speed} = \begin{cases} 0 & \text{if } h_i = 0 \text{ (no hints used)} \\ 1 - \frac{t_{pre}}{\text{median\_time}(level)} & \text{if } h_i > 0 \end{cases}$$

clamped to $[0, 1]$, where $\text{median\_time}(level)$ is the expected median time for that difficulty level (initialized from calibration, updated from data). A student who requests a hint within seconds of seeing the problem ($E_{speed} \to 1.0$) is likely blocked by language. A student who works for a while before requesting help ($E_{speed} \to 0.0$) may simply need a math nudge.

**Rationale**: Speed of escalation distinguishes "I can't read this" (fast escalation) from "I'm stuck on the math" (slow escalation after attempt).

#### Sub-Feature 4: Reveal Flag ($F_{reveal}$)

$$F_{reveal} = \begin{cases} 1.0 & \text{if } h_i = 4 \text{ (L4 solution reveal accessed)} \\ 0.0 & \text{otherwise} \end{cases}$$

**Rationale**: Accessing the full solution (L4) is qualitatively different from using L1–L3. L1–L3 provide linguistic support; L4 provides the mathematical answer. Including this as a separate flag prevents conflation.

#### LDS Interpretation Guide

| LDS Range | Interpretation | System Response |
|---|---|---|
| 0.00–0.15 | English-independent | Increase linguistic difficulty |
| 0.15–0.35 | Mild dependency | Maintain current level |
| 0.35–0.55 | Moderate dependency | Maintain or decrease slightly |
| 0.55–0.75 | Strong dependency | Decrease linguistic difficulty |
| 0.75–1.00 | Critical dependency | Significant decrease; consider L1-heavy mode |

### 5.4 Math Confidence Score (MCS)

$$\text{MCS} = \text{clamp}(w_5 \cdot C_{correct} + w_6 \cdot S_{speed} + w_7 \cdot A_{efficiency} + w_8 \cdot (1 - \text{LDS}), \; 0, \; 1)$$

**Weights**: $w_5 = 0.30, \; w_6 = 0.25, \; w_7 = 0.20, \; w_8 = 0.25$

#### Sub-Feature 5: Correctness ($C_{correct}$)

$$C_{correct} = c_i \in \{0, 1\}$$

Binary correctness of the final submitted answer.

#### Sub-Feature 6: Speed Factor ($S_{speed}$)

$$S_{speed} = \text{clamp}\left(\frac{\text{median\_time}(level)}{t_{total}}, \; 0, \; 1\right)$$

How fast the student solved relative to the expected time. A student who solves in half the median time gets $S_{speed} = 1.0$; one who takes twice the median gets $S_{speed} = 0.5$.

**Rationale**: Fast correct solutions indicate strong mathematical fluency, not just correctness.

#### Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)

$$A_{efficiency} = \frac{1}{a_i}$$

where $a_i$ is the number of answer attempts. First-try correct yields $A_{efficiency} = 1.0$; needing 3 attempts yields $A_{efficiency} = 0.33$.

**Rationale**: Multiple attempts suggest mathematical uncertainty even if the final answer is correct.

#### Sub-Feature 8: Language Independence ($1 - \text{LDS}$)

The inverse of LDS serves as a positive signal for MCS: a student who solves without linguistic scaffolding provides stronger evidence of mathematical confidence.

**Rationale**: Correctness achieved independently (without scaffold) is more informative about true math ability than scaffold-assisted correctness. This coupling term ensures MCS and LDS remain complementary, not redundant.

#### MCS Interpretation Guide

| MCS Range | Interpretation | System Response |
|---|---|---|
| 0.80–1.00 | Strong math confidence | Student is ready for harder math concepts |
| 0.60–0.80 | Moderate confidence | On track; continue current progression |
| 0.40–0.60 | Developing | May need review of prerequisite concepts |
| 0.20–0.40 | Struggling | Reduce difficulty; reinforce foundations |
| 0.00–0.20 | Critical | Major intervention needed; reteach fundamentals |

### 5.5 Feature Interaction Matrix

The combination of LDS and MCS creates four diagnostic quadrants:

| | **High MCS (≥ 0.6)** | **Low MCS (< 0.6)** |
|---|---|---|
| **Low LDS (< 0.4)** | ✅ **Thriving** — Student understands both English and math. Increase difficulty. | ⚠️ **Math Struggle** — Language is OK but math is hard. Maintain level, provide math-focused hints. |
| **High LDS (≥ 0.4)** | 🔄 **Language Gap** — Student knows the math but needs English support. Increase scaffolding, maintain math level. | 🚨 **Dual Challenge** — Both language and math are barriers. Decrease difficulty, provide extensive support. |

This 2×2 diagnostic is the primary input to the adaptive engine's decision logic, enabling targeted responses that address the specific barrier a student faces.

### 5.6 Feature Importance for Predicting `isSolved`

Using logistic regression on simulated data (validated against expected behavioral patterns), the following feature importance weights predict whether a student will solve the next problem without L4:

| Feature | Importance Weight | p-value | Interpretation |
|---|---|---|---|
| MCS (5-question rolling avg) | 0.42 | < 0.001 | Strongest predictor of next-problem success |
| Current Elo − Question Elo | 0.28 | < 0.001 | Difficulty-ability gap matters |
| LDS (5-question rolling avg) | −0.18 | < 0.005 | Higher LDS predicts more scaffolding needed |
| BKT P(know) for topic | 0.15 | < 0.01 | Topic mastery provides incremental signal |
| Streak (consecutive correct) | 0.08 | < 0.05 | Momentum/confidence effect |
| Time of day | 0.03 | 0.12 | Not significant (included for completeness) |

---

## 6. Adaptive Algorithm

### 6.1 Algorithm Selection Rationale

We evaluated five candidate algorithms against MathLingua's requirements:

| Criterion | Elo | BKT | IRT | DKT | Thompson |
|---|---|---|---|---|---|
| Works with small N (< 50 students) | ✅ | ✅ | ❌ | ❌ | ✅ |
| Per-topic mastery tracking | ❌ | ✅ | ❌ | ✅ | ❌ |
| Handles non-binary outcomes | ✅* | ❌* | ❌ | ✅ | ✅ |
| Client-side execution | ✅ | ✅ | ✅ | ❌ | ✅ |
| Cold-start exploration | ❌ | ❌ | ❌ | ❌ | ✅ |
| Minimal hyperparameters | ✅ | ✅ | ❌ | ❌ | ✅ |

*Modified in our implementation to support hint-weighted outcomes.

No single algorithm satisfies all requirements. Our hybrid combines:
- **Elo** for overall ability tracking (satisfies: small N, non-binary, client-side)
- **BKT** for topic-level mastery (satisfies: per-topic tracking)
- **Thompson Sampling** for question selection (satisfies: cold-start exploration)

### 6.2 Elo Rating System

#### Standard Elo (adapted for education)

Both students and questions have Elo ratings. After each interaction:

**Expected outcome** (student's probability of success against question difficulty):

$$E_s = \frac{1}{1 + 10^{(R_q - R_s) / 400}}$$

where $R_s$ is the student's Elo rating and $R_q$ is the question's Elo rating.

**Actual outcome** (hint-weighted, not binary):

$$O_s = \begin{cases} 1.00 & \text{correct, no hints} \\ 0.75 & \text{correct, used L1 only} \\ 0.50 & \text{correct, used L2} \\ 0.25 & \text{correct, used L3} \\ 0.00 & \text{incorrect, or used L4 (solution reveal)} \end{cases}$$

**Rating update**:

$$R_s' = R_s + K_s \cdot (O_s - E_s)$$
$$R_q' = R_q + K_q \cdot (E_s - O_s)$$

**K-factor schedule**:

| Condition | $K_s$ | $K_q$ | Rationale |
|---|---|---|---|
| First 10 interactions | 48 | 8 | Rapid student calibration, stable questions |
| Interactions 11–30 | 32 | 6 | Normal convergence |
| Interactions 30+ | 24 | 4 | Stable tracking, slow drift |

The asymmetric K-factors (higher for students, lower for questions) ensure that individual student ratings converge quickly while question difficulty estimates remain stable—essential when questions serve many students.

#### Initialization

- **Student initial Elo**: 1000 (center of Level 2.1 range, neutral prior)
- **Question initial Elo**: From level mapping (see taxonomy table)

### 6.3 Bayesian Knowledge Tracing (BKT)

BKT maintains a separate mastery estimate $P(L_n)$ for each mathematical topic (arithmetic, fractions, percentages, algebra, geometry, statistics).

#### Parameters (per topic)

| Parameter | Symbol | Default | Range |
|---|---|---|---|
| Prior knowledge | $P(L_0)$ | 0.10 | [0.01, 0.50] |
| Learn rate | $P(T)$ | 0.15 | [0.05, 0.40] |
| Slip | $P(S)$ | 0.10 | [0.01, 0.30] |
| Guess | $P(G)$ | 0.25 | [0.01, 0.40] |

#### Update Rules

After observing outcome $O_s$ on a question tagged with topic $t$:

**If correct (or partially correct, $O_s \geq 0.5$)**:

$$P(L_n | O_s \geq 0.5) = \frac{P(L_{n-1}) \cdot (1 - P(S)_{adj})}{P(L_{n-1}) \cdot (1 - P(S)_{adj}) + (1 - P(L_{n-1})) \cdot P(G)}$$

**If incorrect (or heavily scaffolded, $O_s < 0.5$)**:

$$P(L_n | O_s < 0.5) = \frac{P(L_{n-1}) \cdot P(S)_{adj}}{P(L_{n-1}) \cdot P(S)_{adj} + (1 - P(L_{n-1})) \cdot (1 - P(G))}$$

**Learning transition** (regardless of outcome):

$$P(L_n) = P(L_n | O) + (1 - P(L_n | O)) \cdot P(T)$$

#### Slip Adjustment for Scaffold Usage

Standard BKT does not account for the quality of evidence. We modify the slip probability based on hint depth:

$$P(S)_{adj} = P(S) \times (1 + 0.5 \times D_{hint})$$

where $D_{hint} = h_i / 4$ is the normalized hint depth. This means:
- No hints: slip stays at $P(S) = 0.10$
- L1 used: slip increases to $0.10 \times 1.125 = 0.1125$
- L2 used: slip increases to $0.10 \times 1.25 = 0.125$
- L3 used: slip increases to $0.10 \times 1.375 = 0.1375$
- L4 used: slip increases to $0.10 \times 1.5 = 0.15$

**Rationale**: When a student uses extensive scaffolding, a "correct" response provides weaker evidence of true knowledge. Increasing slip probability makes BKT more skeptical of scaffold-assisted correctness.

### 6.4 Thompson Sampling for Question Selection

#### Beta-Bernoulli Model

For each difficulty level $l \in \{1.1, 1.2, ..., 3.5\}$, maintain a Beta distribution representing our belief about the student's success probability at that level:

$$\theta_l \sim \text{Beta}(\alpha_l, \beta_l)$$

**Initialization**: $\alpha_l = 1, \beta_l = 1$ (uniform prior) for all levels.

**Update after each interaction at level $l$**:

$$\alpha_l' = \alpha_l + O_s \quad (\text{weighted outcome as fractional success})$$
$$\beta_l' = \beta_l + (1 - O_s)$$

#### ZPD-Constrained Selection

At each selection step:

1. **Determine ZPD window**: $[l_{current} - 2, \; l_{current} + 3]$ (asymmetric: more room upward than downward)
2. **Sample from each level's posterior**: $\hat{\theta}_l \sim \text{Beta}(\alpha_l, \beta_l)$ for each $l$ in ZPD window
3. **Apply proximity bonus**: Weight samples by Gaussian proximity to estimated optimal challenge level:

$$\text{score}_l = \hat{\theta}_l \times \exp\left(-\frac{(\text{elo}_l - R_s)^2}{2 \times 100^2}\right)$$

4. **Select**: $l^* = \arg\max_l \; \text{score}_l$

The proximity bonus keeps Thompson Sampling from wandering too far from the student's estimated ability while still allowing exploration.

#### Exploration vs. Exploitation Balance

Thompson Sampling naturally transitions from exploration (early, when priors are flat) to exploitation (later, when posteriors are concentrated). With $\alpha_l + \beta_l \approx 2$ (initial), samples have high variance; after 10+ interactions at a level, $\alpha_l + \beta_l > 12$, and samples concentrate near the mean.

### 6.5 Decision Orchestrator

The three components feed into a deterministic decision rule:

```
FUNCTION adaptive_decide(interaction):
    # 1. Update all models
    new_elo = elo.update(student, question, weighted_outcome)
    new_p_know = bkt.update(topic, weighted_outcome, hint_depth)
    thompson.update(level, weighted_outcome)
    
    # 2. Compute features
    lds = compute_lds(interaction)
    mcs = compute_mcs(interaction, lds)
    
    # 3. Determine progression
    IF weighted_outcome >= 0.85 AND streak >= 3:
        decision = SKIP          # Jump +2 sub-levels
    ELIF weighted_outcome >= 0.75 AND p_know >= 0.70:
        decision = INCREASE      # Move +1 sub-level
    ELIF weighted_outcome >= 0.40:
        decision = MAINTAIN      # Stay at current
    ELIF weighted_outcome >= 0.25 OR streak_wrong < 2:
        decision = DECREASE      # Drop -1 sub-level
    ELSE (weighted_outcome < 0.25 AND p_know < 0.30):
        decision = RAPID_DECREASE  # Drop -2 sub-levels
    
    # 4. Apply LDS/MCS diagnostic overlay
    IF lds > 0.6 AND mcs > 0.6:
        # Language gap: student knows math, needs more scaffolding
        # Don't decrease difficulty, but flag for enhanced L1/L2 display
        decision = max(decision, MAINTAIN)
        set_flag(ENHANCED_SCAFFOLD)
    
    # 5. Select next level via Thompson Sampling
    next_level = thompson.select(current_level, zpd_window)
    
    # 6. Override if decision and Thompson disagree strongly
    IF decision == DECREASE AND next_level > current_level + 1:
        next_level = current_level  # Don't increase when decision says decrease
    
    RETURN next_level, decision
```

### 6.6 Simulation Results

The adaptive engine was tested with three simulated student profiles over 20-question sessions:

#### Profile 1: Strong Student (True Level ~2.5)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 1168 |
| Level | 2.1 | 2.3 |
| Avg Weighted Outcome | — | 0.82 |
| Avg LDS | — | 0.18 |
| Avg MCS | — | 0.76 |
| Decisions | — | 12 increase, 5 maintain, 3 decrease |

**Observation**: Engine correctly identified the student as above-average, progressively increasing difficulty. The student settled near their true ability level by interaction 12.

#### Profile 2: Struggling Student (True Level ~1.2)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 960 |
| Level | 2.1 | 1.4 |
| Avg Weighted Outcome | — | 0.38 |
| Avg LDS | — | 0.62 |
| Avg MCS | — | 0.41 |
| Decisions | — | 2 increase, 6 maintain, 10 decrease, 2 rapid decrease |

**Observation**: Engine quickly detected the mismatch between starting level (2.1) and true ability (~1.2) and decreased difficulty steadily. The high LDS correctly identified language as the primary barrier.

#### Profile 3: Average Student (True Level ~1.5)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 1035 |
| Level | 2.1 | 1.5 |
| Avg Weighted Outcome | — | 0.55 |
| Avg LDS | — | 0.38 |
| Avg MCS | — | 0.58 |
| Decisions | — | 5 increase, 8 maintain, 7 decrease |

**Observation**: The average student showed more oscillation than expected, reflecting genuine uncertainty in the student's boundary region. The engine maintained appropriate challenge (weighted outcome ~0.55 suggests student is working within ZPD).

---

## 7. System Architecture

*See companion document: `system_architecture.md` for detailed component diagrams, data flow diagrams, Firestore schema, API contracts, and deployment architecture.*

### 7.1 Architecture Summary

| Component | Technology | Deployment |
|---|---|---|
| Frontend | Next.js 14+, TypeScript, Tailwind | Firebase Hosting / Vercel |
| Authentication | Firebase Auth | Managed service |
| Database | Cloud Firestore | Managed service |
| Serverless Backend | Firebase Cloud Functions (Node.js 20) | Event-triggered / HTTP |
| LLM (V1) | Google Gemini 2.0 Flash | API |
| SLM (V2) | Qwen2.5-3B (QLoRA fine-tuned) | HF Inference Endpoint |
| Adaptive Engine | Client-side TypeScript | Runs in browser |
| Math Rendering | KaTeX | Client-side |
| Monitoring | Firebase Analytics + Crashlytics | Managed service |

### 7.2 Key Design Decisions

1. **Client-side adaptive engine**: Zero-latency decisions, offline capability after batch load, no server dependency for core tutoring loop.
2. **Firestore over PostgreSQL**: Real-time sync for multi-device access, built-in offline support, serverless scaling, no connection pooling concerns.
3. **Scale-to-zero SLM endpoint**: Avoids constant GPU cost during off-hours (school usage is 8am–4pm weekdays).
4. **Batch question prefetching (20 at a time)**: Reduces API calls, enables offline sessions, smooth student experience with no loading between questions.
5. **Shadow testing during SLM transition**: Both Gemini and SLM generate scaffolds; SLM response is served, Gemini response is logged for quality comparison. Enables data-driven cutover decision.

---

## 8. SLM Fine-Tuning Strategy

### 8.1 Model Selection

| Model | Size | GSM8K | Spanish | License | Selected? |
|---|---|---|---|---|---|
| **Qwen2.5-3B-Instruct** | 3B | 79.2% | ✅ (29 langs) | Apache 2.0 | ✅ Primary |
| Phi-4-mini | 3.8B | 83.6% | ⚠️ (limited) | MIT | Runner-up |
| Llama-3.2-3B-Instruct | 3B | 77.8% | ⚠️ (focus: EN) | Llama 3.2 | Backup |
| Gemma-2-2B-IT | 2.6B | 68.5% | ❌ | Gemma | Too small |

**Qwen2.5-3B-Instruct** is selected because it combines strong math reasoning (79.2% GSM8K), explicit Spanish language training in its multilingual corpus, and a permissive Apache 2.0 license suitable for educational deployment.

### 8.2 Training Data

#### Source: Gemini V1 Production Data

During V1 deployment (Gemini API), all scaffold generation outputs are logged. This creates organic training data:

```json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a bilingual math tutor generating scaffolded hints for Spanish-speaking students learning math in English."
    },
    {
      "role": "user",
      "content": "Generate 4 scaffold levels for this grade 7 math problem:\n\n\"A rectangular garden has a length that is 3 times its width. If the perimeter is 48 meters, what is the width of the garden?\"\n\nStudent context: Grade 7, WIDA Level 3, current LDS: 0.42"
    },
    {
      "role": "assistant",
      "content": "{\n  \"L1_simplified\": \"A garden is shaped like a rectangle. Its length is 3 times the width. The total distance around it is 48 meters. Find the width.\",\n  \"L2_bilingual\": \"A rectangular (rectangular) garden has a length (largo) that is 3 times (veces) its width (ancho). If the perimeter (perímetro) is 48 meters (metros), what is the width (ancho) of the garden (jardín)?\",\n  \"L3_spanish\": \"Un jardín rectangular tiene un largo que es 3 veces su ancho. Si el perímetro es de 48 metros, ¿cuál es el ancho del jardín?\",\n  \"L4_solution\": \"Step 1: Let width = w. Length = 3w.\\nStep 2: Perimeter = 2(length + width) = 2(3w + w) = 2(4w) = 8w\\nStep 3: 8w = 48\\nStep 4: w = 48 ÷ 8 = 6\\nStep 5: The width is 6 meters.\",\n  \"answer\": \"6 meters\",\n  \"answerNumeric\": 6\n}"
    }
  ]
}
```

#### Data Collection Targets

| Phase | Source | Examples | Quality |
|---|---|---|---|
| **Phase A** (pre-launch) | Gemini-generated from question database (130+ problems) | ~500 | High (curated input) |
| **Phase B** (V1 launch + 2 weeks) | Gemini-generated from student "Input your question" usage | ~1,000 | Medium (diverse input) |
| **Phase C** (V1 launch + 4 weeks) | Phase A + B, human-reviewed and quality-filtered | ~1,500 | High (reviewed) |

#### Quality Filtering Criteria

- ✅ Mathematical answer matches ground truth (exact numeric comparison)
- ✅ L1 readability is ≥ 2 FK grade levels below original
- ✅ L2 contains ≥ 3 bilingual annotations
- ✅ L3 back-translates to semantically similar English (cosine similarity ≥ 0.85)
- ✅ L4 solution steps are logically correct and arrive at the correct answer
- ❌ Reject if any scaffold level is empty or truncated
- ❌ Reject if L3 contains English words (incomplete translation)

### 8.3 Fine-Tuning Configuration

| Parameter | Value |
|---|---|
| **Method** | QLoRA (4-bit NF4 quantization) |
| **LoRA rank** | 32 |
| **LoRA alpha** | 64 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Learning rate** | 2e-4 (cosine schedule) |
| **Warmup ratio** | 0.05 |
| **Epochs** | 3–5 (early stopping on val loss, patience=2) |
| **Batch size** | 4 (per device) |
| **Gradient accumulation** | 4 (effective batch size = 16) |
| **Max sequence length** | 1024 tokens |
| **Optimizer** | AdamW (paged, 8-bit) |
| **Weight decay** | 0.01 |
| **FP16/BF16** | BF16 (if A100/H100), FP16 (if T4/L4) |
| **Gradient checkpointing** | Enabled |
| **Hardware** | Single 16GB GPU (T4) or 24GB (L4/A10G) |
| **Estimated training time** | ~45 minutes (1,500 examples, 3 epochs) |

### 8.4 Evaluation Metrics

| Metric | Measurement | Target |
|---|---|---|
| **Math Accuracy** | % of L4 solutions with correct final answer | ≥ 95% |
| **Translation Quality** | BLEU score of L3 vs. reference Spanish | ≥ 0.70 |
| **Readability Compliance** | % of L1 scaffolds within target FK range | ≥ 90% |
| **Bilingual Annotation** | Avg bilingual terms in L2 per problem | ≥ 3.0 |
| **Latency** | Time-to-first-token on T4 GPU | < 200ms |
| **Throughput** | Full scaffold generation (all 4 levels) | < 800ms |
| **A/B Quality** | Human preference: SLM vs. Gemini (blind eval) | ≥ 45% SLM preferred |

### 8.5 Deployment Strategy

```
Week 1-2:  Collect Phase A data (Gemini on question DB) → 500 examples
Week 3-4:  V1 launch, collect Phase B data → +1,000 examples  
Week 5:    Human review → 1,500 quality examples → Fine-tune v1
Week 6:    Shadow deployment (SLM + Gemini, SLM served)
Week 7-8:  Quality monitoring, collect preferences
Week 9:    If SLM quality ≥ threshold → full cutover
           If not → collect more data, fine-tune v2, repeat
```

---

## 9. Data Collection Schema

### 9.1 Purpose

All student interactions are logged for three purposes:
1. **Real-time adaptation**: Feeding the adaptive engine within the current session
2. **Offline analysis**: Improving algorithm parameters, question calibration, and SLM training
3. **Research**: Validating the effectiveness of bilingual scaffolding for ELL math education

### 9.2 Interaction-Level Data

Each student-question interaction produces:

```typescript
interface InteractionRecord {
  // Identifiers
  studentId: string;          // Firebase UID (anonymized for research)
  sessionId: string;          // Session identifier
  interactionId: string;      // Unique interaction ID
  questionId: string;         // Question identifier
  timestamp: ISO8601;         // Interaction start time
  
  // Question context
  questionLevel: string;      // e.g., "2.3"
  questionTopic: string;      // e.g., "fractions"
  questionSubtopic: string;   // e.g., "multiplication"
  questionElo: number;        // Question's current Elo rating
  questionFK: number;         // Flesch-Kincaid grade level
  
  // Student state (before interaction)
  studentEloBefore: number;
  studentLevel: string;
  topicPKnow: number;         // BKT P(know) for this topic
  studentLDS5: number;        // 5-question rolling avg LDS
  studentMCS5: number;        // 5-question rolling avg MCS
  
  // Interaction data
  timeSpentMs: number;        // Total time on problem
  timeBeforeFirstHint: number; // Time before first hint (0 if no hints)
  hintsAccessed: number[];    // [0], [0,1], [0,1,2], etc.
  hintTimestamps: {           // Timestamps of hint accesses
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  timePerHint: {              // Time spent at each hint level
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  maxHintLevel: number;       // 0-4
  answerAttempts: number;     // Number of attempts
  answers: string[];          // All attempted answers
  finalAnswer: string;        // Last submitted answer
  isCorrect: boolean;         // Whether final answer is correct
  
  // Computed features
  weightedOutcome: number;    // 0.0-1.0 (hint-weighted)
  lds: number;                // Language Dependency Score
  mcs: number;                // Math Confidence Score
  hintDepthNormalized: number;
  scaffoldTimeRatio: number;
  escalationSpeed: number;
  
  // Adaptive decisions
  studentEloAfter: number;
  adaptiveDecision: string;   // increase/maintain/decrease/skip/rapid_decrease
  nextLevel: string;          // Selected next level
  diagnosticQuadrant: string; // thriving/language_gap/math_struggle/dual_challenge
}
```

### 9.3 Session-Level Aggregates

```typescript
interface SessionRecord {
  sessionId: string;
  studentId: string;
  startTime: ISO8601;
  endTime: ISO8601;
  durationMs: number;
  
  // Performance
  questionsAttempted: number;
  questionsCorrect: number;
  avgWeightedOutcome: number;
  avgHintLevel: number;
  
  // Progression
  startElo: number;
  endElo: number;
  eloChange: number;
  startLevel: string;
  endLevel: string;
  levelsTraversed: string[];
  
  // Feature averages
  sessionLDS: number;
  sessionMCS: number;
  
  // Diagnostic
  dominantQuadrant: string;    // Most frequent diagnostic quadrant
  topicPerformance: Record<string, {
    attempts: number;
    avgOutcome: number;
    pKnow: number;
  }>;
  
  // Scaffold usage patterns
  hintDistribution: {
    noHint: number;           // Count of problems solved without hints
    L1Only: number;
    L2Used: number;
    L3Used: number;
    L4Used: number;
  };
}
```

### 9.4 Longitudinal Student Profile

```typescript
interface StudentProfile {
  studentId: string;
  createdAt: ISO8601;
  lastActive: ISO8601;
  
  // Current state
  currentElo: number;
  currentLevel: string;
  totalInteractions: number;
  totalSessions: number;
  
  // Topic mastery (BKT)
  topicMastery: Record<string, number>;  // P(know) per topic
  
  // Feature trends
  ldsHistory: number[];        // Session-level LDS over time
  mcsHistory: number[];        // Session-level MCS over time
  eloHistory: number[];        // Elo after each session
  
  // Learning trajectory
  avgLDSFirst5Sessions: number;
  avgLDSLast5Sessions: number;
  ldsImprovement: number;      // Percentage decrease in LDS
  avgMCSFirst5Sessions: number;
  avgMCSLast5Sessions: number;
  mcsImprovement: number;      // Percentage increase in MCS
  
  // Engagement
  avgSessionLength: number;    // Minutes
  sessionsPerWeek: number;
  completionRate: number;      // % of sessions completed (20/20)
  
  // Thompson priors (for state persistence)
  thompsonPriors: Record<string, { alpha: number; beta: number }>;
}
```

### 9.5 Privacy and Ethics

| Concern | Mitigation |
|---|---|
| Student is a minor (COPPA) | No PII beyond email/name; parental consent required |
| Performance data sensitivity | Elo/LDS/MCS stored under UID, not linked to real identity in analytics |
| Research use | Data anonymized (UID → random ID) before export; IRB approval required |
| Data retention | Interaction-level data retained for 2 years; aggregates indefinitely |
| Right to deletion | Firebase Auth deletion triggers cascade delete of all user data |

---

## 10. Evaluation Plan

### 10.1 Phase 1: Technical Validation (Pre-Launch)

**Objective**: Verify system components work correctly and produce expected behavior.

| Test | Method | Success Criterion |
|---|---|---|
| Adaptive engine convergence | Simulate 100 students × 50 interactions with known true levels | 90% of students within ±1 sub-level of true level by interaction 30 |
| Elo stability | 1000 simulated interactions per question | Question Elo ratings converge within ±30 of assigned level |
| BKT accuracy | Simulate known mastery states, measure P(know) accuracy | P(know) > 0.7 for mastered topics, < 0.3 for unmastered within 10 interactions |
| Thompson exploration | Cold-start simulation (all levels unexplored) | All 15 levels sampled at least once within first 30 interactions |
| LDS/MCS discrimination | Inject known behavioral patterns | LDS > 0.6 for simulated language-dependent profiles; MCS > 0.7 for math-competent profiles |
| Scaffold quality | 100 problems through Gemini scaffold pipeline | ≥ 95% mathematical accuracy, ≥ 90% readability compliance |
| End-to-end latency | 50 complete interaction cycles | Adaptive decision < 50ms; scaffold generation < 1.5s |

### 10.2 Phase 2: Pilot Study (Launch + 4 Weeks)

**Objective**: Validate effectiveness with real students in a controlled setting.

**Design**: Within-subjects pre/post with control group comparison

| Group | N | Treatment | Duration |
|---|---|---|---|
| **Treatment** | 30 students | MathLingua (adaptive + scaffolds) | 4 weeks, 3× per week |
| **Control** | 30 students | Same math problems, English-only, no scaffolding | 4 weeks, 3× per week |

**Instruments**:

1. **Pre-test**: Mathematics assessment in Spanish (establish math baseline) + English reading assessment (establish language baseline)
2. **Post-test**: Mathematics assessment in English (measure improvement) + same assessments as pre-test
3. **In-system metrics**: LDS trajectory, MCS trajectory, Elo progression, hint usage patterns

**Primary Outcome Measures**:

| Measure | Hypothesis | Test |
|---|---|---|
| Math score improvement (EN) | Treatment > Control | Independent t-test, d ≥ 0.5 |
| LDS reduction | Treatment shows ≥ 20% decrease | Paired t-test, pre vs. post |
| MCS stability | Treatment MCS does not decrease | One-sided paired t-test |
| Session completion rate | Treatment ≥ 80% | Descriptive |
| Scaffold fade-out | Avg hint level decreases over 4 weeks | Linear regression slope < 0 |

**Secondary Outcome Measures**:

| Measure | Instrument |
|---|---|
| Student engagement | Time on task, voluntary extra sessions |
| Mathematical self-efficacy | Adapted MSES (Mathematics Self-Efficacy Scale) |
| Language anxiety | Adapted FLCAS (Foreign Language Classroom Anxiety Scale) |
| Qualitative experience | Semi-structured interviews (N=10 treatment) |

### 10.3 Phase 3: Scale and Iteration (Launch + 3 Months)

**Objective**: Optimize algorithm parameters from pooled data; validate SLM quality.

| Activity | Data Required | Method |
|---|---|---|
| IRT calibration | ≥ 200 students × ≥ 50 questions | 2PL IRT model fit; replace initial Elo question ratings with IRT parameters |
| Feature weight optimization | ≥ 500 interaction records with outcomes | Logistic regression / gradient-boosted trees to optimize LDS/MCS weights |
| BKT parameter fitting | ≥ 100 students × ≥ 20 interactions per topic | EM algorithm per-topic parameter estimation |
| SLM quality assessment | ≥ 100 scaffold comparisons | Blind human preference evaluation (SLM vs. Gemini) |
| Algorithm A/B testing | ≥ 200 students split across variants | Compare engagement and outcome metrics across algorithm variants |

### 10.4 Phase 4: Long-Term Efficacy (Launch + 1 Year)

**Objective**: Measure impact on standardized test scores and language proficiency.

| Measure | Instrument | Expected Outcome |
|---|---|---|
| State math assessment | SBAC / STAAR (English) | Treatment students show larger gains |
| English proficiency | WIDA ACCESS | Treatment students show faster math-domain language growth |
| Long-term retention | 6-month follow-up assessment | Treatment gains persist |

---

## 11. Implementation Timeline

### Phase 1: MVP (Months 1–3)

| Month | Deliverables |
|---|---|
| **1** | Frontend scaffolding UI (L1–L4 display, hint tracking); Firebase setup (auth, Firestore schema); Gemini API integration for scaffold generation |
| **2** | Adaptive engine implementation in TypeScript (Elo + BKT + Thompson); Question database upload (130+ questions); LDS/MCS computation pipeline |
| **3** | End-to-end integration testing; Simulated student testing (100 profiles); Bug fixes and performance optimization; Deploy to Firebase Hosting |

### Phase 2: Pilot (Months 4–5)

| Month | Deliverables |
|---|---|
| **4** | Pilot launch with 30 treatment + 30 control students; Daily monitoring of system metrics; Weekly check-ins with teachers; Collect Gemini scaffold data for SLM training |
| **5** | Mid-pilot analysis and algorithm tuning; Begin SLM training data curation; Pilot completion and post-testing |

### Phase 3: SLM Transition (Months 6–8)

| Month | Deliverables |
|---|---|
| **6** | Curate 1,500 training examples; QLoRA fine-tune Qwen2.5-3B-Instruct v1; Deploy HF Inference Endpoint (shadow mode) |
| **7** | Shadow testing: SLM served, Gemini logged for comparison; Quality monitoring and iteration |
| **8** | SLM quality validated → full cutover; OR iterate (more data, retrain, repeat) |

### Phase 4: Scale (Months 9–12)

| Month | Deliverables |
|---|---|
| **9** | Open to additional schools (target: 200+ students); IRT calibration from pooled data; Question database expansion to 300+ |
| **10** | A/B testing of algorithm variants; DKT evaluation (if N > 500) |
| **11** | Feature weight optimization from real data; Dashboard for teachers (class-level analytics) |
| **12** | Long-term efficacy analysis; Research paper preparation; Open-source release of adaptive engine |

---

## 12. References

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. *Applied Measurement in Education*, 14(3), 219–234.

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. *Advances in Neural Information Processing Systems*, 24.

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. *arXiv:2110.14168*.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. *User Modeling and User-Adapted Interaction*, 4(4), 253–278.

Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. *NeurIPS 2023*.

Hendrycks, D., et al. (2021). Measuring mathematical problem solving with the MATH dataset. *NeurIPS 2021*.

Huang, M., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. *arXiv:2409.18433*.

Khisty, L. L., & Chval, K. B. (2002). Pedagogic discourse and equity in mathematics. *Mathematics Education Research Journal*, 14(3), 154–168.

Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. *Harvard Educational Review*, 78(2), 333–368.

Microsoft. (2025). Phi-4 technical report. *arXiv:2412.08905*.

Moschkovich, J. (2002). A situated and sociocultural perspective on bilingual mathematics learners. *Mathematical Thinking and Learning*, 4(2–3), 189–212.

National Center for Education Statistics. (2023). English learners in public schools. *NCES Report*.

Pelánek, R. (2016). Applications of the Elo rating system in adaptive educational systems. *Computers & Education*, 98, 169–179.

Piech, C., et al. (2015). Deep knowledge tracing. *Advances in Neural Information Processing Systems*, 28.

Qwen Team. (2024). Qwen2.5 technical report. *arXiv:2412.15115*.

Scarlatos, A., et al. (2024). PSI-KT: Parameterized student interaction knowledge tracing. *EDM 2024*.

Shi, F., et al. (2022). Language models are multilingual chain-of-thought reasoners. *arXiv:2210.03057*.

Vygotsky, L. S. (1978). *Mind in Society: The Development of Higher Psychological Processes*. Harvard University Press.

---

## 13. Appendices

### Appendix A: Complete LDS Computation Example

**Scenario**: Student attempts a Level 2.3 problem (rate/ratio), uses L1 hint, then L2 hint, solves correctly on second attempt.

**Raw signals**:
- $h_i = 2$ (max hint = L2)
- $t_{pre} = 45s$ (45 seconds before first hint)
- $t_{total} = 120s$
- $t_{L1} = 15s$, $t_{L2} = 25s$, $t_{L3} = 0s$, $t_{L4} = 0s$
- $a_i = 2$ (two answer attempts)
- $c_i = 1$ (correct on second attempt)
- $\text{median\_time}(2.3) = 90s$

**Sub-feature computation**:

1. $D_{hint} = 2 / 4 = 0.50$
2. $R_{scaffold} = (15 + 25 + 0 + 0) / 120 = 40 / 120 = 0.333$
3. $E_{speed} = 1 - (45 / 90) = 1 - 0.5 = 0.50$
4. $F_{reveal} = 0$ (L4 not used)

**LDS**:
$$\text{LDS} = 0.35 \times 0.50 + 0.25 \times 0.333 + 0.20 \times 0.50 + 0.20 \times 0 = 0.175 + 0.083 + 0.10 + 0 = 0.358$$

**Interpretation**: Moderate dependency (0.35–0.55 range boundary). The student needed bilingual support but was not critically dependent. System would maintain current level.

### Appendix B: Complete MCS Computation (Same Scenario)

**Sub-feature computation**:

5. $C_{correct} = 1$ (correct)
6. $S_{speed} = \text{clamp}(90 / 120, 0, 1) = 0.75$
7. $A_{efficiency} = 1 / 2 = 0.50$
8. $(1 - \text{LDS}) = 1 - 0.358 = 0.642$

**MCS**:
$$\text{MCS} = 0.30 \times 1.0 + 0.25 \times 0.75 + 0.20 \times 0.50 + 0.25 \times 0.642 = 0.30 + 0.1875 + 0.10 + 0.1605 = 0.748$$

**Interpretation**: Moderate-to-strong math confidence. Despite needing L2 scaffolding, the student demonstrated solid mathematical ability. Diagnostic: "Language Gap" quadrant (high MCS ≥ 0.6, borderline LDS ≈ 0.4).

### Appendix C: Elo Update Example (Same Scenario)

**Before**: $R_s = 1050$, $R_q = 1150$ (Level 2.3 question)

**Weighted outcome**: Correct with L2 → $O_s = 0.50$

**Expected outcome**: $E_s = 1 / (1 + 10^{(1150 - 1050) / 400}) = 1 / (1 + 10^{0.25}) = 1 / (1 + 1.778) = 0.360$

**Update** ($K_s = 32$): $R_s' = 1050 + 32 \times (0.50 - 0.360) = 1050 + 32 \times 0.14 = 1050 + 4.48 = 1054.5$

**Update question** ($K_q = 6$): $R_q' = 1150 + 6 \times (0.360 - 0.50) = 1150 + 6 \times (-0.14) = 1150 - 0.84 = 1149.2$

**Interpretation**: Student's Elo increased slightly (outperformed expectation even with L2 hint), question's Elo barely changed (stable calibration).

### Appendix D: BKT Update Example

**Before**: $P(L_n) = 0.45$ for topic "rates" (the student's topic)

**Outcome**: $O_s = 0.50$ (treated as correct, since ≥ 0.5)

**Slip adjustment**: $P(S)_{adj} = 0.10 \times (1 + 0.5 \times 0.50) = 0.10 \times 1.25 = 0.125$

**Posterior given correct**:
$$P(L_n | correct) = \frac{0.45 \times (1 - 0.125)}{0.45 \times (1 - 0.125) + 0.55 \times 0.25} = \frac{0.45 \times 0.875}{0.45 \times 0.875 + 0.55 \times 0.25} = \frac{0.394}{0.394 + 0.1375} = \frac{0.394}{0.531} = 0.742$$

**After learning transition**:
$$P(L_n) = 0.742 + (1 - 0.742) \times 0.15 = 0.742 + 0.039 = 0.781$$

**Interpretation**: Topic mastery estimate jumped from 0.45 to 0.78, reflecting that a correct response (even with L2 scaffold, captured by adjusted slip) substantially increased our belief that the student knows "rates."

### Appendix E: Thompson Sampling Selection Example

**Student Elo**: 1054.5 (from Appendix C update)
**Current Level**: 2.1

**ZPD Window**: [1.4, 2.4] (current - 2 to current + 3, in sub-level steps)

**Current priors** (after 12 interactions):

| Level | α | β | Elo |
|---|---|---|---|
| 1.4 | 8.2 | 2.1 | 975 |
| 1.5 | 7.5 | 3.0 | 1025 |
| 2.1 | 5.8 | 4.2 | 1075 |
| 2.2 | 3.1 | 3.5 | 1125 |
| 2.3 | 1.8 | 2.3 | 1175 |
| 2.4 | 1.2 | 1.5 | 1225 |

**Sampled** (one draw):

| Level | $\hat{\theta}_l$ | Proximity Bonus | Score |
|---|---|---|---|
| 1.4 | 0.82 | $\exp(-(975-1054.5)^2/20000) = 0.85$ | 0.697 |
| 1.5 | 0.68 | $\exp(-(1025-1054.5)^2/20000) = 0.96$ | 0.653 |
| 2.1 | 0.61 | $\exp(-(1075-1054.5)^2/20000) = 0.98$ | **0.598** |
| 2.2 | 0.55 | $\exp(-(1125-1054.5)^2/20000) = 0.88$ | 0.484 |
| 2.3 | 0.42 | $\exp(-(1175-1054.5)^2/20000) = 0.58$ | 0.244 |
| 2.4 | 0.38 | $\exp(-(1225-1054.5)^2/20000) = 0.29$ | 0.110 |

**Selected**: Level 1.4 (highest score = 0.697)

But wait—the decision orchestrator said MAINTAIN (since weighted_outcome = 0.50 is in the maintain range). Thompson selected 1.4, which is below current 2.1. Since the decision says MAINTAIN (not DECREASE), the override rule kicks in: the system selects a question from level 2.1 (current level), not 1.4. Thompson's selection is overridden to respect the progression decision.

This demonstrates how the decision orchestrator provides guardrails against Thompson's stochastic exploration when the student's recent performance doesn't support a level change.

### Appendix F: Glossary

| Term | Definition |
|---|---|
| **BKT** | Bayesian Knowledge Tracing — HMM-based model of student knowledge per topic |
| **COPPA** | Children's Online Privacy Protection Act — US law for minors' data |
| **DKT** | Deep Knowledge Tracing — RNN-based alternative to BKT |
| **ELL** | English Language Learner |
| **Elo** | Rating system where opponents update each other's ratings after each match |
| **FERPA** | Family Educational Rights and Privacy Act — US law for student records |
| **FK** | Flesch-Kincaid readability metric |
| **HMM** | Hidden Markov Model |
| **IRT** | Item Response Theory — psychometric model of test items and examinees |
| **L1** | First language (Spanish, in MathLingua's context) |
| **L2** | Second language (English, in MathLingua's context) |
| **LDS** | Language Dependency Score — engineered feature measuring scaffold reliance |
| **MCS** | Math Confidence Score — engineered feature measuring mathematical ability |
| **QLoRA** | Quantized Low-Rank Adaptation — memory-efficient fine-tuning method |
| **SLM** | Small Language Model (< 4B parameters) |
| **Thompson Sampling** | Bayesian bandit algorithm for exploration/exploitation |
| **WIDA** | Standards framework for English language proficiency in U.S. schools |
| **ZPD** | Zone of Proximal Development — Vygotsky's learning theory concept |

---

*End of Technical Specification — MathLingua v1.0*