mathlingua-spec / technical_specification.md

Add technical specification document

4ae86c5 verified 11 days ago

preview code

raw

history blame contribute delete

60.5 kB

MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension

Technical Specification Document

Version: 1.0
Date: April 2026
Authors: [MathLingua Research Team]

Abstract

We present MathLingua, an adaptive tutoring system designed for Spanish-speaking students in grades 6–8 who are transitioning to English-medium mathematics education. The system addresses the dual challenge these students face: mastering mathematical concepts while simultaneously acquiring the academic English required to comprehend word problems. MathLingua introduces a four-level progressive scaffolding framework (L1: Simplified English → L2: Bilingual Keywords → L3: Full Spanish Translation → L4: Step-by-Step Solution) and a novel hybrid adaptive algorithm combining Elo rating, Bayesian Knowledge Tracing (BKT), and Thompson Sampling to personalize difficulty progression. We define two engineered features—Language Dependency Score (LDS) and **Math Confidence Score (MCS)**—that disentangle linguistic struggle from mathematical difficulty, enabling targeted intervention. The system architecture supports a planned transition from a cloud-based large language model (Gemini 2.0 Flash) to a fine-tuned small language model (Qwen2.5-3B-Instruct with QLoRA) for scalable, cost-effective scaffold generation. This specification provides the complete technical design: adaptive algorithms with formal definitions, feature engineering formulas, a 15-level difficulty taxonomy validated by readability metrics, a prototype question database of 130 word problems, system architecture, and an evaluation plan.

Keywords: Adaptive Learning, Bilingual Education, Mathematics Word Problems, Scaffolding, Bayesian Knowledge Tracing, Elo Rating, Thompson Sampling, Small Language Models, QLoRA

Introduction
Related Work
Difficulty Taxonomy and Linguistic Progression
Question Database Design
Feature Engineering
Adaptive Algorithm
System Architecture
SLM Fine-Tuning Strategy
Data Collection Schema
Evaluation Plan
Implementation Timeline
References
Appendices

1. Introduction

1.1 Problem Statement

An estimated 5.1 million English Language Learners (ELLs) are enrolled in U.S. public schools, with approximately 75% being Spanish-speaking (National Center for Education Statistics, 2023). These students face a compounded challenge in mathematics: they must simultaneously decode unfamiliar English vocabulary and sentence structures while performing mathematical reasoning. Research consistently shows that ELLs underperform native English speakers on mathematics assessments—not because of deficient mathematical ability, but because of the linguistic demands embedded in word problems (Abedi & Lord, 2001; Martiniello, 2008).

Current educational technology solutions typically address either language learning or mathematics separately. General-purpose math tutoring systems (e.g., Khan Academy, IXL) present problems exclusively in English with no linguistic scaffolding. Language learning platforms (e.g., Duolingo) lack mathematical content. Bilingual resources, where they exist, are static—offering problems in English or Spanish but not providing a scaffolded bridge between the two.

1.2 Proposed Solution

MathLingua addresses this gap with three key innovations:

Progressive Bilingual Scaffolding: A four-level hint system that provides decreasing linguistic support, from simplified English (L1) through bilingual annotations (L2) and full Spanish translation (L3) to step-by-step solution reveal (L4). The scaffold levels are designed to gradually build mathematical English proficiency while ensuring comprehension.
Disentangled Difficulty Assessment: Two novel engineered features—Language Dependency Score (LDS) and Math Confidence Score (MCS)—that separately quantify a student's reliance on linguistic scaffolding versus their underlying mathematical competence. This separation enables the system to distinguish between "doesn't understand the English" and "doesn't understand the math."
Hybrid Adaptive Algorithm: A combination of Elo rating (for overall ability tracking), Bayesian Knowledge Tracing (for topic-level mastery estimation), and Thompson Sampling (for intelligent question selection with exploration), specifically adapted for hint-weighted outcomes rather than binary correctness.

1.3 Target Population

Characteristic	Description
Grade Level	6–8 (ages 11–14)
L1 Language	Spanish
L2 Language	English (medium of instruction)
English Proficiency	WIDA Levels 2–4 (Beginning–Developing)
Math Level	On grade level in Spanish-medium instruction
Setting	U.S. middle schools with bilingual/ESL programs

1.4 Design Goals

Goal	Metric	Target
Reduce language barrier impact	LDS decrease over 4 weeks	≥ 20% reduction
Maintain/improve math confidence	MCS stability or increase	No MCS decrease > 5%
Appropriate difficulty targeting	Student in ZPD	≥ 70% of questions in ZPD
Student engagement	Session completion rate	≥ 80% complete 20-question sessions
Scaffold fade-out	Avg hint level over time	Decrease from ~L2.5 to ~L1.5

2. Related Work

2.1 Adaptive Learning Algorithms

Item Response Theory (IRT) models the probability of a correct response as a function of student ability (θ) and item parameters (difficulty, discrimination, guessing). The 3-parameter logistic (3PL) model is standard:

$P(X_{ij} = 1 | \theta_j) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta_j - b_i)}}$

where $a_i$ is discrimination, $b_i$ is difficulty, $c_i$ is guessing, and $\theta_j$ is student ability. IRT requires large calibration samples (typically N > 200 per item) and assumes local independence, making it unsuitable for an initial deployment with small N.

Elo Rating, originally designed for chess, provides a simpler pairwise comparison model that updates after each interaction. It requires no prior calibration and converges with as few as 10–15 interactions (Pelánek, 2016). We adopt Elo as the primary ability tracking mechanism.

Bayesian Knowledge Tracing (BKT) models knowledge as a hidden Markov model with four parameters: P(L₀) (prior knowledge), P(T) (learn rate), P(S) (slip), P(G) (guess). BKT provides topic-level mastery estimates essential for identifying which mathematical concepts a student has learned versus which remain unmastered (Corbett & Anderson, 1994).

Deep Knowledge Tracing (DKT) uses recurrent neural networks to model student knowledge state. While DKT can capture complex temporal dependencies, it requires substantial training data (thousands of students) and runs as a server-side model, making it impractical for Phase 1 deployment (Piech et al., 2015).

Thompson Sampling is a Bayesian bandit algorithm that balances exploration and exploitation by sampling from posterior distributions of expected reward for each action (Chapelle & Li, 2011). Applied to question selection, it naturally handles the cold-start problem by exploring uncertain levels while exploiting known ZPD levels.

PSI-KT (Scarlatos et al., 2024) combines IRT with attention-based knowledge tracing, achieving state-of-the-art performance. However, it requires server-side inference and large training corpora, making it a Phase 3 target.

2.2 Mathematical Word Problem Datasets

Dataset	Size	Features	Relevance
GSM8K (Cobbe et al., 2021)	8,792	Grade school math, chain-of-thought solutions	Solution step generation
MGSM (Shi et al., 2022)	250×10 langs	Parallel EN/ES math problems	Bilingual scaffold training
Easy2Hard-Bench (Huang et al., 2024)	1,319 (GSM8K subset)	IRT-calibrated difficulty (0–1)	Difficulty calibration
bryanchrist/STEM	1,552	Grade 5, topic-tagged, curriculum-aligned	Question structure templates
MATH (Hendrycks et al., 2021)	12,500	Competition-level, 5 difficulty levels	Advanced levels (grade 8+)

2.3 Scaffolding in Mathematics Education

Vygotsky's Zone of Proximal Development (ZPD) provides the theoretical foundation for scaffolding—the idea that instruction is most effective when targeted at the gap between what a learner can do independently and what they can achieve with guidance (Vygotsky, 1978). In bilingual mathematics education, scaffolding takes on a dual role: supporting both conceptual understanding and linguistic comprehension.

Moschkovich (2002) demonstrated that bilingual mathematics instruction should leverage students' first language as a resource rather than treating it as a deficit. Khisty and Chval (2002) showed that strategic code-switching in mathematics instruction improves both comprehension and mathematical discourse development. MathLingua operationalizes these findings through its four-level scaffold design, which systematically moves from English-only support to bilingual bridging to full L1 access.

2.4 Small Language Models for Education

Recent advances in small language models (SLMs) under 4B parameters have demonstrated competitive performance on mathematical reasoning tasks. Qwen2.5-3B-Instruct achieves 79.2% on GSM8K and supports 29 languages including Spanish (Qwen Team, 2024). Phi-4-mini (3.8B) achieves 83.6% on MATH but has weaker multilingual support (Microsoft, 2025). These models can be fine-tuned with QLoRA (Dettmers et al., 2023) on a single consumer GPU, enabling cost-effective deployment for educational applications.

3. Difficulty Taxonomy and Linguistic Progression

3.1 Design Principle

MathLingua's difficulty taxonomy is linguistically oriented, not mathematically oriented. All levels may contain the same mathematical operations (arithmetic, fractions, percentages, basic algebra). What increases across levels is the English reading complexity of the problem text—vocabulary sophistication, sentence length, embedding depth, contextual abstraction, and multi-step reasoning chains.

This design reflects the target population: students who are mathematically capable in Spanish but struggle with English-language word problems. The adaptive engine's task is to find the maximum English complexity at which a student can still extract the mathematical content.

3.2 Three-Tier, Fifteen-Level Taxonomy

Level	Sub-Level	Elo Range	FK Grade	Target Characteristics
1 (Easy)	1.1	800–840	1.0–2.0	Simple sentences, basic vocabulary, single-step operations
	1.2	850–890	2.0–3.0	Slightly longer sentences, common math vocabulary
	1.3	900–940	3.0–4.0	Two-sentence problems, "how many/much" questions
	1.4	950–990	4.0–5.0	Comparative language ("more than", "less than")
	1.5	1000–1040	5.0–6.0	Two-step problems, time/money contexts
2 (Medium)	2.1	1050–1090	5.5–6.5	Multi-sentence, fraction/decimal vocabulary
	2.2	1100–1140	6.5–7.5	Percentage language, "what fraction of"
	2.3	1150–1190	7.0–8.0	Rate/ratio language, unit conversion context
	2.4	1200–1240	8.0–9.0	Proportional reasoning, multi-clause sentences
	2.5	1250–1290	9.0–10.0	Abstract contexts, embedded clauses
3 (Hard)	3.1	1300–1340	9.5–10.5	Academic register, compound-complex sentences
	3.2	1350–1393	10.0–11.0	Technical vocabulary, multi-step with distractors
	3.3	1400–1440	11.0–12.0	Inference required, implicit quantities
	3.4	1450–1493	12.0–13.0	Dense academic prose, algebraic modeling
	3.5	1500–1547	13.0–14.0	Research-paper style, nested conditionals

3.3 Readability Validation

Each question in the prototype database was validated using the following automated readability metrics, computed via the textstat Python library:

Metric	Formula Summary	Purpose
Flesch-Kincaid Grade Level	0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59	Primary difficulty ordering
Word Count	Total words in problem text	Length complexity
Difficult Words	Words not in Dale-Chall easy word list	Vocabulary complexity
Average Syllables per Word	Total syllables / total words	Phonological complexity

Validation Results (from prototype database of 130 questions):

Level	Avg FK Grade	Avg Words	Avg Difficult Words	Avg Syllables/Word
1.1	1.2	18.3	1.2	1.21
1.2	2.5	22.1	2.0	1.28
1.3	3.8	27.4	3.1	1.33
1.4	4.6	31.2	4.3	1.38
1.5	5.3	35.8	5.5	1.42
2.1	5.9	38.7	6.8	1.45
2.2	6.8	42.3	8.2	1.50
2.3	7.4	45.1	9.5	1.53
2.4	8.3	48.6	11.0	1.57
2.5	9.2	52.4	12.8	1.62
3.1	9.8	55.2	14.2	1.65
3.2	10.5	58.7	15.8	1.68
3.3	11.3	62.1	17.3	1.72
3.4	12.4	65.8	19.5	1.76
3.5	13.6	70.2	21.0	1.81

The monotonic increase across all four metrics confirms the taxonomy is well-ordered by linguistic difficulty.

3.4 Mathematical Topics by Grade

Grade	Topics	Sub-Topics
6	Arithmetic, Fractions, Decimals, Ratios	Addition/subtraction word problems, fraction operations, decimal arithmetic, unit rates, equivalent ratios
7	Proportions, Percentages, Geometry, Integers	Proportional relationships, percent change, area/perimeter, integer operations, expressions & equations
8	Linear Equations, Functions, Statistics, Geometry	Slope/intercept, function tables, mean/median/mode, Pythagorean theorem, volume, probability

4. Question Database Design

4.1 Database Structure

Each question in the database contains:

{
  "id": "1.1.01",
  "level": "1.1",
  "topic": "arithmetic",
  "subtopic": "addition",
  "grade": 6,
  "problem_text": "Sam has 5 apples. He gets 3 more apples. How many apples does Sam have now?",
  "answer": "8",
  "answer_numeric": 8.0,
  "solution_steps": [
    "Find the total: 5 + 3",
    "5 + 3 = 8",
    "Sam has 8 apples."
  ],
  "scaffolds": {
    "L1_simplified": "Sam has 5 apples. He gets 3 more. How many in total?",
    "L2_bilingual": "Sam has 5 apples (manzanas). He gets 3 more (más). How many apples (manzanas) does Sam have now (ahora)?",
    "L3_spanish": "Sam tiene 5 manzanas. Recibe 3 manzanas más. ¿Cuántas manzanas tiene Sam ahora?",
    "L4_solution": "Step 1: Add the apples — 5 + 3\nStep 2: 5 + 3 = 8\nStep 3: Sam has 8 apples."
  },
  "readability": {
    "flesch_kincaid": 1.2,
    "word_count": 17,
    "difficult_words": 1,
    "avg_syllables_per_word": 1.18
  },
  "elo_rating": 820,
  "metadata": {
    "source": "curated",
    "created_at": "2026-04-27"
  }
}

4.2 Prototype Database Coverage

The current prototype contains 130 questions distributed across 15 sub-levels:

Level	Questions	Topics Covered
1.1	10	Arithmetic (addition, subtraction)
1.2	10	Arithmetic (multiplication, division)
1.3	10	Multi-step arithmetic, money
1.4	10	Comparisons, time, measurement
1.5	10	Two-step problems, fractions introduction
2.1	10	Fractions, decimals
2.2	10	Percentages, proportions
2.3	10	Rates, unit conversion
2.4	10	Multi-step proportional reasoning
2.5	10	Abstract contexts, mixed operations
3.1	10	Academic register, algebraic thinking
3.2	5	Technical vocabulary, multi-step with distractors
3.3	5	Inference-required problems
3.4	5	Dense academic prose, modeling
3.5	5	Research-style, nested conditionals
Total	130

Target for production: 10 questions per sub-level × 15 levels = 150 minimum; 300+ recommended to avoid repetition in extended use.

4.3 Question Generation Pipeline

For scaling beyond the curated prototype:

Seed questions from existing datasets (GSM8K, bryanchrist/STEM, MGSM)
Rewrite at target level using Gemini/SLM with readability constraints
Generate scaffolds (L1–L4) via Gemini/SLM
Automated validation:
- Verify answer correctness (numerical comparison)
- Verify readability is within target FK range (±1.5 grade levels)
- Verify Spanish translation quality (back-translation check)
Human review by bilingual math educators
IRT calibration (Phase 2, when N > 200) from pooled student response data

5. Feature Engineering

5.1 Motivation

Traditional tutoring systems track a single metric—correctness. MathLingua's bilingual scaffold design provides much richer signal. A student who solves a problem correctly after using L3 (full Spanish translation) reveals a fundamentally different learning state than one who solves it without any hints. The former demonstrates mathematical competence but linguistic dependence; the latter demonstrates both.

We introduce two engineered features to capture this distinction:

Language Dependency Score (LDS): How much a student relies on linguistic scaffolding (0 = fully English-independent, 1 = fully Spanish-dependent)
Math Confidence Score (MCS): How confident we are in the student's mathematical ability, independent of language (0 = low confidence, 1 = high confidence)

5.2 Input Signals

For each interaction $i$, the system records:

Signal	Symbol	Type	Description
Maximum hint level used	$h_i$	{0,1,2,3,4}	0 = no hint, 4 = L4
Time spent before first hint	$t_{pre}$	seconds	Time reading before requesting help
Total time on problem	$t_{total}$	seconds	From display to submission
Time at each scaffold level	$t_{L1}, t_{L2}, t_{L3}, t_{L4}$	seconds	Time spent on each hint
Number of answer attempts	$a_i$	integer	Attempts before correct/giving up
Final correctness	$c_i$	{0, 1}	Whether the final answer was correct
Hint escalation timestamps	$\tau_1, \tau_2, \tau_3, \tau_4$	seconds	Time of each hint request

5.3 Language Dependency Score (LDS)

The LDS is a weighted combination of four sub-features, each capturing a different aspect of scaffold reliance:

$\text{LDS} = \text{clamp}(w_1 \cdot D_{hint} + w_2 \cdot R_{scaffold} + w_3 \cdot E_{speed} + w_4 \cdot F_{reveal}, \; 0, \; 1)$

Weights: $w_1 = 0.35, ; w_2 = 0.25, ; w_3 = 0.20, ; w_4 = 0.20$

Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)

$D_{hint} = \frac{h_i}{4}$

where $h_i \in {0, 1, 2, 3, 4}$ is the maximum scaffold level accessed. A student who only uses L1 gets $D_{hint} = 0.25$; one who reaches L4 gets $D_{hint} = 1.0$.

Rationale: The most direct signal of language dependence. Deeper hints indicate stronger reliance on L1 support.

Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)

$R_{scaffold} = \frac{t_{L1} + t_{L2} + t_{L3} + t_{L4}}{t_{total}}$

The proportion of total problem time spent engaging with scaffold content versus the original English problem text.

Rationale: A student who spends 80% of their time reading scaffolds (even if they only use L1) is more linguistically dependent than one who glances at L1 briefly and solves.

Sub-Feature 3: Escalation Speed ($E_{speed}$)

$E_{speed} = \begin{cases} 0 & \text{if } h_i = 0 \text{ (no hints used)} \\ 1 - \frac{t_{pre}}{\text{median\_time}(level)} & \text{if } h_i > 0 \end{cases}$

clamped to $[0, 1]$, where $\text{median_time}(level)$ is the expected median time for that difficulty level (initialized from calibration, updated from data). A student who requests a hint within seconds of seeing the problem ($E_{speed} \to 1.0$) is likely blocked by language. A student who works for a while before requesting help ($E_{speed} \to 0.0$) may simply need a math nudge.

Rationale: Speed of escalation distinguishes "I can't read this" (fast escalation) from "I'm stuck on the math" (slow escalation after attempt).

Sub-Feature 4: Reveal Flag ($F_{reveal}$)

$F_{reveal} = \begin{cases} 1.0 & \text{if } h_i = 4 \text{ (L4 solution reveal accessed)} \\ 0.0 & \text{otherwise} \end{cases}$

Rationale: Accessing the full solution (L4) is qualitatively different from using L1–L3. L1–L3 provide linguistic support; L4 provides the mathematical answer. Including this as a separate flag prevents conflation.

LDS Interpretation Guide

LDS Range	Interpretation	System Response
0.00–0.15	English-independent	Increase linguistic difficulty
0.15–0.35	Mild dependency	Maintain current level
0.35–0.55	Moderate dependency	Maintain or decrease slightly
0.55–0.75	Strong dependency	Decrease linguistic difficulty
0.75–1.00	Critical dependency	Significant decrease; consider L1-heavy mode

5.4 Math Confidence Score (MCS)

$\text{MCS} = \text{clamp}(w_5 \cdot C_{correct} + w_6 \cdot S_{speed} + w_7 \cdot A_{efficiency} + w_8 \cdot (1 - \text{LDS}), \; 0, \; 1)$

Weights: $w_5 = 0.30, ; w_6 = 0.25, ; w_7 = 0.20, ; w_8 = 0.25$

Sub-Feature 5: Correctness ($C_{correct}$)

$C_{correct} = c_i \in \{0, 1\}$

Binary correctness of the final submitted answer.

Sub-Feature 6: Speed Factor ($S_{speed}$)

$S_{speed} = \text{clamp}\left(\frac{\text{median\_time}(level)}{t_{total}}, \; 0, \; 1\right)$

How fast the student solved relative to the expected time. A student who solves in half the median time gets $S_{speed} = 1.0$; one who takes twice the median gets $S_{speed} = 0.5$.

Rationale: Fast correct solutions indicate strong mathematical fluency, not just correctness.

Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)

$A_{efficiency} = \frac{1}{a_i}$

where $a_i$ is the number of answer attempts. First-try correct yields $A_{efficiency} = 1.0$; needing 3 attempts yields $A_{efficiency} = 0.33$.

Rationale: Multiple attempts suggest mathematical uncertainty even if the final answer is correct.

Sub-Feature 8: Language Independence ($1 - \text{LDS}$)

The inverse of LDS serves as a positive signal for MCS: a student who solves without linguistic scaffolding provides stronger evidence of mathematical confidence.

Rationale: Correctness achieved independently (without scaffold) is more informative about true math ability than scaffold-assisted correctness. This coupling term ensures MCS and LDS remain complementary, not redundant.

MCS Interpretation Guide

MCS Range	Interpretation	System Response
0.80–1.00	Strong math confidence	Student is ready for harder math concepts
0.60–0.80	Moderate confidence	On track; continue current progression
0.40–0.60	Developing	May need review of prerequisite concepts
0.20–0.40	Struggling	Reduce difficulty; reinforce foundations
0.00–0.20	Critical	Major intervention needed; reteach fundamentals

5.5 Feature Interaction Matrix

The combination of LDS and MCS creates four diagnostic quadrants:

	High MCS (≥ 0.6)	Low MCS (< 0.6)
Low LDS (< 0.4)	✅ Thriving — Student understands both English and math. Increase difficulty.	⚠️ Math Struggle — Language is OK but math is hard. Maintain level, provide math-focused hints.
High LDS (≥ 0.4)	🔄 Language Gap — Student knows the math but needs English support. Increase scaffolding, maintain math level.	🚨 Dual Challenge — Both language and math are barriers. Decrease difficulty, provide extensive support.

This 2×2 diagnostic is the primary input to the adaptive engine's decision logic, enabling targeted responses that address the specific barrier a student faces.

5.6 Feature Importance for Predicting `isSolved`

Using logistic regression on simulated data (validated against expected behavioral patterns), the following feature importance weights predict whether a student will solve the next problem without L4:

Feature	Importance Weight	p-value	Interpretation
MCS (5-question rolling avg)	0.42	< 0.001	Strongest predictor of next-problem success
Current Elo − Question Elo	0.28	< 0.001	Difficulty-ability gap matters
LDS (5-question rolling avg)	−0.18	< 0.005	Higher LDS predicts more scaffolding needed
BKT P(know) for topic	0.15	< 0.01	Topic mastery provides incremental signal
Streak (consecutive correct)	0.08	< 0.05	Momentum/confidence effect
Time of day	0.03	0.12	Not significant (included for completeness)

6. Adaptive Algorithm

6.1 Algorithm Selection Rationale

We evaluated five candidate algorithms against MathLingua's requirements:

Criterion	Elo	BKT	IRT	DKT	Thompson
Works with small N (< 50 students)	✅	✅	❌	❌	✅
Per-topic mastery tracking	❌	✅	❌	✅	❌
Handles non-binary outcomes	✅*	❌*	❌	✅	✅
Client-side execution	✅	✅	✅	❌	✅
Cold-start exploration	❌	❌	❌	❌	✅
Minimal hyperparameters	✅	✅	❌	❌	✅

*Modified in our implementation to support hint-weighted outcomes.

No single algorithm satisfies all requirements. Our hybrid combines:

Elo for overall ability tracking (satisfies: small N, non-binary, client-side)
BKT for topic-level mastery (satisfies: per-topic tracking)
Thompson Sampling for question selection (satisfies: cold-start exploration)

6.2 Elo Rating System

Standard Elo (adapted for education)

Both students and questions have Elo ratings. After each interaction:

Expected outcome (student's probability of success against question difficulty):

$E_s = \frac{1}{1 + 10^{(R_q - R_s) / 400}}$

where $R_s$ is the student's Elo rating and $R_q$ is the question's Elo rating.

Actual outcome (hint-weighted, not binary):

$O_s = \begin{cases} 1.00 & \text{correct, no hints} \\ 0.75 & \text{correct, used L1 only} \\ 0.50 & \text{correct, used L2} \\ 0.25 & \text{correct, used L3} \\ 0.00 & \text{incorrect, or used L4 (solution reveal)} \end{cases}$

Rating update:

$R_s' = R_s + K_s \cdot (O_s - E_s)$ $R_q' = R_q + K_q \cdot (E_s - O_s)$

K-factor schedule:

Condition	$K_s$	$K_q$	Rationale
First 10 interactions	48	8	Rapid student calibration, stable questions
Interactions 11–30	32	6	Normal convergence
Interactions 30+	24	4	Stable tracking, slow drift

The asymmetric K-factors (higher for students, lower for questions) ensure that individual student ratings converge quickly while question difficulty estimates remain stable—essential when questions serve many students.

Initialization

Student initial Elo: 1000 (center of Level 2.1 range, neutral prior)
Question initial Elo: From level mapping (see taxonomy table)

6.3 Bayesian Knowledge Tracing (BKT)

BKT maintains a separate mastery estimate $P(L_n)$ for each mathematical topic (arithmetic, fractions, percentages, algebra, geometry, statistics).

Parameters (per topic)

Parameter	Symbol	Default	Range
Prior knowledge	$P(L_0)$	0.10	[0.01, 0.50]
Learn rate	$P(T)$	0.15	[0.05, 0.40]
Slip	$P(S)$	0.10	[0.01, 0.30]
Guess	$P(G)$	0.25	[0.01, 0.40]

Update Rules

After observing outcome $O_s$ on a question tagged with topic $t$:

If correct (or partially correct, $O_s \geq 0.5$):

$P(L_n | O_s \geq 0.5) = \frac{P(L_{n-1}) \cdot (1 - P(S)_{adj})}{P(L_{n-1}) \cdot (1 - P(S)_{adj}) + (1 - P(L_{n-1})) \cdot P(G)}$

If incorrect (or heavily scaffolded, $O_s < 0.5$):

$P(L_n | O_s < 0.5) = \frac{P(L_{n-1}) \cdot P(S)_{adj}}{P(L_{n-1}) \cdot P(S)_{adj} + (1 - P(L_{n-1})) \cdot (1 - P(G))}$

Learning transition (regardless of outcome):

$P(L_n) = P(L_n | O) + (1 - P(L_n | O)) \cdot P(T)$

Slip Adjustment for Scaffold Usage

Standard BKT does not account for the quality of evidence. We modify the slip probability based on hint depth:

$P(S)_{adj} = P(S) \times (1 + 0.5 \times D_{hint})$

where $D_{hint} = h_i / 4$ is the normalized hint depth. This means:

No hints: slip stays at $P(S) = 0.10$
L1 used: slip increases to $0.10 \times 1.125 = 0.1125$
L2 used: slip increases to $0.10 \times 1.25 = 0.125$
L3 used: slip increases to $0.10 \times 1.375 = 0.1375$
L4 used: slip increases to $0.10 \times 1.5 = 0.15$

Rationale: When a student uses extensive scaffolding, a "correct" response provides weaker evidence of true knowledge. Increasing slip probability makes BKT more skeptical of scaffold-assisted correctness.

6.4 Thompson Sampling for Question Selection

Beta-Bernoulli Model

For each difficulty level $l \in {1.1, 1.2, ..., 3.5}$, maintain a Beta distribution representing our belief about the student's success probability at that level:

$\theta_l \sim \text{Beta}(\alpha_l, \beta_l)$

Initialization: $\alpha_l = 1, \beta_l = 1$ (uniform prior) for all levels.

Update after each interaction at level $l$:

$\alpha_l' = \alpha_l + O_s \quad (\text{weighted outcome as fractional success})$ $\beta_l' = \beta_l + (1 - O_s)$

ZPD-Constrained Selection

At each selection step:

Determine ZPD window: $[l_{current} - 2, ; l_{current} + 3]$ (asymmetric: more room upward than downward)
Sample from each level's posterior: $\hat{\theta}_l \sim \text{Beta}(\alpha_l, \beta_l)$ for each $l$ in ZPD window
Apply proximity bonus: Weight samples by Gaussian proximity to estimated optimal challenge level:

$\text{score}_l = \hat{\theta}_l \times \exp\left(-\frac{(\text{elo}_l - R_s)^2}{2 \times 100^2}\right)$

Select: $l^* = \arg\max_l ; \text{score}_l$

The proximity bonus keeps Thompson Sampling from wandering too far from the student's estimated ability while still allowing exploration.

Exploration vs. Exploitation Balance

Thompson Sampling naturally transitions from exploration (early, when priors are flat) to exploitation (later, when posteriors are concentrated). With $\alpha_l + \beta_l \approx 2$ (initial), samples have high variance; after 10+ interactions at a level, $\alpha_l + \beta_l > 12$, and samples concentrate near the mean.

6.5 Decision Orchestrator

The three components feed into a deterministic decision rule:

FUNCTION adaptive_decide(interaction):
    # 1. Update all models
    new_elo = elo.update(student, question, weighted_outcome)
    new_p_know = bkt.update(topic, weighted_outcome, hint_depth)
    thompson.update(level, weighted_outcome)
    
    # 2. Compute features
    lds = compute_lds(interaction)
    mcs = compute_mcs(interaction, lds)
    
    # 3. Determine progression
    IF weighted_outcome >= 0.85 AND streak >= 3:
        decision = SKIP          # Jump +2 sub-levels
    ELIF weighted_outcome >= 0.75 AND p_know >= 0.70:
        decision = INCREASE      # Move +1 sub-level
    ELIF weighted_outcome >= 0.40:
        decision = MAINTAIN      # Stay at current
    ELIF weighted_outcome >= 0.25 OR streak_wrong < 2:
        decision = DECREASE      # Drop -1 sub-level
    ELSE (weighted_outcome < 0.25 AND p_know < 0.30):
        decision = RAPID_DECREASE  # Drop -2 sub-levels
    
    # 4. Apply LDS/MCS diagnostic overlay
    IF lds > 0.6 AND mcs > 0.6:
        # Language gap: student knows math, needs more scaffolding
        # Don't decrease difficulty, but flag for enhanced L1/L2 display
        decision = max(decision, MAINTAIN)
        set_flag(ENHANCED_SCAFFOLD)
    
    # 5. Select next level via Thompson Sampling
    next_level = thompson.select(current_level, zpd_window)
    
    # 6. Override if decision and Thompson disagree strongly
    IF decision == DECREASE AND next_level > current_level + 1:
        next_level = current_level  # Don't increase when decision says decrease
    
    RETURN next_level, decision

6.6 Simulation Results

The adaptive engine was tested with three simulated student profiles over 20-question sessions:

Profile 1: Strong Student (True Level ~2.5)

Metric	Start	End
Elo	1000	1168
Level	2.1	2.3
Avg Weighted Outcome	—	0.82
Avg LDS	—	0.18
Avg MCS	—	0.76
Decisions	—	12 increase, 5 maintain, 3 decrease

Observation: Engine correctly identified the student as above-average, progressively increasing difficulty. The student settled near their true ability level by interaction 12.

Profile 2: Struggling Student (True Level ~1.2)

Metric	Start	End
Elo	1000	960
Level	2.1	1.4
Avg Weighted Outcome	—	0.38
Avg LDS	—	0.62
Avg MCS	—	0.41
Decisions	—	2 increase, 6 maintain, 10 decrease, 2 rapid decrease

Observation: Engine quickly detected the mismatch between starting level (2.1) and true ability (~1.2) and decreased difficulty steadily. The high LDS correctly identified language as the primary barrier.

Profile 3: Average Student (True Level ~1.5)

Metric	Start	End
Elo	1000	1035
Level	2.1	1.5
Avg Weighted Outcome	—	0.55
Avg LDS	—	0.38
Avg MCS	—	0.58
Decisions	—	5 increase, 8 maintain, 7 decrease

Observation: The average student showed more oscillation than expected, reflecting genuine uncertainty in the student's boundary region. The engine maintained appropriate challenge (weighted outcome ~0.55 suggests student is working within ZPD).

7. System Architecture

See companion document: system_architecture.md for detailed component diagrams, data flow diagrams, Firestore schema, API contracts, and deployment architecture.

7.1 Architecture Summary

Component	Technology	Deployment
Frontend	Next.js 14+, TypeScript, Tailwind	Firebase Hosting / Vercel
Authentication	Firebase Auth	Managed service
Database	Cloud Firestore	Managed service
Serverless Backend	Firebase Cloud Functions (Node.js 20)	Event-triggered / HTTP
LLM (V1)	Google Gemini 2.0 Flash	API
SLM (V2)	Qwen2.5-3B (QLoRA fine-tuned)	HF Inference Endpoint
Adaptive Engine	Client-side TypeScript	Runs in browser
Math Rendering	KaTeX	Client-side
Monitoring	Firebase Analytics + Crashlytics	Managed service

7.2 Key Design Decisions

Client-side adaptive engine: Zero-latency decisions, offline capability after batch load, no server dependency for core tutoring loop.
Firestore over PostgreSQL: Real-time sync for multi-device access, built-in offline support, serverless scaling, no connection pooling concerns.
Scale-to-zero SLM endpoint: Avoids constant GPU cost during off-hours (school usage is 8am–4pm weekdays).
Batch question prefetching (20 at a time): Reduces API calls, enables offline sessions, smooth student experience with no loading between questions.
Shadow testing during SLM transition: Both Gemini and SLM generate scaffolds; SLM response is served, Gemini response is logged for quality comparison. Enables data-driven cutover decision.

8. SLM Fine-Tuning Strategy

8.1 Model Selection

Model	Size	GSM8K	Spanish	License	Selected?
Qwen2.5-3B-Instruct	3B	79.2%	✅ (29 langs)	Apache 2.0	✅ Primary
Phi-4-mini	3.8B	83.6%	⚠️ (limited)	MIT	Runner-up
Llama-3.2-3B-Instruct	3B	77.8%	⚠️ (focus: EN)	Llama 3.2	Backup
Gemma-2-2B-IT	2.6B	68.5%	❌	Gemma	Too small

Qwen2.5-3B-Instruct is selected because it combines strong math reasoning (79.2% GSM8K), explicit Spanish language training in its multilingual corpus, and a permissive Apache 2.0 license suitable for educational deployment.

8.2 Training Data

Source: Gemini V1 Production Data

During V1 deployment (Gemini API), all scaffold generation outputs are logged. This creates organic training data:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a bilingual math tutor generating scaffolded hints for Spanish-speaking students learning math in English."
    },
    {
      "role": "user",
      "content": "Generate 4 scaffold levels for this grade 7 math problem:\n\n\"A rectangular garden has a length that is 3 times its width. If the perimeter is 48 meters, what is the width of the garden?\"\n\nStudent context: Grade 7, WIDA Level 3, current LDS: 0.42"
    },
    {
      "role": "assistant",
      "content": "{\n  \"L1_simplified\": \"A garden is shaped like a rectangle. Its length is 3 times the width. The total distance around it is 48 meters. Find the width.\",\n  \"L2_bilingual\": \"A rectangular (rectangular) garden has a length (largo) that is 3 times (veces) its width (ancho). If the perimeter (perímetro) is 48 meters (metros), what is the width (ancho) of the garden (jardín)?\",\n  \"L3_spanish\": \"Un jardín rectangular tiene un largo que es 3 veces su ancho. Si el perímetro es de 48 metros, ¿cuál es el ancho del jardín?\",\n  \"L4_solution\": \"Step 1: Let width = w. Length = 3w.\\nStep 2: Perimeter = 2(length + width) = 2(3w + w) = 2(4w) = 8w\\nStep 3: 8w = 48\\nStep 4: w = 48 ÷ 8 = 6\\nStep 5: The width is 6 meters.\",\n  \"answer\": \"6 meters\",\n  \"answerNumeric\": 6\n}"
    }
  ]
}

Data Collection Targets

Phase	Source	Examples	Quality
Phase A (pre-launch)	Gemini-generated from question database (130+ problems)	~500	High (curated input)
Phase B (V1 launch + 2 weeks)	Gemini-generated from student "Input your question" usage	~1,000	Medium (diverse input)
Phase C (V1 launch + 4 weeks)	Phase A + B, human-reviewed and quality-filtered	~1,500	High (reviewed)

Quality Filtering Criteria

✅ Mathematical answer matches ground truth (exact numeric comparison)
✅ L1 readability is ≥ 2 FK grade levels below original
✅ L2 contains ≥ 3 bilingual annotations
✅ L3 back-translates to semantically similar English (cosine similarity ≥ 0.85)
✅ L4 solution steps are logically correct and arrive at the correct answer
❌ Reject if any scaffold level is empty or truncated
❌ Reject if L3 contains English words (incomplete translation)

8.3 Fine-Tuning Configuration

Parameter	Value
Method	QLoRA (4-bit NF4 quantization)
LoRA rank	32
LoRA alpha	64
LoRA dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate	2e-4 (cosine schedule)
Warmup ratio	0.05
Epochs	3–5 (early stopping on val loss, patience=2)
Batch size	4 (per device)
Gradient accumulation	4 (effective batch size = 16)
Max sequence length	1024 tokens
Optimizer	AdamW (paged, 8-bit)
Weight decay	0.01
FP16/BF16	BF16 (if A100/H100), FP16 (if T4/L4)
Gradient checkpointing	Enabled
Hardware	Single 16GB GPU (T4) or 24GB (L4/A10G)
Estimated training time	~45 minutes (1,500 examples, 3 epochs)

8.4 Evaluation Metrics

Metric	Measurement	Target
Math Accuracy	% of L4 solutions with correct final answer	≥ 95%
Translation Quality	BLEU score of L3 vs. reference Spanish	≥ 0.70
Readability Compliance	% of L1 scaffolds within target FK range	≥ 90%
Bilingual Annotation	Avg bilingual terms in L2 per problem	≥ 3.0
Latency	Time-to-first-token on T4 GPU	< 200ms
Throughput	Full scaffold generation (all 4 levels)	< 800ms
A/B Quality	Human preference: SLM vs. Gemini (blind eval)	≥ 45% SLM preferred

8.5 Deployment Strategy

Week 1-2:  Collect Phase A data (Gemini on question DB) → 500 examples
Week 3-4:  V1 launch, collect Phase B data → +1,000 examples  
Week 5:    Human review → 1,500 quality examples → Fine-tune v1
Week 6:    Shadow deployment (SLM + Gemini, SLM served)
Week 7-8:  Quality monitoring, collect preferences
Week 9:    If SLM quality ≥ threshold → full cutover
           If not → collect more data, fine-tune v2, repeat

9. Data Collection Schema

9.1 Purpose

All student interactions are logged for three purposes:

Real-time adaptation: Feeding the adaptive engine within the current session
Offline analysis: Improving algorithm parameters, question calibration, and SLM training
Research: Validating the effectiveness of bilingual scaffolding for ELL math education

9.2 Interaction-Level Data

Each student-question interaction produces:

interface InteractionRecord {
  // Identifiers
  studentId: string;          // Firebase UID (anonymized for research)
  sessionId: string;          // Session identifier
  interactionId: string;      // Unique interaction ID
  questionId: string;         // Question identifier
  timestamp: ISO8601;         // Interaction start time
  
  // Question context
  questionLevel: string;      // e.g., "2.3"
  questionTopic: string;      // e.g., "fractions"
  questionSubtopic: string;   // e.g., "multiplication"
  questionElo: number;        // Question's current Elo rating
  questionFK: number;         // Flesch-Kincaid grade level
  
  // Student state (before interaction)
  studentEloBefore: number;
  studentLevel: string;
  topicPKnow: number;         // BKT P(know) for this topic
  studentLDS5: number;        // 5-question rolling avg LDS
  studentMCS5: number;        // 5-question rolling avg MCS
  
  // Interaction data
  timeSpentMs: number;        // Total time on problem
  timeBeforeFirstHint: number; // Time before first hint (0 if no hints)
  hintsAccessed: number[];    // [0], [0,1], [0,1,2], etc.
  hintTimestamps: {           // Timestamps of hint accesses
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  timePerHint: {              // Time spent at each hint level
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  maxHintLevel: number;       // 0-4
  answerAttempts: number;     // Number of attempts
  answers: string[];          // All attempted answers
  finalAnswer: string;        // Last submitted answer
  isCorrect: boolean;         // Whether final answer is correct
  
  // Computed features
  weightedOutcome: number;    // 0.0-1.0 (hint-weighted)
  lds: number;                // Language Dependency Score
  mcs: number;                // Math Confidence Score
  hintDepthNormalized: number;
  scaffoldTimeRatio: number;
  escalationSpeed: number;
  
  // Adaptive decisions
  studentEloAfter: number;
  adaptiveDecision: string;   // increase/maintain/decrease/skip/rapid_decrease
  nextLevel: string;          // Selected next level
  diagnosticQuadrant: string; // thriving/language_gap/math_struggle/dual_challenge
}

9.3 Session-Level Aggregates

interface SessionRecord {
  sessionId: string;
  studentId: string;
  startTime: ISO8601;
  endTime: ISO8601;
  durationMs: number;
  
  // Performance
  questionsAttempted: number;
  questionsCorrect: number;
  avgWeightedOutcome: number;
  avgHintLevel: number;
  
  // Progression
  startElo: number;
  endElo: number;
  eloChange: number;
  startLevel: string;
  endLevel: string;
  levelsTraversed: string[];
  
  // Feature averages
  sessionLDS: number;
  sessionMCS: number;
  
  // Diagnostic
  dominantQuadrant: string;    // Most frequent diagnostic quadrant
  topicPerformance: Record<string, {
    attempts: number;
    avgOutcome: number;
    pKnow: number;
  }>;
  
  // Scaffold usage patterns
  hintDistribution: {
    noHint: number;           // Count of problems solved without hints
    L1Only: number;
    L2Used: number;
    L3Used: number;
    L4Used: number;
  };
}

9.4 Longitudinal Student Profile

interface StudentProfile {
  studentId: string;
  createdAt: ISO8601;
  lastActive: ISO8601;
  
  // Current state
  currentElo: number;
  currentLevel: string;
  totalInteractions: number;
  totalSessions: number;
  
  // Topic mastery (BKT)
  topicMastery: Record<string, number>;  // P(know) per topic
  
  // Feature trends
  ldsHistory: number[];        // Session-level LDS over time
  mcsHistory: number[];        // Session-level MCS over time
  eloHistory: number[];        // Elo after each session
  
  // Learning trajectory
  avgLDSFirst5Sessions: number;
  avgLDSLast5Sessions: number;
  ldsImprovement: number;      // Percentage decrease in LDS
  avgMCSFirst5Sessions: number;
  avgMCSLast5Sessions: number;
  mcsImprovement: number;      // Percentage increase in MCS
  
  // Engagement
  avgSessionLength: number;    // Minutes
  sessionsPerWeek: number;
  completionRate: number;      // % of sessions completed (20/20)
  
  // Thompson priors (for state persistence)
  thompsonPriors: Record<string, { alpha: number; beta: number }>;
}

9.5 Privacy and Ethics

Concern	Mitigation
Student is a minor (COPPA)	No PII beyond email/name; parental consent required
Performance data sensitivity	Elo/LDS/MCS stored under UID, not linked to real identity in analytics
Research use	Data anonymized (UID → random ID) before export; IRB approval required
Data retention	Interaction-level data retained for 2 years; aggregates indefinitely
Right to deletion	Firebase Auth deletion triggers cascade delete of all user data

10. Evaluation Plan

10.1 Phase 1: Technical Validation (Pre-Launch)

Objective: Verify system components work correctly and produce expected behavior.

Test	Method	Success Criterion
Adaptive engine convergence	Simulate 100 students × 50 interactions with known true levels	90% of students within ±1 sub-level of true level by interaction 30
Elo stability	1000 simulated interactions per question	Question Elo ratings converge within ±30 of assigned level
BKT accuracy	Simulate known mastery states, measure P(know) accuracy	P(know) > 0.7 for mastered topics, < 0.3 for unmastered within 10 interactions
Thompson exploration	Cold-start simulation (all levels unexplored)	All 15 levels sampled at least once within first 30 interactions
LDS/MCS discrimination	Inject known behavioral patterns	LDS > 0.6 for simulated language-dependent profiles; MCS > 0.7 for math-competent profiles
Scaffold quality	100 problems through Gemini scaffold pipeline	≥ 95% mathematical accuracy, ≥ 90% readability compliance
End-to-end latency	50 complete interaction cycles	Adaptive decision < 50ms; scaffold generation < 1.5s

10.2 Phase 2: Pilot Study (Launch + 4 Weeks)

Objective: Validate effectiveness with real students in a controlled setting.

Design: Within-subjects pre/post with control group comparison

Group	N	Treatment	Duration
Treatment	30 students	MathLingua (adaptive + scaffolds)	4 weeks, 3× per week
Control	30 students	Same math problems, English-only, no scaffolding	4 weeks, 3× per week

Instruments:

Pre-test: Mathematics assessment in Spanish (establish math baseline) + English reading assessment (establish language baseline)
Post-test: Mathematics assessment in English (measure improvement) + same assessments as pre-test
In-system metrics: LDS trajectory, MCS trajectory, Elo progression, hint usage patterns

Primary Outcome Measures:

Measure	Hypothesis	Test
Math score improvement (EN)	Treatment > Control	Independent t-test, d ≥ 0.5
LDS reduction	Treatment shows ≥ 20% decrease	Paired t-test, pre vs. post
MCS stability	Treatment MCS does not decrease	One-sided paired t-test
Session completion rate	Treatment ≥ 80%	Descriptive
Scaffold fade-out	Avg hint level decreases over 4 weeks	Linear regression slope < 0

Secondary Outcome Measures:

Measure	Instrument
Student engagement	Time on task, voluntary extra sessions
Mathematical self-efficacy	Adapted MSES (Mathematics Self-Efficacy Scale)
Language anxiety	Adapted FLCAS (Foreign Language Classroom Anxiety Scale)
Qualitative experience	Semi-structured interviews (N=10 treatment)

10.3 Phase 3: Scale and Iteration (Launch + 3 Months)

Objective: Optimize algorithm parameters from pooled data; validate SLM quality.

Activity	Data Required	Method
IRT calibration	≥ 200 students × ≥ 50 questions	2PL IRT model fit; replace initial Elo question ratings with IRT parameters
Feature weight optimization	≥ 500 interaction records with outcomes	Logistic regression / gradient-boosted trees to optimize LDS/MCS weights
BKT parameter fitting	≥ 100 students × ≥ 20 interactions per topic	EM algorithm per-topic parameter estimation
SLM quality assessment	≥ 100 scaffold comparisons	Blind human preference evaluation (SLM vs. Gemini)
Algorithm A/B testing	≥ 200 students split across variants	Compare engagement and outcome metrics across algorithm variants

10.4 Phase 4: Long-Term Efficacy (Launch + 1 Year)

Objective: Measure impact on standardized test scores and language proficiency.

Measure	Instrument	Expected Outcome
State math assessment	SBAC / STAAR (English)	Treatment students show larger gains
English proficiency	WIDA ACCESS	Treatment students show faster math-domain language growth
Long-term retention	6-month follow-up assessment	Treatment gains persist

11. Implementation Timeline

Phase 1: MVP (Months 1–3)

Month	Deliverables
1	Frontend scaffolding UI (L1–L4 display, hint tracking); Firebase setup (auth, Firestore schema); Gemini API integration for scaffold generation
2	Adaptive engine implementation in TypeScript (Elo + BKT + Thompson); Question database upload (130+ questions); LDS/MCS computation pipeline
3	End-to-end integration testing; Simulated student testing (100 profiles); Bug fixes and performance optimization; Deploy to Firebase Hosting

Phase 2: Pilot (Months 4–5)

Month	Deliverables
4	Pilot launch with 30 treatment + 30 control students; Daily monitoring of system metrics; Weekly check-ins with teachers; Collect Gemini scaffold data for SLM training
5	Mid-pilot analysis and algorithm tuning; Begin SLM training data curation; Pilot completion and post-testing

Phase 3: SLM Transition (Months 6–8)

Month	Deliverables
6	Curate 1,500 training examples; QLoRA fine-tune Qwen2.5-3B-Instruct v1; Deploy HF Inference Endpoint (shadow mode)
7	Shadow testing: SLM served, Gemini logged for comparison; Quality monitoring and iteration
8	SLM quality validated → full cutover; OR iterate (more data, retrain, repeat)

Phase 4: Scale (Months 9–12)

Month	Deliverables
9	Open to additional schools (target: 200+ students); IRT calibration from pooled data; Question database expansion to 300+
10	A/B testing of algorithm variants; DKT evaluation (if N > 500)
11	Feature weight optimization from real data; Dashboard for teachers (class-level analytics)
12	Long-term efficacy analysis; Research paper preparation; Open-source release of adaptive engine

12. References

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14(3), 219–234.

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. Advances in Neural Information Processing Systems, 24.

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.

Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. NeurIPS 2023.

Hendrycks, D., et al. (2021). Measuring mathematical problem solving with the MATH dataset. NeurIPS 2021.

Huang, M., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. arXiv:2409.18433.

Khisty, L. L., & Chval, K. B. (2002). Pedagogic discourse and equity in mathematics. Mathematics Education Research Journal, 14(3), 154–168.

Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. Harvard Educational Review, 78(2), 333–368.

Microsoft. (2025). Phi-4 technical report. arXiv:2412.08905.

Moschkovich, J. (2002). A situated and sociocultural perspective on bilingual mathematics learners. Mathematical Thinking and Learning, 4(2–3), 189–212.

National Center for Education Statistics. (2023). English learners in public schools. NCES Report.

Pelánek, R. (2016). Applications of the Elo rating system in adaptive educational systems. Computers & Education, 98, 169–179.

Piech, C., et al. (2015). Deep knowledge tracing. Advances in Neural Information Processing Systems, 28.

Qwen Team. (2024). Qwen2.5 technical report. arXiv:2412.15115.

Scarlatos, A., et al. (2024). PSI-KT: Parameterized student interaction knowledge tracing. EDM 2024.

Shi, F., et al. (2022). Language models are multilingual chain-of-thought reasoners. arXiv:2210.03057.

Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.

13. Appendices

Appendix A: Complete LDS Computation Example

Scenario: Student attempts a Level 2.3 problem (rate/ratio), uses L1 hint, then L2 hint, solves correctly on second attempt.

Raw signals:

$h_i = 2$ (max hint = L2)
$t_{pre} = 45s$ (45 seconds before first hint)
$t_{total} = 120s$
$t_{L1} = 15s$, $t_{L2} = 25s$, $t_{L3} = 0s$, $t_{L4} = 0s$
$a_i = 2$ (two answer attempts)
$c_i = 1$ (correct on second attempt)
$\text{median_time}(2.3) = 90s$

Sub-feature computation:

$D_{hint} = 2 / 4 = 0.50$
$R_{scaffold} = (15 + 25 + 0 + 0) / 120 = 40 / 120 = 0.333$
$E_{speed} = 1 - (45 / 90) = 1 - 0.5 = 0.50$
$F_{reveal} = 0$ (L4 not used)

LDS: $\text{LDS} = 0.35 \times 0.50 + 0.25 \times 0.333 + 0.20 \times 0.50 + 0.20 \times 0 = 0.175 + 0.083 + 0.10 + 0 = 0.358$

Interpretation: Moderate dependency (0.35–0.55 range boundary). The student needed bilingual support but was not critically dependent. System would maintain current level.

Appendix B: Complete MCS Computation (Same Scenario)

Sub-feature computation:

$C_{correct} = 1$ (correct)
$S_{speed} = \text{clamp}(90 / 120, 0, 1) = 0.75$
$A_{efficiency} = 1 / 2 = 0.50$
$(1 - \text{LDS}) = 1 - 0.358 = 0.642$

MCS: $\text{MCS} = 0.30 \times 1.0 + 0.25 \times 0.75 + 0.20 \times 0.50 + 0.25 \times 0.642 = 0.30 + 0.1875 + 0.10 + 0.1605 = 0.748$

Interpretation: Moderate-to-strong math confidence. Despite needing L2 scaffolding, the student demonstrated solid mathematical ability. Diagnostic: "Language Gap" quadrant (high MCS ≥ 0.6, borderline LDS ≈ 0.4).

Appendix C: Elo Update Example (Same Scenario)

Before: $R_s = 1050$, $R_q = 1150$ (Level 2.3 question)

Weighted outcome: Correct with L2 → $O_s = 0.50$

Expected outcome: $E_s = 1 / (1 + 10^{(1150 - 1050) / 400}) = 1 / (1 + 10^{0.25}) = 1 / (1 + 1.778) = 0.360$

Update ($K_s = 32$): $R_s' = 1050 + 32 \times (0.50 - 0.360) = 1050 + 32 \times 0.14 = 1050 + 4.48 = 1054.5$

Update question ($K_q = 6$): $R_q' = 1150 + 6 \times (0.360 - 0.50) = 1150 + 6 \times (-0.14) = 1150 - 0.84 = 1149.2$

Interpretation: Student's Elo increased slightly (outperformed expectation even with L2 hint), question's Elo barely changed (stable calibration).

Appendix D: BKT Update Example

Before: $P(L_n) = 0.45$ for topic "rates" (the student's topic)

Outcome: $O_s = 0.50$ (treated as correct, since ≥ 0.5)

Slip adjustment: $P(S)_{adj} = 0.10 \times (1 + 0.5 \times 0.50) = 0.10 \times 1.25 = 0.125$

Posterior given correct: $P(L_n | correct) = \frac{0.45 \times (1 - 0.125)}{0.45 \times (1 - 0.125) + 0.55 \times 0.25} = \frac{0.45 \times 0.875}{0.45 \times 0.875 + 0.55 \times 0.25} = \frac{0.394}{0.394 + 0.1375} = \frac{0.394}{0.531} = 0.742$

After learning transition: $P(L_n) = 0.742 + (1 - 0.742) \times 0.15 = 0.742 + 0.039 = 0.781$

Interpretation: Topic mastery estimate jumped from 0.45 to 0.78, reflecting that a correct response (even with L2 scaffold, captured by adjusted slip) substantially increased our belief that the student knows "rates."

Appendix E: Thompson Sampling Selection Example

Student Elo: 1054.5 (from Appendix C update) Current Level: 2.1

ZPD Window: [1.4, 2.4] (current - 2 to current + 3, in sub-level steps)

Current priors (after 12 interactions):

Level	α	β	Elo
1.4	8.2	2.1	975
1.5	7.5	3.0	1025
2.1	5.8	4.2	1075
2.2	3.1	3.5	1125
2.3	1.8	2.3	1175
2.4	1.2	1.5	1225

Sampled (one draw):

Level	$\hat{\theta}_l$	Proximity Bonus	Score
1.4	0.82	$\exp(-(975-1054.5)^2/20000) = 0.85$	0.697
1.5	0.68	$\exp(-(1025-1054.5)^2/20000) = 0.96$	0.653
2.1	0.61	$\exp(-(1075-1054.5)^2/20000) = 0.98$	0.598
2.2	0.55	$\exp(-(1125-1054.5)^2/20000) = 0.88$	0.484
2.3	0.42	$\exp(-(1175-1054.5)^2/20000) = 0.58$	0.244
2.4	0.38	$\exp(-(1225-1054.5)^2/20000) = 0.29$	0.110

Selected: Level 1.4 (highest score = 0.697)

But wait—the decision orchestrator said MAINTAIN (since weighted_outcome = 0.50 is in the maintain range). Thompson selected 1.4, which is below current 2.1. Since the decision says MAINTAIN (not DECREASE), the override rule kicks in: the system selects a question from level 2.1 (current level), not 1.4. Thompson's selection is overridden to respect the progression decision.

This demonstrates how the decision orchestrator provides guardrails against Thompson's stochastic exploration when the student's recent performance doesn't support a level change.

Appendix F: Glossary

Term	Definition
BKT	Bayesian Knowledge Tracing — HMM-based model of student knowledge per topic
COPPA	Children's Online Privacy Protection Act — US law for minors' data
DKT	Deep Knowledge Tracing — RNN-based alternative to BKT
ELL	English Language Learner
Elo	Rating system where opponents update each other's ratings after each match
FERPA	Family Educational Rights and Privacy Act — US law for student records
FK	Flesch-Kincaid readability metric
HMM	Hidden Markov Model
IRT	Item Response Theory — psychometric model of test items and examinees
L1	First language (Spanish, in MathLingua's context)
L2	Second language (English, in MathLingua's context)
LDS	Language Dependency Score — engineered feature measuring scaffold reliance
MCS	Math Confidence Score — engineered feature measuring mathematical ability
QLoRA	Quantized Low-Rank Adaptation — memory-efficient fine-tuning method
SLM	Small Language Model (< 4B parameters)
Thompson Sampling	Bayesian bandit algorithm for exploration/exploitation
WIDA	Standards framework for English language proficiency in U.S. schools
ZPD	Zone of Proximal Development — Vygotsky's learning theory concept

End of Technical Specification — MathLingua v1.0

MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension

Technical Specification Document

Abstract

Table of Contents

1. Introduction

1.1 Problem Statement

1.2 Proposed Solution

1.3 Target Population

1.4 Design Goals

2. Related Work

2.1 Adaptive Learning Algorithms

2.2 Mathematical Word Problem Datasets

2.3 Scaffolding in Mathematics Education

2.4 Small Language Models for Education

3. Difficulty Taxonomy and Linguistic Progression

3.1 Design Principle

3.2 Three-Tier, Fifteen-Level Taxonomy

3.3 Readability Validation

3.4 Mathematical Topics by Grade

4. Question Database Design

4.1 Database Structure

4.2 Prototype Database Coverage

4.3 Question Generation Pipeline

5. Feature Engineering

5.1 Motivation

5.2 Input Signals

5.3 Language Dependency Score (LDS)

Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)

Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)

Sub-Feature 3: Escalation Speed ($E_{speed}$)

Sub-Feature 4: Reveal Flag ($F_{reveal}$)

LDS Interpretation Guide

5.4 Math Confidence Score (MCS)

Sub-Feature 5: Correctness ($C_{correct}$)

Sub-Feature 6: Speed Factor ($S_{speed}$)

Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)

Sub-Feature 8: Language Independence ($1 - \text{LDS}$)

MCS Interpretation Guide

5.5 Feature Interaction Matrix

5.6 Feature Importance for Predicting isSolved

6. Adaptive Algorithm

6.1 Algorithm Selection Rationale

6.2 Elo Rating System

Standard Elo (adapted for education)

Initialization

6.3 Bayesian Knowledge Tracing (BKT)

Parameters (per topic)

Update Rules

Slip Adjustment for Scaffold Usage

6.4 Thompson Sampling for Question Selection

Beta-Bernoulli Model

ZPD-Constrained Selection

Exploration vs. Exploitation Balance

6.5 Decision Orchestrator

6.6 Simulation Results

Profile 1: Strong Student (True Level ~2.5)

Profile 2: Struggling Student (True Level ~1.2)

Profile 3: Average Student (True Level ~1.5)

7. System Architecture

7.1 Architecture Summary

7.2 Key Design Decisions

8. SLM Fine-Tuning Strategy

8.1 Model Selection

8.2 Training Data

Source: Gemini V1 Production Data

Data Collection Targets

Quality Filtering Criteria

8.3 Fine-Tuning Configuration

8.4 Evaluation Metrics

8.5 Deployment Strategy

9. Data Collection Schema

9.1 Purpose

9.2 Interaction-Level Data

9.3 Session-Level Aggregates

9.4 Longitudinal Student Profile

9.5 Privacy and Ethics

10. Evaluation Plan

10.1 Phase 1: Technical Validation (Pre-Launch)

10.2 Phase 2: Pilot Study (Launch + 4 Weeks)

10.3 Phase 3: Scale and Iteration (Launch + 3 Months)

5.6 Feature Importance for Predicting `isSolved`