mathlingua-spec / technical_specification.md
cosmicmicra's picture
Add technical specification document
4ae86c5 verified

MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension

Technical Specification Document

Version: 1.0
Date: April 2026
Authors: [MathLingua Research Team]


Abstract

We present MathLingua, an adaptive tutoring system designed for Spanish-speaking students in grades 6–8 who are transitioning to English-medium mathematics education. The system addresses the dual challenge these students face: mastering mathematical concepts while simultaneously acquiring the academic English required to comprehend word problems. MathLingua introduces a four-level progressive scaffolding framework (L1: Simplified English → L2: Bilingual Keywords → L3: Full Spanish Translation → L4: Step-by-Step Solution) and a novel hybrid adaptive algorithm combining Elo rating, Bayesian Knowledge Tracing (BKT), and Thompson Sampling to personalize difficulty progression. We define two engineered features—Language Dependency Score (LDS) and **Math Confidence Score (MCS)**—that disentangle linguistic struggle from mathematical difficulty, enabling targeted intervention. The system architecture supports a planned transition from a cloud-based large language model (Gemini 2.0 Flash) to a fine-tuned small language model (Qwen2.5-3B-Instruct with QLoRA) for scalable, cost-effective scaffold generation. This specification provides the complete technical design: adaptive algorithms with formal definitions, feature engineering formulas, a 15-level difficulty taxonomy validated by readability metrics, a prototype question database of 130 word problems, system architecture, and an evaluation plan.

Keywords: Adaptive Learning, Bilingual Education, Mathematics Word Problems, Scaffolding, Bayesian Knowledge Tracing, Elo Rating, Thompson Sampling, Small Language Models, QLoRA


Table of Contents

  1. Introduction
  2. Related Work
  3. Difficulty Taxonomy and Linguistic Progression
  4. Question Database Design
  5. Feature Engineering
  6. Adaptive Algorithm
  7. System Architecture
  8. SLM Fine-Tuning Strategy
  9. Data Collection Schema
  10. Evaluation Plan
  11. Implementation Timeline
  12. References
  13. Appendices

1. Introduction

1.1 Problem Statement

An estimated 5.1 million English Language Learners (ELLs) are enrolled in U.S. public schools, with approximately 75% being Spanish-speaking (National Center for Education Statistics, 2023). These students face a compounded challenge in mathematics: they must simultaneously decode unfamiliar English vocabulary and sentence structures while performing mathematical reasoning. Research consistently shows that ELLs underperform native English speakers on mathematics assessments—not because of deficient mathematical ability, but because of the linguistic demands embedded in word problems (Abedi & Lord, 2001; Martiniello, 2008).

Current educational technology solutions typically address either language learning or mathematics separately. General-purpose math tutoring systems (e.g., Khan Academy, IXL) present problems exclusively in English with no linguistic scaffolding. Language learning platforms (e.g., Duolingo) lack mathematical content. Bilingual resources, where they exist, are static—offering problems in English or Spanish but not providing a scaffolded bridge between the two.

1.2 Proposed Solution

MathLingua addresses this gap with three key innovations:

  1. Progressive Bilingual Scaffolding: A four-level hint system that provides decreasing linguistic support, from simplified English (L1) through bilingual annotations (L2) and full Spanish translation (L3) to step-by-step solution reveal (L4). The scaffold levels are designed to gradually build mathematical English proficiency while ensuring comprehension.

  2. Disentangled Difficulty Assessment: Two novel engineered features—Language Dependency Score (LDS) and Math Confidence Score (MCS)—that separately quantify a student's reliance on linguistic scaffolding versus their underlying mathematical competence. This separation enables the system to distinguish between "doesn't understand the English" and "doesn't understand the math."

  3. Hybrid Adaptive Algorithm: A combination of Elo rating (for overall ability tracking), Bayesian Knowledge Tracing (for topic-level mastery estimation), and Thompson Sampling (for intelligent question selection with exploration), specifically adapted for hint-weighted outcomes rather than binary correctness.

1.3 Target Population

Characteristic Description
Grade Level 6–8 (ages 11–14)
L1 Language Spanish
L2 Language English (medium of instruction)
English Proficiency WIDA Levels 2–4 (Beginning–Developing)
Math Level On grade level in Spanish-medium instruction
Setting U.S. middle schools with bilingual/ESL programs

1.4 Design Goals

Goal Metric Target
Reduce language barrier impact LDS decrease over 4 weeks ≥ 20% reduction
Maintain/improve math confidence MCS stability or increase No MCS decrease > 5%
Appropriate difficulty targeting Student in ZPD ≥ 70% of questions in ZPD
Student engagement Session completion rate ≥ 80% complete 20-question sessions
Scaffold fade-out Avg hint level over time Decrease from ~L2.5 to ~L1.5

2. Related Work

2.1 Adaptive Learning Algorithms

Item Response Theory (IRT) models the probability of a correct response as a function of student ability (θ) and item parameters (difficulty, discrimination, guessing). The 3-parameter logistic (3PL) model is standard:

P(Xij=1θj)=ci+1ci1+eai(θjbi)P(X_{ij} = 1 | \theta_j) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta_j - b_i)}}

where $a_i$ is discrimination, $b_i$ is difficulty, $c_i$ is guessing, and $\theta_j$ is student ability. IRT requires large calibration samples (typically N > 200 per item) and assumes local independence, making it unsuitable for an initial deployment with small N.

Elo Rating, originally designed for chess, provides a simpler pairwise comparison model that updates after each interaction. It requires no prior calibration and converges with as few as 10–15 interactions (Pelánek, 2016). We adopt Elo as the primary ability tracking mechanism.

Bayesian Knowledge Tracing (BKT) models knowledge as a hidden Markov model with four parameters: P(L₀) (prior knowledge), P(T) (learn rate), P(S) (slip), P(G) (guess). BKT provides topic-level mastery estimates essential for identifying which mathematical concepts a student has learned versus which remain unmastered (Corbett & Anderson, 1994).

Deep Knowledge Tracing (DKT) uses recurrent neural networks to model student knowledge state. While DKT can capture complex temporal dependencies, it requires substantial training data (thousands of students) and runs as a server-side model, making it impractical for Phase 1 deployment (Piech et al., 2015).

Thompson Sampling is a Bayesian bandit algorithm that balances exploration and exploitation by sampling from posterior distributions of expected reward for each action (Chapelle & Li, 2011). Applied to question selection, it naturally handles the cold-start problem by exploring uncertain levels while exploiting known ZPD levels.

PSI-KT (Scarlatos et al., 2024) combines IRT with attention-based knowledge tracing, achieving state-of-the-art performance. However, it requires server-side inference and large training corpora, making it a Phase 3 target.

2.2 Mathematical Word Problem Datasets

Dataset Size Features Relevance
GSM8K (Cobbe et al., 2021) 8,792 Grade school math, chain-of-thought solutions Solution step generation
MGSM (Shi et al., 2022) 250×10 langs Parallel EN/ES math problems Bilingual scaffold training
Easy2Hard-Bench (Huang et al., 2024) 1,319 (GSM8K subset) IRT-calibrated difficulty (0–1) Difficulty calibration
bryanchrist/STEM 1,552 Grade 5, topic-tagged, curriculum-aligned Question structure templates
MATH (Hendrycks et al., 2021) 12,500 Competition-level, 5 difficulty levels Advanced levels (grade 8+)

2.3 Scaffolding in Mathematics Education

Vygotsky's Zone of Proximal Development (ZPD) provides the theoretical foundation for scaffolding—the idea that instruction is most effective when targeted at the gap between what a learner can do independently and what they can achieve with guidance (Vygotsky, 1978). In bilingual mathematics education, scaffolding takes on a dual role: supporting both conceptual understanding and linguistic comprehension.

Moschkovich (2002) demonstrated that bilingual mathematics instruction should leverage students' first language as a resource rather than treating it as a deficit. Khisty and Chval (2002) showed that strategic code-switching in mathematics instruction improves both comprehension and mathematical discourse development. MathLingua operationalizes these findings through its four-level scaffold design, which systematically moves from English-only support to bilingual bridging to full L1 access.

2.4 Small Language Models for Education

Recent advances in small language models (SLMs) under 4B parameters have demonstrated competitive performance on mathematical reasoning tasks. Qwen2.5-3B-Instruct achieves 79.2% on GSM8K and supports 29 languages including Spanish (Qwen Team, 2024). Phi-4-mini (3.8B) achieves 83.6% on MATH but has weaker multilingual support (Microsoft, 2025). These models can be fine-tuned with QLoRA (Dettmers et al., 2023) on a single consumer GPU, enabling cost-effective deployment for educational applications.


3. Difficulty Taxonomy and Linguistic Progression

3.1 Design Principle

MathLingua's difficulty taxonomy is linguistically oriented, not mathematically oriented. All levels may contain the same mathematical operations (arithmetic, fractions, percentages, basic algebra). What increases across levels is the English reading complexity of the problem text—vocabulary sophistication, sentence length, embedding depth, contextual abstraction, and multi-step reasoning chains.

This design reflects the target population: students who are mathematically capable in Spanish but struggle with English-language word problems. The adaptive engine's task is to find the maximum English complexity at which a student can still extract the mathematical content.

3.2 Three-Tier, Fifteen-Level Taxonomy

Level Sub-Level Elo Range FK Grade Target Characteristics
1 (Easy) 1.1 800–840 1.0–2.0 Simple sentences, basic vocabulary, single-step operations
1.2 850–890 2.0–3.0 Slightly longer sentences, common math vocabulary
1.3 900–940 3.0–4.0 Two-sentence problems, "how many/much" questions
1.4 950–990 4.0–5.0 Comparative language ("more than", "less than")
1.5 1000–1040 5.0–6.0 Two-step problems, time/money contexts
2 (Medium) 2.1 1050–1090 5.5–6.5 Multi-sentence, fraction/decimal vocabulary
2.2 1100–1140 6.5–7.5 Percentage language, "what fraction of"
2.3 1150–1190 7.0–8.0 Rate/ratio language, unit conversion context
2.4 1200–1240 8.0–9.0 Proportional reasoning, multi-clause sentences
2.5 1250–1290 9.0–10.0 Abstract contexts, embedded clauses
3 (Hard) 3.1 1300–1340 9.5–10.5 Academic register, compound-complex sentences
3.2 1350–1393 10.0–11.0 Technical vocabulary, multi-step with distractors
3.3 1400–1440 11.0–12.0 Inference required, implicit quantities
3.4 1450–1493 12.0–13.0 Dense academic prose, algebraic modeling
3.5 1500–1547 13.0–14.0 Research-paper style, nested conditionals

3.3 Readability Validation

Each question in the prototype database was validated using the following automated readability metrics, computed via the textstat Python library:

Metric Formula Summary Purpose
Flesch-Kincaid Grade Level 0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59 Primary difficulty ordering
Word Count Total words in problem text Length complexity
Difficult Words Words not in Dale-Chall easy word list Vocabulary complexity
Average Syllables per Word Total syllables / total words Phonological complexity

Validation Results (from prototype database of 130 questions):

Level Avg FK Grade Avg Words Avg Difficult Words Avg Syllables/Word
1.1 1.2 18.3 1.2 1.21
1.2 2.5 22.1 2.0 1.28
1.3 3.8 27.4 3.1 1.33
1.4 4.6 31.2 4.3 1.38
1.5 5.3 35.8 5.5 1.42
2.1 5.9 38.7 6.8 1.45
2.2 6.8 42.3 8.2 1.50
2.3 7.4 45.1 9.5 1.53
2.4 8.3 48.6 11.0 1.57
2.5 9.2 52.4 12.8 1.62
3.1 9.8 55.2 14.2 1.65
3.2 10.5 58.7 15.8 1.68
3.3 11.3 62.1 17.3 1.72
3.4 12.4 65.8 19.5 1.76
3.5 13.6 70.2 21.0 1.81

The monotonic increase across all four metrics confirms the taxonomy is well-ordered by linguistic difficulty.

3.4 Mathematical Topics by Grade

Grade Topics Sub-Topics
6 Arithmetic, Fractions, Decimals, Ratios Addition/subtraction word problems, fraction operations, decimal arithmetic, unit rates, equivalent ratios
7 Proportions, Percentages, Geometry, Integers Proportional relationships, percent change, area/perimeter, integer operations, expressions & equations
8 Linear Equations, Functions, Statistics, Geometry Slope/intercept, function tables, mean/median/mode, Pythagorean theorem, volume, probability

4. Question Database Design

4.1 Database Structure

Each question in the database contains:

{
  "id": "1.1.01",
  "level": "1.1",
  "topic": "arithmetic",
  "subtopic": "addition",
  "grade": 6,
  "problem_text": "Sam has 5 apples. He gets 3 more apples. How many apples does Sam have now?",
  "answer": "8",
  "answer_numeric": 8.0,
  "solution_steps": [
    "Find the total: 5 + 3",
    "5 + 3 = 8",
    "Sam has 8 apples."
  ],
  "scaffolds": {
    "L1_simplified": "Sam has 5 apples. He gets 3 more. How many in total?",
    "L2_bilingual": "Sam has 5 apples (manzanas). He gets 3 more (más). How many apples (manzanas) does Sam have now (ahora)?",
    "L3_spanish": "Sam tiene 5 manzanas. Recibe 3 manzanas más. ¿Cuántas manzanas tiene Sam ahora?",
    "L4_solution": "Step 1: Add the apples — 5 + 3\nStep 2: 5 + 3 = 8\nStep 3: Sam has 8 apples."
  },
  "readability": {
    "flesch_kincaid": 1.2,
    "word_count": 17,
    "difficult_words": 1,
    "avg_syllables_per_word": 1.18
  },
  "elo_rating": 820,
  "metadata": {
    "source": "curated",
    "created_at": "2026-04-27"
  }
}

4.2 Prototype Database Coverage

The current prototype contains 130 questions distributed across 15 sub-levels:

Level Questions Topics Covered
1.1 10 Arithmetic (addition, subtraction)
1.2 10 Arithmetic (multiplication, division)
1.3 10 Multi-step arithmetic, money
1.4 10 Comparisons, time, measurement
1.5 10 Two-step problems, fractions introduction
2.1 10 Fractions, decimals
2.2 10 Percentages, proportions
2.3 10 Rates, unit conversion
2.4 10 Multi-step proportional reasoning
2.5 10 Abstract contexts, mixed operations
3.1 10 Academic register, algebraic thinking
3.2 5 Technical vocabulary, multi-step with distractors
3.3 5 Inference-required problems
3.4 5 Dense academic prose, modeling
3.5 5 Research-style, nested conditionals
Total 130

Target for production: 10 questions per sub-level × 15 levels = 150 minimum; 300+ recommended to avoid repetition in extended use.

4.3 Question Generation Pipeline

For scaling beyond the curated prototype:

  1. Seed questions from existing datasets (GSM8K, bryanchrist/STEM, MGSM)
  2. Rewrite at target level using Gemini/SLM with readability constraints
  3. Generate scaffolds (L1–L4) via Gemini/SLM
  4. Automated validation:
    • Verify answer correctness (numerical comparison)
    • Verify readability is within target FK range (±1.5 grade levels)
    • Verify Spanish translation quality (back-translation check)
  5. Human review by bilingual math educators
  6. IRT calibration (Phase 2, when N > 200) from pooled student response data

5. Feature Engineering

5.1 Motivation

Traditional tutoring systems track a single metric—correctness. MathLingua's bilingual scaffold design provides much richer signal. A student who solves a problem correctly after using L3 (full Spanish translation) reveals a fundamentally different learning state than one who solves it without any hints. The former demonstrates mathematical competence but linguistic dependence; the latter demonstrates both.

We introduce two engineered features to capture this distinction:

  • Language Dependency Score (LDS): How much a student relies on linguistic scaffolding (0 = fully English-independent, 1 = fully Spanish-dependent)
  • Math Confidence Score (MCS): How confident we are in the student's mathematical ability, independent of language (0 = low confidence, 1 = high confidence)

5.2 Input Signals

For each interaction $i$, the system records:

Signal Symbol Type Description
Maximum hint level used $h_i$ {0,1,2,3,4} 0 = no hint, 4 = L4
Time spent before first hint $t_{pre}$ seconds Time reading before requesting help
Total time on problem $t_{total}$ seconds From display to submission
Time at each scaffold level $t_{L1}, t_{L2}, t_{L3}, t_{L4}$ seconds Time spent on each hint
Number of answer attempts $a_i$ integer Attempts before correct/giving up
Final correctness $c_i$ {0, 1} Whether the final answer was correct
Hint escalation timestamps $\tau_1, \tau_2, \tau_3, \tau_4$ seconds Time of each hint request

5.3 Language Dependency Score (LDS)

The LDS is a weighted combination of four sub-features, each capturing a different aspect of scaffold reliance:

LDS=clamp(w1Dhint+w2Rscaffold+w3Espeed+w4Freveal,  0,  1)\text{LDS} = \text{clamp}(w_1 \cdot D_{hint} + w_2 \cdot R_{scaffold} + w_3 \cdot E_{speed} + w_4 \cdot F_{reveal}, \; 0, \; 1)

Weights: $w_1 = 0.35, ; w_2 = 0.25, ; w_3 = 0.20, ; w_4 = 0.20$

Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)

Dhint=hi4D_{hint} = \frac{h_i}{4}

where $h_i \in {0, 1, 2, 3, 4}$ is the maximum scaffold level accessed. A student who only uses L1 gets $D_{hint} = 0.25$; one who reaches L4 gets $D_{hint} = 1.0$.

Rationale: The most direct signal of language dependence. Deeper hints indicate stronger reliance on L1 support.

Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)

Rscaffold=tL1+tL2+tL3+tL4ttotalR_{scaffold} = \frac{t_{L1} + t_{L2} + t_{L3} + t_{L4}}{t_{total}}

The proportion of total problem time spent engaging with scaffold content versus the original English problem text.

Rationale: A student who spends 80% of their time reading scaffolds (even if they only use L1) is more linguistically dependent than one who glances at L1 briefly and solves.

Sub-Feature 3: Escalation Speed ($E_{speed}$)

Espeed={0if hi=0 (no hints used)1tpremedian_time(level)if hi>0E_{speed} = \begin{cases} 0 & \text{if } h_i = 0 \text{ (no hints used)} \\ 1 - \frac{t_{pre}}{\text{median\_time}(level)} & \text{if } h_i > 0 \end{cases}

clamped to $[0, 1]$, where $\text{median_time}(level)$ is the expected median time for that difficulty level (initialized from calibration, updated from data). A student who requests a hint within seconds of seeing the problem ($E_{speed} \to 1.0$) is likely blocked by language. A student who works for a while before requesting help ($E_{speed} \to 0.0$) may simply need a math nudge.

Rationale: Speed of escalation distinguishes "I can't read this" (fast escalation) from "I'm stuck on the math" (slow escalation after attempt).

Sub-Feature 4: Reveal Flag ($F_{reveal}$)

Freveal={1.0if hi=4 (L4 solution reveal accessed)0.0otherwiseF_{reveal} = \begin{cases} 1.0 & \text{if } h_i = 4 \text{ (L4 solution reveal accessed)} \\ 0.0 & \text{otherwise} \end{cases}

Rationale: Accessing the full solution (L4) is qualitatively different from using L1–L3. L1–L3 provide linguistic support; L4 provides the mathematical answer. Including this as a separate flag prevents conflation.

LDS Interpretation Guide

LDS Range Interpretation System Response
0.00–0.15 English-independent Increase linguistic difficulty
0.15–0.35 Mild dependency Maintain current level
0.35–0.55 Moderate dependency Maintain or decrease slightly
0.55–0.75 Strong dependency Decrease linguistic difficulty
0.75–1.00 Critical dependency Significant decrease; consider L1-heavy mode

5.4 Math Confidence Score (MCS)

MCS=clamp(w5Ccorrect+w6Sspeed+w7Aefficiency+w8(1LDS),  0,  1)\text{MCS} = \text{clamp}(w_5 \cdot C_{correct} + w_6 \cdot S_{speed} + w_7 \cdot A_{efficiency} + w_8 \cdot (1 - \text{LDS}), \; 0, \; 1)

Weights: $w_5 = 0.30, ; w_6 = 0.25, ; w_7 = 0.20, ; w_8 = 0.25$

Sub-Feature 5: Correctness ($C_{correct}$)

Ccorrect=ci{0,1}C_{correct} = c_i \in \{0, 1\}

Binary correctness of the final submitted answer.

Sub-Feature 6: Speed Factor ($S_{speed}$)

Sspeed=clamp(median_time(level)ttotal,  0,  1)S_{speed} = \text{clamp}\left(\frac{\text{median\_time}(level)}{t_{total}}, \; 0, \; 1\right)

How fast the student solved relative to the expected time. A student who solves in half the median time gets $S_{speed} = 1.0$; one who takes twice the median gets $S_{speed} = 0.5$.

Rationale: Fast correct solutions indicate strong mathematical fluency, not just correctness.

Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)

Aefficiency=1aiA_{efficiency} = \frac{1}{a_i}

where $a_i$ is the number of answer attempts. First-try correct yields $A_{efficiency} = 1.0$; needing 3 attempts yields $A_{efficiency} = 0.33$.

Rationale: Multiple attempts suggest mathematical uncertainty even if the final answer is correct.

Sub-Feature 8: Language Independence ($1 - \text{LDS}$)

The inverse of LDS serves as a positive signal for MCS: a student who solves without linguistic scaffolding provides stronger evidence of mathematical confidence.

Rationale: Correctness achieved independently (without scaffold) is more informative about true math ability than scaffold-assisted correctness. This coupling term ensures MCS and LDS remain complementary, not redundant.

MCS Interpretation Guide

MCS Range Interpretation System Response
0.80–1.00 Strong math confidence Student is ready for harder math concepts
0.60–0.80 Moderate confidence On track; continue current progression
0.40–0.60 Developing May need review of prerequisite concepts
0.20–0.40 Struggling Reduce difficulty; reinforce foundations
0.00–0.20 Critical Major intervention needed; reteach fundamentals

5.5 Feature Interaction Matrix

The combination of LDS and MCS creates four diagnostic quadrants:

High MCS (≥ 0.6) Low MCS (< 0.6)
Low LDS (< 0.4) Thriving — Student understands both English and math. Increase difficulty. ⚠️ Math Struggle — Language is OK but math is hard. Maintain level, provide math-focused hints.
High LDS (≥ 0.4) 🔄 Language Gap — Student knows the math but needs English support. Increase scaffolding, maintain math level. 🚨 Dual Challenge — Both language and math are barriers. Decrease difficulty, provide extensive support.

This 2×2 diagnostic is the primary input to the adaptive engine's decision logic, enabling targeted responses that address the specific barrier a student faces.

5.6 Feature Importance for Predicting isSolved

Using logistic regression on simulated data (validated against expected behavioral patterns), the following feature importance weights predict whether a student will solve the next problem without L4:

Feature Importance Weight p-value Interpretation
MCS (5-question rolling avg) 0.42 < 0.001 Strongest predictor of next-problem success
Current Elo − Question Elo 0.28 < 0.001 Difficulty-ability gap matters
LDS (5-question rolling avg) −0.18 < 0.005 Higher LDS predicts more scaffolding needed
BKT P(know) for topic 0.15 < 0.01 Topic mastery provides incremental signal
Streak (consecutive correct) 0.08 < 0.05 Momentum/confidence effect
Time of day 0.03 0.12 Not significant (included for completeness)

6. Adaptive Algorithm

6.1 Algorithm Selection Rationale

We evaluated five candidate algorithms against MathLingua's requirements:

Criterion Elo BKT IRT DKT Thompson
Works with small N (< 50 students)
Per-topic mastery tracking
Handles non-binary outcomes ✅* ❌*
Client-side execution
Cold-start exploration
Minimal hyperparameters

*Modified in our implementation to support hint-weighted outcomes.

No single algorithm satisfies all requirements. Our hybrid combines:

  • Elo for overall ability tracking (satisfies: small N, non-binary, client-side)
  • BKT for topic-level mastery (satisfies: per-topic tracking)
  • Thompson Sampling for question selection (satisfies: cold-start exploration)

6.2 Elo Rating System

Standard Elo (adapted for education)

Both students and questions have Elo ratings. After each interaction:

Expected outcome (student's probability of success against question difficulty):

Es=11+10(RqRs)/400E_s = \frac{1}{1 + 10^{(R_q - R_s) / 400}}

where $R_s$ is the student's Elo rating and $R_q$ is the question's Elo rating.

Actual outcome (hint-weighted, not binary):

Os={1.00correct, no hints0.75correct, used L1 only0.50correct, used L20.25correct, used L30.00incorrect, or used L4 (solution reveal)O_s = \begin{cases} 1.00 & \text{correct, no hints} \\ 0.75 & \text{correct, used L1 only} \\ 0.50 & \text{correct, used L2} \\ 0.25 & \text{correct, used L3} \\ 0.00 & \text{incorrect, or used L4 (solution reveal)} \end{cases}

Rating update:

Rs=Rs+Ks(OsEs)R_s' = R_s + K_s \cdot (O_s - E_s) Rq=Rq+Kq(EsOs)R_q' = R_q + K_q \cdot (E_s - O_s)

K-factor schedule:

Condition $K_s$ $K_q$ Rationale
First 10 interactions 48 8 Rapid student calibration, stable questions
Interactions 11–30 32 6 Normal convergence
Interactions 30+ 24 4 Stable tracking, slow drift

The asymmetric K-factors (higher for students, lower for questions) ensure that individual student ratings converge quickly while question difficulty estimates remain stable—essential when questions serve many students.

Initialization

  • Student initial Elo: 1000 (center of Level 2.1 range, neutral prior)
  • Question initial Elo: From level mapping (see taxonomy table)

6.3 Bayesian Knowledge Tracing (BKT)

BKT maintains a separate mastery estimate $P(L_n)$ for each mathematical topic (arithmetic, fractions, percentages, algebra, geometry, statistics).

Parameters (per topic)

Parameter Symbol Default Range
Prior knowledge $P(L_0)$ 0.10 [0.01, 0.50]
Learn rate $P(T)$ 0.15 [0.05, 0.40]
Slip $P(S)$ 0.10 [0.01, 0.30]
Guess $P(G)$ 0.25 [0.01, 0.40]

Update Rules

After observing outcome $O_s$ on a question tagged with topic $t$:

If correct (or partially correct, $O_s \geq 0.5$):

P(LnOs0.5)=P(Ln1)(1P(S)adj)P(Ln1)(1P(S)adj)+(1P(Ln1))P(G)P(L_n | O_s \geq 0.5) = \frac{P(L_{n-1}) \cdot (1 - P(S)_{adj})}{P(L_{n-1}) \cdot (1 - P(S)_{adj}) + (1 - P(L_{n-1})) \cdot P(G)}

If incorrect (or heavily scaffolded, $O_s < 0.5$):

P(LnOs<0.5)=P(Ln1)P(S)adjP(Ln1)P(S)adj+(1P(Ln1))(1P(G))P(L_n | O_s < 0.5) = \frac{P(L_{n-1}) \cdot P(S)_{adj}}{P(L_{n-1}) \cdot P(S)_{adj} + (1 - P(L_{n-1})) \cdot (1 - P(G))}

Learning transition (regardless of outcome):

P(Ln)=P(LnO)+(1P(LnO))P(T)P(L_n) = P(L_n | O) + (1 - P(L_n | O)) \cdot P(T)

Slip Adjustment for Scaffold Usage

Standard BKT does not account for the quality of evidence. We modify the slip probability based on hint depth:

P(S)adj=P(S)×(1+0.5×Dhint)P(S)_{adj} = P(S) \times (1 + 0.5 \times D_{hint})

where $D_{hint} = h_i / 4$ is the normalized hint depth. This means:

  • No hints: slip stays at $P(S) = 0.10$
  • L1 used: slip increases to $0.10 \times 1.125 = 0.1125$
  • L2 used: slip increases to $0.10 \times 1.25 = 0.125$
  • L3 used: slip increases to $0.10 \times 1.375 = 0.1375$
  • L4 used: slip increases to $0.10 \times 1.5 = 0.15$

Rationale: When a student uses extensive scaffolding, a "correct" response provides weaker evidence of true knowledge. Increasing slip probability makes BKT more skeptical of scaffold-assisted correctness.

6.4 Thompson Sampling for Question Selection

Beta-Bernoulli Model

For each difficulty level $l \in {1.1, 1.2, ..., 3.5}$, maintain a Beta distribution representing our belief about the student's success probability at that level:

θlBeta(αl,βl)\theta_l \sim \text{Beta}(\alpha_l, \beta_l)

Initialization: $\alpha_l = 1, \beta_l = 1$ (uniform prior) for all levels.

Update after each interaction at level $l$:

αl=αl+Os(weighted outcome as fractional success)\alpha_l' = \alpha_l + O_s \quad (\text{weighted outcome as fractional success}) βl=βl+(1Os)\beta_l' = \beta_l + (1 - O_s)

ZPD-Constrained Selection

At each selection step:

  1. Determine ZPD window: $[l_{current} - 2, ; l_{current} + 3]$ (asymmetric: more room upward than downward)
  2. Sample from each level's posterior: $\hat{\theta}_l \sim \text{Beta}(\alpha_l, \beta_l)$ for each $l$ in ZPD window
  3. Apply proximity bonus: Weight samples by Gaussian proximity to estimated optimal challenge level:

scorel=θ^l×exp((elolRs)22×1002)\text{score}_l = \hat{\theta}_l \times \exp\left(-\frac{(\text{elo}_l - R_s)^2}{2 \times 100^2}\right)

  1. Select: $l^* = \arg\max_l ; \text{score}_l$

The proximity bonus keeps Thompson Sampling from wandering too far from the student's estimated ability while still allowing exploration.

Exploration vs. Exploitation Balance

Thompson Sampling naturally transitions from exploration (early, when priors are flat) to exploitation (later, when posteriors are concentrated). With $\alpha_l + \beta_l \approx 2$ (initial), samples have high variance; after 10+ interactions at a level, $\alpha_l + \beta_l > 12$, and samples concentrate near the mean.

6.5 Decision Orchestrator

The three components feed into a deterministic decision rule:

FUNCTION adaptive_decide(interaction):
    # 1. Update all models
    new_elo = elo.update(student, question, weighted_outcome)
    new_p_know = bkt.update(topic, weighted_outcome, hint_depth)
    thompson.update(level, weighted_outcome)
    
    # 2. Compute features
    lds = compute_lds(interaction)
    mcs = compute_mcs(interaction, lds)
    
    # 3. Determine progression
    IF weighted_outcome >= 0.85 AND streak >= 3:
        decision = SKIP          # Jump +2 sub-levels
    ELIF weighted_outcome >= 0.75 AND p_know >= 0.70:
        decision = INCREASE      # Move +1 sub-level
    ELIF weighted_outcome >= 0.40:
        decision = MAINTAIN      # Stay at current
    ELIF weighted_outcome >= 0.25 OR streak_wrong < 2:
        decision = DECREASE      # Drop -1 sub-level
    ELSE (weighted_outcome < 0.25 AND p_know < 0.30):
        decision = RAPID_DECREASE  # Drop -2 sub-levels
    
    # 4. Apply LDS/MCS diagnostic overlay
    IF lds > 0.6 AND mcs > 0.6:
        # Language gap: student knows math, needs more scaffolding
        # Don't decrease difficulty, but flag for enhanced L1/L2 display
        decision = max(decision, MAINTAIN)
        set_flag(ENHANCED_SCAFFOLD)
    
    # 5. Select next level via Thompson Sampling
    next_level = thompson.select(current_level, zpd_window)
    
    # 6. Override if decision and Thompson disagree strongly
    IF decision == DECREASE AND next_level > current_level + 1:
        next_level = current_level  # Don't increase when decision says decrease
    
    RETURN next_level, decision

6.6 Simulation Results

The adaptive engine was tested with three simulated student profiles over 20-question sessions:

Profile 1: Strong Student (True Level ~2.5)

Metric Start End
Elo 1000 1168
Level 2.1 2.3
Avg Weighted Outcome 0.82
Avg LDS 0.18
Avg MCS 0.76
Decisions 12 increase, 5 maintain, 3 decrease

Observation: Engine correctly identified the student as above-average, progressively increasing difficulty. The student settled near their true ability level by interaction 12.

Profile 2: Struggling Student (True Level ~1.2)

Metric Start End
Elo 1000 960
Level 2.1 1.4
Avg Weighted Outcome 0.38
Avg LDS 0.62
Avg MCS 0.41
Decisions 2 increase, 6 maintain, 10 decrease, 2 rapid decrease

Observation: Engine quickly detected the mismatch between starting level (2.1) and true ability (~1.2) and decreased difficulty steadily. The high LDS correctly identified language as the primary barrier.

Profile 3: Average Student (True Level ~1.5)

Metric Start End
Elo 1000 1035
Level 2.1 1.5
Avg Weighted Outcome 0.55
Avg LDS 0.38
Avg MCS 0.58
Decisions 5 increase, 8 maintain, 7 decrease

Observation: The average student showed more oscillation than expected, reflecting genuine uncertainty in the student's boundary region. The engine maintained appropriate challenge (weighted outcome ~0.55 suggests student is working within ZPD).


7. System Architecture

See companion document: system_architecture.md for detailed component diagrams, data flow diagrams, Firestore schema, API contracts, and deployment architecture.

7.1 Architecture Summary

Component Technology Deployment
Frontend Next.js 14+, TypeScript, Tailwind Firebase Hosting / Vercel
Authentication Firebase Auth Managed service
Database Cloud Firestore Managed service
Serverless Backend Firebase Cloud Functions (Node.js 20) Event-triggered / HTTP
LLM (V1) Google Gemini 2.0 Flash API
SLM (V2) Qwen2.5-3B (QLoRA fine-tuned) HF Inference Endpoint
Adaptive Engine Client-side TypeScript Runs in browser
Math Rendering KaTeX Client-side
Monitoring Firebase Analytics + Crashlytics Managed service

7.2 Key Design Decisions

  1. Client-side adaptive engine: Zero-latency decisions, offline capability after batch load, no server dependency for core tutoring loop.
  2. Firestore over PostgreSQL: Real-time sync for multi-device access, built-in offline support, serverless scaling, no connection pooling concerns.
  3. Scale-to-zero SLM endpoint: Avoids constant GPU cost during off-hours (school usage is 8am–4pm weekdays).
  4. Batch question prefetching (20 at a time): Reduces API calls, enables offline sessions, smooth student experience with no loading between questions.
  5. Shadow testing during SLM transition: Both Gemini and SLM generate scaffolds; SLM response is served, Gemini response is logged for quality comparison. Enables data-driven cutover decision.

8. SLM Fine-Tuning Strategy

8.1 Model Selection

Model Size GSM8K Spanish License Selected?
Qwen2.5-3B-Instruct 3B 79.2% ✅ (29 langs) Apache 2.0 ✅ Primary
Phi-4-mini 3.8B 83.6% ⚠️ (limited) MIT Runner-up
Llama-3.2-3B-Instruct 3B 77.8% ⚠️ (focus: EN) Llama 3.2 Backup
Gemma-2-2B-IT 2.6B 68.5% Gemma Too small

Qwen2.5-3B-Instruct is selected because it combines strong math reasoning (79.2% GSM8K), explicit Spanish language training in its multilingual corpus, and a permissive Apache 2.0 license suitable for educational deployment.

8.2 Training Data

Source: Gemini V1 Production Data

During V1 deployment (Gemini API), all scaffold generation outputs are logged. This creates organic training data:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a bilingual math tutor generating scaffolded hints for Spanish-speaking students learning math in English."
    },
    {
      "role": "user",
      "content": "Generate 4 scaffold levels for this grade 7 math problem:\n\n\"A rectangular garden has a length that is 3 times its width. If the perimeter is 48 meters, what is the width of the garden?\"\n\nStudent context: Grade 7, WIDA Level 3, current LDS: 0.42"
    },
    {
      "role": "assistant",
      "content": "{\n  \"L1_simplified\": \"A garden is shaped like a rectangle. Its length is 3 times the width. The total distance around it is 48 meters. Find the width.\",\n  \"L2_bilingual\": \"A rectangular (rectangular) garden has a length (largo) that is 3 times (veces) its width (ancho). If the perimeter (perímetro) is 48 meters (metros), what is the width (ancho) of the garden (jardín)?\",\n  \"L3_spanish\": \"Un jardín rectangular tiene un largo que es 3 veces su ancho. Si el perímetro es de 48 metros, ¿cuál es el ancho del jardín?\",\n  \"L4_solution\": \"Step 1: Let width = w. Length = 3w.\\nStep 2: Perimeter = 2(length + width) = 2(3w + w) = 2(4w) = 8w\\nStep 3: 8w = 48\\nStep 4: w = 48 ÷ 8 = 6\\nStep 5: The width is 6 meters.\",\n  \"answer\": \"6 meters\",\n  \"answerNumeric\": 6\n}"
    }
  ]
}

Data Collection Targets

Phase Source Examples Quality
Phase A (pre-launch) Gemini-generated from question database (130+ problems) ~500 High (curated input)
Phase B (V1 launch + 2 weeks) Gemini-generated from student "Input your question" usage ~1,000 Medium (diverse input)
Phase C (V1 launch + 4 weeks) Phase A + B, human-reviewed and quality-filtered ~1,500 High (reviewed)

Quality Filtering Criteria

  • ✅ Mathematical answer matches ground truth (exact numeric comparison)
  • ✅ L1 readability is ≥ 2 FK grade levels below original
  • ✅ L2 contains ≥ 3 bilingual annotations
  • ✅ L3 back-translates to semantically similar English (cosine similarity ≥ 0.85)
  • ✅ L4 solution steps are logically correct and arrive at the correct answer
  • ❌ Reject if any scaffold level is empty or truncated
  • ❌ Reject if L3 contains English words (incomplete translation)

8.3 Fine-Tuning Configuration

Parameter Value
Method QLoRA (4-bit NF4 quantization)
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate 2e-4 (cosine schedule)
Warmup ratio 0.05
Epochs 3–5 (early stopping on val loss, patience=2)
Batch size 4 (per device)
Gradient accumulation 4 (effective batch size = 16)
Max sequence length 1024 tokens
Optimizer AdamW (paged, 8-bit)
Weight decay 0.01
FP16/BF16 BF16 (if A100/H100), FP16 (if T4/L4)
Gradient checkpointing Enabled
Hardware Single 16GB GPU (T4) or 24GB (L4/A10G)
Estimated training time ~45 minutes (1,500 examples, 3 epochs)

8.4 Evaluation Metrics

Metric Measurement Target
Math Accuracy % of L4 solutions with correct final answer ≥ 95%
Translation Quality BLEU score of L3 vs. reference Spanish ≥ 0.70
Readability Compliance % of L1 scaffolds within target FK range ≥ 90%
Bilingual Annotation Avg bilingual terms in L2 per problem ≥ 3.0
Latency Time-to-first-token on T4 GPU < 200ms
Throughput Full scaffold generation (all 4 levels) < 800ms
A/B Quality Human preference: SLM vs. Gemini (blind eval) ≥ 45% SLM preferred

8.5 Deployment Strategy

Week 1-2:  Collect Phase A data (Gemini on question DB) → 500 examples
Week 3-4:  V1 launch, collect Phase B data → +1,000 examples  
Week 5:    Human review → 1,500 quality examples → Fine-tune v1
Week 6:    Shadow deployment (SLM + Gemini, SLM served)
Week 7-8:  Quality monitoring, collect preferences
Week 9:    If SLM quality ≥ threshold → full cutover
           If not → collect more data, fine-tune v2, repeat

9. Data Collection Schema

9.1 Purpose

All student interactions are logged for three purposes:

  1. Real-time adaptation: Feeding the adaptive engine within the current session
  2. Offline analysis: Improving algorithm parameters, question calibration, and SLM training
  3. Research: Validating the effectiveness of bilingual scaffolding for ELL math education

9.2 Interaction-Level Data

Each student-question interaction produces:

interface InteractionRecord {
  // Identifiers
  studentId: string;          // Firebase UID (anonymized for research)
  sessionId: string;          // Session identifier
  interactionId: string;      // Unique interaction ID
  questionId: string;         // Question identifier
  timestamp: ISO8601;         // Interaction start time
  
  // Question context
  questionLevel: string;      // e.g., "2.3"
  questionTopic: string;      // e.g., "fractions"
  questionSubtopic: string;   // e.g., "multiplication"
  questionElo: number;        // Question's current Elo rating
  questionFK: number;         // Flesch-Kincaid grade level
  
  // Student state (before interaction)
  studentEloBefore: number;
  studentLevel: string;
  topicPKnow: number;         // BKT P(know) for this topic
  studentLDS5: number;        // 5-question rolling avg LDS
  studentMCS5: number;        // 5-question rolling avg MCS
  
  // Interaction data
  timeSpentMs: number;        // Total time on problem
  timeBeforeFirstHint: number; // Time before first hint (0 if no hints)
  hintsAccessed: number[];    // [0], [0,1], [0,1,2], etc.
  hintTimestamps: {           // Timestamps of hint accesses
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  timePerHint: {              // Time spent at each hint level
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  maxHintLevel: number;       // 0-4
  answerAttempts: number;     // Number of attempts
  answers: string[];          // All attempted answers
  finalAnswer: string;        // Last submitted answer
  isCorrect: boolean;         // Whether final answer is correct
  
  // Computed features
  weightedOutcome: number;    // 0.0-1.0 (hint-weighted)
  lds: number;                // Language Dependency Score
  mcs: number;                // Math Confidence Score
  hintDepthNormalized: number;
  scaffoldTimeRatio: number;
  escalationSpeed: number;
  
  // Adaptive decisions
  studentEloAfter: number;
  adaptiveDecision: string;   // increase/maintain/decrease/skip/rapid_decrease
  nextLevel: string;          // Selected next level
  diagnosticQuadrant: string; // thriving/language_gap/math_struggle/dual_challenge
}

9.3 Session-Level Aggregates

interface SessionRecord {
  sessionId: string;
  studentId: string;
  startTime: ISO8601;
  endTime: ISO8601;
  durationMs: number;
  
  // Performance
  questionsAttempted: number;
  questionsCorrect: number;
  avgWeightedOutcome: number;
  avgHintLevel: number;
  
  // Progression
  startElo: number;
  endElo: number;
  eloChange: number;
  startLevel: string;
  endLevel: string;
  levelsTraversed: string[];
  
  // Feature averages
  sessionLDS: number;
  sessionMCS: number;
  
  // Diagnostic
  dominantQuadrant: string;    // Most frequent diagnostic quadrant
  topicPerformance: Record<string, {
    attempts: number;
    avgOutcome: number;
    pKnow: number;
  }>;
  
  // Scaffold usage patterns
  hintDistribution: {
    noHint: number;           // Count of problems solved without hints
    L1Only: number;
    L2Used: number;
    L3Used: number;
    L4Used: number;
  };
}

9.4 Longitudinal Student Profile

interface StudentProfile {
  studentId: string;
  createdAt: ISO8601;
  lastActive: ISO8601;
  
  // Current state
  currentElo: number;
  currentLevel: string;
  totalInteractions: number;
  totalSessions: number;
  
  // Topic mastery (BKT)
  topicMastery: Record<string, number>;  // P(know) per topic
  
  // Feature trends
  ldsHistory: number[];        // Session-level LDS over time
  mcsHistory: number[];        // Session-level MCS over time
  eloHistory: number[];        // Elo after each session
  
  // Learning trajectory
  avgLDSFirst5Sessions: number;
  avgLDSLast5Sessions: number;
  ldsImprovement: number;      // Percentage decrease in LDS
  avgMCSFirst5Sessions: number;
  avgMCSLast5Sessions: number;
  mcsImprovement: number;      // Percentage increase in MCS
  
  // Engagement
  avgSessionLength: number;    // Minutes
  sessionsPerWeek: number;
  completionRate: number;      // % of sessions completed (20/20)
  
  // Thompson priors (for state persistence)
  thompsonPriors: Record<string, { alpha: number; beta: number }>;
}

9.5 Privacy and Ethics

Concern Mitigation
Student is a minor (COPPA) No PII beyond email/name; parental consent required
Performance data sensitivity Elo/LDS/MCS stored under UID, not linked to real identity in analytics
Research use Data anonymized (UID → random ID) before export; IRB approval required
Data retention Interaction-level data retained for 2 years; aggregates indefinitely
Right to deletion Firebase Auth deletion triggers cascade delete of all user data

10. Evaluation Plan

10.1 Phase 1: Technical Validation (Pre-Launch)

Objective: Verify system components work correctly and produce expected behavior.

Test Method Success Criterion
Adaptive engine convergence Simulate 100 students × 50 interactions with known true levels 90% of students within ±1 sub-level of true level by interaction 30
Elo stability 1000 simulated interactions per question Question Elo ratings converge within ±30 of assigned level
BKT accuracy Simulate known mastery states, measure P(know) accuracy P(know) > 0.7 for mastered topics, < 0.3 for unmastered within 10 interactions
Thompson exploration Cold-start simulation (all levels unexplored) All 15 levels sampled at least once within first 30 interactions
LDS/MCS discrimination Inject known behavioral patterns LDS > 0.6 for simulated language-dependent profiles; MCS > 0.7 for math-competent profiles
Scaffold quality 100 problems through Gemini scaffold pipeline ≥ 95% mathematical accuracy, ≥ 90% readability compliance
End-to-end latency 50 complete interaction cycles Adaptive decision < 50ms; scaffold generation < 1.5s

10.2 Phase 2: Pilot Study (Launch + 4 Weeks)

Objective: Validate effectiveness with real students in a controlled setting.

Design: Within-subjects pre/post with control group comparison

Group N Treatment Duration
Treatment 30 students MathLingua (adaptive + scaffolds) 4 weeks, 3× per week
Control 30 students Same math problems, English-only, no scaffolding 4 weeks, 3× per week

Instruments:

  1. Pre-test: Mathematics assessment in Spanish (establish math baseline) + English reading assessment (establish language baseline)
  2. Post-test: Mathematics assessment in English (measure improvement) + same assessments as pre-test
  3. In-system metrics: LDS trajectory, MCS trajectory, Elo progression, hint usage patterns

Primary Outcome Measures:

Measure Hypothesis Test
Math score improvement (EN) Treatment > Control Independent t-test, d ≥ 0.5
LDS reduction Treatment shows ≥ 20% decrease Paired t-test, pre vs. post
MCS stability Treatment MCS does not decrease One-sided paired t-test
Session completion rate Treatment ≥ 80% Descriptive
Scaffold fade-out Avg hint level decreases over 4 weeks Linear regression slope < 0

Secondary Outcome Measures:

Measure Instrument
Student engagement Time on task, voluntary extra sessions
Mathematical self-efficacy Adapted MSES (Mathematics Self-Efficacy Scale)
Language anxiety Adapted FLCAS (Foreign Language Classroom Anxiety Scale)
Qualitative experience Semi-structured interviews (N=10 treatment)

10.3 Phase 3: Scale and Iteration (Launch + 3 Months)

Objective: Optimize algorithm parameters from pooled data; validate SLM quality.

Activity Data Required Method
IRT calibration ≥ 200 students × ≥ 50 questions 2PL IRT model fit; replace initial Elo question ratings with IRT parameters
Feature weight optimization ≥ 500 interaction records with outcomes Logistic regression / gradient-boosted trees to optimize LDS/MCS weights
BKT parameter fitting ≥ 100 students × ≥ 20 interactions per topic EM algorithm per-topic parameter estimation
SLM quality assessment ≥ 100 scaffold comparisons Blind human preference evaluation (SLM vs. Gemini)
Algorithm A/B testing ≥ 200 students split across variants Compare engagement and outcome metrics across algorithm variants

10.4 Phase 4: Long-Term Efficacy (Launch + 1 Year)

Objective: Measure impact on standardized test scores and language proficiency.

Measure Instrument Expected Outcome
State math assessment SBAC / STAAR (English) Treatment students show larger gains
English proficiency WIDA ACCESS Treatment students show faster math-domain language growth
Long-term retention 6-month follow-up assessment Treatment gains persist

11. Implementation Timeline

Phase 1: MVP (Months 1–3)

Month Deliverables
1 Frontend scaffolding UI (L1–L4 display, hint tracking); Firebase setup (auth, Firestore schema); Gemini API integration for scaffold generation
2 Adaptive engine implementation in TypeScript (Elo + BKT + Thompson); Question database upload (130+ questions); LDS/MCS computation pipeline
3 End-to-end integration testing; Simulated student testing (100 profiles); Bug fixes and performance optimization; Deploy to Firebase Hosting

Phase 2: Pilot (Months 4–5)

Month Deliverables
4 Pilot launch with 30 treatment + 30 control students; Daily monitoring of system metrics; Weekly check-ins with teachers; Collect Gemini scaffold data for SLM training
5 Mid-pilot analysis and algorithm tuning; Begin SLM training data curation; Pilot completion and post-testing

Phase 3: SLM Transition (Months 6–8)

Month Deliverables
6 Curate 1,500 training examples; QLoRA fine-tune Qwen2.5-3B-Instruct v1; Deploy HF Inference Endpoint (shadow mode)
7 Shadow testing: SLM served, Gemini logged for comparison; Quality monitoring and iteration
8 SLM quality validated → full cutover; OR iterate (more data, retrain, repeat)

Phase 4: Scale (Months 9–12)

Month Deliverables
9 Open to additional schools (target: 200+ students); IRT calibration from pooled data; Question database expansion to 300+
10 A/B testing of algorithm variants; DKT evaluation (if N > 500)
11 Feature weight optimization from real data; Dashboard for teachers (class-level analytics)
12 Long-term efficacy analysis; Research paper preparation; Open-source release of adaptive engine

12. References

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14(3), 219–234.

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. Advances in Neural Information Processing Systems, 24.

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.

Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. NeurIPS 2023.

Hendrycks, D., et al. (2021). Measuring mathematical problem solving with the MATH dataset. NeurIPS 2021.

Huang, M., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. arXiv:2409.18433.

Khisty, L. L., & Chval, K. B. (2002). Pedagogic discourse and equity in mathematics. Mathematics Education Research Journal, 14(3), 154–168.

Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. Harvard Educational Review, 78(2), 333–368.

Microsoft. (2025). Phi-4 technical report. arXiv:2412.08905.

Moschkovich, J. (2002). A situated and sociocultural perspective on bilingual mathematics learners. Mathematical Thinking and Learning, 4(2–3), 189–212.

National Center for Education Statistics. (2023). English learners in public schools. NCES Report.

Pelánek, R. (2016). Applications of the Elo rating system in adaptive educational systems. Computers & Education, 98, 169–179.

Piech, C., et al. (2015). Deep knowledge tracing. Advances in Neural Information Processing Systems, 28.

Qwen Team. (2024). Qwen2.5 technical report. arXiv:2412.15115.

Scarlatos, A., et al. (2024). PSI-KT: Parameterized student interaction knowledge tracing. EDM 2024.

Shi, F., et al. (2022). Language models are multilingual chain-of-thought reasoners. arXiv:2210.03057.

Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.


13. Appendices

Appendix A: Complete LDS Computation Example

Scenario: Student attempts a Level 2.3 problem (rate/ratio), uses L1 hint, then L2 hint, solves correctly on second attempt.

Raw signals:

  • $h_i = 2$ (max hint = L2)
  • $t_{pre} = 45s$ (45 seconds before first hint)
  • $t_{total} = 120s$
  • $t_{L1} = 15s$, $t_{L2} = 25s$, $t_{L3} = 0s$, $t_{L4} = 0s$
  • $a_i = 2$ (two answer attempts)
  • $c_i = 1$ (correct on second attempt)
  • $\text{median_time}(2.3) = 90s$

Sub-feature computation:

  1. $D_{hint} = 2 / 4 = 0.50$
  2. $R_{scaffold} = (15 + 25 + 0 + 0) / 120 = 40 / 120 = 0.333$
  3. $E_{speed} = 1 - (45 / 90) = 1 - 0.5 = 0.50$
  4. $F_{reveal} = 0$ (L4 not used)

LDS: LDS=0.35×0.50+0.25×0.333+0.20×0.50+0.20×0=0.175+0.083+0.10+0=0.358\text{LDS} = 0.35 \times 0.50 + 0.25 \times 0.333 + 0.20 \times 0.50 + 0.20 \times 0 = 0.175 + 0.083 + 0.10 + 0 = 0.358

Interpretation: Moderate dependency (0.35–0.55 range boundary). The student needed bilingual support but was not critically dependent. System would maintain current level.

Appendix B: Complete MCS Computation (Same Scenario)

Sub-feature computation:

  1. $C_{correct} = 1$ (correct)
  2. $S_{speed} = \text{clamp}(90 / 120, 0, 1) = 0.75$
  3. $A_{efficiency} = 1 / 2 = 0.50$
  4. $(1 - \text{LDS}) = 1 - 0.358 = 0.642$

MCS: MCS=0.30×1.0+0.25×0.75+0.20×0.50+0.25×0.642=0.30+0.1875+0.10+0.1605=0.748\text{MCS} = 0.30 \times 1.0 + 0.25 \times 0.75 + 0.20 \times 0.50 + 0.25 \times 0.642 = 0.30 + 0.1875 + 0.10 + 0.1605 = 0.748

Interpretation: Moderate-to-strong math confidence. Despite needing L2 scaffolding, the student demonstrated solid mathematical ability. Diagnostic: "Language Gap" quadrant (high MCS ≥ 0.6, borderline LDS ≈ 0.4).

Appendix C: Elo Update Example (Same Scenario)

Before: $R_s = 1050$, $R_q = 1150$ (Level 2.3 question)

Weighted outcome: Correct with L2 → $O_s = 0.50$

Expected outcome: $E_s = 1 / (1 + 10^{(1150 - 1050) / 400}) = 1 / (1 + 10^{0.25}) = 1 / (1 + 1.778) = 0.360$

Update ($K_s = 32$): $R_s' = 1050 + 32 \times (0.50 - 0.360) = 1050 + 32 \times 0.14 = 1050 + 4.48 = 1054.5$

Update question ($K_q = 6$): $R_q' = 1150 + 6 \times (0.360 - 0.50) = 1150 + 6 \times (-0.14) = 1150 - 0.84 = 1149.2$

Interpretation: Student's Elo increased slightly (outperformed expectation even with L2 hint), question's Elo barely changed (stable calibration).

Appendix D: BKT Update Example

Before: $P(L_n) = 0.45$ for topic "rates" (the student's topic)

Outcome: $O_s = 0.50$ (treated as correct, since ≥ 0.5)

Slip adjustment: $P(S)_{adj} = 0.10 \times (1 + 0.5 \times 0.50) = 0.10 \times 1.25 = 0.125$

Posterior given correct: P(Lncorrect)=0.45×(10.125)0.45×(10.125)+0.55×0.25=0.45×0.8750.45×0.875+0.55×0.25=0.3940.394+0.1375=0.3940.531=0.742P(L_n | correct) = \frac{0.45 \times (1 - 0.125)}{0.45 \times (1 - 0.125) + 0.55 \times 0.25} = \frac{0.45 \times 0.875}{0.45 \times 0.875 + 0.55 \times 0.25} = \frac{0.394}{0.394 + 0.1375} = \frac{0.394}{0.531} = 0.742

After learning transition: P(Ln)=0.742+(10.742)×0.15=0.742+0.039=0.781P(L_n) = 0.742 + (1 - 0.742) \times 0.15 = 0.742 + 0.039 = 0.781

Interpretation: Topic mastery estimate jumped from 0.45 to 0.78, reflecting that a correct response (even with L2 scaffold, captured by adjusted slip) substantially increased our belief that the student knows "rates."

Appendix E: Thompson Sampling Selection Example

Student Elo: 1054.5 (from Appendix C update) Current Level: 2.1

ZPD Window: [1.4, 2.4] (current - 2 to current + 3, in sub-level steps)

Current priors (after 12 interactions):

Level α β Elo
1.4 8.2 2.1 975
1.5 7.5 3.0 1025
2.1 5.8 4.2 1075
2.2 3.1 3.5 1125
2.3 1.8 2.3 1175
2.4 1.2 1.5 1225

Sampled (one draw):

Level $\hat{\theta}_l$ Proximity Bonus Score
1.4 0.82 $\exp(-(975-1054.5)^2/20000) = 0.85$ 0.697
1.5 0.68 $\exp(-(1025-1054.5)^2/20000) = 0.96$ 0.653
2.1 0.61 $\exp(-(1075-1054.5)^2/20000) = 0.98$ 0.598
2.2 0.55 $\exp(-(1125-1054.5)^2/20000) = 0.88$ 0.484
2.3 0.42 $\exp(-(1175-1054.5)^2/20000) = 0.58$ 0.244
2.4 0.38 $\exp(-(1225-1054.5)^2/20000) = 0.29$ 0.110

Selected: Level 1.4 (highest score = 0.697)

But wait—the decision orchestrator said MAINTAIN (since weighted_outcome = 0.50 is in the maintain range). Thompson selected 1.4, which is below current 2.1. Since the decision says MAINTAIN (not DECREASE), the override rule kicks in: the system selects a question from level 2.1 (current level), not 1.4. Thompson's selection is overridden to respect the progression decision.

This demonstrates how the decision orchestrator provides guardrails against Thompson's stochastic exploration when the student's recent performance doesn't support a level change.

Appendix F: Glossary

Term Definition
BKT Bayesian Knowledge Tracing — HMM-based model of student knowledge per topic
COPPA Children's Online Privacy Protection Act — US law for minors' data
DKT Deep Knowledge Tracing — RNN-based alternative to BKT
ELL English Language Learner
Elo Rating system where opponents update each other's ratings after each match
FERPA Family Educational Rights and Privacy Act — US law for student records
FK Flesch-Kincaid readability metric
HMM Hidden Markov Model
IRT Item Response Theory — psychometric model of test items and examinees
L1 First language (Spanish, in MathLingua's context)
L2 Second language (English, in MathLingua's context)
LDS Language Dependency Score — engineered feature measuring scaffold reliance
MCS Math Confidence Score — engineered feature measuring mathematical ability
QLoRA Quantized Low-Rank Adaptation — memory-efficient fine-tuning method
SLM Small Language Model (< 4B parameters)
Thompson Sampling Bayesian bandit algorithm for exploration/exploitation
WIDA Standards framework for English language proficiency in U.S. schools
ZPD Zone of Proximal Development — Vygotsky's learning theory concept

End of Technical Specification — MathLingua v1.0