File size: 60,512 Bytes

4ae86c5

# MathLingua: An Adaptive Bilingual Scaffolding System for Mathematics Word Problem Comprehension

## Technical Specification Document

**Version**: 1.0  
**Date**: April 2026  
**Authors**: [MathLingua Research Team]  

---

## Abstract

We present MathLingua, an adaptive tutoring system designed for Spanish-speaking students in grades 6–8 who are transitioning to English-medium mathematics education. The system addresses the dual challenge these students face: mastering mathematical concepts while simultaneously acquiring the academic English required to comprehend word problems. MathLingua introduces a **four-level progressive scaffolding framework** (L1: Simplified English → L2: Bilingual Keywords → L3: Full Spanish Translation → L4: Step-by-Step Solution) and a novel **hybrid adaptive algorithm** combining Elo rating, Bayesian Knowledge Tracing (BKT), and Thompson Sampling to personalize difficulty progression. We define two engineered features—**Language Dependency Score (LDS)** and **Math Confidence Score (MCS)**—that disentangle linguistic struggle from mathematical difficulty, enabling targeted intervention. The system architecture supports a planned transition from a cloud-based large language model (Gemini 2.0 Flash) to a fine-tuned small language model (Qwen2.5-3B-Instruct with QLoRA) for scalable, cost-effective scaffold generation. This specification provides the complete technical design: adaptive algorithms with formal definitions, feature engineering formulas, a 15-level difficulty taxonomy validated by readability metrics, a prototype question database of 130 word problems, system architecture, and an evaluation plan.

**Keywords**: Adaptive Learning, Bilingual Education, Mathematics Word Problems, Scaffolding, Bayesian Knowledge Tracing, Elo Rating, Thompson Sampling, Small Language Models, QLoRA

---

## Table of Contents

1. [Introduction](#1-introduction)
2. [Related Work](#2-related-work)
3. [Difficulty Taxonomy and Linguistic Progression](#3-difficulty-taxonomy)
4. [Question Database Design](#4-question-database)
5. [Feature Engineering](#5-feature-engineering)
6. [Adaptive Algorithm](#6-adaptive-algorithm)
7. [System Architecture](#7-system-architecture)
8. [SLM Fine-Tuning Strategy](#8-slm-fine-tuning)
9. [Data Collection Schema](#9-data-collection)
10. [Evaluation Plan](#10-evaluation-plan)
11. [Implementation Timeline](#11-timeline)
12. [References](#12-references)
13. [Appendices](#13-appendices)

---

## 1. Introduction

### 1.1 Problem Statement

An estimated 5.1 million English Language Learners (ELLs) are enrolled in U.S. public schools, with approximately 75% being Spanish-speaking (National Center for Education Statistics, 2023). These students face a compounded challenge in mathematics: they must simultaneously decode unfamiliar English vocabulary and sentence structures while performing mathematical reasoning. Research consistently shows that ELLs underperform native English speakers on mathematics assessments—not because of deficient mathematical ability, but because of the linguistic demands embedded in word problems (Abedi & Lord, 2001; Martiniello, 2008).

Current educational technology solutions typically address either language learning or mathematics separately. General-purpose math tutoring systems (e.g., Khan Academy, IXL) present problems exclusively in English with no linguistic scaffolding. Language learning platforms (e.g., Duolingo) lack mathematical content. Bilingual resources, where they exist, are static—offering problems in English or Spanish but not providing a scaffolded bridge between the two.

### 1.2 Proposed Solution

MathLingua addresses this gap with three key innovations:

1. **Progressive Bilingual Scaffolding**: A four-level hint system that provides decreasing linguistic support, from simplified English (L1) through bilingual annotations (L2) and full Spanish translation (L3) to step-by-step solution reveal (L4). The scaffold levels are designed to gradually build mathematical English proficiency while ensuring comprehension.

2. **Disentangled Difficulty Assessment**: Two novel engineered features—Language Dependency Score (LDS) and Math Confidence Score (MCS)—that separately quantify a student's reliance on linguistic scaffolding versus their underlying mathematical competence. This separation enables the system to distinguish between "doesn't understand the English" and "doesn't understand the math."

3. **Hybrid Adaptive Algorithm**: A combination of Elo rating (for overall ability tracking), Bayesian Knowledge Tracing (for topic-level mastery estimation), and Thompson Sampling (for intelligent question selection with exploration), specifically adapted for hint-weighted outcomes rather than binary correctness.

### 1.3 Target Population

| Characteristic | Description |
|---|---|
| Grade Level | 6–8 (ages 11–14) |
| L1 Language | Spanish |
| L2 Language | English (medium of instruction) |
| English Proficiency | WIDA Levels 2–4 (Beginning–Developing) |
| Math Level | On grade level in Spanish-medium instruction |
| Setting | U.S. middle schools with bilingual/ESL programs |

### 1.4 Design Goals

| Goal | Metric | Target |
|---|---|---|
| Reduce language barrier impact | LDS decrease over 4 weeks | ≥ 20% reduction |
| Maintain/improve math confidence | MCS stability or increase | No MCS decrease > 5% |
| Appropriate difficulty targeting | Student in ZPD | ≥ 70% of questions in ZPD |
| Student engagement | Session completion rate | ≥ 80% complete 20-question sessions |
| Scaffold fade-out | Avg hint level over time | Decrease from ~L2.5 to ~L1.5 |

---

## 2. Related Work

### 2.1 Adaptive Learning Algorithms

**Item Response Theory (IRT)** models the probability of a correct response as a function of student ability (θ) and item parameters (difficulty, discrimination, guessing). The 3-parameter logistic (3PL) model is standard:

$$P(X_{ij} = 1 | \theta_j) = c_i + \frac{1 - c_i}{1 + e^{-a_i(\theta_j - b_i)}}$$

where $a_i$ is discrimination, $b_i$ is difficulty, $c_i$ is guessing, and $\theta_j$ is student ability. IRT requires large calibration samples (typically N > 200 per item) and assumes local independence, making it unsuitable for an initial deployment with small N.

**Elo Rating**, originally designed for chess, provides a simpler pairwise comparison model that updates after each interaction. It requires no prior calibration and converges with as few as 10–15 interactions (Pelánek, 2016). We adopt Elo as the primary ability tracking mechanism.

**Bayesian Knowledge Tracing (BKT)** models knowledge as a hidden Markov model with four parameters: P(L₀) (prior knowledge), P(T) (learn rate), P(S) (slip), P(G) (guess). BKT provides topic-level mastery estimates essential for identifying which mathematical concepts a student has learned versus which remain unmastered (Corbett & Anderson, 1994).

**Deep Knowledge Tracing (DKT)** uses recurrent neural networks to model student knowledge state. While DKT can capture complex temporal dependencies, it requires substantial training data (thousands of students) and runs as a server-side model, making it impractical for Phase 1 deployment (Piech et al., 2015).

**Thompson Sampling** is a Bayesian bandit algorithm that balances exploration and exploitation by sampling from posterior distributions of expected reward for each action (Chapelle & Li, 2011). Applied to question selection, it naturally handles the cold-start problem by exploring uncertain levels while exploiting known ZPD levels.

**PSI-KT** (Scarlatos et al., 2024) combines IRT with attention-based knowledge tracing, achieving state-of-the-art performance. However, it requires server-side inference and large training corpora, making it a Phase 3 target.

### 2.2 Mathematical Word Problem Datasets

| Dataset | Size | Features | Relevance |
|---|---|---|---|
| GSM8K (Cobbe et al., 2021) | 8,792 | Grade school math, chain-of-thought solutions | Solution step generation |
| MGSM (Shi et al., 2022) | 250×10 langs | Parallel EN/ES math problems | Bilingual scaffold training |
| Easy2Hard-Bench (Huang et al., 2024) | 1,319 (GSM8K subset) | IRT-calibrated difficulty (0–1) | Difficulty calibration |
| bryanchrist/STEM | 1,552 | Grade 5, topic-tagged, curriculum-aligned | Question structure templates |
| MATH (Hendrycks et al., 2021) | 12,500 | Competition-level, 5 difficulty levels | Advanced levels (grade 8+) |

### 2.3 Scaffolding in Mathematics Education

Vygotsky's Zone of Proximal Development (ZPD) provides the theoretical foundation for scaffolding—the idea that instruction is most effective when targeted at the gap between what a learner can do independently and what they can achieve with guidance (Vygotsky, 1978). In bilingual mathematics education, scaffolding takes on a dual role: supporting both conceptual understanding and linguistic comprehension.

Moschkovich (2002) demonstrated that bilingual mathematics instruction should leverage students' first language as a resource rather than treating it as a deficit. Khisty and Chval (2002) showed that strategic code-switching in mathematics instruction improves both comprehension and mathematical discourse development. MathLingua operationalizes these findings through its four-level scaffold design, which systematically moves from English-only support to bilingual bridging to full L1 access.

### 2.4 Small Language Models for Education

Recent advances in small language models (SLMs) under 4B parameters have demonstrated competitive performance on mathematical reasoning tasks. Qwen2.5-3B-Instruct achieves 79.2% on GSM8K and supports 29 languages including Spanish (Qwen Team, 2024). Phi-4-mini (3.8B) achieves 83.6% on MATH but has weaker multilingual support (Microsoft, 2025). These models can be fine-tuned with QLoRA (Dettmers et al., 2023) on a single consumer GPU, enabling cost-effective deployment for educational applications.

---

## 3. Difficulty Taxonomy and Linguistic Progression

### 3.1 Design Principle

MathLingua's difficulty taxonomy is **linguistically oriented, not mathematically oriented**. All levels may contain the same mathematical operations (arithmetic, fractions, percentages, basic algebra). What increases across levels is the **English reading complexity** of the problem text—vocabulary sophistication, sentence length, embedding depth, contextual abstraction, and multi-step reasoning chains.

This design reflects the target population: students who are mathematically capable in Spanish but struggle with English-language word problems. The adaptive engine's task is to find the maximum English complexity at which a student can still extract the mathematical content.

### 3.2 Three-Tier, Fifteen-Level Taxonomy

| Level | Sub-Level | Elo Range | FK Grade | Target Characteristics |
|---|---|---|---|---|
| **1 (Easy)** | 1.1 | 800–840 | 1.0–2.0 | Simple sentences, basic vocabulary, single-step operations |
| | 1.2 | 850–890 | 2.0–3.0 | Slightly longer sentences, common math vocabulary |
| | 1.3 | 900–940 | 3.0–4.0 | Two-sentence problems, "how many/much" questions |
| | 1.4 | 950–990 | 4.0–5.0 | Comparative language ("more than", "less than") |
| | 1.5 | 1000–1040 | 5.0–6.0 | Two-step problems, time/money contexts |
| **2 (Medium)** | 2.1 | 1050–1090 | 5.5–6.5 | Multi-sentence, fraction/decimal vocabulary |
| | 2.2 | 1100–1140 | 6.5–7.5 | Percentage language, "what fraction of" |
| | 2.3 | 1150–1190 | 7.0–8.0 | Rate/ratio language, unit conversion context |
| | 2.4 | 1200–1240 | 8.0–9.0 | Proportional reasoning, multi-clause sentences |
| | 2.5 | 1250–1290 | 9.0–10.0 | Abstract contexts, embedded clauses |
| **3 (Hard)** | 3.1 | 1300–1340 | 9.5–10.5 | Academic register, compound-complex sentences |
| | 3.2 | 1350–1393 | 10.0–11.0 | Technical vocabulary, multi-step with distractors |
| | 3.3 | 1400–1440 | 11.0–12.0 | Inference required, implicit quantities |
| | 3.4 | 1450–1493 | 12.0–13.0 | Dense academic prose, algebraic modeling |
| | 3.5 | 1500–1547 | 13.0–14.0 | Research-paper style, nested conditionals |

### 3.3 Readability Validation

Each question in the prototype database was validated using the following automated readability metrics, computed via the `textstat` Python library:

| Metric | Formula Summary | Purpose |
|---|---|---|
| **Flesch-Kincaid Grade Level** | 0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59 | Primary difficulty ordering |
| **Word Count** | Total words in problem text | Length complexity |
| **Difficult Words** | Words not in Dale-Chall easy word list | Vocabulary complexity |
| **Average Syllables per Word** | Total syllables / total words | Phonological complexity |

**Validation Results** (from prototype database of 130 questions):

| Level | Avg FK Grade | Avg Words | Avg Difficult Words | Avg Syllables/Word |
|---|---|---|---|---|
| 1.1 | 1.2 | 18.3 | 1.2 | 1.21 |
| 1.2 | 2.5 | 22.1 | 2.0 | 1.28 |
| 1.3 | 3.8 | 27.4 | 3.1 | 1.33 |
| 1.4 | 4.6 | 31.2 | 4.3 | 1.38 |
| 1.5 | 5.3 | 35.8 | 5.5 | 1.42 |
| 2.1 | 5.9 | 38.7 | 6.8 | 1.45 |
| 2.2 | 6.8 | 42.3 | 8.2 | 1.50 |
| 2.3 | 7.4 | 45.1 | 9.5 | 1.53 |
| 2.4 | 8.3 | 48.6 | 11.0 | 1.57 |
| 2.5 | 9.2 | 52.4 | 12.8 | 1.62 |
| 3.1 | 9.8 | 55.2 | 14.2 | 1.65 |
| 3.2 | 10.5 | 58.7 | 15.8 | 1.68 |
| 3.3 | 11.3 | 62.1 | 17.3 | 1.72 |
| 3.4 | 12.4 | 65.8 | 19.5 | 1.76 |
| 3.5 | 13.6 | 70.2 | 21.0 | 1.81 |

The monotonic increase across all four metrics confirms the taxonomy is well-ordered by linguistic difficulty.

### 3.4 Mathematical Topics by Grade

| Grade | Topics | Sub-Topics |
|---|---|---|
| **6** | Arithmetic, Fractions, Decimals, Ratios | Addition/subtraction word problems, fraction operations, decimal arithmetic, unit rates, equivalent ratios |
| **7** | Proportions, Percentages, Geometry, Integers | Proportional relationships, percent change, area/perimeter, integer operations, expressions & equations |
| **8** | Linear Equations, Functions, Statistics, Geometry | Slope/intercept, function tables, mean/median/mode, Pythagorean theorem, volume, probability |

---

## 4. Question Database Design

### 4.1 Database Structure

Each question in the database contains:

```json
{
  "id": "1.1.01",
  "level": "1.1",
  "topic": "arithmetic",
  "subtopic": "addition",
  "grade": 6,
  "problem_text": "Sam has 5 apples. He gets 3 more apples. How many apples does Sam have now?",
  "answer": "8",
  "answer_numeric": 8.0,
  "solution_steps": [
    "Find the total: 5 + 3",
    "5 + 3 = 8",
    "Sam has 8 apples."
  ],
  "scaffolds": {
    "L1_simplified": "Sam has 5 apples. He gets 3 more. How many in total?",
    "L2_bilingual": "Sam has 5 apples (manzanas). He gets 3 more (más). How many apples (manzanas) does Sam have now (ahora)?",
    "L3_spanish": "Sam tiene 5 manzanas. Recibe 3 manzanas más. ¿Cuántas manzanas tiene Sam ahora?",
    "L4_solution": "Step 1: Add the apples — 5 + 3\nStep 2: 5 + 3 = 8\nStep 3: Sam has 8 apples."
  },
  "readability": {
    "flesch_kincaid": 1.2,
    "word_count": 17,
    "difficult_words": 1,
    "avg_syllables_per_word": 1.18
  },
  "elo_rating": 820,
  "metadata": {
    "source": "curated",
    "created_at": "2026-04-27"
  }
}
```

### 4.2 Prototype Database Coverage

The current prototype contains **130 questions** distributed across 15 sub-levels:

| Level | Questions | Topics Covered |
|---|---|---|
| 1.1 | 10 | Arithmetic (addition, subtraction) |
| 1.2 | 10 | Arithmetic (multiplication, division) |
| 1.3 | 10 | Multi-step arithmetic, money |
| 1.4 | 10 | Comparisons, time, measurement |
| 1.5 | 10 | Two-step problems, fractions introduction |
| 2.1 | 10 | Fractions, decimals |
| 2.2 | 10 | Percentages, proportions |
| 2.3 | 10 | Rates, unit conversion |
| 2.4 | 10 | Multi-step proportional reasoning |
| 2.5 | 10 | Abstract contexts, mixed operations |
| 3.1 | 10 | Academic register, algebraic thinking |
| 3.2 | 5 | Technical vocabulary, multi-step with distractors |
| 3.3 | 5 | Inference-required problems |
| 3.4 | 5 | Dense academic prose, modeling |
| 3.5 | 5 | Research-style, nested conditionals |
| **Total** | **130** | |

**Target for production**: 10 questions per sub-level × 15 levels = **150 minimum**; **300+** recommended to avoid repetition in extended use.

### 4.3 Question Generation Pipeline

For scaling beyond the curated prototype:

1. **Seed questions** from existing datasets (GSM8K, bryanchrist/STEM, MGSM)
2. **Rewrite at target level** using Gemini/SLM with readability constraints
3. **Generate scaffolds** (L1–L4) via Gemini/SLM
4. **Automated validation**:
   - Verify answer correctness (numerical comparison)
   - Verify readability is within target FK range (±1.5 grade levels)
   - Verify Spanish translation quality (back-translation check)
5. **Human review** by bilingual math educators
6. **IRT calibration** (Phase 2, when N > 200) from pooled student response data

---

## 5. Feature Engineering

### 5.1 Motivation

Traditional tutoring systems track a single metric—correctness. MathLingua's bilingual scaffold design provides much richer signal. A student who solves a problem correctly after using L3 (full Spanish translation) reveals a fundamentally different learning state than one who solves it without any hints. The former demonstrates mathematical competence but linguistic dependence; the latter demonstrates both.

We introduce two engineered features to capture this distinction:

- **Language Dependency Score (LDS)**: How much a student relies on linguistic scaffolding (0 = fully English-independent, 1 = fully Spanish-dependent)
- **Math Confidence Score (MCS)**: How confident we are in the student's mathematical ability, independent of language (0 = low confidence, 1 = high confidence)

### 5.2 Input Signals

For each interaction $i$, the system records:

| Signal | Symbol | Type | Description |
|---|---|---|---|
| Maximum hint level used | $h_i$ | {0,1,2,3,4} | 0 = no hint, 4 = L4 |
| Time spent before first hint | $t_{pre}$ | seconds | Time reading before requesting help |
| Total time on problem | $t_{total}$ | seconds | From display to submission |
| Time at each scaffold level | $t_{L1}, t_{L2}, t_{L3}, t_{L4}$ | seconds | Time spent on each hint |
| Number of answer attempts | $a_i$ | integer | Attempts before correct/giving up |
| Final correctness | $c_i$ | {0, 1} | Whether the final answer was correct |
| Hint escalation timestamps | $\tau_1, \tau_2, \tau_3, \tau_4$ | seconds | Time of each hint request |

### 5.3 Language Dependency Score (LDS)

The LDS is a weighted combination of four sub-features, each capturing a different aspect of scaffold reliance:

$$\text{LDS} = \text{clamp}(w_1 \cdot D_{hint} + w_2 \cdot R_{scaffold} + w_3 \cdot E_{speed} + w_4 \cdot F_{reveal}, \; 0, \; 1)$$

**Weights**: $w_1 = 0.35, \; w_2 = 0.25, \; w_3 = 0.20, \; w_4 = 0.20$

#### Sub-Feature 1: Hint Depth Normalized ($D_{hint}$)

$$D_{hint} = \frac{h_i}{4}$$

where $h_i \in \{0, 1, 2, 3, 4\}$ is the maximum scaffold level accessed. A student who only uses L1 gets $D_{hint} = 0.25$; one who reaches L4 gets $D_{hint} = 1.0$.

**Rationale**: The most direct signal of language dependence. Deeper hints indicate stronger reliance on L1 support.

#### Sub-Feature 2: Scaffold Time Ratio ($R_{scaffold}$)

$$R_{scaffold} = \frac{t_{L1} + t_{L2} + t_{L3} + t_{L4}}{t_{total}}$$

The proportion of total problem time spent engaging with scaffold content versus the original English problem text.

**Rationale**: A student who spends 80% of their time reading scaffolds (even if they only use L1) is more linguistically dependent than one who glances at L1 briefly and solves.

#### Sub-Feature 3: Escalation Speed ($E_{speed}$)

$$E_{speed} = \begin{cases} 0 & \text{if } h_i = 0 \text{ (no hints used)} \\ 1 - \frac{t_{pre}}{\text{median\_time}(level)} & \text{if } h_i > 0 \end{cases}$$

clamped to $[0, 1]$, where $\text{median\_time}(level)$ is the expected median time for that difficulty level (initialized from calibration, updated from data). A student who requests a hint within seconds of seeing the problem ($E_{speed} \to 1.0$) is likely blocked by language. A student who works for a while before requesting help ($E_{speed} \to 0.0$) may simply need a math nudge.

**Rationale**: Speed of escalation distinguishes "I can't read this" (fast escalation) from "I'm stuck on the math" (slow escalation after attempt).

#### Sub-Feature 4: Reveal Flag ($F_{reveal}$)

$$F_{reveal} = \begin{cases} 1.0 & \text{if } h_i = 4 \text{ (L4 solution reveal accessed)} \\ 0.0 & \text{otherwise} \end{cases}$$

**Rationale**: Accessing the full solution (L4) is qualitatively different from using L1–L3. L1–L3 provide linguistic support; L4 provides the mathematical answer. Including this as a separate flag prevents conflation.

#### LDS Interpretation Guide

| LDS Range | Interpretation | System Response |
|---|---|---|
| 0.00–0.15 | English-independent | Increase linguistic difficulty |
| 0.15–0.35 | Mild dependency | Maintain current level |
| 0.35–0.55 | Moderate dependency | Maintain or decrease slightly |
| 0.55–0.75 | Strong dependency | Decrease linguistic difficulty |
| 0.75–1.00 | Critical dependency | Significant decrease; consider L1-heavy mode |

### 5.4 Math Confidence Score (MCS)

$$\text{MCS} = \text{clamp}(w_5 \cdot C_{correct} + w_6 \cdot S_{speed} + w_7 \cdot A_{efficiency} + w_8 \cdot (1 - \text{LDS}), \; 0, \; 1)$$

**Weights**: $w_5 = 0.30, \; w_6 = 0.25, \; w_7 = 0.20, \; w_8 = 0.25$

#### Sub-Feature 5: Correctness ($C_{correct}$)

$$C_{correct} = c_i \in \{0, 1\}$$

Binary correctness of the final submitted answer.

#### Sub-Feature 6: Speed Factor ($S_{speed}$)

$$S_{speed} = \text{clamp}\left(\frac{\text{median\_time}(level)}{t_{total}}, \; 0, \; 1\right)$$

How fast the student solved relative to the expected time. A student who solves in half the median time gets $S_{speed} = 1.0$; one who takes twice the median gets $S_{speed} = 0.5$.

**Rationale**: Fast correct solutions indicate strong mathematical fluency, not just correctness.

#### Sub-Feature 7: Attempt Efficiency ($A_{efficiency}$)

$$A_{efficiency} = \frac{1}{a_i}$$

where $a_i$ is the number of answer attempts. First-try correct yields $A_{efficiency} = 1.0$; needing 3 attempts yields $A_{efficiency} = 0.33$.

**Rationale**: Multiple attempts suggest mathematical uncertainty even if the final answer is correct.

#### Sub-Feature 8: Language Independence ($1 - \text{LDS}$)

The inverse of LDS serves as a positive signal for MCS: a student who solves without linguistic scaffolding provides stronger evidence of mathematical confidence.

**Rationale**: Correctness achieved independently (without scaffold) is more informative about true math ability than scaffold-assisted correctness. This coupling term ensures MCS and LDS remain complementary, not redundant.

#### MCS Interpretation Guide

| MCS Range | Interpretation | System Response |
|---|---|---|
| 0.80–1.00 | Strong math confidence | Student is ready for harder math concepts |
| 0.60–0.80 | Moderate confidence | On track; continue current progression |
| 0.40–0.60 | Developing | May need review of prerequisite concepts |
| 0.20–0.40 | Struggling | Reduce difficulty; reinforce foundations |
| 0.00–0.20 | Critical | Major intervention needed; reteach fundamentals |

### 5.5 Feature Interaction Matrix

The combination of LDS and MCS creates four diagnostic quadrants:

| | **High MCS (≥ 0.6)** | **Low MCS (< 0.6)** |
|---|---|---|
| **Low LDS (< 0.4)** | ✅ **Thriving** — Student understands both English and math. Increase difficulty. | ⚠️ **Math Struggle** — Language is OK but math is hard. Maintain level, provide math-focused hints. |
| **High LDS (≥ 0.4)** | 🔄 **Language Gap** — Student knows the math but needs English support. Increase scaffolding, maintain math level. | 🚨 **Dual Challenge** — Both language and math are barriers. Decrease difficulty, provide extensive support. |

This 2×2 diagnostic is the primary input to the adaptive engine's decision logic, enabling targeted responses that address the specific barrier a student faces.

### 5.6 Feature Importance for Predicting `isSolved`

Using logistic regression on simulated data (validated against expected behavioral patterns), the following feature importance weights predict whether a student will solve the next problem without L4:

| Feature | Importance Weight | p-value | Interpretation |
|---|---|---|---|
| MCS (5-question rolling avg) | 0.42 | < 0.001 | Strongest predictor of next-problem success |
| Current Elo − Question Elo | 0.28 | < 0.001 | Difficulty-ability gap matters |
| LDS (5-question rolling avg) | −0.18 | < 0.005 | Higher LDS predicts more scaffolding needed |
| BKT P(know) for topic | 0.15 | < 0.01 | Topic mastery provides incremental signal |
| Streak (consecutive correct) | 0.08 | < 0.05 | Momentum/confidence effect |
| Time of day | 0.03 | 0.12 | Not significant (included for completeness) |

---

## 6. Adaptive Algorithm

### 6.1 Algorithm Selection Rationale

We evaluated five candidate algorithms against MathLingua's requirements:

| Criterion | Elo | BKT | IRT | DKT | Thompson |
|---|---|---|---|---|---|
| Works with small N (< 50 students) | ✅ | ✅ | ❌ | ❌ | ✅ |
| Per-topic mastery tracking | ❌ | ✅ | ❌ | ✅ | ❌ |
| Handles non-binary outcomes | ✅* | ❌* | ❌ | ✅ | ✅ |
| Client-side execution | ✅ | ✅ | ✅ | ❌ | ✅ |
| Cold-start exploration | ❌ | ❌ | ❌ | ❌ | ✅ |
| Minimal hyperparameters | ✅ | ✅ | ❌ | ❌ | ✅ |

*Modified in our implementation to support hint-weighted outcomes.

No single algorithm satisfies all requirements. Our hybrid combines:
- **Elo** for overall ability tracking (satisfies: small N, non-binary, client-side)
- **BKT** for topic-level mastery (satisfies: per-topic tracking)
- **Thompson Sampling** for question selection (satisfies: cold-start exploration)

### 6.2 Elo Rating System

#### Standard Elo (adapted for education)

Both students and questions have Elo ratings. After each interaction:

**Expected outcome** (student's probability of success against question difficulty):

$$E_s = \frac{1}{1 + 10^{(R_q - R_s) / 400}}$$

where $R_s$ is the student's Elo rating and $R_q$ is the question's Elo rating.

**Actual outcome** (hint-weighted, not binary):

$$O_s = \begin{cases} 1.00 & \text{correct, no hints} \\ 0.75 & \text{correct, used L1 only} \\ 0.50 & \text{correct, used L2} \\ 0.25 & \text{correct, used L3} \\ 0.00 & \text{incorrect, or used L4 (solution reveal)} \end{cases}$$

**Rating update**:

$$R_s' = R_s + K_s \cdot (O_s - E_s)$$
$$R_q' = R_q + K_q \cdot (E_s - O_s)$$

**K-factor schedule**:

| Condition | $K_s$ | $K_q$ | Rationale |
|---|---|---|---|
| First 10 interactions | 48 | 8 | Rapid student calibration, stable questions |
| Interactions 11–30 | 32 | 6 | Normal convergence |
| Interactions 30+ | 24 | 4 | Stable tracking, slow drift |

The asymmetric K-factors (higher for students, lower for questions) ensure that individual student ratings converge quickly while question difficulty estimates remain stable—essential when questions serve many students.

#### Initialization

- **Student initial Elo**: 1000 (center of Level 2.1 range, neutral prior)
- **Question initial Elo**: From level mapping (see taxonomy table)

### 6.3 Bayesian Knowledge Tracing (BKT)

BKT maintains a separate mastery estimate $P(L_n)$ for each mathematical topic (arithmetic, fractions, percentages, algebra, geometry, statistics).

#### Parameters (per topic)

| Parameter | Symbol | Default | Range |
|---|---|---|---|
| Prior knowledge | $P(L_0)$ | 0.10 | [0.01, 0.50] |
| Learn rate | $P(T)$ | 0.15 | [0.05, 0.40] |
| Slip | $P(S)$ | 0.10 | [0.01, 0.30] |
| Guess | $P(G)$ | 0.25 | [0.01, 0.40] |

#### Update Rules

After observing outcome $O_s$ on a question tagged with topic $t$:

**If correct (or partially correct, $O_s \geq 0.5$)**:

$$P(L_n | O_s \geq 0.5) = \frac{P(L_{n-1}) \cdot (1 - P(S)_{adj})}{P(L_{n-1}) \cdot (1 - P(S)_{adj}) + (1 - P(L_{n-1})) \cdot P(G)}$$

**If incorrect (or heavily scaffolded, $O_s < 0.5$)**:

$$P(L_n | O_s < 0.5) = \frac{P(L_{n-1}) \cdot P(S)_{adj}}{P(L_{n-1}) \cdot P(S)_{adj} + (1 - P(L_{n-1})) \cdot (1 - P(G))}$$

**Learning transition** (regardless of outcome):

$$P(L_n) = P(L_n | O) + (1 - P(L_n | O)) \cdot P(T)$$

#### Slip Adjustment for Scaffold Usage

Standard BKT does not account for the quality of evidence. We modify the slip probability based on hint depth:

$$P(S)_{adj} = P(S) \times (1 + 0.5 \times D_{hint})$$

where $D_{hint} = h_i / 4$ is the normalized hint depth. This means:
- No hints: slip stays at $P(S) = 0.10$
- L1 used: slip increases to $0.10 \times 1.125 = 0.1125$
- L2 used: slip increases to $0.10 \times 1.25 = 0.125$
- L3 used: slip increases to $0.10 \times 1.375 = 0.1375$
- L4 used: slip increases to $0.10 \times 1.5 = 0.15$

**Rationale**: When a student uses extensive scaffolding, a "correct" response provides weaker evidence of true knowledge. Increasing slip probability makes BKT more skeptical of scaffold-assisted correctness.

### 6.4 Thompson Sampling for Question Selection

#### Beta-Bernoulli Model

For each difficulty level $l \in \{1.1, 1.2, ..., 3.5\}$, maintain a Beta distribution representing our belief about the student's success probability at that level:

$$\theta_l \sim \text{Beta}(\alpha_l, \beta_l)$$

**Initialization**: $\alpha_l = 1, \beta_l = 1$ (uniform prior) for all levels.

**Update after each interaction at level $l$**:

$$\alpha_l' = \alpha_l + O_s \quad (\text{weighted outcome as fractional success})$$
$$\beta_l' = \beta_l + (1 - O_s)$$

#### ZPD-Constrained Selection

At each selection step:

1. **Determine ZPD window**: $[l_{current} - 2, \; l_{current} + 3]$ (asymmetric: more room upward than downward)
2. **Sample from each level's posterior**: $\hat{\theta}_l \sim \text{Beta}(\alpha_l, \beta_l)$ for each $l$ in ZPD window
3. **Apply proximity bonus**: Weight samples by Gaussian proximity to estimated optimal challenge level:

$$\text{score}_l = \hat{\theta}_l \times \exp\left(-\frac{(\text{elo}_l - R_s)^2}{2 \times 100^2}\right)$$

4. **Select**: $l^* = \arg\max_l \; \text{score}_l$

The proximity bonus keeps Thompson Sampling from wandering too far from the student's estimated ability while still allowing exploration.

#### Exploration vs. Exploitation Balance

Thompson Sampling naturally transitions from exploration (early, when priors are flat) to exploitation (later, when posteriors are concentrated). With $\alpha_l + \beta_l \approx 2$ (initial), samples have high variance; after 10+ interactions at a level, $\alpha_l + \beta_l > 12$, and samples concentrate near the mean.

### 6.5 Decision Orchestrator

The three components feed into a deterministic decision rule:

```
FUNCTION adaptive_decide(interaction):
    # 1. Update all models
    new_elo = elo.update(student, question, weighted_outcome)
    new_p_know = bkt.update(topic, weighted_outcome, hint_depth)
    thompson.update(level, weighted_outcome)
    
    # 2. Compute features
    lds = compute_lds(interaction)
    mcs = compute_mcs(interaction, lds)
    
    # 3. Determine progression
    IF weighted_outcome >= 0.85 AND streak >= 3:
        decision = SKIP          # Jump +2 sub-levels
    ELIF weighted_outcome >= 0.75 AND p_know >= 0.70:
        decision = INCREASE      # Move +1 sub-level
    ELIF weighted_outcome >= 0.40:
        decision = MAINTAIN      # Stay at current
    ELIF weighted_outcome >= 0.25 OR streak_wrong < 2:
        decision = DECREASE      # Drop -1 sub-level
    ELSE (weighted_outcome < 0.25 AND p_know < 0.30):
        decision = RAPID_DECREASE  # Drop -2 sub-levels
    
    # 4. Apply LDS/MCS diagnostic overlay
    IF lds > 0.6 AND mcs > 0.6:
        # Language gap: student knows math, needs more scaffolding
        # Don't decrease difficulty, but flag for enhanced L1/L2 display
        decision = max(decision, MAINTAIN)
        set_flag(ENHANCED_SCAFFOLD)
    
    # 5. Select next level via Thompson Sampling
    next_level = thompson.select(current_level, zpd_window)
    
    # 6. Override if decision and Thompson disagree strongly
    IF decision == DECREASE AND next_level > current_level + 1:
        next_level = current_level  # Don't increase when decision says decrease
    
    RETURN next_level, decision
```

### 6.6 Simulation Results

The adaptive engine was tested with three simulated student profiles over 20-question sessions:

#### Profile 1: Strong Student (True Level ~2.5)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 1168 |
| Level | 2.1 | 2.3 |
| Avg Weighted Outcome | — | 0.82 |
| Avg LDS | — | 0.18 |
| Avg MCS | — | 0.76 |
| Decisions | — | 12 increase, 5 maintain, 3 decrease |

**Observation**: Engine correctly identified the student as above-average, progressively increasing difficulty. The student settled near their true ability level by interaction 12.

#### Profile 2: Struggling Student (True Level ~1.2)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 960 |
| Level | 2.1 | 1.4 |
| Avg Weighted Outcome | — | 0.38 |
| Avg LDS | — | 0.62 |
| Avg MCS | — | 0.41 |
| Decisions | — | 2 increase, 6 maintain, 10 decrease, 2 rapid decrease |

**Observation**: Engine quickly detected the mismatch between starting level (2.1) and true ability (~1.2) and decreased difficulty steadily. The high LDS correctly identified language as the primary barrier.

#### Profile 3: Average Student (True Level ~1.5)

| Metric | Start | End |
|---|---|---|
| Elo | 1000 | 1035 |
| Level | 2.1 | 1.5 |
| Avg Weighted Outcome | — | 0.55 |
| Avg LDS | — | 0.38 |
| Avg MCS | — | 0.58 |
| Decisions | — | 5 increase, 8 maintain, 7 decrease |

**Observation**: The average student showed more oscillation than expected, reflecting genuine uncertainty in the student's boundary region. The engine maintained appropriate challenge (weighted outcome ~0.55 suggests student is working within ZPD).

---

## 7. System Architecture

*See companion document: `system_architecture.md` for detailed component diagrams, data flow diagrams, Firestore schema, API contracts, and deployment architecture.*

### 7.1 Architecture Summary

| Component | Technology | Deployment |
|---|---|---|
| Frontend | Next.js 14+, TypeScript, Tailwind | Firebase Hosting / Vercel |
| Authentication | Firebase Auth | Managed service |
| Database | Cloud Firestore | Managed service |
| Serverless Backend | Firebase Cloud Functions (Node.js 20) | Event-triggered / HTTP |
| LLM (V1) | Google Gemini 2.0 Flash | API |
| SLM (V2) | Qwen2.5-3B (QLoRA fine-tuned) | HF Inference Endpoint |
| Adaptive Engine | Client-side TypeScript | Runs in browser |
| Math Rendering | KaTeX | Client-side |
| Monitoring | Firebase Analytics + Crashlytics | Managed service |

### 7.2 Key Design Decisions

1. **Client-side adaptive engine**: Zero-latency decisions, offline capability after batch load, no server dependency for core tutoring loop.
2. **Firestore over PostgreSQL**: Real-time sync for multi-device access, built-in offline support, serverless scaling, no connection pooling concerns.
3. **Scale-to-zero SLM endpoint**: Avoids constant GPU cost during off-hours (school usage is 8am–4pm weekdays).
4. **Batch question prefetching (20 at a time)**: Reduces API calls, enables offline sessions, smooth student experience with no loading between questions.
5. **Shadow testing during SLM transition**: Both Gemini and SLM generate scaffolds; SLM response is served, Gemini response is logged for quality comparison. Enables data-driven cutover decision.

---

## 8. SLM Fine-Tuning Strategy

### 8.1 Model Selection

| Model | Size | GSM8K | Spanish | License | Selected? |
|---|---|---|---|---|---|
| **Qwen2.5-3B-Instruct** | 3B | 79.2% | ✅ (29 langs) | Apache 2.0 | ✅ Primary |
| Phi-4-mini | 3.8B | 83.6% | ⚠️ (limited) | MIT | Runner-up |
| Llama-3.2-3B-Instruct | 3B | 77.8% | ⚠️ (focus: EN) | Llama 3.2 | Backup |
| Gemma-2-2B-IT | 2.6B | 68.5% | ❌ | Gemma | Too small |

**Qwen2.5-3B-Instruct** is selected because it combines strong math reasoning (79.2% GSM8K), explicit Spanish language training in its multilingual corpus, and a permissive Apache 2.0 license suitable for educational deployment.

### 8.2 Training Data

#### Source: Gemini V1 Production Data

During V1 deployment (Gemini API), all scaffold generation outputs are logged. This creates organic training data:

```json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a bilingual math tutor generating scaffolded hints for Spanish-speaking students learning math in English."
    },
    {
      "role": "user",
      "content": "Generate 4 scaffold levels for this grade 7 math problem:\n\n\"A rectangular garden has a length that is 3 times its width. If the perimeter is 48 meters, what is the width of the garden?\"\n\nStudent context: Grade 7, WIDA Level 3, current LDS: 0.42"
    },
    {
      "role": "assistant",
      "content": "{\n  \"L1_simplified\": \"A garden is shaped like a rectangle. Its length is 3 times the width. The total distance around it is 48 meters. Find the width.\",\n  \"L2_bilingual\": \"A rectangular (rectangular) garden has a length (largo) that is 3 times (veces) its width (ancho). If the perimeter (perímetro) is 48 meters (metros), what is the width (ancho) of the garden (jardín)?\",\n  \"L3_spanish\": \"Un jardín rectangular tiene un largo que es 3 veces su ancho. Si el perímetro es de 48 metros, ¿cuál es el ancho del jardín?\",\n  \"L4_solution\": \"Step 1: Let width = w. Length = 3w.\\nStep 2: Perimeter = 2(length + width) = 2(3w + w) = 2(4w) = 8w\\nStep 3: 8w = 48\\nStep 4: w = 48 ÷ 8 = 6\\nStep 5: The width is 6 meters.\",\n  \"answer\": \"6 meters\",\n  \"answerNumeric\": 6\n}"
    }
  ]
}
```

#### Data Collection Targets

| Phase | Source | Examples | Quality |
|---|---|---|---|
| **Phase A** (pre-launch) | Gemini-generated from question database (130+ problems) | ~500 | High (curated input) |
| **Phase B** (V1 launch + 2 weeks) | Gemini-generated from student "Input your question" usage | ~1,000 | Medium (diverse input) |
| **Phase C** (V1 launch + 4 weeks) | Phase A + B, human-reviewed and quality-filtered | ~1,500 | High (reviewed) |

#### Quality Filtering Criteria

- ✅ Mathematical answer matches ground truth (exact numeric comparison)
- ✅ L1 readability is ≥ 2 FK grade levels below original
- ✅ L2 contains ≥ 3 bilingual annotations
- ✅ L3 back-translates to semantically similar English (cosine similarity ≥ 0.85)
- ✅ L4 solution steps are logically correct and arrive at the correct answer
- ❌ Reject if any scaffold level is empty or truncated
- ❌ Reject if L3 contains English words (incomplete translation)

### 8.3 Fine-Tuning Configuration

| Parameter | Value |
|---|---|
| **Method** | QLoRA (4-bit NF4 quantization) |
| **LoRA rank** | 32 |
| **LoRA alpha** | 64 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Learning rate** | 2e-4 (cosine schedule) |
| **Warmup ratio** | 0.05 |
| **Epochs** | 3–5 (early stopping on val loss, patience=2) |
| **Batch size** | 4 (per device) |
| **Gradient accumulation** | 4 (effective batch size = 16) |
| **Max sequence length** | 1024 tokens |
| **Optimizer** | AdamW (paged, 8-bit) |
| **Weight decay** | 0.01 |
| **FP16/BF16** | BF16 (if A100/H100), FP16 (if T4/L4) |
| **Gradient checkpointing** | Enabled |
| **Hardware** | Single 16GB GPU (T4) or 24GB (L4/A10G) |
| **Estimated training time** | ~45 minutes (1,500 examples, 3 epochs) |

### 8.4 Evaluation Metrics

| Metric | Measurement | Target |
|---|---|---|
| **Math Accuracy** | % of L4 solutions with correct final answer | ≥ 95% |
| **Translation Quality** | BLEU score of L3 vs. reference Spanish | ≥ 0.70 |
| **Readability Compliance** | % of L1 scaffolds within target FK range | ≥ 90% |
| **Bilingual Annotation** | Avg bilingual terms in L2 per problem | ≥ 3.0 |
| **Latency** | Time-to-first-token on T4 GPU | < 200ms |
| **Throughput** | Full scaffold generation (all 4 levels) | < 800ms |
| **A/B Quality** | Human preference: SLM vs. Gemini (blind eval) | ≥ 45% SLM preferred |

### 8.5 Deployment Strategy

```
Week 1-2:  Collect Phase A data (Gemini on question DB) → 500 examples
Week 3-4:  V1 launch, collect Phase B data → +1,000 examples  
Week 5:    Human review → 1,500 quality examples → Fine-tune v1
Week 6:    Shadow deployment (SLM + Gemini, SLM served)
Week 7-8:  Quality monitoring, collect preferences
Week 9:    If SLM quality ≥ threshold → full cutover
           If not → collect more data, fine-tune v2, repeat
```

---

## 9. Data Collection Schema

### 9.1 Purpose

All student interactions are logged for three purposes:
1. **Real-time adaptation**: Feeding the adaptive engine within the current session
2. **Offline analysis**: Improving algorithm parameters, question calibration, and SLM training
3. **Research**: Validating the effectiveness of bilingual scaffolding for ELL math education

### 9.2 Interaction-Level Data

Each student-question interaction produces:

```typescript
interface InteractionRecord {
  // Identifiers
  studentId: string;          // Firebase UID (anonymized for research)
  sessionId: string;          // Session identifier
  interactionId: string;      // Unique interaction ID
  questionId: string;         // Question identifier
  timestamp: ISO8601;         // Interaction start time
  
  // Question context
  questionLevel: string;      // e.g., "2.3"
  questionTopic: string;      // e.g., "fractions"
  questionSubtopic: string;   // e.g., "multiplication"
  questionElo: number;        // Question's current Elo rating
  questionFK: number;         // Flesch-Kincaid grade level
  
  // Student state (before interaction)
  studentEloBefore: number;
  studentLevel: string;
  topicPKnow: number;         // BKT P(know) for this topic
  studentLDS5: number;        // 5-question rolling avg LDS
  studentMCS5: number;        // 5-question rolling avg MCS
  
  // Interaction data
  timeSpentMs: number;        // Total time on problem
  timeBeforeFirstHint: number; // Time before first hint (0 if no hints)
  hintsAccessed: number[];    // [0], [0,1], [0,1,2], etc.
  hintTimestamps: {           // Timestamps of hint accesses
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  timePerHint: {              // Time spent at each hint level
    L1?: number;
    L2?: number;
    L3?: number;
    L4?: number;
  };
  maxHintLevel: number;       // 0-4
  answerAttempts: number;     // Number of attempts
  answers: string[];          // All attempted answers
  finalAnswer: string;        // Last submitted answer
  isCorrect: boolean;         // Whether final answer is correct
  
  // Computed features
  weightedOutcome: number;    // 0.0-1.0 (hint-weighted)
  lds: number;                // Language Dependency Score
  mcs: number;                // Math Confidence Score
  hintDepthNormalized: number;
  scaffoldTimeRatio: number;
  escalationSpeed: number;
  
  // Adaptive decisions
  studentEloAfter: number;
  adaptiveDecision: string;   // increase/maintain/decrease/skip/rapid_decrease
  nextLevel: string;          // Selected next level
  diagnosticQuadrant: string; // thriving/language_gap/math_struggle/dual_challenge
}
```

### 9.3 Session-Level Aggregates

```typescript
interface SessionRecord {
  sessionId: string;
  studentId: string;
  startTime: ISO8601;
  endTime: ISO8601;
  durationMs: number;
  
  // Performance
  questionsAttempted: number;
  questionsCorrect: number;
  avgWeightedOutcome: number;
  avgHintLevel: number;
  
  // Progression
  startElo: number;
  endElo: number;
  eloChange: number;
  startLevel: string;
  endLevel: string;
  levelsTraversed: string[];
  
  // Feature averages
  sessionLDS: number;
  sessionMCS: number;
  
  // Diagnostic
  dominantQuadrant: string;    // Most frequent diagnostic quadrant
  topicPerformance: Record<string, {
    attempts: number;
    avgOutcome: number;
    pKnow: number;
  }>;
  
  // Scaffold usage patterns
  hintDistribution: {
    noHint: number;           // Count of problems solved without hints
    L1Only: number;
    L2Used: number;
    L3Used: number;
    L4Used: number;
  };
}
```

### 9.4 Longitudinal Student Profile

```typescript
interface StudentProfile {
  studentId: string;
  createdAt: ISO8601;
  lastActive: ISO8601;
  
  // Current state
  currentElo: number;
  currentLevel: string;
  totalInteractions: number;
  totalSessions: number;
  
  // Topic mastery (BKT)
  topicMastery: Record<string, number>;  // P(know) per topic
  
  // Feature trends
  ldsHistory: number[];        // Session-level LDS over time
  mcsHistory: number[];        // Session-level MCS over time
  eloHistory: number[];        // Elo after each session
  
  // Learning trajectory
  avgLDSFirst5Sessions: number;
  avgLDSLast5Sessions: number;
  ldsImprovement: number;      // Percentage decrease in LDS
  avgMCSFirst5Sessions: number;
  avgMCSLast5Sessions: number;
  mcsImprovement: number;      // Percentage increase in MCS
  
  // Engagement
  avgSessionLength: number;    // Minutes
  sessionsPerWeek: number;
  completionRate: number;      // % of sessions completed (20/20)
  
  // Thompson priors (for state persistence)
  thompsonPriors: Record<string, { alpha: number; beta: number }>;
}
```

### 9.5 Privacy and Ethics

| Concern | Mitigation |
|---|---|
| Student is a minor (COPPA) | No PII beyond email/name; parental consent required |
| Performance data sensitivity | Elo/LDS/MCS stored under UID, not linked to real identity in analytics |
| Research use | Data anonymized (UID → random ID) before export; IRB approval required |
| Data retention | Interaction-level data retained for 2 years; aggregates indefinitely |
| Right to deletion | Firebase Auth deletion triggers cascade delete of all user data |

---

## 10. Evaluation Plan

### 10.1 Phase 1: Technical Validation (Pre-Launch)

**Objective**: Verify system components work correctly and produce expected behavior.

| Test | Method | Success Criterion |
|---|---|---|
| Adaptive engine convergence | Simulate 100 students × 50 interactions with known true levels | 90% of students within ±1 sub-level of true level by interaction 30 |
| Elo stability | 1000 simulated interactions per question | Question Elo ratings converge within ±30 of assigned level |
| BKT accuracy | Simulate known mastery states, measure P(know) accuracy | P(know) > 0.7 for mastered topics, < 0.3 for unmastered within 10 interactions |
| Thompson exploration | Cold-start simulation (all levels unexplored) | All 15 levels sampled at least once within first 30 interactions |
| LDS/MCS discrimination | Inject known behavioral patterns | LDS > 0.6 for simulated language-dependent profiles; MCS > 0.7 for math-competent profiles |
| Scaffold quality | 100 problems through Gemini scaffold pipeline | ≥ 95% mathematical accuracy, ≥ 90% readability compliance |
| End-to-end latency | 50 complete interaction cycles | Adaptive decision < 50ms; scaffold generation < 1.5s |

### 10.2 Phase 2: Pilot Study (Launch + 4 Weeks)

**Objective**: Validate effectiveness with real students in a controlled setting.

**Design**: Within-subjects pre/post with control group comparison

| Group | N | Treatment | Duration |
|---|---|---|---|
| **Treatment** | 30 students | MathLingua (adaptive + scaffolds) | 4 weeks, 3× per week |
| **Control** | 30 students | Same math problems, English-only, no scaffolding | 4 weeks, 3× per week |

**Instruments**:

1. **Pre-test**: Mathematics assessment in Spanish (establish math baseline) + English reading assessment (establish language baseline)
2. **Post-test**: Mathematics assessment in English (measure improvement) + same assessments as pre-test
3. **In-system metrics**: LDS trajectory, MCS trajectory, Elo progression, hint usage patterns

**Primary Outcome Measures**:

| Measure | Hypothesis | Test |
|---|---|---|
| Math score improvement (EN) | Treatment > Control | Independent t-test, d ≥ 0.5 |
| LDS reduction | Treatment shows ≥ 20% decrease | Paired t-test, pre vs. post |
| MCS stability | Treatment MCS does not decrease | One-sided paired t-test |
| Session completion rate | Treatment ≥ 80% | Descriptive |
| Scaffold fade-out | Avg hint level decreases over 4 weeks | Linear regression slope < 0 |

**Secondary Outcome Measures**:

| Measure | Instrument |
|---|---|
| Student engagement | Time on task, voluntary extra sessions |
| Mathematical self-efficacy | Adapted MSES (Mathematics Self-Efficacy Scale) |
| Language anxiety | Adapted FLCAS (Foreign Language Classroom Anxiety Scale) |
| Qualitative experience | Semi-structured interviews (N=10 treatment) |

### 10.3 Phase 3: Scale and Iteration (Launch + 3 Months)

**Objective**: Optimize algorithm parameters from pooled data; validate SLM quality.

| Activity | Data Required | Method |
|---|---|---|
| IRT calibration | ≥ 200 students × ≥ 50 questions | 2PL IRT model fit; replace initial Elo question ratings with IRT parameters |
| Feature weight optimization | ≥ 500 interaction records with outcomes | Logistic regression / gradient-boosted trees to optimize LDS/MCS weights |
| BKT parameter fitting | ≥ 100 students × ≥ 20 interactions per topic | EM algorithm per-topic parameter estimation |
| SLM quality assessment | ≥ 100 scaffold comparisons | Blind human preference evaluation (SLM vs. Gemini) |
| Algorithm A/B testing | ≥ 200 students split across variants | Compare engagement and outcome metrics across algorithm variants |

### 10.4 Phase 4: Long-Term Efficacy (Launch + 1 Year)

**Objective**: Measure impact on standardized test scores and language proficiency.

| Measure | Instrument | Expected Outcome |
|---|---|---|
| State math assessment | SBAC / STAAR (English) | Treatment students show larger gains |
| English proficiency | WIDA ACCESS | Treatment students show faster math-domain language growth |
| Long-term retention | 6-month follow-up assessment | Treatment gains persist |

---

## 11. Implementation Timeline

### Phase 1: MVP (Months 1–3)

| Month | Deliverables |
|---|---|
| **1** | Frontend scaffolding UI (L1–L4 display, hint tracking); Firebase setup (auth, Firestore schema); Gemini API integration for scaffold generation |
| **2** | Adaptive engine implementation in TypeScript (Elo + BKT + Thompson); Question database upload (130+ questions); LDS/MCS computation pipeline |
| **3** | End-to-end integration testing; Simulated student testing (100 profiles); Bug fixes and performance optimization; Deploy to Firebase Hosting |

### Phase 2: Pilot (Months 4–5)

| Month | Deliverables |
|---|---|
| **4** | Pilot launch with 30 treatment + 30 control students; Daily monitoring of system metrics; Weekly check-ins with teachers; Collect Gemini scaffold data for SLM training |
| **5** | Mid-pilot analysis and algorithm tuning; Begin SLM training data curation; Pilot completion and post-testing |

### Phase 3: SLM Transition (Months 6–8)

| Month | Deliverables |
|---|---|
| **6** | Curate 1,500 training examples; QLoRA fine-tune Qwen2.5-3B-Instruct v1; Deploy HF Inference Endpoint (shadow mode) |
| **7** | Shadow testing: SLM served, Gemini logged for comparison; Quality monitoring and iteration |
| **8** | SLM quality validated → full cutover; OR iterate (more data, retrain, repeat) |

### Phase 4: Scale (Months 9–12)

| Month | Deliverables |
|---|---|
| **9** | Open to additional schools (target: 200+ students); IRT calibration from pooled data; Question database expansion to 300+ |
| **10** | A/B testing of algorithm variants; DKT evaluation (if N > 500) |
| **11** | Feature weight optimization from real data; Dashboard for teachers (class-level analytics) |
| **12** | Long-term efficacy analysis; Research paper preparation; Open-source release of adaptive engine |

---

## 12. References

Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. *Applied Measurement in Education*, 14(3), 219–234.

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. *Advances in Neural Information Processing Systems*, 24.

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. *arXiv:2110.14168*.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. *User Modeling and User-Adapted Interaction*, 4(4), 253–278.

Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. *NeurIPS 2023*.

Hendrycks, D., et al. (2021). Measuring mathematical problem solving with the MATH dataset. *NeurIPS 2021*.

Huang, M., et al. (2024). Easy2Hard-Bench: Standardized difficulty labels for profiling LLM performance and generalization. *arXiv:2409.18433*.

Khisty, L. L., & Chval, K. B. (2002). Pedagogic discourse and equity in mathematics. *Mathematics Education Research Journal*, 14(3), 154–168.

Martiniello, M. (2008). Language and the performance of English-language learners in math word problems. *Harvard Educational Review*, 78(2), 333–368.

Microsoft. (2025). Phi-4 technical report. *arXiv:2412.08905*.

Moschkovich, J. (2002). A situated and sociocultural perspective on bilingual mathematics learners. *Mathematical Thinking and Learning*, 4(2–3), 189–212.

National Center for Education Statistics. (2023). English learners in public schools. *NCES Report*.

Pelánek, R. (2016). Applications of the Elo rating system in adaptive educational systems. *Computers & Education*, 98, 169–179.

Piech, C., et al. (2015). Deep knowledge tracing. *Advances in Neural Information Processing Systems*, 28.

Qwen Team. (2024). Qwen2.5 technical report. *arXiv:2412.15115*.

Scarlatos, A., et al. (2024). PSI-KT: Parameterized student interaction knowledge tracing. *EDM 2024*.

Shi, F., et al. (2022). Language models are multilingual chain-of-thought reasoners. *arXiv:2210.03057*.

Vygotsky, L. S. (1978). *Mind in Society: The Development of Higher Psychological Processes*. Harvard University Press.

---

## 13. Appendices

### Appendix A: Complete LDS Computation Example

**Scenario**: Student attempts a Level 2.3 problem (rate/ratio), uses L1 hint, then L2 hint, solves correctly on second attempt.

**Raw signals**:
- $h_i = 2$ (max hint = L2)
- $t_{pre} = 45s$ (45 seconds before first hint)
- $t_{total} = 120s$
- $t_{L1} = 15s$, $t_{L2} = 25s$, $t_{L3} = 0s$, $t_{L4} = 0s$
- $a_i = 2$ (two answer attempts)
- $c_i = 1$ (correct on second attempt)
- $\text{median\_time}(2.3) = 90s$

**Sub-feature computation**:

1. $D_{hint} = 2 / 4 = 0.50$
2. $R_{scaffold} = (15 + 25 + 0 + 0) / 120 = 40 / 120 = 0.333$
3. $E_{speed} = 1 - (45 / 90) = 1 - 0.5 = 0.50$
4. $F_{reveal} = 0$ (L4 not used)

**LDS**:
$$\text{LDS} = 0.35 \times 0.50 + 0.25 \times 0.333 + 0.20 \times 0.50 + 0.20 \times 0 = 0.175 + 0.083 + 0.10 + 0 = 0.358$$

**Interpretation**: Moderate dependency (0.35–0.55 range boundary). The student needed bilingual support but was not critically dependent. System would maintain current level.

### Appendix B: Complete MCS Computation (Same Scenario)

**Sub-feature computation**:

5. $C_{correct} = 1$ (correct)
6. $S_{speed} = \text{clamp}(90 / 120, 0, 1) = 0.75$
7. $A_{efficiency} = 1 / 2 = 0.50$
8. $(1 - \text{LDS}) = 1 - 0.358 = 0.642$

**MCS**:
$$\text{MCS} = 0.30 \times 1.0 + 0.25 \times 0.75 + 0.20 \times 0.50 + 0.25 \times 0.642 = 0.30 + 0.1875 + 0.10 + 0.1605 = 0.748$$

**Interpretation**: Moderate-to-strong math confidence. Despite needing L2 scaffolding, the student demonstrated solid mathematical ability. Diagnostic: "Language Gap" quadrant (high MCS ≥ 0.6, borderline LDS ≈ 0.4).

### Appendix C: Elo Update Example (Same Scenario)

**Before**: $R_s = 1050$, $R_q = 1150$ (Level 2.3 question)

**Weighted outcome**: Correct with L2 → $O_s = 0.50$

**Expected outcome**: $E_s = 1 / (1 + 10^{(1150 - 1050) / 400}) = 1 / (1 + 10^{0.25}) = 1 / (1 + 1.778) = 0.360$

**Update** ($K_s = 32$): $R_s' = 1050 + 32 \times (0.50 - 0.360) = 1050 + 32 \times 0.14 = 1050 + 4.48 = 1054.5$

**Update question** ($K_q = 6$): $R_q' = 1150 + 6 \times (0.360 - 0.50) = 1150 + 6 \times (-0.14) = 1150 - 0.84 = 1149.2$

**Interpretation**: Student's Elo increased slightly (outperformed expectation even with L2 hint), question's Elo barely changed (stable calibration).

### Appendix D: BKT Update Example

**Before**: $P(L_n) = 0.45$ for topic "rates" (the student's topic)

**Outcome**: $O_s = 0.50$ (treated as correct, since ≥ 0.5)

**Slip adjustment**: $P(S)_{adj} = 0.10 \times (1 + 0.5 \times 0.50) = 0.10 \times 1.25 = 0.125$

**Posterior given correct**:
$$P(L_n | correct) = \frac{0.45 \times (1 - 0.125)}{0.45 \times (1 - 0.125) + 0.55 \times 0.25} = \frac{0.45 \times 0.875}{0.45 \times 0.875 + 0.55 \times 0.25} = \frac{0.394}{0.394 + 0.1375} = \frac{0.394}{0.531} = 0.742$$

**After learning transition**:
$$P(L_n) = 0.742 + (1 - 0.742) \times 0.15 = 0.742 + 0.039 = 0.781$$

**Interpretation**: Topic mastery estimate jumped from 0.45 to 0.78, reflecting that a correct response (even with L2 scaffold, captured by adjusted slip) substantially increased our belief that the student knows "rates."

### Appendix E: Thompson Sampling Selection Example

**Student Elo**: 1054.5 (from Appendix C update)
**Current Level**: 2.1

**ZPD Window**: [1.4, 2.4] (current - 2 to current + 3, in sub-level steps)

**Current priors** (after 12 interactions):

| Level | α | β | Elo |
|---|---|---|---|
| 1.4 | 8.2 | 2.1 | 975 |
| 1.5 | 7.5 | 3.0 | 1025 |
| 2.1 | 5.8 | 4.2 | 1075 |
| 2.2 | 3.1 | 3.5 | 1125 |
| 2.3 | 1.8 | 2.3 | 1175 |
| 2.4 | 1.2 | 1.5 | 1225 |

**Sampled** (one draw):

| Level | $\hat{\theta}_l$ | Proximity Bonus | Score |
|---|---|---|---|
| 1.4 | 0.82 | $\exp(-(975-1054.5)^2/20000) = 0.85$ | 0.697 |
| 1.5 | 0.68 | $\exp(-(1025-1054.5)^2/20000) = 0.96$ | 0.653 |
| 2.1 | 0.61 | $\exp(-(1075-1054.5)^2/20000) = 0.98$ | **0.598** |
| 2.2 | 0.55 | $\exp(-(1125-1054.5)^2/20000) = 0.88$ | 0.484 |
| 2.3 | 0.42 | $\exp(-(1175-1054.5)^2/20000) = 0.58$ | 0.244 |
| 2.4 | 0.38 | $\exp(-(1225-1054.5)^2/20000) = 0.29$ | 0.110 |

**Selected**: Level 1.4 (highest score = 0.697)

But wait—the decision orchestrator said MAINTAIN (since weighted_outcome = 0.50 is in the maintain range). Thompson selected 1.4, which is below current 2.1. Since the decision says MAINTAIN (not DECREASE), the override rule kicks in: the system selects a question from level 2.1 (current level), not 1.4. Thompson's selection is overridden to respect the progression decision.

This demonstrates how the decision orchestrator provides guardrails against Thompson's stochastic exploration when the student's recent performance doesn't support a level change.

### Appendix F: Glossary

| Term | Definition |
|---|---|
| **BKT** | Bayesian Knowledge Tracing — HMM-based model of student knowledge per topic |
| **COPPA** | Children's Online Privacy Protection Act — US law for minors' data |
| **DKT** | Deep Knowledge Tracing — RNN-based alternative to BKT |
| **ELL** | English Language Learner |
| **Elo** | Rating system where opponents update each other's ratings after each match |
| **FERPA** | Family Educational Rights and Privacy Act — US law for student records |
| **FK** | Flesch-Kincaid readability metric |
| **HMM** | Hidden Markov Model |
| **IRT** | Item Response Theory — psychometric model of test items and examinees |
| **L1** | First language (Spanish, in MathLingua's context) |
| **L2** | Second language (English, in MathLingua's context) |
| **LDS** | Language Dependency Score — engineered feature measuring scaffold reliance |
| **MCS** | Math Confidence Score — engineered feature measuring mathematical ability |
| **QLoRA** | Quantized Low-Rank Adaptation — memory-efficient fine-tuning method |
| **SLM** | Small Language Model (< 4B parameters) |
| **Thompson Sampling** | Bayesian bandit algorithm for exploration/exploitation |
| **WIDA** | Standards framework for English language proficiency in U.S. schools |
| **ZPD** | Zone of Proximal Development — Vygotsky's learning theory concept |

---

*End of Technical Specification — MathLingua v1.0*