MindScan — How It Works (Team Reference)

1 · Button click

→

2 · fetch()

→

3 · Flask

→

4–7 · predict.py

→

8 · JSON

→

9–10 · UI

🖱

Step 01 · index.html

User clicks "Run all 12 models"

onclick="runAnalysis()" — defined in index.html <script>

What happens

The button has an onclick attribute pointing to runAnalysis(). This function grabs whatever text is in the textarea, disables the button, shows a spinning animation, and starts the process. Nothing touches any model yet — this is purely UI setup.

// Button in HTML
<button onclick="runAnalysis()">
  Run all 12 models
</button>

// Function in <script> at bottom of index.html
async function runAnalysis() {
  const text = document.getElementById('textInput').value.trim();
  if (!text) return;   // do nothing if textarea is empty

  btn.disabled = true;          // disable button while running
  spinner.style.display = 'block'; // show spinning circle
  btnTxt.textContent = 'Running 12 models...';
  // next: send to backend ↓
}

Important for teammates

The function is async (uses await). This means the browser does NOT freeze while waiting for the server — the user can still scroll the page. async/await is just a cleaner way of writing a Promise.

1 · Button click

→

2 · fetch()

→

3 · Flask

→

4–7 · predict.py

→

8 · JSON

→

9–10 · UI

📡

Step 02 · index.html

HTTP request sent to Flask

fetch('/predict') — browser's built-in HTTP function

What happens

The browser sends an HTTP POST request to the Flask server at /predict. The text is sent as JSON in the request body. The browser then waits for a response — this is when the ~2 second loading spinner appears.

// Still inside runAnalysis() in index.html
const r = await fetch('/predict', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text })
  // sends: { "text": "I feel hopeless..." }
});

const d = await r.json();   // parse the JSON response
render(d, text);              // draw results on screen

Why /predict and not a full URL?

Because the frontend and backend run on the same server (localhost:5000). Flask serves both the HTML page and the API endpoint. A relative URL like /predict automatically goes to the same host.

1 · Button click

→

2 · fetch()

→

3 · Flask

→

4–7 · predict.py

→

8 · JSON

→

9–10 · UI

🐍

Step 03 · app.py

Flask receives the POST request

app.py → @app.route('/predict') → predict()

What happens

Flask matches the incoming request to the @app.route('/predict') decorator. The predict() function extracts the text from the request body, validates it (not empty, not too long), then calls predict_all() from predict.py. It wraps the result with processing time and sends it back as JSON.

# app.py
from predict import predict_all

@app.route('/predict', methods=['POST'])
def predict():
  data = request.get_json()
  text = data['text'].strip()

  # validation
  if not text:
    return jsonify({'error': 'Text cannot be empty'}), 400
  if len(text) > 5000:
    return jsonify({'error': 'Too long'}), 400

  t0 = time.time()
  result = predict_all(text)          # ← the big function (next steps)
  result['processing_time_ms'] = round((time.time() - t0) * 1000)
  return jsonify(result)              # sends JSON back to browser

Models load at STARTUP not per request

The 12 models are loaded once when you run python app.py (takes ~30s). Every subsequent request reuses them from RAM. If models loaded per request it would take 30s per click.

1–3 · Browser → Flask

→

4 · clean_text()

→

5 · classical

→

6 · XLM-R

→

7 · assemble

🧹

Step 04 · predict.py

Text cleaning

predict.py → clean_text(raw_text)

What happens

The raw text is cleaned with the same pipeline used in training. This is critical — if you trained on cleaned text, you must clean the same way at prediction time. The classical models (LR, SVM, XGBoost) use the cleaned version. XLM-RoBERTa uses the original raw text because its tokeniser handles formatting itself.

# predict.py
def clean_text(text):
  text = str(text).lower()                          # UPPERCASE → lowercase
  text = re.sub(r'http\S+|www\S+', '', text)       # remove URLs
  text = re.sub(r'@\w+', '', text)                  # remove @mentions
  text = re.sub(r'#', '', text)                       # remove # (keep word)
  text = text.translate(str.maketrans('','',punctuation))  # remove !.,?etc
  text = re.sub(r'\s+', ' ', text).strip()            # collapse spaces
  return text

# Example:
# IN:  "@user I've been SO depressed https://t.co #mentalhealth 😢"
# OUT: "ive been so depressed mentalhealth"

Used by

Classical models (LR/SVM/XGB)

Use the cleaned version — TF-IDF cannot handle URLs, emojis, punctuation

XLM-RoBERTa

Uses the ORIGINAL raw_text — the transformer's tokeniser handles it better

1–4 · Browser → clean

→

5 · predict_classical()

→

6 · XLM-R

→

7 · assemble

⚙️

Step 05 · predict.py

Classical model predictions

predict.py → predict_classical(text_clean, ds)

What happens — 3 steps inside this function

1. TF-IDF transform: Converts the cleaned text into a vector of 50,000 numbers using the same vectoriser fitted during training.

2. Model.predict: Each of the 3 classical models takes the vector and outputs a class index (e.g. 4 = "postpartum").

3. Confidence score: Different method per model — LR and XGBoost use predict_proba(), SVM uses decision_function() converted via softmax.

# predict.py — called 3× (once per dataset)
def predict_classical(text_clean, ds):
  tfidf = _models[f'tfidf_{ds}']
  le    = _models[f'le_{ds}']
  vec   = tfidf.transform([text_clean])   # text → 50K-dim vector

  for model_name in ['logistic_regression', 'svm', 'xgboost']:
    model    = _models[f'{model_name}_{ds}']
    pred_idx = model.predict(vec)[0]         # → e.g. 4
    label    = le.classes_[pred_idx]           # 4 → "postpartum"

    # LR / XGBoost: direct probability
    if hasattr(model, 'predict_proba'):
      conf = model.predict_proba(vec)[0][pred_idx]

    # SVM: no predict_proba → use softmax of decision scores
    elif hasattr(model, 'decision_function'):
      scores = model.decision_function(vec)[0]
      e      = np.exp(scores - scores.max())
      conf   = e[pred_idx] / e.sum()         # normalise to 0–1

Why SVM needs special treatment

SVM (LinearSVC) finds a decision boundary but does not model probabilities — it just says "which side of the line?" Converting decision_function scores with softmax gives a reasonable confidence proxy. It is not a true probability but works well enough for display.

1–5 · Browser → classical

→

6 · predict_xlmr()

→

7 · assemble

🤖

Step 06 · predict.py

XLM-RoBERTa prediction

predict.py → predict_xlmr(raw_text, model, le, max_len)

What happens — 4 steps

1. Tokenise: The tokeniser splits text into sub-word pieces and converts them to integer IDs (e.g. "hopeless" might become [1234, 5678]).

2. Forward pass: The 278M parameter model processes the token IDs and produces raw logit scores for each class.

3. Softmax: Converts logits to proper probabilities that sum to 1.0.

4. All class probs: Returns every class probability, not just the winner — this feeds the 6-class breakdown bars in Dataset 1.

# predict.py — called 3× (once per dataset)
def predict_xlmr(raw_text, xlmr_model, le, max_len=128):
  inputs = tokenizer(
    raw_text,
    return_tensors='pt',         # PyTorch tensors
    max_length=max_len,            # 128 for tweets, 256 for Reddit
    truncation=True,
    padding='max_length'
  ).to(device)                    # send to GPU if available

  with torch.no_grad():            # no_grad saves memory (not training)
    logits = xlmr_model(**inputs).logits

  probs    = torch.softmax(logits, dim=1)[0]  # → [0.91, 0.04, 0.02, ...]
  pred_idx = int(probs.argmax())               # index of highest
  label    = le.classes_[pred_idx]

  all_probs = {le.classes_[i]: float(p) for i, p in enumerate(probs)}
  # all_probs = {"postpartum":0.913, "bipolar":0.041, ...}
  # only D1 uses this for the breakdown chart

  return {'label': label, 'confidence': float(probs[pred_idx]), 'all_probs': all_probs}

max_length differs per dataset

D1 and D2 are tweets (avg 31 words ≈ 40 tokens) → max_length=128. D3 is Reddit posts (avg 200 words ≈ 260 tokens) → max_length=256. This doubles memory usage for D3, which is why batch_size was halved during training.

1–6 · all models run

→

7 · predict_all()

→

8 · JSON

🔧

Step 07 · predict.py

predict_all() assembles everything

predict.py → predict_all(raw_text) — the main function

What happens

predict_all() is the orchestrator. It calls predict_classical() 3 times (once per dataset) and predict_xlmr() 3 times. Then it finds the winner per dataset (highest confidence), runs the suicide majority vote across D3's 4 models, and packages everything into a single JSON-ready dictionary.

# predict.py — the main function Flask calls
def predict_all(raw_text):
  clean = clean_text(raw_text)

  # Run all 4 models per dataset
  d1 = predict_classical(clean, 'd1')  # → {LR:{}, SVM:{}, XGB:{}}
  d1['XLM-RoBERTa'] = predict_xlmr(raw_text, xlmr1, le1, 128)
  # same for d2, d3...

  # Winner = model with highest confidence
  d1_winner = max(d1.items(), key=lambda x: x[1]['confidence'])
  # → ('XGBoost', {'label': 'postpartum', 'confidence': 0.999})

  # Suicide risk = majority vote across 4 D3 models
  suicide_count = sum(
    1 for r in d3.values()
    if 'suicide' in r['label'] and 'non' not in r['label']
  )
  risk_flag = suicide_count >= 3   # ≥3 of 4 models → HIGH RISK

  return {
    'dataset1': {'models': d1, 'winner_model': d1_winner[0], ...},
    'dataset2': {...},
    'dataset3': {...},
    'risk_flag': risk_flag,
    'suicide_votes': f'{suicide_count}/4 models flagged'
  }

The majority vote threshold — why 3 of 4?

We chose 3/4 (75%) as the threshold for the high-risk alert. 2/4 (50%) would be too sensitive — a single false positive triggers an alert. 4/4 (100%) would be too strict — if one model misses it, no alert. 3/4 balances sensitivity against false alarms for a research prototype.

1–7 · All predictions done

→

8 · JSON response

→

9–10 · UI renders

📦

Step 08 · app.py → browser

JSON sent back to browser

app.py → jsonify(result) → HTTP 200 response

What the browser receives

Flask wraps the predict_all() result in a JSON HTTP response. The browser's fetch() receives this and parses it. The structure below is exactly what flows into the render() function next.

{
  "dataset1": {
    "task": "Depression Type (6 Classes)",
    "models": {
      "Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
      "SVM":                  { "label": "postpartum", "confidence": 0.828 },
      "XGBoost":              { "label": "postpartum", "confidence": 0.999 },
      "XLM-RoBERTa":         { "label": "postpartum", "confidence": 0.997 }
    },
    "winner_model":      "XGBoost",
    "winner_prediction": "postpartum",
    "winner_confidence": 0.999,
    "class_probs": { "postpartum": 0.997, "bipolar": 0.001, ... }
  },
  "dataset2": { ... },
  "dataset3": { ... },
  "risk_flag": false,
  "suicide_votes": "0/4 models flagged suicide risk",
  "processing_time_ms": 2341
}

1–8 · JSON received

→

9 · render() + buildPanel()

→

10 · CSS animation

🎨

Step 09 · index.html

render() draws the results

index.html → render(data) → buildPanel() × 3

What happens

render() fills in the three winner cards (depression type, depressed?, suicide risk) and then calls buildPanel() three times — once per dataset — to build the model comparison rows. Each row shows the model name, its prediction, a confidence bar, and a ★ if it's the winner.

// index.html — called after fetch() returns
function render(d, text) {

  // 1. Fill winner cards
  document.getElementById('wpA').textContent = d.dataset1.winner_prediction;
  document.getElementById('wcA').textContent = (d.dataset1.winner_confidence * 100).toFixed(1) + '%';

  // 2. Build per-model rows for each dataset
  buildPanel('p1', d.dataset1.models, d.dataset1.winner_model);
  buildPanel('p2', d.dataset2.models, d.dataset2.winner_model);
  buildPanel('p3', d.dataset3.models, d.dataset3.winner_model);

  // 3. Risk banner
  if (d.risk_flag) {
    riskBanner.className = 'risk-banner danger';
  } else {
    riskBanner.className = 'risk-banner safe';
  }

  // 4. Show results section
  document.getElementById('results').style.display = 'block';
}

function buildPanel(panelId, models, winner) {
  let html = '';
  Object.entries(models).forEach(([name, res]) => {
    html += `<div class="mr ${name===winner?'winner':''}">
      <div class="mr-name">${name}</div>
      <div class="mr-pred">${res.label}</div>
      <div class="mr-fill" data-w="${(res.confidence*100).toFixed(1)}"></div>
      <div class="mr-pct">${(res.confidence*100).toFixed(1)}%</div>
    </div>`;
  });
  panel.innerHTML = html;  // inject HTML
  // bars animate next step ↓
}

1–9 · HTML rows created

→

10 · CSS animation

✨

Step 10 · index.html + CSS

Confidence bars animate

setTimeout(80ms) → style.width → CSS transition

What happens

The bars are created with width: 0%. An 80ms delay gives the browser time to paint the DOM first. Then JavaScript sets each bar's width from its data-w attribute (e.g. "82.8"). The CSS transition property smoothly animates from 0% → 82.8% over 0.8 seconds. That's the fill animation you see.

/* CSS — transition defined in <style> */
.mr-fill {
  width: 0%;                            /* starts invisible */
  transition: width 0.8s cubic-bezier(.4,0,.2,1);  /* smooth ease-out */
}
.mr.winner .mr-fill { background: var(--purple); }  /* winner = purple */

// JavaScript — in buildPanel()
setTimeout(() => {
  panel.querySelectorAll('.mr-fill').forEach(el => {
    el.style.width = el.getAttribute('data-w') + '%';
    // sets e.g. "82.8%" → CSS transition plays automatically
  });
}, 80);  // 80ms wait for DOM to paint first

// The 6-class breakdown bars work the same way
// but use 200ms delay and .cp-fill class

Why the 80ms delay?

If you set style.width immediately after setting innerHTML, the browser hasn't painted the elements yet. The transition has nothing to "from" — the bars jump to their final width instantly with no animation. The 80ms gives the browser one render frame to establish the 0% starting state, so the transition has a clean start point.

Complete flow summary

Total round trip time

~2–4 seconds (dominated by XLM-RoBERTa inference on CPU)

Files involved

index.html → app.py → predict.py → back to index.html

Models called

12 total: LR + SVM + XGBoost + XLM-R × 3 datasets

Winner selection

Highest confidence per dataset — pure Python max()

Risk flag

Majority vote — ≥3 of 4 Dataset 3 models predict "suicide"