Spaces:

fyliu
/

Flight-Search

Running

fyliu Claude Opus 4.6 commited on Feb 25

Commit

2e50ccd

0 Parent(s):

Add flight booking website (Google Flights clone)

Full-stack flight search app for computer-use agent testing:
- Backend: FastAPI with deterministic pricing engine, route finder
(direct + 1-stop + 2-stop via hub detection), timezone-aware
flight generation, airport autocomplete, calendar pricing
- Frontend: React + TypeScript + Tailwind CSS with search form,
autocomplete, results page, filters, sorting, URL-driven state
- Docker: Multi-stage build (Node frontend → Python backend)
- Data: 3,770 airports, 55,627 routes, 604 airlines from
airline_routes.json loaded in-memory (~21 MB)
- All elements have data-testid attributes for agent testing
- Same search params always produce same results (SHA-256 seeded)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +10 -0
.gitattributes +1 -0
.gitignore +8 -0
CLAUDE.md +518 -0
Dockerfile +26 -0
JOURNAL.md +0 -0
airline_routes.json +3 -0
backend/__init__.py +0 -0
backend/api/__init__.py +0 -0
backend/api/airports.py +47 -0
backend/api/calendar.py +70 -0
backend/api/search.py +144 -0
backend/config.py +93 -0
backend/data_loader.py +164 -0
backend/flight_generator.py +270 -0
backend/hub_detector.py +52 -0
backend/main.py +59 -0
backend/models.py +145 -0
backend/price_engine.py +113 -0
backend/requirements.txt +3 -0
backend/route_finder.py +141 -0
backend/seed_utils.py +20 -0
description.md +1122 -0
docker-compose.yml +6 -0
frontend/eslint.config.js +23 -0
frontend/index.html +13 -0
frontend/package-lock.json +0 -0
frontend/package.json +33 -0
frontend/public/vite.svg +1 -0
frontend/src/App.css +42 -0
frontend/src/App.tsx +16 -0
frontend/src/api/client.ts +39 -0
frontend/src/api/types.ts +90 -0
frontend/src/assets/react.svg +1 -0
frontend/src/components/results/FilterPanel.tsx +133 -0
frontend/src/components/results/FlightCard.tsx +112 -0
frontend/src/components/results/FlightSegment.tsx +42 -0
frontend/src/components/results/NoResults.tsx +29 -0
frontend/src/components/results/SortBar.tsx +40 -0
frontend/src/components/search/AirportInput.tsx +107 -0
frontend/src/components/search/ClassSelector.tsx +30 -0
frontend/src/components/search/DatePicker.tsx +22 -0
frontend/src/components/search/PassengerSelector.tsx +86 -0
frontend/src/components/search/SearchForm.tsx +114 -0
frontend/src/components/search/SwapButton.tsx +18 -0
frontend/src/components/search/TripTypeSelector.tsx +33 -0
frontend/src/components/shared/Header.tsx +22 -0
frontend/src/components/shared/Loading.tsx +8 -0
frontend/src/hooks/useDebounce.ts +12 -0
frontend/src/hooks/useFlightSearch.ts +70 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,10 @@

+node_modules
+frontend/node_modules
+frontend/dist
+.venv
+.venv_*
+.git
+__pycache__
+*.pyc
+.uv_cache
+.uv_pythons

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ airline_routes.json filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+node_modules/
+frontend/dist/
+.venv/
+.venv_*/
+__pycache__/
+*.pyc
+.uv_cache/
+.uv_pythons/

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,518 @@

+# CLAUDE.md — Rules for AI Assistants (ECMoE Project)
+## MANDATORY FIRST STEPS
+**Before taking ANY action on a task, you MUST:**
+1. Tell the user you have read CLAUDE.md and how you'll follow the THREE RULES
+2. **Actually read these files** (not optional):
+   - **README.md** — Directory structure, setup, how to run experiments
+   - **JOURNAL.md** — Recent bugs, what's broken/fixed, latest results
+   - **description.md** — Detailed method descriptions, design choices, hyperparameters
+**Do NOT skip this to "get to work faster."** Skipping causes you to use wrong directories, miss known issues, and waste time on already-solved problems.
+---
+## THE THREE RULES
+### 1. EDIT, NEVER REWRITE
+- **ALWAYS edit existing code, NEVER rewrite from scratch**
+- Find the exact file/function, make surgical changes with Edit tool
+- If you're about to write 50+ lines of new code doing something similar to existing code, STOP
+- Reuse existing classes: `Compressor`, `Decompressor`, `StaleDecompressor`, `train_compressor`, etc.
+### 2. VALIDATE DATA BEFORE PLOTTING
+- Always load results from JSON files, never hardcode values
+- If a number looks different than expected, investigate before proceeding
+- Check `results/summary/all_results_summary.json` for the canonical results
+### 3. COMMIT AND DOCUMENT IMMEDIATELY
+- `git commit` after every fix (no remote configured — push when available)
+- Update `JOURNAL.md` right after committing
+- Don't batch changes — commit as you go
+---
+## MINDSET: NO SHORTCUTS
+- Academic rigor means doing things RIGHT, not just doing things FAST
+- Be skeptical of your own first approach — question whether it could be better
+- Don't simplify the requirement — solve the actual problem
+---
+## Communication
+**When showing results or finishing tasks:**
+- ALWAYS provide the **full absolute path** to any files created or modified
+- Example: "View the result at: `/project/6004852/lfy/ECMoE/results/summary/ppl_vs_ratio_all.png`"
+---
+## Project-Specific Rules
+### Environment Setup (Compute Canada)
+```bash
+# Modules MUST be loaded BEFORE activating venv
+module load cuda/12.6 arrow/22.0.0
+source .venv/bin/activate
+# HuggingFace cache goes to persistent project dir (home quota is small)
+export HF_HOME=/home/lfy/projects/rrg-bengioy-ad/lfy/ECMoE/.cache/huggingface
+```
+### Directory Structure
+```
+src/                        # Python source code
+scripts/                    # Bash wrappers for each experiment
+results/                    # ALL experiment outputs (gitignored)
+  01_distribution/          # Task 1: distribution analysis
+  02_quantization/          # Task 2: quantization baseline
+  03_neural_compressor/     # Task 3: shared neural compressor
+  03b_perlayer_compressor/  # Task 3b: per-layer neural compressor
+  04a_stale_compressed/     # Task 4a: stale-conditioned (compressed stale)
+  04b_stale_uncompressed/   # Task 4b: stale-conditioned (uncompressed stale)
+  05a_e2e_perlayer/         # Task 5a: e2e per-layer compressor (no stale)
+  05b_e2e_stale/            # Task 5b: e2e stale-conditioned compressor
+  05c_e2e_baseline/         # Task 5c: baseline (no compression, same pipeline)
+  05c_megatron_e2e_baseline/ # Task 5c: baseline (Megatron variant)
+  06a_megatron_e2e_pretrained_perlayer/ # Task 6a: e2e with 3b init (Megatron)
+  06b_megatron_e2e_pretrained_stale/    # Task 6b: e2e with 4b init (Megatron)
+  07a_megatron_e2e_split_perlayer/      # Task 7a: split-mode e2e (router=original)
+  07b_megatron_e2e_split_stale/         # Task 7b: split-mode e2e + stale
+  08_ep_compression/        # Task 8: EP compression eval (uses 7a/7b weights)
+  summary/                  # Cross-method comparison plots and tables
+data/hidden_states/         # Cached MoE hidden states (gitignored, ~37 GB in bfloat16)
+```
+### Key Code Architecture
+- **`src/model_utils.py`** — Central library: model loading, MoE detection, hidden state
+  collection, ALL perplexity evaluation functions (baseline, shared, per-layer, stale)
+- **`src/metrics.py`** — Reconstruction metrics: MSE, cosine similarity, relative error, SNR
+- **`src/run_neural_compressor.py`** — Defines `Compressor`, `Decompressor`, `train_compressor()`.
+  Other scripts import from here — never duplicate these classes
+- **`src/run_stale_compressor.py`** — Defines `StaleDecompressor`, `train_stale_compressor()`
+- **`src/run_e2e_compressor.py`** — End-to-end training of per-layer compressors via LM loss.
+  Defines `E2ECompressorManager`, `SFTDataset`. Uses Dolci-Instruct-SFT with SFT mode
+  (response-only training). `_tokenize_sft_sample()` in `model_utils.py` handles the
+  response-only label masking.
+- **`src/vllm_ep_compression.py`** — EP-aware compress/decompress registration for vLLM.
+  Sets `_ecmoe_compress_fn` / `_ecmoe_decompress_fn` on FusedMoE instances via
+  `apply_model()`. Supports per-layer and stale-conditioned methods. Requires patched
+  vLLM (`.venv_vllm_exp`).
+- **`src/run_ep_compression_eval.py`** — Task 8 entry point: evaluates EP compression
+  with actual dispatch/combine in vLLM. Two modes: `simulation` (single-GPU) and `ep`
+  (multi-GPU with `enable_expert_parallel=True`). Uses Task 7a/7b weights.
+- **`src/visualize_all_results.py`** — Generates all cross-method comparison plots and tables
+- **`src/downstream_eval.py`** — Shared utility for downstream task evaluation via lm-eval-harness.
+  Provides hook registration functions (`register_quantization_hooks`, `register_perlayer_hooks`,
+  `register_stale_hooks`, `register_e2e_hooks`), `run_lm_eval()` wrapper, and result saving.
+  Imported by each task script when `--downstream-tasks` is specified.
+  Also provides vLLM backend support via apply_model pattern: `create_vllm_backend()`,
+  `register_perlayer_hooks_vllm()`, `register_stale_hooks_vllm()`,
+  `register_quantization_hooks_vllm()`, `remove_hooks_vllm()`.
+  Split (router-uncompressed) mode: `register_perlayer_hooks_split()`,
+  `register_stale_hooks_split()` for HF, and `register_perlayer_hooks_split_vllm()`,
+  `register_stale_hooks_split_vllm()` for vLLM. In split mode, the router sees original
+  hidden states while experts see decompressed — more realistic EP simulation.
+- **`src/run_all_downstream.py`** — Standalone downstream evaluator. Loads model once,
+  evaluates all methods sequentially. Supports `--backend hf/vllm` and
+  `--router-mode compressed/uncompressed`.
+### Known Issues / Gotchas
+**Layer sorting:** Always use `sorted(keys, key=layer_index)` from `model_utils`. Lexicographic
+sorting puts layer 10 before layer 2 (`model.layers.10` < `model.layers.2`).
+**Dtype mismatch:** Dequantized tensors and neural compressor outputs must match the model's
+activation dtype (bfloat16). Always cast: `.to(x.dtype).to(x.device)`.
+**What went wrong (2026-02-11):** `absmax_dequantize` returned float32 but model expected
+bfloat16, causing `RuntimeError` during perplexity eval. Fix: explicit `.to(scale.dtype)` cast.
+**What went wrong (2026-02-11):** When asked to remove quantization for Tasks 1–4, the agent
+implemented the change (default `load_in_4bit=False`, `device="auto"`) without the user having
+specified this as a hyperparameter. The model loading precision (BF16 vs 4-bit NF4) is a key
+experimental parameter — changing it retroactively means old results are no longer reproducible
+with default settings. **Lesson:** Treat model loading precision as a hyperparameter. Do NOT
+change defaults that affect reproducibility without explicit user instruction. When the user says
+"remove quantization", ASK whether they want it as a new default or as a CLI override.
+**Response-only hidden state collection:** `collect_hidden_states()` defaults to
+`response_only=True` — only assistant-response tokens are captured (labels != -100).
+This ensures offline compressor training (Tasks 2–4) trains on the same distribution
+that PPL evaluation measures. Use `--no-response-only` in `run_distribution.py` for
+legacy all-token collection. Metadata records `"response_only": true/false`.
+**Legacy Megatron script deleted:** `src/run_megatron_e2e_compressor.py` was removed because
+it used `PackedTokenDataset` + `labels=input_ids` (standard LM, not SFT response-only),
+did not use `get_split_indices()`, and misreported effective batch size with DP > 1.
+Always use `src/megatron_e2e/train.py` for Megatron-based training.
+**Large data files:** Hidden states for 100K tokens are ~18.5 GB per file in bfloat16
+(dispatch + gather = ~37 GB). These are gitignored. Never try to `git add` them.
+**Model VRAM:** Model is loaded in full BF16 (~60 GB). Tasks 1–4 use single GPU
+(`device="cuda:0"`) — the model fits on one H100 80 GB with headroom for inference.
+Task 5 uses multi-GPU (`device_map="auto"`) because backprop needs extra VRAM.
+4-bit NF4 loading (~15 GB) is available via `--load-in-4bit` but is NOT the default.
+**device="auto" vs tensor ops:** When `device="auto"` is used for model loading (Task 5),
+`"auto"` is NOT a valid torch device for tensor operations. Scripts that do `.to(device)` or
+`train_compressor(device=...)` must use `compute_device` (resolved to `"cuda:0"` when
+`device="auto"`). Only `load_model_and_tokenizer()` accepts `"auto"` directly.
+Tasks 1–4 default to `device="cuda:0"` so this is only relevant for Task 5.
+**Hook device safety (2026-02-17):** With `device_map="auto"`, model layers may reside on
+different GPUs. PPL evaluation hooks in `model_utils.py` now explicitly call `.to(x.device)`
+on compressor/decompressor outputs before returning them to the model. This is a no-op when
+compressor and layer are on the same device but prevents cross-device errors when they differ.
+### vLLM Environment (downstream evaluation)
+**vLLM backend:** `src/downstream_eval.py` + `src/run_all_downstream.py` — vLLM 0.8.4+
+for downstream task evaluation with compression hooks.
+```bash
+# Separate venv from HF-based experiments — CUDA 12.6
+module load cuda/12.6 arrow/22.0.0
+source .venv_vllm/bin/activate
+export HF_HOME=/home/lfy/projects/rrg-bengioy-ad/lfy/ECMoE/.cache/huggingface
+# Setup (first time only):
+bash scripts/vllm_setup_env.sh
+```
+**Known issues / gotchas (vLLM):**
+- **vLLM V1 engine (>= 0.15):** The model runs in a **separate subprocess** (EngineCore).
+  You CANNOT access the model directly from the main process. The old path
+  `llm_engine.model_executor.driver_worker.model_runner.model` does NOT work.
+  Instead, use `vllm.LLM.apply_model(func)` to send functions to the worker process.
+  Functions are serialized via cloudpickle — they must be self-contained (include their
+  own imports and class definitions). Requires `VLLM_ALLOW_INSECURE_SERIALIZATION=1`.
+  `create_vllm_backend()` sets this automatically.
+- **enforce_eager=True required:** vLLM's CUDA graph capture prevents PyTorch hooks
+  from being called. Always use `enforce_eager=True` when registering compression hooks.
+  `create_vllm_backend()` sets this automatically.
+- **Hook registration pattern:** All vLLM hook functions use the apply_model pattern:
+  `_vllm_register_perlayer()` returns a closure → `vllm_llm.apply_model(closure)`.
+  The closure runs inside the worker, loads weights, creates compressor modules,
+  and registers PyTorch pre-hooks. Cleanup via `_vllm_remove_hooks()` → `remove_hooks_vllm()`.
+- **Layer name mapping:** vLLM may use different module paths than HF. `_map_layer_name()`
+  maps by numeric layer index, which is robust to naming differences.
+- **Two router modes (--router-mode):**
+  - `compressed` (default): Pre-hook compress→decompress. Router AND experts see
+    decompressed. Conservative lower bound — same as the original PPL evaluation hooks.
+  - `uncompressed`: Split forward — router sees ORIGINAL input, experts see decompressed.
+    More realistic EP simulation where router runs on source GPU with original data.
+    Both modes work for HF and vLLM backends.
+- **No multi-device placement:** The plan called for `compressor_device` (attention GPU)
+  vs `decompressor_devices` (expert GPUs) to simulate the actual communication topology.
+  Current implementation puts both compressor and decompressor on the same device. This
+  doesn't affect quality measurement (the math is device-independent) but doesn't
+  demonstrate the real communication pattern or measure cross-device overhead.
+- **No shared expert handling:** Split mode omits `shared_expert` /
+  `shared_expert_gate` logic. Qwen3-30B-A3B doesn't use shared experts so this is
+  correct for the current model, but reduces generality.
+- **No separate E2E hooks for vLLM:** E2E and offline weights have identical format.
+  `register_perlayer_hooks_vllm()` works for 3b + 5a + 6a weights.
+  `register_stale_hooks_vllm()` works for 4a/4b + 5b + 6b weights.
+- **TP > 1 with vLLM:** When using tensor parallelism, each rank has a partial model.
+  Hook registration should still work (hooks are on the full module), but compressor
+  modules stay on one device. Tested with TP=1 by default.
+**vLLM-specific directories:**
+```
+.venv_vllm/            # Separate virtual environment (gitignored)
+```
+### vLLM EP Compression Environment (Task 8)
+**EP compression:** `src/vllm_ep_compression.py` — Sets compress/decompress functions
+on FusedMoE instances.  Patched `forward_impl()` calls compress BEFORE dispatch and
+decompress AFTER, achieving real communication reduction.
+```bash
+# Separate venv with patched vLLM 0.15.1 — CUDA 12.6
+module load cuda/12.6 arrow/22.0.0
+source .venv_vllm_exp/bin/activate
+export HF_HOME=/home/lfy/projects/rrg-bengioy-ad/lfy/ECMoE/.cache/huggingface
+# Setup (first time only):
+bash scripts/vllm_exp_setup_env.sh
+```
+**Key differences from .venv_vllm:**
+- vLLM 0.15.1 pinned (for patch compatibility)
+- `FusedMoE.forward_impl()` patched with 3 insertion points (~12 lines)
+- Uses `_ecmoe_compress_fn` / `_ecmoe_decompress_fn` attributes (not PyTorch hooks)
+- Supports `enable_expert_parallel=True` for actual EP dispatch
+**Known issues / gotchas (EP compression):**
+- **allgather_reducescatter backend:** vLLM's default `all2all_backend`. After dispatch,
+  every rank has ALL tokens.  Stale cache approach works because token ordering is
+  consistent across layers.
+- **Router unaffected:** `router_logits` are computed at `Qwen3MoeSparseMoeBlock.forward()`
+  BEFORE `FusedMoE.forward_impl()`, so compression never affects routing decisions.
+- **Stale piggybacking:** Reference layers concatenate `cat(compressed, stale)` before
+  dispatch. After dispatch, decompress_fn splits and caches stale globally. Non-reference
+  layers dispatch only compressed (max compression), retrieve cached stale for decompression.
+**vLLM EP compression directories:**
+```
+.venv_vllm_exp/        # Patched vLLM environment (gitignored)
+results/08_ep_compression/ # EP eval results
+```
+### Megatron-LM Environment (Task 5 Megatron variant)
+**Megatron implementation:** `src/megatron_e2e/` package — EP-first, CUDA 12.9, Megatron Bridge.
+(Legacy `src/run_megatron_e2e_compressor.py` was deleted due to SFT/split/batch bugs.)
+```bash
+# Separate venv from HF-based experiments — CUDA 12.9 required
+module load cuda/12.9 nccl arrow/22.0.0
+source .venv_megatron/bin/activate
+export HF_HOME=/home/lfy/projects/rrg-bengioy-ad/lfy/ECMoE/.cache/huggingface
+# Setup (first time only):
+bash scripts/megatron_setup_env.sh
+```
+**Key differences from HF environment:**
+- Uses `megatron-core` >=0.15.0 for model parallelism (EP, TP, DP, PP)
+- Requires Transformer Engine (for Megatron Bridge and fused kernels)
+- Uses `megatron-bridge` >=0.2.0 for HF→Megatron weight conversion
+- Default parallelism: EP=4, TP=1, PP=1 (expert parallelism, not tensor)
+- Launch via `torchrun`, not `python`
+**Megatron-specific directories:**
+```
+src/megatron_e2e/      # Package-based implementation (recommended)
+.venv_megatron/        # Separate virtual environment (gitignored)
+.uv_cache/             # uv cache on project disk (gitignored)
+.uv_pythons/           # uv Python installs (gitignored)
+third_party/           # Apex, etc. (gitignored, legacy only)
+data/megatron_dolci/   # Preprocessed binary dataset (gitignored)
+```
+**Known issues / gotchas (Megatron):**
+- **CUDA version:** Megatron Bridge requires CUDA >= 12.8. Use `cuda/12.9` module
+  on Compute Canada, NOT `cuda/12.6`.
+- **EP vs TP:** Default is EP=4 (expert parallelism). With EP, each GPU holds 32/128
+  experts per layer. TP=4 is the legacy approach and splits attention heads across GPUs.
+- **Megatron layer names** differ from HF: `decoder.layers.N.mlp` vs `model.layers.N.mlp`.
+  `_megatron_to_hf_layer_name()` in `compressor_manager.py` handles conversion.
+- Compressor weights are replicated across all ranks (not sharded), since they
+  are tiny (~200M total). Saved from rank 0 only.
+- With EP>1, compressor is on source GPU (attention side), decompressor on
+  destination GPU (expert side) — different devices.
+- `MegatronModelWrapper` bridges Megatron's forward interface to HF-style
+  `SimpleNamespace(loss=..., logits=...)`. Uses `vocab_parallel_cross_entropy`
+  for correct loss with TP > 1. SFT labels (-100) are clamped to 0 before
+  calling `vocab_parallel_cross_entropy`, and loss is masked via
+  `(per_token_loss * loss_mask).sum() / num_valid`.
+- DistributedSampler must use DP rank/size (via `get_dp_info()`), NOT global
+  world size. All ranks in a TP group must see the SAME data.
+- Saved weights use HF layer names (`model.layers.N.mlp`) for compatibility
+  with HF `E2ECompressorManager.load_weights()`.
+- **Model loading:** `train.py` tries AutoBridge → MegatronBridge → manual fallback
+  for HF→Megatron conversion. If Bridge is not installed, falls back to manual
+  weight conversion using `load_megatron_qwen3()` from legacy code.
+- **Train loss DP reduction (2026-02-17):** `train.py` now all-reduces step-level and
+  epoch-level train loss across DP ranks before logging. Previously, only rank 0's local
+  shard loss was logged, which was inaccurate with DP > 1. Wandb `train/loss` and
+  `train/epoch_loss` now reflect the true DP-averaged loss.
+### Running Experiments
+Task 1 must run first (caches hidden states for Tasks 2–4). Task 5 is independent.
+Tasks 1–4 use 1 GPU each; Task 5a/5b use 4 GPUs each.
+**Data selection:** All tasks use seed=42 for reproducible 80/10/10 train/val/test
+split of dataset rows. Tasks 1–4 draw from TRAIN split, PPL evaluation from TEST
+split. No data leakage between splits.
+**Task 5 config (HF):** batch_size=2, grad_accum=8 (effective=16), max_sequences=500K,
+max_length=2048, val_interval=2500 steps, val_batch_size=8, SFT mode
+(response-only training), wandb enabled by default.
+**Task 5/6 config (Megatron):** Same as HF except max_sequences=100K,
+val_interval=1000 steps. Task 6 uses same Megatron config with `--init-weights-dir`.
+Tail micro-batches (when `len(dataloader) % grad_accum != 0`) are handled by rescaling
+accumulated gradients and performing the optimizer step.
+**Two evaluation stages:** Training-time val loss uses the VAL split (50K seqs,
+batch_size=8, every 2500 steps) for checkpoint selection and wandb monitoring.
+Final PPL evaluation uses the TEST split (50K seqs, batch_size=1, in
+`model_utils.py`) for reported results. Different code paths — `--val-batch-size`
+only affects training-time eval.
+**SFT data loading:** All E2E training (Task 5) and perplexity evaluation now use
+SFT mode: each sample is one conversation, tokenized independently. Labels are
+-100 for non-assistant tokens (system, user, template markup) and actual token
+IDs for assistant responses. Loss and perplexity are computed on response tokens
+only. Data is loaded by sampling N sequences from the dataset (not packing tokens).
+`_tokenize_sft_sample()` in `model_utils.py` handles the tokenization.
+```bash
+# Phase 1: Megatron 5a + 5b in parallel (8 GPUs)
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/05_megatron_e2e.sh none &
+CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/05_megatron_e2e.sh uncompressed &
+wait
+# Phase 2: Task 1 (re-cache with seed=42)
+CUDA_VISIBLE_DEVICES=0 bash scripts/01_analyze_distribution.sh
+# Phase 3: Tasks 2-4 + HF 5a (parallel)
+CUDA_VISIBLE_DEVICES=0 bash scripts/02_run_quantization.sh &
+CUDA_VISIBLE_DEVICES=1 bash scripts/03_run_neural_compressor.sh &
+CUDA_VISIBLE_DEVICES=2 bash scripts/03b_run_perlayer_compressor.sh &
+CUDA_VISIBLE_DEVICES=3 bash scripts/04_run_stale_compressor.sh compressed &
+CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/05_run_e2e_compressor.sh none &
+wait
+# Phase 4: Task 4b + HF 5b (parallel)
+CUDA_VISIBLE_DEVICES=0 bash scripts/04_run_stale_compressor.sh uncompressed &
+CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/05_run_e2e_compressor.sh uncompressed &
+wait
+# Megatron-based E2E training (alternative to HF Task 5):
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/05_megatron_e2e.sh none          # 5a
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/05_megatron_e2e.sh uncompressed   # 5b
+# Task 5c: Baseline evaluation (no compression, same pipeline):
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/05_run_e2e_compressor.sh baseline        # HF
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/05_megatron_e2e.sh baseline              # Megatron
+# Task 6a/6b: E2E with pretrained init (requires Task 3b/4b weights):
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/06_megatron_e2e_pretrained.sh none &          # 6a (init from 3b)
+CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/06_megatron_e2e_pretrained.sh uncompressed &  # 6b (init from 4b)
+wait
+# Task 7a/7b: Split-mode E2E (router sees original, experts see decompressed):
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/07_megatron_e2e_split.sh none &          # 7a (init from 3b)
+CUDA_VISIBLE_DEVICES=4,5,6,7 bash scripts/07_megatron_e2e_split.sh uncompressed &  # 7b (init from 4b)
+wait
+```
+### Downstream Task Evaluation (lm-eval-harness)
+Downstream eval is triggered by setting `DOWNSTREAM_TASKS` before running any script.
+It runs **after** the existing PPL evaluation step, using `lm-eval-harness` with the
+same compression hooks active. Results saved to `downstream_results.json` in each
+task's output directory.
+```bash
+# Run Task 2 + PPL eval + downstream eval:
+DOWNSTREAM_TASKS="gsm8k_cot" bash scripts/02_run_quantization.sh
+# Run Task 5a + PPL eval + downstream eval:
+DOWNSTREAM_TASKS="gsm8k_cot" bash scripts/05_run_e2e_compressor.sh none
+# Eval-only mode + downstream:
+DOWNSTREAM_TASKS="gsm8k_cot" python src/run_e2e_compressor.py \
+    --skip-training --output-dir results/05a_e2e_perlayer --stale-mode none
+# Smoke test with 10 examples:
+DOWNSTREAM_TASKS="gsm8k_cot" DOWNSTREAM_LIMIT=10 bash scripts/05_run_e2e_compressor.sh none
+```
+**Key code:** `src/downstream_eval.py` provides `register_*_hooks()` for each method,
+`run_lm_eval()` wrapper, and `save_downstream_results()`. Each task script imports from
+it when `--downstream-tasks` is specified. GSM8K variant: `gsm8k_cot` (8-shot CoT).
+**vLLM backend:** Use `--backend vllm` (or `DOWNSTREAM_BACKEND=vllm`) for vLLM-based
+downstream evaluation. Two router modes (`--router-mode compressed/uncompressed`):
+```bash
+# Standalone vLLM eval (all methods, default router=compressed):
+source .venv_vllm/bin/activate
+python src/run_all_downstream.py --backend vllm --tasks gsm8k_cot
+# Router-uncompressed mode (split: router sees original, experts see decompressed):
+python src/run_all_downstream.py --backend vllm --router-mode uncompressed --method e2e_perlayer --tasks gsm8k_cot
+# With tensor parallelism:
+python src/run_all_downstream.py --backend vllm --tensor-parallel-size 4 --tasks gsm8k_cot
+# Via task scripts (HF model, vLLM downstream):
+DOWNSTREAM_TASKS="gsm8k_cot" DOWNSTREAM_BACKEND=vllm bash scripts/05_run_e2e_compressor.sh none
+```
+### Visualization
+Regenerate all summary plots and tables:
+```bash
+source .venv/bin/activate
+python src/visualize_all_results.py
+```
+Outputs to `results/summary/`:
+- `ppl_vs_ratio_all.png` — PPL vs compression ratio (log-log)
+- `reconstruction_vs_ratio_all.png` — MSE and CosSim vs ratio
+- `ppl_bar_practical.png` — Bar chart at 2x and 4x
+- `all_results_summary.json` — Machine-readable summary
+- `param_count_table.{csv,md,json}` — Parameter counts for all methods
+---
+## Code Changes
+**Before changing any code:**
+1. FIND the exact file that produces the current output
+2. READ and understand it
+3. EDIT only the specific lines needed (use Edit tool)
+4. TEST that output matches except for your intended change
+**Adding new compression methods:**
+- Reuse `Compressor`, `Decompressor` from `run_neural_compressor.py`
+- Reuse `train_compressor()` for standard autoencoder training
+- Add new perplexity evaluation functions to `model_utils.py`
+- Follow the same JSON output format as existing experiments
+- Update `visualize_all_results.py` to include the new method
+---
+## NEVER GUESS SILENTLY
+**When you encounter ambiguity:**
+1. **STOP** — Do not make an arbitrary choice
+2. **ASK** — Present the options to the user
+3. **FLAG** — Note the documentation gap
+4. **FIX** — Update README.md or CLAUDE.md
+---
+## Version Control
+- Commit after EVERY fix (don't wait)
+- Check `git status` and file sizes before committing (no files >100MB)
+- Update JOURNAL.md immediately after committing
+- No git remote is currently configured — commits are local only
+---
+## Investigation
+**When something seems wrong:**
+1. STOP — don't patch the visible symptom
+2. ASK WHY — trace back to data generation
+3. VERIFY — test hypotheses with minimal examples
+4. FIX ROOT — fix the source, not downstream
+---
+## Meta-Rule: Continuous Improvement
+**When a preventable issue occurs:**
+1. Identify the root cause
+2. Add a "What went wrong" example to this file
+3. Commit the improvement
+This file should evolve based on lessons learned.

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+# Stage 1: Build frontend
+FROM node:22-alpine AS frontend-build
+WORKDIR /app/frontend
+COPY frontend/package.json frontend/package-lock.json ./
+RUN npm ci
+COPY frontend/ ./
+RUN npm run build
+# Stage 2: Run backend + serve frontend
+FROM python:3.12-slim
+WORKDIR /app
+# Install Python deps
+COPY backend/requirements.txt ./backend/
+RUN pip install --no-cache-dir -r backend/requirements.txt
+# Copy backend
+COPY backend/ ./backend/
+COPY airline_routes.json ./
+# Copy built frontend
+COPY --from=frontend-build /app/frontend/dist ./frontend/dist
+EXPOSE 8080
+CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8080"]

JOURNAL.md ADDED Viewed

The diff for this file is too large to render. See raw diff

airline_routes.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e8d97548f626230927e3e38b8ba9712710612ef90292e9a0696698d20b3bac3
+size 21798276

backend/__init__.py ADDED Viewed

File without changes

backend/api/__init__.py ADDED Viewed

File without changes

backend/api/airports.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""Airport autocomplete and info endpoints."""
+from fastapi import APIRouter, HTTPException, Query
+from ..data_loader import get_route_graph
+from ..models import AirportInfo, AutocompleteResult
+router = APIRouter(prefix="/api/airports", tags=["airports"])
+@router.get("/autocomplete", response_model=list[AutocompleteResult])
+async def autocomplete(q: str = Query(..., min_length=1, max_length=50)):
+    graph = get_route_graph()
+    airports = graph.search_airports(q, limit=10)
+    return [
+        AutocompleteResult(
+            iata=a.iata,
+            name=a.name,
+            city_name=a.city_name,
+            country=a.country,
+            display_name=a.display_name,
+            hub_score=a.hub_score,
+        )
+        for a in airports
+    ]
+@router.get("/{iata}", response_model=AirportInfo)
+async def get_airport(iata: str):
+    graph = get_route_graph()
+    iata = iata.upper()
+    airport = graph.airports.get(iata)
+    if not airport:
+        raise HTTPException(status_code=404, detail=f"Airport {iata} not found")
+    return AirportInfo(
+        iata=airport.iata,
+        name=airport.name,
+        city_name=airport.city_name,
+        country=airport.country,
+        country_code=airport.country_code,
+        continent=airport.continent,
+        latitude=airport.latitude,
+        longitude=airport.longitude,
+        timezone=airport.timezone,
+        hub_score=airport.hub_score,
+        route_count=len(airport.routes),
+    )

backend/api/calendar.py ADDED Viewed

	@@ -0,0 +1,70 @@

+"""Calendar pricing endpoint — cheapest price per day for a month."""
+from __future__ import annotations
+import calendar
+from datetime import date
+from fastapi import APIRouter, HTTPException, Query
+from ..data_loader import get_route_graph
+from ..models import CalendarDay, CalendarResponse, CabinClass
+from ..price_engine import compute_calendar_price
+from ..seed_utils import seeded_random
+router = APIRouter(prefix="/api", tags=["calendar"])
+@router.get("/calendar", response_model=CalendarResponse)
+async def get_calendar(
+    origin: str = Query(..., min_length=3, max_length=3),
+    destination: str = Query(..., min_length=3, max_length=3),
+    year: int = Query(..., ge=2025, le=2028),
+    month: int = Query(..., ge=1, le=12),
+    cabin_class: CabinClass = Query(CabinClass.economy),
+):
+    graph = get_route_graph()
+    origin = origin.upper()
+    destination = destination.upper()
+    if origin not in graph.airports:
+        raise HTTPException(status_code=404, detail=f"Airport {origin} not found")
+    if destination not in graph.airports:
+        raise HTTPException(status_code=404, detail=f"Airport {destination} not found")
+    route = graph.get_direct_route(origin, destination)
+    if not route:
+        # Try to estimate distance for pricing
+        from ..route_finder import _estimate_distance
+        distance = _estimate_distance(graph, origin, destination)
+        if distance is None:
+            raise HTTPException(status_code=404, detail="No route found")
+        num_carriers = 2  # default estimate
+    else:
+        distance = route.distance_km
+        num_carriers = len(route.carriers)
+    dest_airport = graph.airports[destination]
+    num_days = calendar.monthrange(year, month)[1]
+    days = []
+    for day in range(1, num_days + 1):
+        d = date(year, month, day)
+        rng = seeded_random(origin, destination, d.isoformat(), "calendar")
+        price = compute_calendar_price(
+            distance_km=distance,
+            cabin_class=cabin_class.value,
+            target_date=d,
+            num_carriers=num_carriers,
+            dest_continent=dest_airport.continent,
+            rng=rng,
+        )
+        days.append(CalendarDay(date=d, cheapest_price=price))
+    return CalendarResponse(
+        origin=origin,
+        destination=destination,
+        year=year,
+        month=month,
+        days=days,
+    )

backend/api/search.py ADDED Viewed

	@@ -0,0 +1,144 @@

+"""Flight search endpoint."""
+from __future__ import annotations
+from fastapi import APIRouter, HTTPException
+from ..config import MAX_RESULTS
+from ..data_loader import get_route_graph
+from ..flight_generator import generate_flights_for_route
+from ..hub_detector import compute_hub_scores
+from ..models import FlightOffer, SearchRequest, SearchResponse, SortBy
+from ..route_finder import find_routes
+from ..seed_utils import make_seed
+router = APIRouter(prefix="/api", tags=["search"])
+# Module-level hub cache
+_hub_iatas: list[str] | None = None
+def _get_hubs() -> list[str]:
+    global _hub_iatas
+    if _hub_iatas is None:
+        graph = get_route_graph()
+        _hub_iatas = compute_hub_scores(graph)
+    return _hub_iatas
+def _apply_filters(flights: list[FlightOffer], req: SearchRequest) -> list[FlightOffer]:
+    f = req.filters
+    result = flights
+    if f.max_stops is not None:
+        result = [fl for fl in result if fl.stops <= f.max_stops]
+    if f.max_price is not None:
+        result = [fl for fl in result if fl.price_usd <= f.max_price]
+    if f.max_duration_minutes is not None:
+        result = [fl for fl in result if fl.total_duration_minutes <= f.max_duration_minutes]
+    if f.airlines:
+        airline_set = set(f.airlines)
+        result = [
+            fl for fl in result
+            if any(seg.airline_code in airline_set for seg in fl.segments)
+        ]
+    if f.departure_time_min:
+        h, m = map(int, f.departure_time_min.split(":"))
+        min_minutes = h * 60 + m
+        result = [
+            fl for fl in result
+            if fl.departure.hour * 60 + fl.departure.minute >= min_minutes
+        ]
+    if f.departure_time_max:
+        h, m = map(int, f.departure_time_max.split(":"))
+        max_minutes = h * 60 + m
+        result = [
+            fl for fl in result
+            if fl.departure.hour * 60 + fl.departure.minute <= max_minutes
+        ]
+    return result
+def _sort_flights(flights: list[FlightOffer], sort_by: SortBy) -> list[FlightOffer]:
+    if sort_by == SortBy.cheapest:
+        return sorted(flights, key=lambda f: f.price_usd)
+    elif sort_by == SortBy.fastest:
+        return sorted(flights, key=lambda f: f.total_duration_minutes)
+    else:  # best: balance of price and duration
+        if not flights:
+            return flights
+        max_price = max(f.price_usd for f in flights) or 1
+        max_dur = max(f.total_duration_minutes for f in flights) or 1
+        return sorted(
+            flights,
+            key=lambda f: (f.price_usd / max_price) * 0.6 + (f.total_duration_minutes / max_dur) * 0.4,
+        )
+@router.post("/search", response_model=SearchResponse)
+async def search_flights(req: SearchRequest):
+    graph = get_route_graph()
+    hub_iatas = _get_hubs()
+    if not req.legs:
+        raise HTTPException(status_code=400, detail="At least one leg required")
+    # Validate airports
+    for leg in req.legs:
+        if leg.origin.upper() not in graph.airports:
+            raise HTTPException(status_code=404, detail=f"Airport {leg.origin} not found")
+        if leg.destination.upper() not in graph.airports:
+            raise HTTPException(status_code=404, detail=f"Airport {leg.destination} not found")
+    # Generate outbound flights
+    outbound_leg = req.legs[0]
+    origin = outbound_leg.origin.upper()
+    destination = outbound_leg.destination.upper()
+    max_stops = req.filters.max_stops
+    route_plans = find_routes(graph, origin, destination, hub_iatas, max_stops=max_stops)
+    outbound_flights: list[FlightOffer] = []
+    for plan in route_plans:
+        flights = generate_flights_for_route(
+            graph, plan, outbound_leg.date, req.cabin_class, hub_iatas
+        )
+        outbound_flights.extend(flights)
+    outbound_flights = _apply_filters(outbound_flights, req)
+    outbound_flights = _sort_flights(outbound_flights, req.sort_by)
+    outbound_flights = outbound_flights[:MAX_RESULTS]
+    # Generate return flights if round trip
+    return_flights: list[FlightOffer] = []
+    if req.trip_type.value == "round_trip" and len(req.legs) >= 2:
+        return_leg = req.legs[1]
+        ret_origin = return_leg.origin.upper()
+        ret_dest = return_leg.destination.upper()
+        ret_plans = find_routes(graph, ret_origin, ret_dest, hub_iatas, max_stops=max_stops)
+        for plan in ret_plans:
+            flights = generate_flights_for_route(
+                graph, plan, return_leg.date, req.cabin_class, hub_iatas
+            )
+            return_flights.extend(flights)
+        return_flights = _apply_filters(return_flights, req)
+        return_flights = _sort_flights(return_flights, req.sort_by)
+        return_flights = return_flights[:MAX_RESULTS]
+    search_id = str(make_seed(origin, destination, outbound_leg.date.isoformat()))
+    return SearchResponse(
+        outbound_flights=outbound_flights,
+        return_flights=return_flights,
+        search_id=search_id,
+        origin=origin,
+        destination=destination,
+    )

backend/config.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Pricing constants and configuration."""
+# Base price formula
+BASE_FIXED_USD = 40
+BASE_PER_KM_USD = 0.08
+# Cabin class multipliers
+CLASS_MULTIPLIERS = {
+    "economy": 1.0,
+    "premium_economy": 1.6,
+    "business": 3.2,
+    "first": 5.5,
+}
+# Day-of-week multipliers (0=Monday, 6=Sunday)
+DAY_MULTIPLIERS = {
+    0: 0.90,  # Monday
+    1: 0.90,  # Tuesday
+    2: 0.90,  # Wednesday
+    3: 1.00,  # Thursday
+    4: 1.15,  # Friday
+    5: 1.05,  # Saturday
+    6: 1.10,  # Sunday
+}
+# Season multipliers by month
+SEASON_MULTIPLIERS = {
+    1: 0.85,   # January - off season
+    2: 0.85,   # February - off season
+    3: 0.95,   # March
+    4: 1.00,   # April
+    5: 1.05,   # May
+    6: 1.15,   # June - summer peak
+    7: 1.20,   # July - summer peak
+    8: 1.15,   # August - summer peak
+    9: 0.90,   # September - off season
+    10: 0.95,  # October
+    11: 1.00,  # November
+    12: 1.40,  # December - Christmas
+}
+# Season bonus for EU destinations in summer
+EU_SUMMER_BONUS = 0.15  # +15% on top of summer multiplier
+EU_CONTINENTS = {"EU"}
+EU_SUMMER_MONTHS = {6, 7, 8}
+# Demand multipliers
+MONOPOLY_ROUTE_BONUS = 0.20      # +20% if only 1 carrier
+HIGH_COMPETITION_DISCOUNT = 0.05  # -5% if 4+ carriers
+# Advance booking multipliers (days before departure)
+ADVANCE_MULTIPLIERS = [
+    (3, 1.50),     # 0-3 days: +50%
+    (7, 1.35),     # 4-7 days: +35%
+    (14, 1.20),    # 8-14 days: +20%
+    (21, 1.10),    # 15-21 days: +10%
+    (60, 1.00),    # 22-60 days: base
+    (90, 0.90),    # 61-90 days: -10%
+    (float("inf"), 0.95),  # 91+ days: -5%
+]
+# Jitter range (±8%)
+JITTER_RANGE = 0.08
+# Hub detection thresholds
+HUB_MIN_ROUTES = 100
+HUB_TOP_N = 125
+# Connecting flight constraints
+MAX_1STOP_DISTANCE_RATIO = 1.8   # Max total distance vs great-circle
+MAX_2STOP_DISTANCE_RATIO = 2.5
+MIN_LAYOVER_MINUTES = 60         # 1 hour
+MAX_LAYOVER_MINUTES = 360        # 6 hours
+# Flight generation
+MIN_FLIGHTS_PER_DAY = 1
+MAX_FLIGHTS_SINGLE_CARRIER = 3
+MAX_FLIGHTS_MULTI_CARRIER = 15
+DEPARTURE_HOUR_MIN = 5           # 05:00
+DEPARTURE_HOUR_MAX = 23          # 23:00
+# Aircraft types by distance
+AIRCRAFT_BY_DISTANCE = [
+    (500, ["E190", "E175", "CRJ-900"]),
+    (2000, ["A320", "A321", "737-800", "737 MAX 8"]),
+    (5000, ["A321neo LR", "757-200", "767-300ER"]),
+    (10000, ["787-8", "787-9", "A330-300", "A350-900"]),
+    (float("inf"), ["777-300ER", "A350-1000", "787-10", "A380"]),
+]
+# Search limits
+MAX_RESULTS = 200
+MAX_AUTOCOMPLETE_RESULTS = 10

backend/data_loader.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""Load airline_routes.json and build in-memory route graph + search index."""
+from __future__ import annotations
+import json
+import os
+from dataclasses import dataclass, field
+@dataclass
+class Route:
+    destination: str
+    distance_km: int
+    duration_min: int
+    carriers: list[dict]  # [{"iata": "AA", "name": "American Airlines"}, ...]
+@dataclass
+class Airport:
+    iata: str
+    name: str
+    city_name: str
+    country: str
+    country_code: str
+    continent: str
+    latitude: float
+    longitude: float
+    timezone: str
+    elevation: int
+    icao: str
+    display_name: str
+    routes: list[Route] = field(default_factory=list)
+    hub_score: float = 0.0
+class RouteGraph:
+    """In-memory route graph and search index."""
+    def __init__(self) -> None:
+        self.airports: dict[str, Airport] = {}
+        # route_map[origin_iata][dest_iata] = Route
+        self.route_map: dict[str, dict[str, Route]] = {}
+        # Search index: lowercase tokens → set of IATA codes
+        self._search_index: dict[str, set[str]] = {}
+    def load(self, filepath: str) -> None:
+        with open(filepath) as f:
+            data: dict = json.load(f)
+        for iata, info in data.items():
+            routes = []
+            for r in info.get("routes", []):
+                routes.append(Route(
+                    destination=r["iata"],
+                    distance_km=r["km"],
+                    duration_min=r["min"],
+                    carriers=r["carriers"],
+                ))
+            airport = Airport(
+                iata=iata,
+                name=info["name"],
+                city_name=info["city_name"],
+                country=info["country"],
+                country_code=info["country_code"],
+                continent=info["continent"],
+                latitude=float(info["latitude"]) if info.get("latitude") is not None else 0.0,
+                longitude=float(info["longitude"]) if info.get("longitude") is not None else 0.0,
+                timezone=info.get("timezone", "UTC"),
+                elevation=info.get("elevation", 0),
+                icao=info.get("icao", ""),
+                display_name=info.get("display_name", f"{info['city_name']} ({iata})"),
+                routes=routes,
+            )
+            self.airports[iata] = airport
+            # Build route map
+            self.route_map.setdefault(iata, {})
+            for route in routes:
+                self.route_map[iata][route.destination] = route
+            # Build search index
+            self._index_airport(airport)
+    def _index_airport(self, airport: Airport) -> None:
+        tokens = set()
+        # IATA code
+        tokens.add(airport.iata.lower())
+        # City name tokens
+        for word in airport.city_name.lower().split():
+            tokens.add(word)
+        # Airport name tokens
+        for word in airport.name.lower().split():
+            tokens.add(word)
+        # Country
+        for word in airport.country.lower().split():
+            tokens.add(word)
+        # Country code
+        tokens.add(airport.country_code.lower())
+        for token in tokens:
+            # Index exact token and all prefixes ≥ 2 chars
+            for i in range(2, len(token) + 1):
+                prefix = token[:i]
+                self._search_index.setdefault(prefix, set()).add(airport.iata)
+    def search_airports(self, query: str, limit: int = 10) -> list[Airport]:
+        """Search airports by IATA code, city, name, or country."""
+        q = query.strip().lower()
+        if not q:
+            return []
+        # Exact IATA match first
+        if len(q) == 3 and q.upper() in self.airports:
+            exact = self.airports[q.upper()]
+            results = [exact]
+            # Add more results from prefix search
+            candidates = self._search_index.get(q, set())
+            for iata in candidates:
+                if iata != exact.iata:
+                    results.append(self.airports[iata])
+                if len(results) >= limit:
+                    break
+            return results[:limit]
+        # Split query into tokens, intersect matches
+        query_tokens = q.split()
+        if not query_tokens:
+            return []
+        # Get candidates matching first token
+        candidates = self._search_index.get(query_tokens[0], set()).copy()
+        # Intersect with additional tokens
+        for token in query_tokens[1:]:
+            token_matches = self._search_index.get(token, set())
+            candidates &= token_matches
+        if not candidates:
+            return []
+        # Sort by hub score (descending), then alphabetically
+        airports = [self.airports[iata] for iata in candidates if iata in self.airports]
+        airports.sort(key=lambda a: (-a.hub_score, a.city_name))
+        return airports[:limit]
+    def get_direct_route(self, origin: str, destination: str) -> Route | None:
+        return self.route_map.get(origin, {}).get(destination)
+    def get_outbound_routes(self, origin: str) -> dict[str, Route]:
+        return self.route_map.get(origin, {})
+# Singleton
+_graph: RouteGraph | None = None
+def get_route_graph() -> RouteGraph:
+    global _graph
+    if _graph is None:
+        _graph = RouteGraph()
+        data_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "airline_routes.json")
+        _graph.load(data_path)
+    return _graph

backend/flight_generator.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""Generate concrete flights for a route + date."""
+from __future__ import annotations
+import random
+from datetime import date, datetime, timedelta, timezone
+from zoneinfo import ZoneInfo
+from .config import (
+    AIRCRAFT_BY_DISTANCE,
+    DEPARTURE_HOUR_MAX,
+    DEPARTURE_HOUR_MIN,
+    MAX_FLIGHTS_MULTI_CARRIER,
+    MAX_FLIGHTS_SINGLE_CARRIER,
+    MAX_LAYOVER_MINUTES,
+    MIN_FLIGHTS_PER_DAY,
+    MIN_LAYOVER_MINUTES,
+)
+from .data_loader import Route, RouteGraph
+from .models import CabinClass, FlightOffer, FlightSegment
+from .price_engine import compute_price
+from .route_finder import RoutePlan
+from .seed_utils import seeded_random
+def _pick_aircraft(distance_km: int, rng: random.Random) -> str:
+    for max_dist, aircraft_list in AIRCRAFT_BY_DISTANCE:
+        if distance_km <= max_dist:
+            return rng.choice(aircraft_list)
+    return "777-300ER"
+def _make_flight_number(carrier_iata: str, rng: random.Random) -> str:
+    return f"{carrier_iata}{rng.randint(100, 9999)}"
+def _get_timezone(graph: RouteGraph, iata: str) -> ZoneInfo:
+    airport = graph.airports.get(iata)
+    if airport and airport.timezone:
+        try:
+            return ZoneInfo(airport.timezone)
+        except KeyError:
+            pass
+    return ZoneInfo("UTC")
+def generate_flights_for_route(
+    graph: RouteGraph,
+    route_plan: RoutePlan,
+    departure_date: date,
+    cabin_class: CabinClass,
+    hub_iatas: list[str],
+) -> list[FlightOffer]:
+    """Generate concrete flight offers for a route plan on a given date."""
+    origin = route_plan.waypoints[0]
+    destination = route_plan.waypoints[-1]
+    # Seed based on route + date for determinism
+    seed_key = f"{origin}-{destination}-{departure_date.isoformat()}-{cabin_class.value}"
+    rng = seeded_random(seed_key, *route_plan.waypoints)
+    if route_plan.stops == 0:
+        return _generate_direct_flights(graph, route_plan, departure_date, cabin_class, rng)
+    else:
+        return _generate_connecting_flights(graph, route_plan, departure_date, cabin_class, rng)
+def _generate_direct_flights(
+    graph: RouteGraph,
+    route_plan: RoutePlan,
+    departure_date: date,
+    cabin_class: CabinClass,
+    rng: random.Random,
+) -> list[FlightOffer]:
+    """Generate multiple direct flight options for a single-leg route."""
+    leg = route_plan.legs[0]
+    origin = route_plan.waypoints[0]
+    destination = route_plan.waypoints[1]
+    # Number of flights based on carrier count
+    num_carriers = len(leg.carriers)
+    if num_carriers == 1:
+        num_flights = rng.randint(MIN_FLIGHTS_PER_DAY, MAX_FLIGHTS_SINGLE_CARRIER)
+    elif num_carriers <= 3:
+        num_flights = rng.randint(3, 8)
+    else:
+        num_flights = rng.randint(8, MAX_FLIGHTS_MULTI_CARRIER)
+    # Generate departure times spread across the day
+    departure_hours = sorted([
+        rng.randint(DEPARTURE_HOUR_MIN * 60, DEPARTURE_HOUR_MAX * 60)
+        for _ in range(num_flights)
+    ])
+    origin_tz = _get_timezone(graph, origin)
+    dest_tz = _get_timezone(graph, destination)
+    origin_airport = graph.airports[origin]
+    dest_airport = graph.airports[destination]
+    flights = []
+    for dep_minutes in departure_hours:
+        carrier = rng.choice(leg.carriers)
+        dep_hour = dep_minutes // 60
+        dep_min = dep_minutes % 60
+        departure_dt = datetime(
+            departure_date.year, departure_date.month, departure_date.day,
+            dep_hour, dep_min,
+            tzinfo=origin_tz,
+        )
+        # Calculate arrival
+        arrival_dt = departure_dt + timedelta(minutes=leg.duration_min)
+        arrival_dt = arrival_dt.astimezone(dest_tz)
+        price = compute_price(
+            distance_km=leg.distance_km,
+            cabin_class=cabin_class.value,
+            departure_date=departure_date,
+            departure_hour=dep_hour,
+            num_carriers=num_carriers,
+            dest_continent=dest_airport.continent,
+            rng=rng,
+        )
+        flight_id = f"{origin}{destination}{departure_date.isoformat()}{dep_minutes}{carrier['iata']}"
+        segment = FlightSegment(
+            airline_code=carrier["iata"],
+            airline_name=carrier["name"],
+            flight_number=_make_flight_number(carrier["iata"], rng),
+            aircraft=_pick_aircraft(leg.distance_km, rng),
+            origin=origin,
+            origin_city=origin_airport.city_name,
+            destination=destination,
+            destination_city=dest_airport.city_name,
+            departure=departure_dt,
+            arrival=arrival_dt,
+            duration_minutes=leg.duration_min,
+        )
+        flights.append(FlightOffer(
+            id=flight_id,
+            segments=[segment],
+            total_duration_minutes=leg.duration_min,
+            stops=0,
+            price_usd=price,
+            cabin_class=cabin_class,
+            origin=origin,
+            destination=destination,
+            departure=departure_dt,
+            arrival=arrival_dt,
+        ))
+    return flights
+def _generate_connecting_flights(
+    graph: RouteGraph,
+    route_plan: RoutePlan,
+    departure_date: date,
+    cabin_class: CabinClass,
+    rng: random.Random,
+) -> list[FlightOffer]:
+    """Generate connecting flight options (1-stop or 2-stop)."""
+    origin = route_plan.waypoints[0]
+    destination = route_plan.waypoints[-1]
+    dest_airport = graph.airports[destination]
+    # Generate 2-5 options per connecting route
+    num_options = rng.randint(2, 5)
+    flights = []
+    for option_idx in range(num_options):
+        departure_minutes = rng.randint(DEPARTURE_HOUR_MIN * 60, DEPARTURE_HOUR_MAX * 60)
+        segments = []
+        current_time = datetime(
+            departure_date.year, departure_date.month, departure_date.day,
+            departure_minutes // 60, departure_minutes % 60,
+            tzinfo=_get_timezone(graph, origin),
+        )
+        total_price = 0.0
+        total_duration = 0
+        valid = True
+        for i, leg in enumerate(route_plan.legs):
+            leg_origin = route_plan.waypoints[i]
+            leg_dest = route_plan.waypoints[i + 1]
+            origin_tz = _get_timezone(graph, leg_origin)
+            dest_tz = _get_timezone(graph, leg_dest)
+            origin_ap = graph.airports[leg_origin]
+            dest_ap = graph.airports[leg_dest]
+            carrier = rng.choice(leg.carriers)
+            departure_dt = current_time.astimezone(origin_tz)
+            arrival_dt = departure_dt + timedelta(minutes=leg.duration_min)
+            arrival_dt = arrival_dt.astimezone(dest_tz)
+            # Per-leg price
+            leg_price = compute_price(
+                distance_km=leg.distance_km,
+                cabin_class=cabin_class.value,
+                departure_date=departure_date,
+                departure_hour=departure_dt.hour,
+                num_carriers=len(leg.carriers),
+                dest_continent=dest_ap.continent,
+                rng=rng,
+            )
+            # Connecting flights get a discount
+            leg_price *= 0.75
+            total_price += leg_price
+            segments.append(FlightSegment(
+                airline_code=carrier["iata"],
+                airline_name=carrier["name"],
+                flight_number=_make_flight_number(carrier["iata"], rng),
+                aircraft=_pick_aircraft(leg.distance_km, rng),
+                origin=leg_origin,
+                origin_city=origin_ap.city_name,
+                destination=leg_dest,
+                destination_city=dest_ap.city_name,
+                departure=departure_dt,
+                arrival=arrival_dt,
+                duration_minutes=leg.duration_min,
+            ))
+            # Add layover time for next leg
+            if i < len(route_plan.legs) - 1:
+                layover = rng.randint(MIN_LAYOVER_MINUTES, MAX_LAYOVER_MINUTES)
+                current_time = arrival_dt + timedelta(minutes=layover)
+                total_duration += leg.duration_min + layover
+                # Check if layover pushes to next day too far
+                if (current_time - datetime(
+                    departure_date.year, departure_date.month, departure_date.day,
+                    tzinfo=origin_tz
+                )).days > 1:
+                    valid = False
+                    break
+            else:
+                total_duration += leg.duration_min
+        if not valid:
+            continue
+        total_price = round(total_price, 0)
+        first_departure = segments[0].departure
+        last_arrival = segments[-1].arrival
+        flight_id = (
+            f"{origin}{destination}{departure_date.isoformat()}"
+            f"{departure_minutes}{'-'.join(route_plan.waypoints)}{option_idx}"
+        )
+        flights.append(FlightOffer(
+            id=flight_id,
+            segments=segments,
+            total_duration_minutes=total_duration,
+            stops=route_plan.stops,
+            price_usd=total_price,
+            cabin_class=cabin_class,
+            origin=origin,
+            destination=destination,
+            departure=first_departure,
+            arrival=last_arrival,
+        ))
+    return flights

backend/hub_detector.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""Compute hub scores for airports.
+Hub score = route_count * carrier_diversity * continent_reach
+Used for: connecting flight search (top hubs as waypoints) and autocomplete ranking.
+"""
+from __future__ import annotations
+from .config import HUB_MIN_ROUTES, HUB_TOP_N
+from .data_loader import RouteGraph
+def compute_hub_scores(graph: RouteGraph) -> list[str]:
+    """Compute hub scores for all airports. Returns top hub IATA codes.
+    Modifies airports in-place to set hub_score.
+    """
+    for airport in graph.airports.values():
+        route_count = len(airport.routes)
+        if route_count < 5:
+            airport.hub_score = 0.0
+            continue
+        # Carrier diversity: unique carriers across all routes
+        carriers = set()
+        for route in airport.routes:
+            for c in route.carriers:
+                carriers.add(c["iata"])
+        carrier_diversity = len(carriers)
+        # Continent reach: unique continents reachable
+        continents = set()
+        for route in airport.routes:
+            dest = graph.airports.get(route.destination)
+            if dest:
+                continents.add(dest.continent)
+        continent_reach = len(continents)
+        airport.hub_score = route_count * (carrier_diversity ** 0.5) * (continent_reach ** 0.3)
+    # Normalize scores to 0-100
+    max_score = max((a.hub_score for a in graph.airports.values()), default=1.0)
+    if max_score > 0:
+        for airport in graph.airports.values():
+            airport.hub_score = round(airport.hub_score / max_score * 100, 2)
+    # Return top hubs
+    hubs = sorted(
+        [a for a in graph.airports.values() if len(a.routes) >= HUB_MIN_ROUTES],
+        key=lambda a: -a.hub_score,
+    )
+    return [h.iata for h in hubs[:HUB_TOP_N]]

backend/main.py ADDED Viewed

	@@ -0,0 +1,59 @@

+"""FastAPI application — flight search backend."""
+import os
+import time
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse
+from .api import airports, calendar, search
+from .data_loader import get_route_graph
+from .hub_detector import compute_hub_scores
+app = FastAPI(title="Flight Search API", version="1.0.0")
+# CORS for development
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Register API routers
+app.include_router(airports.router)
+app.include_router(search.router)
+app.include_router(calendar.router)
+@app.on_event("startup")
+async def startup():
+    """Load data and compute hub scores on startup."""
+    t0 = time.time()
+    graph = get_route_graph()
+    hubs = compute_hub_scores(graph)
+    elapsed = time.time() - t0
+    print(f"Loaded {len(graph.airports)} airports, {len(hubs)} hubs in {elapsed:.1f}s")
+@app.get("/api/health")
+async def health():
+    graph = get_route_graph()
+    return {"status": "ok", "airports": len(graph.airports)}
+# Serve frontend static files (production)
+STATIC_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "frontend", "dist")
+if os.path.isdir(STATIC_DIR):
+    app.mount("/assets", StaticFiles(directory=os.path.join(STATIC_DIR, "assets")), name="assets")
+    @app.get("/{full_path:path}")
+    async def serve_frontend(full_path: str):
+        """Serve the React SPA for all non-API routes."""
+        file_path = os.path.join(STATIC_DIR, full_path)
+        if os.path.isfile(file_path):
+            return FileResponse(file_path)
+        return FileResponse(os.path.join(STATIC_DIR, "index.html"))

backend/models.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""Pydantic models for API request/response contracts."""
+from __future__ import annotations
+from datetime import date, datetime
+from enum import Enum
+from typing import Optional
+from pydantic import BaseModel, Field
+class CabinClass(str, Enum):
+    economy = "economy"
+    premium_economy = "premium_economy"
+    business = "business"
+    first = "first"
+class TripType(str, Enum):
+    one_way = "one_way"
+    round_trip = "round_trip"
+    multi_city = "multi_city"
+class SortBy(str, Enum):
+    best = "best"
+    cheapest = "cheapest"
+    fastest = "fastest"
+# --- Airport ---
+class AirportInfo(BaseModel):
+    iata: str
+    name: str
+    city_name: str
+    country: str
+    country_code: str
+    continent: str
+    latitude: float
+    longitude: float
+    timezone: str
+    hub_score: float = 0.0
+    route_count: int = 0
+# --- Flight segment ---
+class FlightSegment(BaseModel):
+    airline_code: str
+    airline_name: str
+    flight_number: str
+    aircraft: str
+    origin: str
+    origin_city: str
+    destination: str
+    destination_city: str
+    departure: datetime
+    arrival: datetime
+    duration_minutes: int
+# --- Flight offer (may have multiple segments) ---
+class FlightOffer(BaseModel):
+    id: str
+    segments: list[FlightSegment]
+    total_duration_minutes: int
+    stops: int
+    price_usd: float
+    cabin_class: CabinClass
+    origin: str
+    destination: str
+    departure: datetime
+    arrival: datetime
+# --- Search request ---
+class SearchLeg(BaseModel):
+    origin: str = Field(..., min_length=3, max_length=3, description="IATA code")
+    destination: str = Field(..., min_length=3, max_length=3, description="IATA code")
+    date: date
+class Passengers(BaseModel):
+    adults: int = Field(1, ge=1, le=9)
+    children: int = Field(0, ge=0, le=9)
+    infants: int = Field(0, ge=0, le=4)
+    @property
+    def total(self) -> int:
+        return self.adults + self.children + self.infants
+class Filters(BaseModel):
+    max_stops: Optional[int] = None
+    max_price: Optional[float] = None
+    max_duration_minutes: Optional[int] = None
+    airlines: Optional[list[str]] = None  # IATA codes to include
+    departure_time_min: Optional[str] = None  # "06:00"
+    departure_time_max: Optional[str] = None  # "18:00"
+class SearchRequest(BaseModel):
+    trip_type: TripType = TripType.round_trip
+    legs: list[SearchLeg] = Field(..., min_length=1, max_length=6)
+    passengers: Passengers = Passengers()
+    cabin_class: CabinClass = CabinClass.economy
+    filters: Filters = Filters()
+    sort_by: SortBy = SortBy.best
+class SearchResponse(BaseModel):
+    outbound_flights: list[FlightOffer]
+    return_flights: list[FlightOffer] = []
+    search_id: str
+    origin: str
+    destination: str
+# --- Calendar ---
+class CalendarDay(BaseModel):
+    date: date
+    cheapest_price: Optional[float] = None
+class CalendarResponse(BaseModel):
+    origin: str
+    destination: str
+    year: int
+    month: int
+    days: list[CalendarDay]
+# --- Autocomplete ---
+class AutocompleteResult(BaseModel):
+    iata: str
+    name: str
+    city_name: str
+    country: str
+    display_name: str
+    hub_score: float = 0.0

backend/price_engine.py ADDED Viewed

	@@ -0,0 +1,113 @@

+"""Price formula: base + 8 multipliers.
+base_usd = 40 + (distance_km * 0.08)
+final = base * class * day_of_week * time_of_day * season * demand * advance * jitter
+"""
+from __future__ import annotations
+import random
+from datetime import date, datetime, timedelta
+from .config import (
+    ADVANCE_MULTIPLIERS,
+    BASE_FIXED_USD,
+    BASE_PER_KM_USD,
+    CLASS_MULTIPLIERS,
+    DAY_MULTIPLIERS,
+    EU_CONTINENTS,
+    EU_SUMMER_BONUS,
+    EU_SUMMER_MONTHS,
+    HIGH_COMPETITION_DISCOUNT,
+    JITTER_RANGE,
+    MONOPOLY_ROUTE_BONUS,
+    SEASON_MULTIPLIERS,
+)
+def compute_price(
+    distance_km: int,
+    cabin_class: str,
+    departure_date: date,
+    departure_hour: int,
+    num_carriers: int,
+    dest_continent: str,
+    rng: random.Random,
+    booking_date: date | None = None,
+) -> float:
+    """Compute flight price using the full pricing formula."""
+    base = BASE_FIXED_USD + (distance_km * BASE_PER_KM_USD)
+    # 1. Cabin class
+    class_mult = CLASS_MULTIPLIERS.get(cabin_class, 1.0)
+    # 2. Day of week
+    day_mult = DAY_MULTIPLIERS.get(departure_date.weekday(), 1.0)
+    # 3. Time of day
+    if 6 <= departure_hour <= 8:
+        time_mult = 1.10  # Morning peak
+    elif 16 <= departure_hour <= 19:
+        time_mult = 1.15  # Evening peak
+    elif departure_hour >= 22 or departure_hour <= 5:
+        time_mult = 0.85  # Red-eye discount
+    else:
+        time_mult = 1.00
+    # 4. Season
+    season_mult = SEASON_MULTIPLIERS.get(departure_date.month, 1.0)
+    # EU summer bonus
+    if dest_continent in EU_CONTINENTS and departure_date.month in EU_SUMMER_MONTHS:
+        season_mult += EU_SUMMER_BONUS
+    # 5. Demand (based on competition)
+    if num_carriers == 1:
+        demand_mult = 1.0 + MONOPOLY_ROUTE_BONUS
+    elif num_carriers >= 4:
+        demand_mult = 1.0 - HIGH_COMPETITION_DISCOUNT
+    else:
+        demand_mult = 1.0
+    # 6. Advance booking
+    if booking_date is None:
+        booking_date = date.today()
+    days_advance = (departure_date - booking_date).days
+    if days_advance < 0:
+        days_advance = 0
+    advance_mult = 1.0
+    for threshold, mult in ADVANCE_MULTIPLIERS:
+        if days_advance <= threshold:
+            advance_mult = mult
+            break
+    # 7. Jitter (seeded)
+    jitter = 1.0 + rng.uniform(-JITTER_RANGE, JITTER_RANGE)
+    price = base * class_mult * day_mult * time_mult * season_mult * demand_mult * advance_mult * jitter
+    # Round to nearest dollar, minimum $25
+    return max(25.0, round(price, 0))
+def compute_calendar_price(
+    distance_km: int,
+    cabin_class: str,
+    target_date: date,
+    num_carriers: int,
+    dest_continent: str,
+    rng: random.Random,
+) -> float:
+    """Compute cheapest flight price for a given date (for calendar view).
+    Uses noon departure and 14-day advance booking as baseline.
+    """
+    return compute_price(
+        distance_km=distance_km,
+        cabin_class=cabin_class,
+        departure_date=target_date,
+        departure_hour=12,
+        num_carriers=num_carriers,
+        dest_continent=dest_continent,
+        rng=rng,
+        booking_date=target_date - timedelta(days=14),
+    )

backend/requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+fastapi>=0.110.0
+uvicorn[standard]>=0.27.0
+pydantic>=2.6.0

backend/route_finder.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""Direct + 1-stop + 2-stop route discovery."""
+from __future__ import annotations
+from dataclasses import dataclass
+from .config import MAX_1STOP_DISTANCE_RATIO, MAX_2STOP_DISTANCE_RATIO
+from .data_loader import Route, RouteGraph
+@dataclass
+class RoutePlan:
+    """A planned route from origin to destination (may have multiple legs)."""
+    legs: list[Route]  # Each leg has origin implicitly from position
+    waypoints: list[str]  # [origin, hub1, ..., destination]
+    total_distance_km: int
+    total_duration_min: int
+    @property
+    def stops(self) -> int:
+        return len(self.legs) - 1
+def find_routes(
+    graph: RouteGraph,
+    origin: str,
+    destination: str,
+    hub_iatas: list[str],
+    max_stops: int | None = None,
+) -> list[RoutePlan]:
+    """Find all route plans from origin to destination.
+    Returns direct, 1-stop, and 2-stop routes.
+    """
+    results: list[RoutePlan] = []
+    if max_stops is not None and max_stops < 0:
+        return results
+    # Direct route
+    direct = graph.get_direct_route(origin, destination)
+    if direct:
+        results.append(RoutePlan(
+            legs=[direct],
+            waypoints=[origin, destination],
+            total_distance_km=direct.distance_km,
+            total_duration_min=direct.duration_min,
+        ))
+    if max_stops is not None and max_stops == 0:
+        return results
+    # 1-stop routes through hubs
+    direct_distance = direct.distance_km if direct else _estimate_distance(graph, origin, destination)
+    if direct_distance is None:
+        return results
+    origin_routes = graph.get_outbound_routes(origin)
+    for hub in hub_iatas:
+        if hub == origin or hub == destination:
+            continue
+        leg1 = origin_routes.get(hub)
+        if not leg1:
+            continue
+        leg2 = graph.get_direct_route(hub, destination)
+        if not leg2:
+            continue
+        total_dist = leg1.distance_km + leg2.distance_km
+        if total_dist > direct_distance * MAX_1STOP_DISTANCE_RATIO:
+            continue
+        total_dur = leg1.duration_min + leg2.duration_min + 90  # +90 min layover estimate
+        results.append(RoutePlan(
+            legs=[leg1, leg2],
+            waypoints=[origin, hub, destination],
+            total_distance_km=total_dist,
+            total_duration_min=total_dur,
+        ))
+    if max_stops is not None and max_stops <= 1:
+        return results
+    # 2-stop routes through pairs of hubs (limit to top hubs for performance)
+    top_hubs = hub_iatas[:60]
+    for hub1 in top_hubs:
+        if hub1 == origin or hub1 == destination:
+            continue
+        leg1 = origin_routes.get(hub1)
+        if not leg1:
+            continue
+        hub1_routes = graph.get_outbound_routes(hub1)
+        for hub2 in top_hubs:
+            if hub2 == origin or hub2 == destination or hub2 == hub1:
+                continue
+            leg2 = hub1_routes.get(hub2)
+            if not leg2:
+                continue
+            leg3 = graph.get_direct_route(hub2, destination)
+            if not leg3:
+                continue
+            total_dist = leg1.distance_km + leg2.distance_km + leg3.distance_km
+            if total_dist > direct_distance * MAX_2STOP_DISTANCE_RATIO:
+                continue
+            total_dur = (leg1.duration_min + leg2.duration_min + leg3.duration_min
+                         + 90 + 90)  # Two layovers
+            results.append(RoutePlan(
+                legs=[leg1, leg2, leg3],
+                waypoints=[origin, hub1, hub2, destination],
+                total_distance_km=total_dist,
+                total_duration_min=total_dur,
+            ))
+    return results
+def _estimate_distance(graph: RouteGraph, origin: str, destination: str) -> int | None:
+    """Estimate great-circle distance between two airports using coordinates."""
+    import math
+    o = graph.airports.get(origin)
+    d = graph.airports.get(destination)
+    if not o or not d:
+        return None
+    lat1, lon1 = math.radians(o.latitude), math.radians(o.longitude)
+    lat2, lon2 = math.radians(d.latitude), math.radians(d.longitude)
+    dlat = lat2 - lat1
+    dlon = lon2 - lon1
+    a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
+    c = 2 * math.asin(math.sqrt(a))
+    return int(c * 6371)  # Earth radius in km

backend/seed_utils.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""Deterministic seeding utilities.
+Same search parameters always produce the same flights and prices.
+Uses SHA-256 hash of search params → integer seed for random.Random.
+"""
+import hashlib
+import random
+def make_seed(*parts: str | int | float) -> int:
+    """Create a deterministic seed from arbitrary parts."""
+    key = "|".join(str(p) for p in parts)
+    h = hashlib.sha256(key.encode()).hexdigest()
+    return int(h[:16], 16)
+def seeded_random(*parts: str | int | float) -> random.Random:
+    """Return a seeded Random instance for the given search params."""
+    return random.Random(make_seed(*parts))

description.md ADDED Viewed

	@@ -0,0 +1,1122 @@

+# ECMoE — Method and Experiment Description
+## 1. Problem Statement
+In Mixture-of-Experts (MoE) models with expert parallelism, each token's hidden state must be communicated between GPUs twice per MoE layer:
+1. **Dispatch (all-to-all):** The hidden state is sent from the token's source GPU to the GPU hosting its assigned expert(s).
+2. **Gather (all-to-all):** The expert output is sent back to the source GPU.
+For a model like Qwen3-30B-A3B with `hidden_dim=2048` and 48 MoE layers, each token requires transmitting `2 × 48 × 2048 × 2 bytes = 384 KB` of data per forward pass (in BF16). At scale, this communication dominates inference latency.
+This project investigates methods to **compress these hidden-state vectors** before transmission, reducing communication volume while preserving model quality.
+### Training paradigms
+This project uses two training paradigms:
+**Offline (Tasks 2–4):** Compressors are trained on **cached hidden states**, not end-to-end through the LLM:
+1. **Capture:** Run the unmodified LLM on calibration data and cache MoE layer inputs/outputs to disk.
+2. **Train:** Train each compressor/decompressor pair independently on the cached data, minimizing a local reconstruction loss. No gradients flow through the LLM.
+3. **Evaluate:** Insert trained compressors into the live model via forward hooks and measure perplexity.
+Each pair is trained in isolation — no joint optimization across layers, no end-to-end backpropagation. This is cheap (minutes per layer) but means compressors cannot adapt to how errors compound across layers.
+**End-to-end (Task 5):** Compressors are trained **through the live LLM** using the language modeling objective:
+1. **Insert:** Register per-layer compressor/decompressor pairs as forward pre-hooks on each MoE layer.
+2. **Train:** Run standard next-token prediction. The LLM weights are frozen; only compressor parameters receive gradients. Gradients flow through the entire frozen LLM.
+3. **Evaluate:** Same hook-based perplexity evaluation as offline methods.
+All 48 compressors are optimized jointly through a single global loss. This allows the system to learn how compression errors at early layers affect all downstream layers.
+---
+## 2. Model Specification
+| Property | Value |
+|---|---|
+| Architecture | Qwen3-30B-A3B-Instruct-2507 |
+| Total parameters | 30.53B |
+| Activated parameters | 3.35B |
+| Hidden dimension | 2048 |
+| Number of layers | 48 (all MoE) |
+| Number of experts | 128 per layer |
+| Top-k routing | 8 experts per token |
+| Attention heads | 32 (Q), 4 (KV) |
+| Head dimension | 128 |
+| MoE expert FFN intermediate size | 768 |
+| Vocabulary size | 151,936 |
+All tasks use the same model variant and precision:
+| Variant | Used in | Loading | VRAM |
+|---|---|---|---|
+| Qwen3-30B-A3B-Instruct-2507 | All tasks (1–5) | Full BF16 | ~60 GB |
+**Tasks 1–4:** Single GPU (`device="cuda:0"`). The ~60 GB model fits on one H100 80 GB with headroom for inference activations. Using single-GPU avoids the overhead of cross-GPU communication from `device_map="auto"`.
+**Task 5:** Multi-GPU via `device_map="auto"` across 4 GPUs. Backpropagation through the frozen model during end-to-end training requires additional VRAM for activations and gradient checkpoints that exceed single-GPU capacity.
+---
+## 3. Data Collection
+### 3.1 Calibration Data
+- **Dataset:** allenai/Dolci-Instruct-SFT (train split)
+- **Format:** Chat-formatted instruction data, tokenized via `tokenizer.apply_chat_template()`
+- **Sequences:** Up to 256 samples, each tokenized independently (one conversation = one sequence)
+- **Max length:** 2048 tokens per sequence (configurable via `--max-length`)
+- **SFT mode:** Labels mask non-assistant tokens with -100; perplexity computed on responses only
+- **Response-only collection:** By default, only assistant-response tokens are captured
+  (positions where `labels != -100`). This ensures offline compressor training (Tasks 2–4)
+  trains on the same token distribution that PPL evaluation measures. Use `--no-response-only`
+  for legacy all-token collection.
+- **Total tokens collected:** 100,000 per MoE layer (response tokens only by default)
+### 3.2 Hidden State Capture
+PyTorch forward hooks are registered on each MoE module:
+- **Pre-forward hook** captures dispatch states (MoE layer inputs)
+- **Post-forward hook** captures gather states (MoE layer outputs)
+**Token filtering:** `MoEHiddenStateCollector` supports a per-sequence boolean mask
+(`set_token_mask(mask)`). When `response_only=True` (default), the mask is derived from
+`labels != -100` before each forward pass. The same mask is applied to all 48 MoE layers
+within a sequence, preserving token alignment across layers. Positions where the mask is
+`False` (system, user, template markup, padding) are not collected.
+Each captured tensor has shape `[N, 2048]` where N = number of response tokens (or all
+tokens if `response_only=False`). States are stored in the model's native dtype (`bfloat16`)
+on CPU.
+**Implementation:** `MoEHiddenStateCollector` class in `src/model_utils.py`.
+### 3.3 Storage
+```
+data/hidden_states/
+├── dispatch_states.pt    # dict {layer_name: tensor [100000, 2048]}
+├── gather_states.pt      # dict {layer_name: tensor [100000, 2048]}
+└── metadata.json         # model name, dims, token count, layer names
+```
+Total size: ~37 GB (18.5 GB dispatch + 18.5 GB gather, bfloat16 = 2 bytes/value).
+---
+## 4. Evaluation Methodology
+### 4.1 Reconstruction Metrics (Offline)
+Computed on cached hidden states without running the full model:
+| Metric | Formula | Notes |
+|---|---|---|
+| MSE | `mean((x - x')²)` | Mean squared error |
+| Cosine Similarity | `mean(cos(x, x'))` | Per-token, averaged |
+| Relative Error | `mean(‖x - x'‖₂ / ‖x‖₂)` | Per-token L2 relative error |
+| SNR (dB) | `10 · log₁₀(signal_power / noise_power)` | Signal-to-noise ratio |
+**Implementation:** `src/metrics.py`
+### 4.2 End-to-End Perplexity (Online)
+The true impact of compression is measured by evaluating cross-entropy perplexity on allenai/Dolci-Instruct-SFT (the same dataset used for calibration/training) with compression hooks active:
+- **Dispatch compression:** A pre-forward hook on each MoE block applies `compress → decompress` to the input hidden states before they enter the block.
+- **Evaluation:** 50,000 sequences, max length 2048 tokens.
+- **SFT mode:** Perplexity is computed on assistant response tokens only. Non-response tokens
+  (system, user, template markup) are labeled with -100 and excluded from the loss.
+  This measures the model's ability to generate correct responses, not to predict prompt tokens.
+**Caveat:** This simulation also affects the router's input. In real expert parallelism, the router runs on the original hidden state at the source node. Our simulation gives a **conservative lower bound** — the true impact would be smaller.
+**Implementation:** `evaluate_perplexity_with_compression()`, `evaluate_perplexity_with_perlayer_compression()`, `evaluate_perplexity_with_stale_compression()` in `src/model_utils.py`.
+---
+## 5. Method Descriptions
+### 5.1 Quantization Baseline (Task 2)
+**Idea:** Reduce the bit width of hidden-state elements from BF16 (16 bits) to INT8/INT4/INT2.
+**Symmetric (absmax) quantization:**
+```
+scale = max(|x|) / (2^(bits-1) - 1)     # per-token
+x_q = round(x / scale)                   # quantize
+x' = x_q * scale                          # dequantize
+```
+**Asymmetric (zero-point) quantization:**
+```
+scale = (max(x) - min(x)) / (2^bits - 1)
+zero_point = round(-min(x) / scale)
+x_q = round(x / scale + zero_point)
+x' = (x_q - zero_point) * scale
+```
+**Compression ratios:**
+| Bits | Effective Ratio | Bytes/token (hidden_dim=2048) |
+|---|---|---|
+| INT8 (absmax) | ~2.0x | 2050 (2048 + 2 for scale) |
+| INT4 (absmax) | ~4.0x | 1026 (1024 + 2 for scale) |
+| INT2 (absmax) | ~8.0x | 514 (512 + 2 for scale) |
+**Additional parameters:** 0 (quantization is parameter-free).
+**Implementation:** `src/run_quantization.py`
+---
+### 5.2 Shared Neural Compressor (Task 3)
+**Idea:** Train a single-layer linear autoencoder shared across all 48 MoE layers.
+**Architecture:**
+```
+Compressor:    Linear(2048, bottleneck_dim) + bias
+Decompressor:  Linear(bottleneck_dim, 2048) + bias
+```
+One compressor-decompressor pair is shared across all layers. Training data pools dispatch states from all 48 layers. Training is offline: the compressor minimizes reconstruction loss on cached hidden states, with no gradients flowing through the LLM.
+**Compression ratios:** `hidden_dim / bottleneck_dim` = {2x, 4x, 8x, 16x} corresponding to `bottleneck_dim` = {1024, 512, 256, 128}.
+**Training hyperparameters:**
+| Parameter | Value |
+|---|---|
+| Optimizer | Adam |
+| Learning rate | 1e-3 |
+| LR schedule | Cosine annealing (T_max = epochs) |
+| Max epochs | 50 |
+| Batch size | 2048 |
+| Early stopping patience | 8 epochs |
+| Validation fraction | 10% |
+| Loss function | MSE + 0.1 × (1 - cosine_similarity) |
+**Loss function:**
+```
+L = MSE(x', x) + λ · (1 - mean(cos_sim(x', x)))
+```
+where `λ = 0.1` (cosine_weight). The cosine term encourages preserving direction, not just magnitude.
+**Parameter count:**
+```
+params = (2048 × b + b) + (b × 2048 + 2048)
+```
+where `b` = bottleneck_dim.
+| Ratio | Bottleneck | Parameters | % of Activated |
+|---|---|---|---|
+| 2x | 1024 | 4.20M | 0.125% |
+| 4x | 512 | 2.10M | 0.063% |
+| 8x | 256 | 1.05M | 0.031% |
+| 16x | 128 | 0.53M | 0.016% |
+**Implementation:** `src/run_neural_compressor.py`
+---
+### 5.3 Per-Layer Neural Compressor (Task 3b)
+**Motivation:** Hidden state distributions vary dramatically across layers:
+- Standard deviation: 0.16 (layer 0) → 1.21 (layer 47)
+- Kurtosis: 3 (near-Gaussian, early layers) → 81,340 (extremely heavy-tailed, late layers)
+A single shared compressor cannot adapt to this variation.
+**Architecture:** Same `Compressor` + `Decompressor` structure, but **48 independent pairs** — one per MoE layer. Each layer's compressor is trained independently and only on that layer's cached dispatch data. There is no joint optimization across layers.
+**Compression ratios:** Same as shared: {2x, 4x, 8x, 16x}.
+**Training:** Same hyperparameters as shared (see Section 5.2). Each layer is trained independently on its own 100K token dispatch data (90% train / 10% val).
+**Parameter count:**
+```
+params = 48 × (2048 × b + b + b × 2048 + 2048)
+```
+| Ratio | Bottleneck | Parameters | % of Activated |
+|---|---|---|---|
+| 2x | 1024 | 201.47M | 6.008% |
+| 4x | 512 | 100.79M | 3.006% |
+| 8x | 256 | 50.44M | 1.504% |
+| 16x | 128 | 25.27M | 0.754% |
+**Implementation:** `src/run_perlayer_compressor.py`
+---
+### 5.4 Stale-Conditioned Compressor (Tasks 4a/4b)
+**Motivation:** Adjacent MoE layers process the same token, so their hidden states are correlated. A decompressor can exploit this by receiving a "stale" signal — the hidden state from a nearby layer that was already transmitted — as side information.
+**Reference layer grouping (stride=12):**
+- Reference layers: {0, 12, 24, 36} (4 layers)
+- Layer 1–11 → stale from layer 0
+- Layer 13–23 → stale from layer 12
+- Layer 25–35 → stale from layer 24
+- Layer 37–47 → stale from layer 36
+**Architecture:**
+- **Reference layers** use standard per-layer `Compressor` + `Decompressor` (no stale signal).
+- **Non-reference layers** use `Compressor` + `StaleDecompressor`:
+```
+Compressor:          Linear(2048, bottleneck_dim) + bias
+StaleDecompressor:   Linear(bottleneck_dim + stale_dim, 2048) + bias
+```
+The decompressor receives `cat(compressed_current, stale_signal)` as input.
+**Two stale modes:**
+| Mode | Task | Stale signal | StaleDecompressor input dim |
+|---|---|---|---|
+| Compressed (4a) | `--stale-mode compressed` | Compressed ref layer input (via ref's compressor) | `bottleneck_dim + bottleneck_dim` |
+| Uncompressed (4b) | `--stale-mode uncompressed` | Raw ref layer input (full hidden dim) | `bottleneck_dim + 2048` |
+**Training:**
+1. **Phase 1:** Train reference layer compressors independently (standard per-layer autoencoder, same hyperparameters as Section 5.2).
+2. **Phase 2:** Train non-reference layer compressors independently. For each non-ref layer:
+   - Current data: that layer's cached dispatch states
+   - Stale data: the reference layer's cached dispatch states (compressed or raw, depending on mode)
+   - The stale signal is **pre-computed and frozen** — the reference layer's compressor is not jointly optimized with non-reference layers
+   - Token alignment is guaranteed: `dispatch[layer_0][i]` and `dispatch[layer_5][i]` correspond to the same token
+As with all neural methods in this project, training is offline on cached hidden states. No gradients flow through the LLM, and each layer's compressor is trained in isolation.
+**Stale-conditioned training loss:** Same as Section 5.2 (`MSE + 0.1 × (1 - cos_sim)`), but the decompressor receives the concatenated input.
+**Parameter count:**
+For compressed stale (`stale_dim = bottleneck_dim`):
+```
+ref_pair  = (2048 × b + b) + (b × 2048 + 2048)
+nonref_pair = (2048 × b + b) + ((b + b) × 2048 + 2048)
+total = 4 × ref_pair + 44 × nonref_pair
+```
+For uncompressed stale (`stale_dim = 2048`):
+```
+ref_pair  = (2048 × b + b) + (b × 2048 + 2048)
+nonref_pair = (2048 × b + b) + ((b + 2048) × 2048 + 2048)
+total = 4 × ref_pair + 44 × nonref_pair
+```
+| Mode | Ratio | Bottleneck | Stale dim | Parameters | % of Activated |
+|---|---|---|---|---|---|
+| Compressed | 2x | 1024 | 1024 | 293.75M | 8.760% |
+| Compressed | 4x | 512 | 512 | 146.92M | 4.382% |
+| Compressed | 8x | 256 | 256 | 73.51M | 2.192% |
+| Compressed | 16x | 128 | 128 | 36.80M | 1.098% |
+| Uncompressed | 2x | 1024 | 2048 | 386.02M | 11.512% |
+| Uncompressed | 4x | 512 | 2048 | 285.34M | 8.509% |
+| Uncompressed | 8x | 256 | 2048 | 234.99M | 7.008% |
+| Uncompressed | 16x | 128 | 2048 | 209.82M | 6.257% |
+Note: The uncompressed stale method's parameter count does not scale down as aggressively because the `StaleDecompressor` input always includes the full 2048-dim stale signal, making the `(2048 × 2048)` weight block dominant.
+**Perplexity evaluation with stale hooks:** During forward pass, a shared `stale_cache` dictionary stores reference layer inputs. PyTorch processes layers 0→47 sequentially, so layer 0's pre-hook fires before layer 1's, guaranteeing the stale cache is populated in time.
+**Implementation:** `src/run_stale_compressor.py`, `evaluate_perplexity_with_stale_compression()` in `src/model_utils.py`.
+---
+### 5.5 End-to-End Per-Layer Compressor (Tasks 5a/5b)
+**Motivation:** All offline methods (Tasks 3–4) share a fundamental limitation: each compressor is trained to minimize *local* reconstruction error in isolation. It cannot account for how its errors compound through downstream layers during a full forward pass. Additionally, the stale signal used during offline training is the *unperturbed* reference layer input, but during inference the reference layer itself is compressed, creating a train-inference mismatch.
+End-to-end training addresses both issues by optimizing compressors through the full LLM forward pass using the language modeling objective.
+**Architecture:** Same `Compressor` + `Decompressor` (5a) or `Compressor` + `StaleDecompressor` (5b) structure as Tasks 3b/4b. The compressor modules are identical — only the training objective differs.
+**Training paradigm:**
+1. Load the LLM in full BF16 across 4 GPUs. Freeze all LLM weights.
+2. Insert per-layer compressor/decompressor pairs as forward pre-hooks on each MoE layer. Each pair is placed on the same GPU as its MoE layer.
+3. Run standard next-token prediction on training data. Only compressor/decompressor parameters receive gradients.
+4. Gradients flow backward through the entire frozen LLM, from the cross-entropy loss at the output back through all 48 layers to every compressor.
+**Key difference from offline: joint optimization.** All 48 compressors share a single loss function (cross-entropy). Layer 0's compressor receives gradient signal about how its reconstruction error affects layers 1–47. The system implicitly learns to allocate more fidelity to layers where errors are most harmful to the final prediction.
+**Stale signal gradient flow (5b):** Unlike offline Task 4b where the stale signal is pre-computed and frozen, end-to-end training does **not** detach the stale signal. Gradients flow through the stale path:
+- A non-reference layer's decompressor receives `cat(compressed_current, stale)` where `stale` is the raw input to the reference layer
+- During backward, gradients flow from the non-ref layer through `stale` to the reference layer's input, and further back to earlier layers
+- This means reference layers' compressors are optimized not just for their own reconstruction, but also for how their inputs serve as stale side information for all downstream non-reference layers
+- This eliminates the train-inference mismatch: during training, the stale signal already reflects upstream compression artifacts
+**Near-identity initialization:**
+- Compressor `W_c`: first `bottleneck_dim` rows of the identity matrix
+- Decompressor `W_d`: first `bottleneck_dim` columns of the identity matrix
+- Composition `W_d @ W_c ≈ I` (projects to first `b` dimensions and reconstructs)
+- This ensures the initial forward pass is close to uncompressed, avoiding catastrophic initial loss from random projections. The optimizer then refines from this starting point.
+**Model and data:**
+- **Model:** Qwen/Qwen3-30B-A3B-Instruct-2507 (full BF16, same as all tasks)
+- **Training data:** allenai/Dolci-Instruct-SFT, 500K sequences (HF) / 100K sequences (Megatron) sampled from train split,
+  max_length=2048 tokens per sequence
+- **SFT mode:** Each conversation is tokenized independently (one sample = one sequence).
+  Labels mask non-assistant tokens with -100; loss is computed on assistant responses only.
+  Data is loaded by sampling N sequences from the dataset (not by packing tokens).
+- **Evaluation:** allenai/Dolci-Instruct-SFT (same dataset, response-only perplexity)
+**Two modes:**
+| Mode | Task | Stale signal | Decompressor |
+|---|---|---|---|
+| No stale (5a) | `--stale-mode none` | None | `Decompressor(bottleneck_dim, 2048)` |
+| Uncompressed stale (5b) | `--stale-mode uncompressed` | Raw ref layer input | `StaleDecompressor(bottleneck_dim, 2048, 2048)` |
+**Training hyperparameters:**
+| Parameter | Value |
+|---|---|
+| Optimizer | AdamW |
+| Learning rate | 1e-4 |
+| Weight decay | 0.01 |
+| LR schedule | Cosine with 10% linear warmup |
+| Max epochs | 1 |
+| Batch size | 2 (gradient accumulation: 8, effective: 16) |
+| Gradient clipping | max_norm = 1.0 |
+| Early stopping patience | 5 epochs |
+| Validation interval | Every 2500 optimizer steps (HF) / 1000 (Megatron) (configurable via `--val-interval`) |
+| Validation batch size | 8 (configurable via `--val-batch-size`; larger than train because no backward) |
+| Validation fraction | 10% |
+| Max sequence length | 2048 (configurable via `--max-length`) |
+| Loss function | Cross-entropy (response tokens only, SFT mode) |
+Note the lower learning rate (1e-4 vs 1e-3 for offline) — the LM loss landscape propagates gradients through 48 frozen transformer layers, requiring more conservative updates.
+**Tail micro-batch handling:** When `len(dataloader) % grad_accum != 0`, the remaining micro-batches
+have their accumulated gradients rescaled by `grad_accum / remainder` (correcting the divisor from
+`1/grad_accum` to `1/remainder`) before performing a final optimizer step. This ensures no training
+data is discarded. Applied to both HF (`run_e2e_compressor.py`) and Megatron (`train.py`).
+**Two evaluation stages (different data, different code paths):**
+| Stage | Split | Batch size | Function | Purpose |
+|---|---|---|---|---|
+| Training-time val | VAL (50K seqs) | `--val-batch-size` (default 8) | `evaluate_val_loss()` in training script | Checkpoint selection, wandb monitoring |
+| Final PPL | TEST (50K seqs) | 1 (per-sample) | `evaluate_perplexity()` in `model_utils.py` | Reported results |
+The training-time validation runs every `--val-interval` optimizer steps and at epoch end, using the VAL split. It drives best-checkpoint selection. The final perplexity evaluation runs after training on the held-out TEST split (never seen during training or checkpoint selection) and produces the numbers reported in the results tables. These are separate code paths — `--val-batch-size` only affects the training-time evaluation.
+**Parameter count:** Same as Tasks 3b (5a) and 4b-uncompressed (5b):
+| Mode | Ratio | Bottleneck | Parameters | % of Activated |
+|---|---|---|---|---|
+| No stale (5a) | 2x | 1024 | 201.47M | 6.008% |
+| No stale (5a) | 4x | 512 | 100.79M | 3.006% |
+| No stale (5a) | 8x | 256 | 50.44M | 1.504% |
+| No stale (5a) | 16x | 128 | 25.27M | 0.754% |
+| Uncompressed stale (5b) | 2x | 1024 | 386.02M | 11.512% |
+| Uncompressed stale (5b) | 4x | 512 | 285.34M | 8.509% |
+| Uncompressed stale (5b) | 8x | 256 | 234.99M | 7.008% |
+| Uncompressed stale (5b) | 16x | 128 | 209.82M | 6.257% |
+**Multi-GPU setup:**
+- Model distributed across 4 GPUs via `device_map="auto"` (~15 GB/GPU)
+- Gradient checkpointing enabled (`use_reentrant=False`) to reduce activation memory
+- 8 GPUs available → 05a and 05b run in parallel on separate GPU sets (GPUs 0-3 and 4-7)
+- Each compressor is automatically placed on the same GPU as its MoE layer
+**Implementation:** `src/run_e2e_compressor.py`, `scripts/05_run_e2e_compressor.sh`.
+---
+### 5.6 Megatron-LM E2E Training (Task 5 — Megatron variant)
+**Motivation:** The HuggingFace-based Task 5 uses `device_map="auto"` for naive layer-sharded model parallelism. Only one GPU is active at a time during forward pass (sequential layer execution), with no tensor or data parallelism. This limits training throughput and cannot scale to multi-node.
+**Approach:** Replace HuggingFace with Megatron-LM to get proper tensor parallelism (TP), expert parallelism (EP), and data parallelism (DP):
+- All 4 GPUs active simultaneously via TP (each GPU holds shards of every layer)
+- Multi-node scaling via DP across nodes + TP within nodes
+- Megatron's optimized kernels (fused LayerNorm, FlashAttention, etc.)
+**Compressor/decompressor placement:**
+In real expert parallelism, the compressor and decompressor are on DIFFERENT GPUs:
+- **Compressor:** Same GPU as attention output (source GPU where token originates)
+- **Decompressor:** Same GPU as MoE expert (destination GPU after dispatch)
+This is more realistic than the HF hook-based simulation where the router sees compressed-then-decompressed input. With Megatron, the router sees the ORIGINAL hidden state; only the dispatch is compressed.
+**Phase A (TP only, EP=1):** Compressor and decompressor on same GPU (same as current HF approach). TP=4 shards each layer across 4 GPUs.
+**Phase B (with EP):** Compressor on attention GPU, decompressor on expert GPU. MoE dispatch sends compressed tokens (reduced all-to-all volume). The `CompressedMoETokenDispatcher` wraps Megatron's dispatcher to:
+1. Compress on source GPU (attention side)
+2. Dispatch compressed tokens (smaller all-to-all)
+3. Decompress on destination GPU (expert side)
+**Training pipeline:**
+1. Convert Qwen3-30B-A3B from HF format to Megatron format via Megatron Bridge
+2. Load with TP=4 (each GPU holds ~15-20 GB of sharded weights)
+3. Freeze all LLM parameters
+4. Insert per-layer compressor/decompressor pairs at MoE boundaries
+5. Train compressors via language modeling objective (same as HF Task 5)
+6. Save compressor weights (from rank 0, since all TP ranks have identical copies)
+**TP-aware loss computation:** `MegatronModelWrapper._compute_loss()` uses
+`vocab_parallel_cross_entropy` when TP > 1. SFT labels (-100) are clamped to 0 before
+the call (avoiding garbage per-token loss for masked positions), and loss is computed as
+`(per_token_loss * loss_mask).sum() / num_valid`. The non-TP path uses PyTorch's
+`cross_entropy(ignore_index=-100)` which handles masking internally.
+**Evaluation:** Uses existing HF-based evaluation code — load trained compressor weights into `E2ECompressorManager` and evaluate perplexity with hook-based simulation.
+**Parallelism strategies:**
+| Hardware | Configuration | Notes |
+|---|---|---|
+| 4 GPUs | TP=4, EP=1, PP=1, DP=1 | All GPUs active via tensor parallelism |
+| 8 GPUs | TP=4, EP=1, PP=1, DP=2 | TP within 4 GPUs, DP across 2 replicas |
+| N nodes × 4 GPUs | TP=4, DP=N | TP within node (NVLink), DP across nodes |
+| EP variant | TP=2, EP=2, PP=1, DP=1 | Compressor on TP ranks, decompressor on EP ranks |
+**Compressor weights with TP:**
+- Compressors are replicated on all TP ranks (not sharded)
+- Input is full hidden state (post-attention all-reduce)
+- Gradients identical across ranks — no extra all-reduce needed
+- Save from rank 0 only
+**Implementation:** `src/megatron_e2e/` package with EP-first parallelism (EP=4, TP=1), CUDA 12.9, Megatron Bridge 0.2+, Transformer Engine. Entry point: `src/megatron_e2e/train.py`, bash wrapper: `scripts/05_megatron_e2e.sh`, setup: `scripts/megatron_setup_env.sh`. Multi-node: `scripts/05_megatron_e2e_multinode.sh`.
+---
+### 5.7 Baseline E2E Evaluation (Task 5c)
+**Motivation:** Tasks 5a/5b report perplexity relative to an "untrained baseline" (the original model evaluated on the same test data). However, 5a/5b's training pipeline also loads and processes data through `load_e2e_data()`, computes SFT-masked loss on train/val splits, and may differ subtly from a raw model evaluation. Task 5c runs the exact same pipeline (same data loading, same loss computation, same evaluation) but WITHOUT inserting any compressors. This provides:
+1. **Train/val loss context:** If 5c's train loss is ~1.0, and 5a-2x's is 1.11, the compression overhead is only +0.11 — not the raw 1.11 value.
+2. **Pipeline consistency:** Confirms that the data pipeline itself does not introduce artifacts.
+3. **Fair comparison:** All three (5a, 5b, 5c) use identical code paths except for the compression hooks.
+**What it does:**
+- Loads data via `load_e2e_data()` (same function as 5a/5b)
+- Evaluates train and val loss using `evaluate_loss_no_hooks()` — same as `evaluate_val_loss()` but without a compressor manager
+- Evaluates baseline PPL on the TEST split (same as 5a/5b)
+- No training, no compression ratios, no weight files
+**Implementation:** Added as `--stale-mode baseline` to both `src/run_e2e_compressor.py` (HF) and `src/megatron_e2e/train.py` (Megatron). Output dirs: `results/05c_e2e_baseline/` (HF), `results/05c_megatron_e2e_baseline/` (Megatron).
+---
+### 5.8 E2E with Pretrained Init (Tasks 6a/6b)
+**Motivation:** Tasks 5a/5b initialize compressor/decompressor weights with a near-identity matrix — the first `bottleneck_dim` dimensions are preserved, and the rest are zeroed out. This is a reasonable starting point but the optimizer must learn the full compression mapping from scratch using only the LM loss signal.
+Tasks 3b and 4b already train compressors to minimize reconstruction loss on cached hidden states. While this offline objective doesn't directly optimize for LM quality, the resulting weights encode the structure of hidden-state distributions and provide a potentially better starting point for E2E fine-tuning.
+Tasks 6a/6b test this hypothesis: does initializing E2E training from reconstruction-optimized weights (instead of near-identity) lead to faster convergence or better final quality?
+**Architecture:** Identical to Tasks 5a/5b — same `Compressor`, `Decompressor`, `StaleDecompressor` classes, same training objective (cross-entropy), same hyperparameters. The only difference is the initial weight values.
+**Two modes:**
+| Mode | Task | Init from | Stale signal |
+|---|---|---|---|
+| No stale (6a) | `--stale-mode none --init-weights-dir results/03b_perlayer_compressor` | Task 3b (per-layer offline) | None |
+| Uncompressed stale (6b) | `--stale-mode uncompressed --init-weights-dir results/04b_stale_uncompressed` | Task 4b (stale offline) | Raw ref layer input |
+**Weight compatibility:** Tasks 3b/4b save weights keyed by HF layer names (`model.layers.N.mlp`) with `compressor` and `decompressor` sub-keys. The `MegatronCompressorManager.load_weights()` expects the same format (it converts Megatron names to HF names via `_megatron_to_hf_layer_name()`). The offline and E2E architectures use identical module classes, so `load_state_dict()` works directly.
+**Parameter count:** Same as Tasks 5a/5b (identical architecture).
+**Training hyperparameters:** Same as Tasks 5a/5b (same LR, warmup, epochs, etc.).
+**Implementation:** Added `--init-weights-dir` argument to `src/megatron_e2e/train.py`. Auto-detects weight file naming pattern. Bash wrapper: `scripts/06_megatron_e2e_pretrained.sh`. Output dirs: `results/06a_megatron_e2e_pretrained_perlayer/` (6a), `results/06b_megatron_e2e_pretrained_stale/` (6b).
+### 5.9 Split-Mode E2E Training (Tasks 7a/7b)
+**Motivation:** Tasks 5/6 use forward pre-hooks that compress→decompress the MoE input — both the router AND experts see the decompressed hidden state. This is a conservative lower bound on quality. In real expert parallelism, the router runs on the source GPU with the **original** hidden state (before compression), and only experts on the destination GPU see the decompressed version. Task 7 trains the compressor under this more realistic "split mode" to see whether the training signal improves when the router is not degraded by compression artifacts.
+**Approach — Two-Level Pre-Hooks:**
+Instead of monkey-patching MoE forward methods, two pre-hooks are registered per MoE layer:
+1. **MoE pre-hook:** Saves the original input, then returns the compress→decompress result. The MoE module's `forward()` receives the decompressed tensor as its input.
+2. **Router pre-hook:** Registered on the router/gate submodule. When the MoE's `forward()` calls `self.gate(hidden_states)`, this hook intercepts and swaps the input back to the saved original.
+This works because:
+- The MoE pre-hook changes what `forward()` receives (decompressed), so experts get decompressed data.
+- The router pre-hook only affects the `gate` submodule's input, restoring the original.
+- PyTorch hook execution order: MoE pre-hook runs first (on the outer module), then when `forward()` calls `self.gate(...)` internally, the gate pre-hook runs and swaps the argument.
+**Two modes:**
+| Mode | Task | Init from | Stale signal | Router input |
+|---|---|---|---|---|
+| No stale (7a) | `--stale-mode none --router-mode uncompressed --init-weights-dir results/03b_perlayer_compressor` | Task 3b | None | Original |
+| Uncompressed stale (7b) | `--stale-mode uncompressed --router-mode uncompressed --init-weights-dir results/04b_stale_uncompressed` | Task 4b | Raw ref input | Original |
+**Architecture:** Identical to Tasks 6a/6b — same classes, same init weights, same hyperparameters. The only difference is that `router_mode="uncompressed"` activates the two-level hook pattern during training and evaluation.
+**Implementation:** Added `--router-mode` argument to `src/megatron_e2e/train.py` and `src/run_e2e_compressor.py`. Split-mode hooks added to `MegatronCompressorManager` (Megatron training) and `evaluate_perplexity_with_perlayer_compression`/`evaluate_perplexity_with_stale_compression` (HF PPL evaluation). Bash wrapper: `scripts/07_megatron_e2e_split.sh`. Output dirs: `results/07a_megatron_e2e_split_perlayer/` (7a), `results/07b_megatron_e2e_split_stale/` (7b).
+---
+## 6. Results
+### 6.1 Summary Table — All Methods
+**Model:** Qwen3-30B-A3B-Instruct-2507 (full BF16)
+**Dataset:** allenai/Dolci-Instruct-SFT
+| Method | Ratio | MSE | CosSim | PPL | PPL Delta | HF Strict | HF Flex |
+|---|---|---|---|---|---|---|---|
+| Baseline (Tasks 2–4) | — | — | — | 3.89 | — | 44.12% | 82.79% |
+| Baseline (5c / Megatron) | — | — | — | 3.94 | — | 44.12% | 82.79% |
+| Quant INT8 | 2.0x | — | — | 3.90 | +0.01 | 48.90% | 82.26% |
+| Quant INT4 | 4.0x | — | — | 4.51 | +0.62 | 56.41% | 68.54% |
+| Quant INT2 | 8.0x | — | — | 1532.59 | +1528.70 | 0.00% | 0.00% |
+| Neural (per-layer) | 2x | 0.0535 | 0.922 | 21.07 | +17.18 | 0.00% | 1.52% |
+| Neural (per-layer) | 4x | 0.1073 | 0.835 | 425.75 | +421.87 | 0.00% | 0.00% |
+| Neural (per-layer) | 8x | 0.1523 | 0.755 | 7949.78 | +7945.89 | 0.00% | 0.00% |
+| Neural (per-layer) | 16x | 0.1893 | 0.683 | 52440.05 | +52436.16 | 0.00% | 0.00% |
+| Stale-cond. (compressed) | 2x | 0.0379 | 0.947 | 6.13 | +2.24 | 3.41% | 62.55% |
+| Stale-cond. (compressed) | 4x | 0.0876 | 0.869 | 31.64 | +27.75 | 0.61% | 1.52% |
+| Stale-cond. (compressed) | 8x | 0.1330 | 0.791 | 2982.23 | +2978.34 | 0.00% | 0.00% |
+| Stale-cond. (compressed) | 16x | 0.1720 | 0.717 | 17486.21 | +17482.32 | 0.00% | 0.00% |
+| Stale-cond. (uncompressed) | 2x | 0.0346 | 0.952 | 6.24 | +2.36 | 2.81% | 67.10% |
+| Stale-cond. (uncompressed) | 4x | 0.0690 | 0.900 | 16.11 | +12.22 | 0.99% | 6.14% |
+| Stale-cond. (uncompressed) | 8x | 0.0966 | 0.855 | 423.68 | +419.79 | 0.00% | 0.00% |
+| Stale-cond. (uncompressed) | 16x | 0.1173 | 0.819 | 3740.41 | +3736.53 | 0.00% | 0.00% |
+| Megatron E2E per-layer (5a) | 2x | — | — | 2.77 | -1.17 | 61.33% | 61.64% |
+| Megatron E2E per-layer (5a) | 4x | — | — | 4.28 | +0.35 | 20.70% | 21.30% |
+| Megatron E2E per-layer (5a) | 8x | — | — | 7.49 | +3.55 | 1.82% | 2.12% |
+| Megatron E2E per-layer (5a) | 16x | — | — | 11.26 | +7.33 | 0.91% | 2.73% |
+| Megatron E2E stale (5b) | 2x | — | — | 2.71 | -1.23 | 60.27% | 60.65% |
+| Megatron E2E stale (5b) | 4x | — | — | 3.61 | -0.33 | 31.54% | 32.37% |
+| Megatron E2E stale (5b) | 8x | — | — | 4.98 | +1.04 | 4.93% | 5.00% |
+| Megatron E2E stale (5b) | 16x | — | — | 6.34 | +2.41 | 2.12% | 2.27% |
+| Megatron E2E pretrained per-layer (6a) | 2x | — | — | 2.41 | -1.53 | 79.98% | 80.06% |
+| Megatron E2E pretrained per-layer (6a) | 4x | — | — | 3.18 | -0.76 | 55.04% | 55.19% |
+| Megatron E2E pretrained per-layer (6a) | 8x | — | — | 4.52 | +0.58 | 16.98% | 16.98% |
+| Megatron E2E pretrained per-layer (6a) | 16x | — | — | 7.34 | +3.40 | 2.27% | 2.27% |
+| Megatron E2E pretrained stale (6b) | 2x | — | — | 2.25 | -1.69 | 82.49% | 82.64% |
+| Megatron E2E pretrained stale (6b) | 4x | — | — | 2.57 | -1.37 | 64.37% | 64.52% |
+| Megatron E2E pretrained stale (6b) | 8x | — | — | 3.04 | -0.90 | 45.79% | 45.94% |
+| Megatron E2E pretrained stale (6b) | 16x | — | — | 3.47 | -0.47 | 25.85% | 25.85% |
+| Split-mode E2E per-layer (7a) | 2x | — | — | 2.58 | -1.31 | 79.91% | 79.98% |
+| Split-mode E2E per-layer (7a) | 4x | — | — | 3.72 | -0.17 | 42.08% | 42.15% |
+| Split-mode E2E per-layer (7a) | 8x | — | — | 6.43 | +2.54 | 4.93% | 5.46% |
+| Split-mode E2E per-layer (7a) | 16x | — | — | 908.20 | +904.31 | 0.00% | 0.53% |
+| Split-mode E2E stale (7b) | 2x | — | — | 2.34 | -1.55 | 80.67% | 80.67% |
+| Split-mode E2E stale (7b) | 4x | — | — | 2.80 | -1.09 | 65.81% | 65.96% |
+| Split-mode E2E stale (7b) | 8x | — | — | 3.37 | -0.51 | 35.63% | 35.63% |
+| Split-mode E2E stale (7b) | 16x | — | — | 4.28 | +0.39 | 16.53% | 16.68% |
+Note: Tasks 2–4 and 5c baselines differ in PPL (3.89 vs 3.94) due to different evaluation
+code paths (single-GPU HF vs Megatron pipeline). PPL deltas for offline methods use 3.89;
+E2E methods use 3.94. HF Strict/Flex: GSM8K evaluated via HF backend (lm-eval-harness,
+router-compressed mode). For Tasks 7a/7b, HF Strict/Flex is compressed-router only.
+Uncompressed-router results for Tasks 7a/7b are in a dedicated table below Section 6.4.
+GSM8K scores are identical for both baselines because GSM8K evaluation uses the same raw HF
+model. GSM8K uses Megatron-trained weights for E2E methods. "Strict" requires exact
+`#### <number>` format; "flexible" extracts the number from anywhere in the output.
+HF-trained E2E weights (Tasks 5a/5b) were not available.
+### 6.2 Key Findings
+1. **E2E training is transformative** — E2E methods achieve PPL *below* baseline (3.94) at 2x. E2E stale stays below baseline at 4x (PPL=3.61).
+2. **E2E stale at 16x is moderate** — PPL=6.34 (+2.41), 61% above baseline, with GSM8K strict-match at 2.12%.
+3. **E2E dramatically outperforms offline** — Same architecture, same params: offline per-layer 4x PPL=425.75 vs E2E 4x PPL=4.28 (99x better). At 16x: 52440 vs 11.26 (4658x better).
+4. **Stale conditioning matters more at high compression** — At 2x the gap is small (E2E stale 2.71 vs E2E per-layer 2.77), but at 16x it's 1.8x (6.34 vs 11.26).
+5. **INT8 quantization is nearly lossless** — PPL 3.90 vs baseline 3.89 at 2x (+0.01), with GSM8K preserved (48.90% strict, 82.26% flexible).
+6. **INT4 quantization is acceptable** — PPL 4.51 at ~4x (+0.62 delta). GSM8K strict-match actually improves to 56.41%.
+7. **INT2 is catastrophic** — PPL 1533 at ~8x, completely unusable.
+8. **Offline methods degrade rapidly** — Per-layer neural: PPL=21 at 2x, PPL=425 at 4x, PPL=7950 at 8x. Stale-conditioning (uncompressed) helps at 2x (PPL=6.24) but collapses at 8x (PPL=424).
+9. **Below-baseline PPL** suggests E2E compressors act as regularizers, filtering noise from hidden states while preserving task-relevant information. Confirmed by GSM8K: E2E 2x scores 61.33% vs baseline 44.12%.
+10. **Downstream tasks are more sensitive than PPL** — Offline stale_uncomp_2x has PPL=6.24 (+2.36) but GSM8K drops from 44% to 3% strict-match. E2E methods maintain both PPL and GSM8K. See Section 6.4.
+11. **Offline compression destroys output format but partially preserves reasoning** — stale_uncomp_2x: 2.81% strict but 67.10% flexible-extract. E2E methods show no such gap (~0.3 pp).
+12. **Pretrained init (Task 6) dramatically improves E2E training** — Initializing from offline-trained weights (Tasks 3b/4b) instead of near-identity gives 13–45% PPL improvement and massive GSM8K gains. 6b at 2x achieves PPL=2.25 and 82.5% GSM8K strict-match (vs 5b: PPL=2.71, 60.3%). Even at 16x, 6b (PPL=3.47, GSM8K 25.9%) stays below baseline PPL (3.89) and retains meaningful downstream accuracy.
+13. **Pretrained init benefits grow with compression ratio** — For stale-conditioned (6b vs 5b): PPL improvement goes from 17% at 2x to 45% at 16x; GSM8K goes from +22 pp at 2x to +24 pp at 16x. The offline-trained weights provide a much better starting point for E2E optimization, especially at high compression where near-identity init struggles.
+14. **Split-mode training (Task 7) matches deployment reality** — Training with split-mode (router sees original, experts see decompressed) then evaluating in the same mode yields the best uncompressed-router results. 7b uncompressed at 2x achieves 83.3% GSM8K strict-match — the best result across all methods and modes.
+15. **7b uncompressed stays below baseline PPL at ALL ratios** — Even at 16x compression, 7b uncompressed PPL=3.27 remains below the no-compression baseline (3.89). This is the only method to maintain below-baseline PPL at every compression ratio, demonstrating that stale-conditioned split-mode E2E compressors can be simultaneously lossy (16x compression) and beneficial (regularization effect).
+16. **Split-mode training trades compressed-eval quality for uncompressed-eval quality** — 7a/7b compressed-eval PPL is worse than 6a/6b (e.g., 7a 16x compressed: 908 vs 6a: 8.49) because the model was not trained to have the router see decompressed data. But 7a/7b uncompressed-eval is better (7a 16x uncompressed: 6.64 vs 6a compressed: 8.49). This confirms the training mode should match the deployment mode.
+17. **Catastrophic collapse at extreme compression without stale** — 7a 16x compressed PPL=908 (vs 7a 16x uncompressed=6.64), showing that when per-layer compression is too lossy, correct routing (from original hidden states) becomes critical. Stale conditioning (7b) avoids this entirely: 7b 16x compressed=4.28, uncompressed=3.27.
+### 6.3 HF vs Megatron Comparison
+**Note:** HF E2E results in this section are from an earlier training run. The HF E2E
+weight files are no longer available in the current `results/05a_e2e_perlayer/` and
+`results/05b_e2e_stale/` directories (only logs remain). The Megatron results are
+from the current run and match the JSON files. The comparison below is preserved for
+historical reference but the HF numbers cannot be independently verified from current data.
+Both implementations use the same compressor architecture (Compressor + Decompressor / StaleDecompressor), the same model (Qwen3-30B-A3B-Instruct-2507), and the same training data (Dolci-Instruct-SFT). The key differences are in the distributed training strategy and model parallelism framework.
+**Implementation differences:**
+| Aspect | HuggingFace | Megatron |
+|---|---|---|
+| Framework | HF Transformers + `device_map="auto"` | Megatron-Core + AutoBridge |
+| Parallelism | Naive layer sharding (sequential) | EP=4, TP=1, PP=1, DP=4 |
+| GPU utilization | 1 GPU active at a time | All 4 GPUs active (DP) |
+| Data parallelism | None (single data stream) | DP=4 (each rank sees 1/4 of data per step) |
+| Optimizer | AdamW (single replica) | AdamW (replicated, gradients all-reduced) |
+| CUDA | 12.6 | 12.9 |
+**Task 5a — E2E per-layer (no stale):**
+| Ratio | HF PPL | Megatron PPL | Gap (Meg−HF) |
+|---|---|---|---|
+| 2x | **2.645** (−1.58) | 2.682 (−1.54) | +0.04 |
+| 4x | **3.687** (−0.54) | 4.410 (+0.19) | +0.72 |
+| 8x | **6.371** (+2.15) | 8.182 (+3.96) | +1.81 |
+| 16x | **9.157** (+4.93) | 11.670 (+7.44) | +2.51 |
+**Task 5b — E2E stale-conditioned (uncompressed stale):**
+| Ratio | HF PPL | Megatron PPL | Gap (Meg−HF) |
+|---|---|---|---|
+| 2x | 2.570 (−1.65) | **2.568** (−1.66) | −0.00 |
+| 4x | **3.102** (−1.12) | 3.420 (−0.80) | +0.32 |
+| 8x | **4.015** (−0.21) | 4.743 (+0.52) | +0.73 |
+| 16x | **4.550** (+0.32) | 5.232 (+1.01) | +0.68 |
+**Training losses (train / val):**
+| Config | HF 5a | Megatron 5a | HF 5b | Megatron 5b |
+|---|---|---|---|---|
+| 2x | 1.215 / 1.093 | 1.258 / 1.109 | 1.193 / 1.070 | 1.210 / 1.068 |
+| 4x | 1.786 / 1.447 | 2.103 / 1.627 | 1.579 / 1.286 | 1.784 / 1.375 |
+| 8x | 2.412 / 2.004 | 2.776 / 2.242 | 1.921 / 1.555 | 2.206 / 1.724 |
+| 16x | 2.768 / 2.326 | 3.180 / 2.567 | 2.069 / 1.686 | 2.344 / 1.823 |
+**Analysis:**
+1. **At 2x, both implementations converge to the same quality.** The gap is negligible (0.04 for 5a, −0.002 for 5b). Near-identity initialization gives a strong starting point, and 2x compression is easy enough that both optimizers find similar solutions.
+2. **Megatron's gap grows at higher compression ratios for 5a** (no stale). At 4x the gap is +0.72, at 16x it's +2.51. The likely cause is that Megatron with DP=4 provides each rank with 1/4 of the data per step — effectively a noisier gradient estimate. HF's single-replica training sees the full data stream, leading to a slightly better optimizer trajectory for harder problems (higher compression).
+3. **Stale conditioning dramatically narrows the Megatron-HF gap.** Adding stale conditioning reduces the gap by 50–73% at all ratios:
+   - 4x: +0.72 → +0.32 (56% reduction)
+   - 8x: +1.81 → +0.73 (60% reduction)
+   - 16x: +2.51 → +0.68 (73% reduction)
+   The stale signal acts as an anchor that partially corrects for the noisier optimization — it provides a strong prior about the expected hidden state, reducing the difficulty of the decompression task.
+4. **Both Megatron variants produce usable compressors.** Megatron 5b at 4x (PPL=3.42) is still 19% below baseline, and even at 16x (PPL=5.23) the degradation is only +24%. For production deployment where Megatron's scalability is needed, these results are practical.
+5. **Recommendation:** Use Megatron with stale conditioning (5b mode) for production. At 2–4x compression, results match HF quality. At 8–16x, there is a modest quality gap, but Megatron's multi-node scalability and proper expert parallelism make it the right choice for large-scale deployment.
+### 6.4 Downstream Task Evaluation (GSM8K)
+**Benchmark:** GSM8K chain-of-thought (gsm8k_cot), 8-shot, 1319 test examples.
+Two metrics: **strict-match** (exact `#### <number>` format) and **flexible-extract**
+(number extracted from anywhere in the output via regex).
+Two router modes: **compressed** (router AND experts see decompressed hidden states)
+and **uncompressed** (router sees original, experts see decompressed — more realistic EP
+simulation). PPL, MSE, CosSim from HF-based evaluation (`model_utils.py`).
+HF Strict/Flex from HF backend (lm-eval-harness, router-compressed mode).
+vLLM columns from vLLM backend (`run_all_downstream.py`, both router modes).
+For Tasks 7a/7b, vLLM Uncomp. columns show HF backend uncompressed-router results
+(confirmed identical via both `run_all_downstream.py` and `run_e2e_compressor.py
+--router-mode uncompressed`).
+| Method | Ratio | MSE | CosSim | PPL | PPL Δ | HF Strict | HF Flex | vLLM Comp. Strict | vLLM Comp. Flex | vLLM Uncomp. Strict | vLLM Uncomp. Flex |
+|---|---|---|---|---|---|---|---|---|---|---|---|
+| Baseline | — | — | — | 3.89 | — | 44.1% | 82.8% | 43.3% | 82.9% | — | — |
+| Quant INT8 | 2x | — | — | 3.90 | +0.01 | 48.9% | 82.3% | 43.7% | 82.2% | — | — |
+| Quant INT4 | 4x | — | — | 4.51 | +0.62 | 56.4% | 68.5% | 46.8% | 65.4% | — | — |
+| Quant INT2 | 8x | — | — | 1532.59 | +1528.70 | 0.0% | 0.0% | 0.0% | 0.0% | — | — |
+| Neural (per-layer) | 2x | 0.0535 | 0.922 | 21.07 | +17.18 | 0.0% | 1.5% | 0.0% | 1.2% | 22.7% | 42.6% |
+| Neural (per-layer) | 4x | 0.1073 | 0.835 | 425.75 | +421.87 | 0.0% | 0.0% | 0.0% | 0.4% | 1.0% | 2.4% |
+| Neural (per-layer) | 8x | 0.1523 | 0.755 | 7949.78 | +7945.89 | 0.0% | 0.0% | 0.0% | 0.0% | 2.0% | 1.9% |
+| Neural (per-layer) | 16x | 0.1893 | 0.683 | 52440.05 | +52436.16 | 0.0% | 0.0% | 0.0% | 0.0% | 1.5% | 1.5% |
+| Stale-cond. (compressed) | 2x | 0.0379 | 0.947 | 6.13 | +2.24 | 3.4% | 62.6% | 0.2% | 0.8% | 34.1% | 69.7% |
+| Stale-cond. (compressed) | 4x | 0.0876 | 0.869 | 31.64 | +27.75 | 0.6% | 1.5% | 0.0% | 0.6% | 2.7% | 4.9% |
+| Stale-cond. (compressed) | 8x | 0.1330 | 0.791 | 2982.23 | +2978.34 | 0.0% | 0.0% | 0.0% | 0.0% | 1.3% | 1.8% |
+| Stale-cond. (compressed) | 16x | 0.1720 | 0.717 | 17486.21 | +17482.32 | 0.0% | 0.0% | 0.0% | 0.0% | 1.8% | 2.0% |
+| Stale-cond. (uncompressed) | 2x | 0.0346 | 0.952 | 6.24 | +2.36 | 2.8% | 67.1% | 0.2% | 1.1% | 30.7% | 72.6% |
+| Stale-cond. (uncompressed) | 4x | 0.0690 | 0.900 | 16.11 | +12.22 | 1.0% | 6.1% | 0.0% | 0.6% | 6.1% | 9.3% |
+| Stale-cond. (uncompressed) | 8x | 0.0966 | 0.855 | 423.68 | +419.79 | 0.0% | 0.0% | 0.0% | 0.0% | 1.2% | 2.5% |
+| Stale-cond. (uncompressed) | 16x | 0.1173 | 0.819 | 3740.41 | +3736.53 | 0.0% | 0.0% | 0.0% | 0.0% | 1.4% | 2.0% |
+| E2E per-layer (5a) | 2x | — | — | 2.77 | −1.17 | 61.3% | 61.6% | 61.5% | 61.6% | 52.4% | 59.6% |
+| E2E per-layer (5a) | 4x | — | — | 4.28 | +0.35 | 20.7% | 21.3% | 21.2% | 22.4% | 11.0% | 12.9% |
+| E2E per-layer (5a) | 8x | — | — | 7.49 | +3.55 | 1.8% | 2.1% | 0.0% | 0.0% | 0.0% | 0.0% |
+| E2E per-layer (5a) | 16x | — | — | 11.26 | +7.33 | 0.9% | 2.7% | 0.0% | 0.0% | 0.0% | 0.1% |
+| E2E stale (5b) | 2x | — | — | 2.71 | −1.23 | 60.3% | 60.7% | 61.3% | 61.6% | 53.2% | 61.2% |
+| E2E stale (5b) | 4x | — | — | 3.61 | −0.33 | 31.5% | 32.4% | 33.0% | 33.2% | 18.6% | 22.1% |
+| E2E stale (5b) | 8x | — | — | 4.98 | +1.04 | 4.9% | 5.0% | 3.4% | 4.3% | 0.2% | 2.4% |
+| E2E stale (5b) | 16x | — | — | 6.34 | +2.41 | 2.1% | 2.3% | 0.0% | 0.2% | 0.0% | 0.1% |
+| E2E pretrained per-layer (6a) | 2x | — | — | 2.41 | −1.53 | 80.0% | 80.1% | 80.1% | 80.0% | 80.6% | 80.8% |
+| E2E pretrained per-layer (6a) | 4x | — | — | 3.18 | −0.76 | 55.0% | 55.2% | 52.8% | 52.9% | 43.3% | 43.9% |
+| E2E pretrained per-layer (6a) | 8x | — | — | 4.52 | +0.58 | 17.0% | 17.0% | 13.5% | 14.0% | 6.7% | 7.6% |
+| E2E pretrained per-layer (6a) | 16x | — | — | 7.34 | +3.40 | 2.3% | 2.3% | 0.3% | 1.1% | 1.1% | 2.1% |
+| E2E pretrained stale (6b) | 2x | — | — | 2.25 | −1.69 | 82.5% | 82.6% | 82.0% | 82.3% | 83.9% | 84.0% |
+| E2E pretrained stale (6b) | 4x | — | — | 2.57 | −1.37 | 64.4% | 64.5% | 71.0% | 71.1% | 68.8% | 68.9% |
+| E2E pretrained stale (6b) | 8x | — | — | 3.04 | −0.90 | 45.8% | 45.9% | 37.6% | 37.6% | 24.3% | 24.3% |
+| E2E pretrained stale (6b) | 16x | — | — | 3.47 | −0.47 | 25.9% | 25.9% | 18.7% | 18.7% | 9.0% | 9.6% |
+| Split E2E per-layer (7a) | 2x | — | — | 2.58 | −1.31 | 79.9% | 80.0% | — | — | 79.5% | 79.7% |
+| Split E2E per-layer (7a) | 4x | — | — | 3.72 | −0.17 | 42.1% | 42.2% | — | — | 51.6% | 51.8% |
+| Split E2E per-layer (7a) | 8x | — | — | 6.43 | +2.54 | 4.9% | 5.5% | — | — | 18.5% | 18.7% |
+| Split E2E per-layer (7a) | 16x | — | — | 908.20 | +904.31 | 0.0% | 0.5% | — | — | 2.0% | 2.5% |
+| Split E2E stale (7b) | 2x | — | — | 2.34 | −1.55 | 80.7% | 80.7% | — | — | 83.3% | 83.4% |
+| Split E2E stale (7b) | 4x | — | — | 2.80 | −1.09 | 65.8% | 66.0% | — | — | 70.7% | 70.7% |
+| Split E2E stale (7b) | 8x | — | — | 3.37 | −0.51 | 35.6% | 35.6% | — | — | 47.2% | 47.2% |
+| Split E2E stale (7b) | 16x | — | — | 4.28 | +0.39 | 16.5% | 16.7% | — | — | 27.1% | 27.1% |
+Notes: HF = HF backend (router-compressed mode). vLLM Comp. = vLLM backend, router-compressed
+(router+experts see decompressed). vLLM Uncomp. = vLLM backend, router-uncompressed (router sees
+original, experts see decompressed — split forward). For Tasks 7a/7b, HF Strict/Flex = HF backend
+with compressed router; vLLM Uncomp. = HF backend with uncompressed router (confirmed identical
+results from both `run_all_downstream.py` and `run_e2e_compressor.py --router-mode uncompressed`).
+Baseline and quantization have no split mode. PPL baseline: 3.89 (offline) / 3.94 (E2E). GSM8K
+uses Megatron-trained weights for E2E methods. Task 7 PPL column shows compressed-router PPL.
+Uncompressed-router results (confirmed identical via both original eval code path and
+`run_e2e_compressor.py --router-mode uncompressed`):
+| Ratio | 7a PPL | 7b PPL | Baseline PPL | 7a Strict | 7a Flex | 7b Strict | 7b Flex |
+|-------|--------|--------|--------------|-----------|---------|-----------|---------|
+| 2x    | 2.38   | 2.23   | 3.89         | 79.5%     | 79.7%   | 83.3%     | 83.4%   |
+| 4x    | 3.08   | 2.53   | 3.89         | 51.6%     | 51.8%   | 70.7%     | 70.7%   |
+| 8x    | 4.18   | 2.89   | 3.89         | 18.5%     | 18.7%   | 47.2%     | 47.2%   |
+| 16x   | 6.64   | 3.27   | 3.89         | 2.0%      | 2.5%    | 27.1%     | 27.1%   |
+**Key findings:**
+1. **E2E compression improves GSM8K over baseline.** Baseline strict-match is 44.12%.
+   E2E per-layer 2x achieves 61.33% (+17.2 pp) and E2E stale 2x achieves 60.27%
+   (+16.2 pp). This mirrors the below-baseline PPL effect — E2E compressors act as
+   regularizers that improve both perplexity and downstream task performance.
+2. **INT8 and INT4 quantization also improve strict-match.** INT8: 48.90% (+4.8 pp),
+   INT4: 56.41% (+12.3 pp). The flexible-extract gap is smaller (INT8: 82.26% vs
+   baseline 82.79%), suggesting quantization noise may regularize the strict output
+   format without hurting reasoning.
+3. **Offline methods catastrophically fail on generation tasks.** Per-layer neural
+   compressors score 0% strict-match at all ratios (even 2x, which has PPL=21.07).
+   Stale-conditioned 2x scores only 2.81% strict / 67.10% flexible. The flexible-extract
+   score reveals that the model still produces correct numerical answers but the output
+   format is destroyed — compression disrupts the learned generation patterns.
+4. **The strict-vs-flexible gap reveals a format disruption effect.** Offline methods
+   show huge gaps: stale_uncomp_2x has 2.81% strict but 67.10% flexible (64.3 pp gap).
+   E2E methods show almost no gap: e2e_2x has 61.33% strict vs 61.64% flexible (0.3 pp).
+   End-to-end training preserves both the model's reasoning ability AND its output
+   formatting, while offline compression preserves some reasoning but destroys formatting.
+5. **GSM8K is more sensitive than PPL to compression quality.** Stale_uncomp_2x has
+   PPL=6.24 (only +2.36 above baseline) yet scores 2.81% on GSM8K strict-match (vs
+   44.12% baseline). E2E per-layer 4x has PPL=4.28 (only +0.35 above baseline) yet
+   drops to 20.70% GSM8K. Generation tasks amplify small distributional shifts that
+   PPL barely registers.
+6. **Stale conditioning matters for downstream tasks.** At 4x: E2E stale gets 31.54%
+   vs E2E per-layer 20.70% (+10.8 pp). At 8x: stale gets 4.93% vs per-layer 1.82%.
+   The stale signal helps preserve generation quality, consistent with PPL findings.
+7. **Pretrained init (Task 6) yields dramatic GSM8K improvements.** 6b stale at 2x
+   achieves 82.49% strict-match — nearly double baseline (44.12%) and +22 pp over 5b
+   (60.27%). 6a per-layer at 2x reaches 79.98% (+19 pp over 5a). Even at 8x, 6b retains
+   45.79% (exceeding baseline) while 5b collapses to 4.93%.
+8. **Pretrained init enables useful compression at 16x.** 6b at 16x achieves 25.85%
+   GSM8K strict-match — down from baseline (44.12%) but still practically useful. Compare
+   with 5b at 16x (2.12%) or 5a at 16x (0.91%). Offline weights provide the optimizer
+   with a much better starting region of parameter space.
+9. **Best overall result: 6b at 2–4x compression.** 6b at 2x (PPL=2.25, GSM8K=82.5%)
+   and 4x (PPL=2.57, GSM8K=64.4%) both outperform baseline on PPL and at 4x still retain
+   strong downstream performance. This suggests stale-conditioned E2E compression with
+   pretrained init is a viable approach for reducing MoE communication by 2–4x with
+   minimal or even improved model quality.
+---
+## 7. Design Choices and Trade-offs
+### 7.1 Offline Independent Training vs End-to-End
+**Offline training (Tasks 2–4)** trains compressors on cached hidden states, independently per layer:
+| Aspect | Offline | End-to-End (Task 5) |
+|---|---|---|
+| Loss | MSE + cosine (reconstruction) | Cross-entropy (next-token prediction) |
+| Optimization scope | Per-layer, independent | Joint, all 48 layers |
+| Gradient flow | None through LLM | Through entire frozen LLM |
+| Stale signal | Pre-computed, frozen | Live, gradients flow through |
+| Model precision | Full BF16 (~60 GB, 1 GPU) | Full BF16 (~60 GB, 4 GPUs) |
+| Training cost | Minutes per layer | Hours for all layers + ratios |
+| Error compounding | Not accounted for | Naturally optimized via global loss |
+**Offline advantages:**
+- Fast and cheap (minutes per layer on a single GPU)
+- No need to backpropagate through the full LLM
+- Each layer's compressor can be trained in parallel
+**Offline limitations (addressed by e2e):**
+- Compressors cannot adapt to how their reconstruction errors compound across layers. A small error at layer 0 may shift the hidden state distribution at layer 1, but layer 1's compressor was trained on the *original* layer-1 distribution.
+- No joint optimization means the system cannot learn to allocate more capacity to layers where errors are most harmful.
+- The stale signal used during offline training is the *unperturbed* reference input, but during inference the reference layer itself is compressed, creating a train-inference mismatch.
+**E2E advantages:**
+- Compressors are optimized for the actual downstream impact of compression on model quality.
+- Joint optimization: the system implicitly learns which layers need higher fidelity.
+- Stale gradients flow: reference layer compressors are optimized for their dual role (own reconstruction + stale side information for downstream layers). The stale signal during training already reflects upstream compression artifacts, eliminating the train-inference mismatch.
+**E2E limitations:**
+- Requires full-precision model in memory for proper gradient flow (~60 GB across 4 GPUs).
+- Training is slower (full forward + backward through 48 frozen transformer layers per step).
+- More hyperparameter-sensitive (LR, warmup, gradient clipping matter more).
+### 7.2 Linear vs Non-linear Compressors
+All compressors are single-layer linear networks (no activation functions). This was a deliberate choice:
+- Linear compressors are equivalent to learning an optimal projection/reconstruction pair (related to PCA)
+- They are fast to train and apply (single matrix multiply)
+- They establish a clean baseline before trying non-linear architectures
+### 7.3 Loss Function
+The combined `MSE + 0.1 × (1 - cos_sim)` loss was chosen because:
+- MSE alone can be dominated by outlier values (which are common in later layers with kurtosis up to 81K)
+- Cosine similarity preserves the direction of the hidden state vector, which matters more than exact magnitude for downstream attention and expert computations
+- The 0.1 weighting keeps MSE as the primary objective while regularizing directions
+### 7.4 Reference Layer Stride
+The stride of 12 (giving reference layers {0, 12, 24, 36}) was chosen as a balance:
+- More reference layers (smaller stride) → better stale signals but more communication (ref layers use standard compression without stale)
+- Fewer reference layers (larger stride) → stale signals become less correlated with non-ref layers
+- stride=12 gives 4 reference layers covering 48 layers, with each non-ref layer at most 11 layers away from its reference
+### 7.5 Training Data Size
+100,000 tokens per layer (increased from initial 10,000). Each token produces a 2048-dim vector, so training data per layer is 100K × 2048 = 204.8M values. This is sufficient for learning a linear map with ~4M parameters (2x compression, per-layer).
+### 7.6 Model Precision
+All tasks use the same model in full BF16 precision (no weight quantization). This ensures:
+- Hidden states used for offline training exactly match inference conditions
+- End-to-end training has proper gradient flow through frozen layers
+- All methods share the same baseline perplexity, enabling direct comparison
+- 4-bit NF4 quantization is available via `--load-in-4bit` but is not the default
+---
+## 8. Implementation Details
+### 8.1 Hook-Based Evaluation and Training
+Four hook modes are used across experiments:
+| Mode | Hook type | Used in |
+|---|---|---|
+| `evaluate_perplexity_with_compression` | Same compress/decompress for all layers | Shared compressor (Task 3) |
+| `evaluate_perplexity_with_perlayer_compression` | Per-layer compress/decompress dicts | Per-layer compressor (Task 3b) |
+| `evaluate_perplexity_with_stale_compression` | Per-layer + stale cache + ref/non-ref split | Stale-conditioned (Tasks 4a/4b) |
+| `E2ECompressorManager.register_hooks()` | Per-layer, trainable, with/without stale cache | E2E training + eval (Task 5) |
+The stale evaluation maintains a `stale_cache` dictionary that is populated by reference layer pre-hooks and read by subsequent non-reference layer hooks. This works because PyTorch processes layers sequentially (layer 0 before layer 1, etc.).
+**Device safety in evaluation hooks:** With `device_map="auto"`, model layers may reside on
+different GPUs. All evaluation hooks in `model_utils.py` (`evaluate_perplexity_with_perlayer_compression`
+and `evaluate_perplexity_with_stale_compression`) explicitly call `.to(x.device)` on
+compressor/decompressor outputs before returning them to the model. This ensures correctness
+when compressor weights and MoE layers are on different devices.
+**E2E training hooks (Task 5)** differ from evaluation hooks in two ways:
+1. Compressor/decompressor parameters have `requires_grad=True`, so the autograd graph is maintained through the hooks.
+2. For stale mode (5b), the cached stale signal is **not detached** — gradients flow through the stale path to earlier layers, enabling true end-to-end optimization.
+### 8.2 MoE Layer Detection
+`find_moe_layers()` in `model_utils.py` detects MoE modules by:
+1. Checking if the class name contains "Moe", "MoE", or "SparseMoe"
+2. Checking for `experts` attribute
+3. Checking for both `gate` and `experts` attributes
+This is model-agnostic and works for Qwen3, Mixtral, and other MoE architectures.
+### 8.3 File Organization
+**Offline experiments (Tasks 1–4)** follow the same pattern:
+1. Load cached hidden states from `data/hidden_states/`
+2. Train compressors on dispatch states
+3. Evaluate reconstruction metrics (offline, on cached data)
+4. Load the full model and evaluate perplexity (online, with hooks)
+5. Save results to `results/{experiment}/`
+**End-to-end experiments (Task 5)** follow a different pattern:
+1. Load the full model in BF16 across 4 GPUs
+2. Load and tokenize training data (Dolci-Instruct-SFT)
+3. For each compression ratio: create compressor manager, train e2e, save weights
+4. Evaluate perplexity on Dolci-Instruct-SFT (with hooks, same as offline)
+5. Save results to `results/05{a,b}_e2e_{perlayer,stale}/`
+Bash wrappers in `scripts/` handle environment setup, module loading, and argument passing.
+### 8.4 Progress Tracking and Logging
+All long-running loops use `tqdm` progress bars (written to stderr) for real-time progress monitoring with elapsed time and ETA. Key loops instrumented:
+- **Training loops:** Epoch progress with loss/cosine postfix (all training functions)
+- **Layer loops:** Per-layer training iteration (Tasks 3b, 4a/4b)
+- **Data loading:** Calibration data and tokenization progress
+- **Evaluation:** Perplexity evaluation sequence progress, quantization config iteration
+- **Ratio loops:** Outer compression ratio iteration (all tasks)
+Each bash script redirects output to two log files in the task's output directory:
+| File | Contents | Source |
+|---|---|---|
+| `run.log` | Full output (print statements, results, summaries) | stdout |
+| `progress.log` | tqdm progress bars (elapsed time, ETA, loss metrics) | stderr |
+Monitor progress of a running experiment: `tail -f results/<task>/progress.log`
+---
+## 9. Reproducibility
+### 9.1 Software Environment
+- Python 3.11
+- PyTorch (via `pip install torch` with CUDA 12.6)
+- Transformers (HuggingFace)
+- bitsandbytes (optional, for 4-bit model loading)
+- datasets (for allenai/Dolci-Instruct-SFT)
+- matplotlib, numpy
+### 9.2 Hardware
+- NVIDIA H100 80 GB GPUs (8 available)
+- Tasks 1–4: single GPU sufficient (model in full BF16, ~60 GB on one H100 80 GB)
+- Task 5: 4 GPUs per job (model in full BF16, ~60 GB + backprop memory); 05a and 05b run in parallel on GPUs 0-3 and 4-7
+- 500+ GB system RAM (required for loading ~37 GB of hidden states for offline tasks)
+- Compute Canada cluster
+### 9.3 Random Seeds and Data Splitting
+All experiments use **seed=42** for reproducibility. A deterministic 80/10/10
+train/val/test split of the Dolci-Instruct-SFT dataset rows is computed via
+`get_split_indices()` in `model_utils.py`:
+```python
+rng = random.Random(42)
+indices = list(range(dataset_size))
+rng.shuffle(indices)
+# 80% train, 10% val, 10% test
+```
+**Split consistency across tasks:**
+- Task 1 hidden state collection: TRAIN split (max_samples=10000)
+- Tasks 2–4 offline training: uses cached hidden states from Task 1 (TRAIN split)
+- Tasks 2–4 PPL evaluation: TEST split (max_samples_ppl=50000, response-only)
+- Task 5 E2E training: TRAIN split (500K sequences HF / 100K Megatron, SFT mode)
+- Task 5 E2E validation: VAL split sequences (SFT mode)
+- Task 5 PPL evaluation: TEST split (same as tasks 2–4, response-only)
+**SFT data loading (Task 5 and PPL evaluation):**
+- Each conversation is tokenized independently (one sample = one sequence)
+- Labels are -100 for non-assistant tokens, actual token IDs for assistant responses
+- `_tokenize_sft_sample()` in `model_utils.py` finds assistant token boundaries
+  via incremental prefix tokenization of the chat template
+- Max sequence length: 2048 (configurable via `--max-length`)
+- Loss and perplexity are computed on response tokens only
+Additional seed setting in Task 5:
+- `random.seed(42)`, `np.random.seed(42)`, `torch.manual_seed(42)`,
+  `torch.cuda.manual_seed_all(42)` at start of main()
+- DataLoader shuffling uses PyTorch's seeded RNG
+### 9.4 Experiment Tracking (Wandb)
+Both HF and Megatron E2E scripts support Weights & Biases logging:
+- **CLI:** `--wandb` / `--no-wandb`, `--wandb-project <name>`
+- **Logged metrics:** `train/loss` and `train/lr` per optimizer step,
+  `val/loss` every `--val-interval` steps (default 2500) and at end of epoch,
+  `train/epoch_loss` per epoch
+- **Projects:** `ecmoe-e2e` (HF), `ecmoe-megatron-e2e` (Megatron)
+- **Default:** Enabled in bash scripts via `WANDB_FLAG`; disable with
+  `WANDB_FLAG="--no-wandb" bash scripts/05_run_e2e_compressor.sh none`
+- Megatron: only rank 0 logs to wandb
+- Megatron `train/loss` and `train/epoch_loss` are DP-averaged (all-reduced across
+  data-parallel ranks) before logging, so wandb values reflect the true global loss
+- Graceful fallback if wandb is not installed (HAS_WANDB flag)
+---
+## 10. Task 8: EP Communication Compression in vLLM
+### 10.1 Motivation
+Tasks 5–7 evaluate compression quality using PyTorch hooks that compress and decompress
+on the **same GPU** — simulating the quality impact but not achieving actual communication
+reduction.  In real expert parallelism, the pipeline is:
+1. Router computes logits from **original** hidden states (attention GPU)
+2. **Compressor** runs on attention GPU: `hidden_dim` → `bottleneck_dim`
+3. All-to-all dispatch sends only the **compressed** tensor (reduced volume!)
+4. **Decompressor** runs on expert GPU: `bottleneck_dim` → `hidden_dim`
+5. Experts compute on decompressed states
+Task 8 modifies vLLM's `FusedMoE.forward_impl()` to implement this pipeline,
+compressing BEFORE dispatch and decompressing AFTER.
+### 10.2 Implementation
+**Patched vLLM (`scripts/patch_vllm_fused_moe.py`):** Adds ~12 lines to
+`FusedMoE.forward_impl()` at three locations:
+1. **Compress before dispatch (EP mode):** `_ecmoe_compress_fn(hidden_states)` →
+   dispatches compressed tensor instead of full hidden_dim.
+2. **Decompress after dispatch (EP mode):** After `get_ep_group().dispatch()`,
+   `_ecmoe_decompress_fn(hidden_states_combined)` restores full hidden_dim.
+3. **Single-GPU fallback:** When `do_naive_dispatch_combine=False` (TP=1),
+   applies compress→decompress in-place for simulation mode.
+When `_ecmoe_compress_fn` is None (default), behavior is identical to stock vLLM.
+**EP-aware registration (`src/vllm_ep_compression.py`):** Uses `apply_model()`
+to set compress/decompress functions on each FusedMoE instance:
+- **Per-layer:** `register_ep_perlayer()` — Independent linear compress/decompress per layer.
+- **Stale-conditioned:** `register_ep_stale()` — Reference layers piggyback stale signal
+  on compressed tensor before dispatch. Non-reference layers dispatch only compressed data.
+### 10.3 Stale Broadcast via Dispatch Piggybacking
+**Reference layers (0, 12, 24, 36):**
+- compress_fn: `cat(compressed[B, bottleneck], stale[B, stale_dim])` → dispatch `[B, bottleneck + stale_dim]`
+- decompress_fn: split → cache stale_part globally → decompress compressed_part
+**Non-reference layers (all others):**
+- compress_fn: `compressed[B, bottleneck]` only → dispatch `[B, bottleneck]` (maximum compression!)
+- decompress_fn: retrieve cached stale → `cat(compressed, cached_stale)` → StaleDecomp
+**Correctness:** vLLM's default `all2all_backend=allgather_reducescatter` means after
+dispatch, every rank has ALL tokens in consistent ordering.  Stale cached from reference
+layers matches token ordering at non-reference layers.
+### 10.4 Communication Savings
+| Mode | Ref layers (4/48) | Non-ref layers (44/48) | Weighted avg | vs baseline 2048 |
+|------|-------------------|----------------------|--------------|-------------------|
+| perlayer 2x | 1024 | 1024 | 1024 | **50% saving** |
+| perlayer 4x | 512 | 512 | 512 | **75% saving** |
+| stale(comp) 4x | 1024 | 512 | 555 | **73% saving** |
+| stale(uncomp) 4x | 2560 | 512 | 683 | **67% saving** |
+| stale(uncomp) 2x | 3072 | 1024 | 1195 | **42% saving** |
+Stale broadcast cost is amortized over ~11 non-reference layers per reference layer.
+### 10.5 Evaluation Modes
+- **simulation** (`--mode simulation`): Single-GPU (TP=1), no dispatch/combine.
+  Validates numerical correctness against existing split-mode results.
+- **ep** (`--mode ep`): Multi-GPU (TP=4, `enable_expert_parallel=True`).
+  Uses actual EP dispatch/combine with compressed tensors.
+Both use Task 7a/7b weights (split-mode E2E trained) from
+`results/07a_megatron_e2e_split_perlayer/` and `results/07b_megatron_e2e_split_stale/`.

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,6 @@

+services:
+  flight-search:
+    build: .
+    ports:
+      - "8080:8080"
+    restart: unless-stopped

frontend/eslint.config.js ADDED Viewed

	@@ -0,0 +1,23 @@

+import js from '@eslint/js'
+import globals from 'globals'
+import reactHooks from 'eslint-plugin-react-hooks'
+import reactRefresh from 'eslint-plugin-react-refresh'
+import tseslint from 'typescript-eslint'
+import { defineConfig, globalIgnores } from 'eslint/config'
+export default defineConfig([
+  globalIgnores(['dist']),
+  {
+    files: ['**/*.{ts,tsx}'],
+    extends: [
+      js.configs.recommended,
+      tseslint.configs.recommended,
+      reactHooks.configs.flat.recommended,
+      reactRefresh.configs.vite,
+    ],
+    languageOptions: {
+      ecmaVersion: 2020,
+      globals: globals.browser,
+    },
+  },
+])

frontend/index.html ADDED Viewed

	@@ -0,0 +1,13 @@

+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>Flight Search</title>
+  </head>
+  <body>
+    <div id="root"></div>
+    <script type="module" src="/src/main.tsx"></script>
+  </body>
+</html>

frontend/package-lock.json ADDED Viewed

The diff for this file is too large to render. See raw diff

frontend/package.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "name": "frontend",
+  "private": true,
+  "version": "0.0.0",
+  "type": "module",
+  "scripts": {
+    "dev": "vite",
+    "build": "tsc -b && vite build",
+    "lint": "eslint .",
+    "preview": "vite preview"
+  },
+  "dependencies": {
+    "react": "^19.2.0",
+    "react-dom": "^19.2.0",
+    "react-router-dom": "^7.13.1"
+  },
+  "devDependencies": {
+    "@eslint/js": "^9.39.1",
+    "@tailwindcss/vite": "^4.2.1",
+    "@types/node": "^24.10.1",
+    "@types/react": "^19.2.7",
+    "@types/react-dom": "^19.2.3",
+    "@vitejs/plugin-react": "^5.1.1",
+    "eslint": "^9.39.1",
+    "eslint-plugin-react-hooks": "^7.0.1",
+    "eslint-plugin-react-refresh": "^0.4.24",
+    "globals": "^16.5.0",
+    "tailwindcss": "^4.2.1",
+    "typescript": "~5.9.3",
+    "typescript-eslint": "^8.48.0",
+    "vite": "^7.3.1"
+  }
+}

frontend/public/vite.svg ADDED Viewed

frontend/src/App.css ADDED Viewed

	@@ -0,0 +1,42 @@

+#root {
+  max-width: 1280px;
+  margin: 0 auto;
+  padding: 2rem;
+  text-align: center;
+}
+.logo {
+  height: 6em;
+  padding: 1.5em;
+  will-change: filter;
+  transition: filter 300ms;
+}
+.logo:hover {
+  filter: drop-shadow(0 0 2em #646cffaa);
+}
+.logo.react:hover {
+  filter: drop-shadow(0 0 2em #61dafbaa);
+}
+@keyframes logo-spin {
+  from {
+    transform: rotate(0deg);
+  }
+  to {
+    transform: rotate(360deg);
+  }
+}
+@media (prefers-reduced-motion: no-preference) {
+  a:nth-of-type(2) .logo {
+    animation: logo-spin infinite 20s linear;
+  }
+}
+.card {
+  padding: 2em;
+}
+.read-the-docs {
+  color: #888;
+}

frontend/src/App.tsx ADDED Viewed

	@@ -0,0 +1,16 @@

+import { BrowserRouter, Route, Routes } from 'react-router-dom';
+import Header from './components/shared/Header';
+import SearchPage from './pages/SearchPage';
+import ResultsPage from './pages/ResultsPage';
+export default function App() {
+  return (
+    <BrowserRouter>
+      <Header />
+      <Routes>
+        <Route path="/" element={<SearchPage />} />
+        <Route path="/results" element={<ResultsPage />} />
+      </Routes>
+    </BrowserRouter>
+  );
+}

frontend/src/api/client.ts ADDED Viewed

	@@ -0,0 +1,39 @@

+import type { AutocompleteResult, CalendarResponse, SearchRequest, SearchResponse } from './types';
+const BASE_URL = '/api';
+async function fetchJson<T>(url: string, init?: RequestInit): Promise<T> {
+  const res = await fetch(url, init);
+  if (!res.ok) {
+    const text = await res.text();
+    throw new Error(`API error ${res.status}: ${text}`);
+  }
+  return res.json();
+}
+export async function searchAirports(query: string): Promise<AutocompleteResult[]> {
+  if (!query || query.length < 1) return [];
+  return fetchJson<AutocompleteResult[]>(
+    `${BASE_URL}/airports/autocomplete?q=${encodeURIComponent(query)}`
+  );
+}
+export async function searchFlights(req: SearchRequest): Promise<SearchResponse> {
+  return fetchJson<SearchResponse>(`${BASE_URL}/search`, {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify(req),
+  });
+}
+export async function getCalendar(
+  origin: string,
+  destination: string,
+  year: number,
+  month: number,
+  cabinClass: string = 'economy'
+): Promise<CalendarResponse> {
+  return fetchJson<CalendarResponse>(
+    `${BASE_URL}/calendar?origin=${origin}&destination=${destination}&year=${year}&month=${month}&cabin_class=${cabinClass}`
+  );
+}

frontend/src/api/types.ts ADDED Viewed

	@@ -0,0 +1,90 @@

+export type CabinClass = 'economy' | 'premium_economy' | 'business' | 'first';
+export type TripType = 'one_way' | 'round_trip' | 'multi_city';
+export type SortBy = 'best' | 'cheapest' | 'fastest';
+export interface AutocompleteResult {
+  iata: string;
+  name: string;
+  city_name: string;
+  country: string;
+  display_name: string;
+  hub_score: number;
+}
+export interface FlightSegment {
+  airline_code: string;
+  airline_name: string;
+  flight_number: string;
+  aircraft: string;
+  origin: string;
+  origin_city: string;
+  destination: string;
+  destination_city: string;
+  departure: string;
+  arrival: string;
+  duration_minutes: number;
+}
+export interface FlightOffer {
+  id: string;
+  segments: FlightSegment[];
+  total_duration_minutes: number;
+  stops: number;
+  price_usd: number;
+  cabin_class: CabinClass;
+  origin: string;
+  destination: string;
+  departure: string;
+  arrival: string;
+}
+export interface SearchLeg {
+  origin: string;
+  destination: string;
+  date: string; // YYYY-MM-DD
+}
+export interface Passengers {
+  adults: number;
+  children: number;
+  infants: number;
+}
+export interface Filters {
+  max_stops?: number | null;
+  max_price?: number | null;
+  max_duration_minutes?: number | null;
+  airlines?: string[] | null;
+  departure_time_min?: string | null;
+  departure_time_max?: string | null;
+}
+export interface SearchRequest {
+  trip_type: TripType;
+  legs: SearchLeg[];
+  passengers: Passengers;
+  cabin_class: CabinClass;
+  filters: Filters;
+  sort_by: SortBy;
+}
+export interface SearchResponse {
+  outbound_flights: FlightOffer[];
+  return_flights: FlightOffer[];
+  search_id: string;
+  origin: string;
+  destination: string;
+}
+export interface CalendarDay {
+  date: string;
+  cheapest_price: number | null;
+}
+export interface CalendarResponse {
+  origin: string;
+  destination: string;
+  year: number;
+  month: number;
+  days: CalendarDay[];
+}

frontend/src/assets/react.svg ADDED Viewed

frontend/src/components/results/FilterPanel.tsx ADDED Viewed

	@@ -0,0 +1,133 @@

+import { useMemo } from 'react';
+import type { Filters, FlightOffer } from '../../api/types';
+interface Props {
+  flights: FlightOffer[];
+  filters: Filters;
+  onChange: (f: Filters) => void;
+}
+export default function FilterPanel({ flights, filters, onChange }: Props) {
+  // Compute available filter options from flights
+  const airlines = useMemo(() => {
+    const map = new Map<string, string>();
+    flights.forEach(f => f.segments.forEach(s => map.set(s.airline_code, s.airline_name)));
+    return Array.from(map.entries()).sort((a, b) => a[1].localeCompare(b[1]));
+  }, [flights]);
+  const maxStopsAvailable = useMemo(() => Math.max(...flights.map(f => f.stops), 0), [flights]);
+  return (
+    <div className="space-y-6" data-testid="filter-panel">
+      {/* Stops filter */}
+      <div>
+        <h3 className="mb-2 text-sm font-medium text-gray-900">Stops</h3>
+        <div className="space-y-1">
+          {[null, 0, 1, 2].filter(v => v === null || v <= maxStopsAvailable).map(v => (
+            <label key={String(v)} className="flex items-center gap-2 cursor-pointer">
+              <input
+                type="radio"
+                name="stops"
+                checked={filters.max_stops === v}
+                onChange={() => onChange({ ...filters, max_stops: v })}
+                className="accent-[#1a73e8]"
+                data-testid={`filter-stops-${v === null ? 'any' : v}`}
+              />
+              <span className="text-sm text-gray-700">
+                {v === null ? 'Any' : v === 0 ? 'Nonstop only' : `Up to ${v} stop${v > 1 ? 's' : ''}`}
+              </span>
+            </label>
+          ))}
+        </div>
+      </div>
+      {/* Price filter */}
+      <div>
+        <h3 className="mb-2 text-sm font-medium text-gray-900">Max price</h3>
+        <div className="flex items-center gap-2">
+          <span className="text-sm text-gray-500">$</span>
+          <input
+            type="number"
+            value={filters.max_price ?? ''}
+            onChange={e => onChange({ ...filters, max_price: e.target.value ? Number(e.target.value) : null })}
+            placeholder="Any"
+            className="w-24 rounded border border-gray-300 px-2 py-1 text-sm focus:border-[#1a73e8] focus:outline-none"
+            min={0}
+            data-testid="filter-max-price"
+          />
+        </div>
+      </div>
+      {/* Departure time filter */}
+      <div>
+        <h3 className="mb-2 text-sm font-medium text-gray-900">Departure time</h3>
+        <div className="flex items-center gap-2">
+          <input
+            type="time"
+            value={filters.departure_time_min ?? ''}
+            onChange={e => onChange({ ...filters, departure_time_min: e.target.value || null })}
+            className="rounded border border-gray-300 px-2 py-1 text-sm focus:border-[#1a73e8] focus:outline-none"
+            data-testid="filter-dep-time-min"
+          />
+          <span className="text-xs text-gray-500">to</span>
+          <input
+            type="time"
+            value={filters.departure_time_max ?? ''}
+            onChange={e => onChange({ ...filters, departure_time_max: e.target.value || null })}
+            className="rounded border border-gray-300 px-2 py-1 text-sm focus:border-[#1a73e8] focus:outline-none"
+            data-testid="filter-dep-time-max"
+          />
+        </div>
+      </div>
+      {/* Airlines filter */}
+      {airlines.length > 1 && (
+        <div>
+          <h3 className="mb-2 text-sm font-medium text-gray-900">Airlines</h3>
+          <div className="max-h-48 space-y-1 overflow-y-auto">
+            {airlines.map(([code, name]) => {
+              const selected = !filters.airlines || filters.airlines.includes(code);
+              return (
+                <label key={code} className="flex items-center gap-2 cursor-pointer">
+                  <input
+                    type="checkbox"
+                    checked={selected}
+                    onChange={() => {
+                      let next: string[] | null;
+                      if (!filters.airlines) {
+                        // First deselection: select all except this one
+                        next = airlines.filter(([c]) => c !== code).map(([c]) => c);
+                      } else if (selected) {
+                        next = filters.airlines.filter(c => c !== code);
+                        if (next.length === 0) next = null; // deselect all = any
+                      } else {
+                        next = [...filters.airlines, code];
+                        if (next.length === airlines.length) next = null; // all selected = any
+                      }
+                      onChange({ ...filters, airlines: next });
+                    }}
+                    className="accent-[#1a73e8]"
+                    data-testid={`filter-airline-${code}`}
+                  />
+                  <span className="text-sm text-gray-700">{name} ({code})</span>
+                </label>
+              );
+            })}
+          </div>
+        </div>
+      )}
+      {/* Clear all */}
+      <button
+        onClick={() => onChange({
+          max_stops: null, max_price: null, max_duration_minutes: null,
+          airlines: null, departure_time_min: null, departure_time_max: null,
+        })}
+        className="text-sm text-[#1a73e8] hover:underline cursor-pointer"
+        data-testid="filter-clear"
+      >
+        Clear all filters
+      </button>
+    </div>
+  );
+}

frontend/src/components/results/FlightCard.tsx ADDED Viewed

	@@ -0,0 +1,112 @@

+import { useState } from 'react';
+import type { FlightOffer } from '../../api/types';
+import { formatDuration, formatPrice, formatStops, formatTime } from '../../utils/format';
+import FlightSegmentView from './FlightSegment';
+interface Props {
+  flight: FlightOffer;
+}
+export default function FlightCard({ flight }: Props) {
+  const [expanded, setExpanded] = useState(false);
+  const firstSeg = flight.segments[0];
+  // Check if arrival is on a different day
+  const depDate = new Date(flight.departure).toDateString();
+  const arrDate = new Date(flight.arrival).toDateString();
+  const dayDiff = depDate !== arrDate;
+  return (
+    <div
+      className="rounded-lg border border-gray-200 bg-white hover:shadow-md transition-shadow cursor-pointer"
+      onClick={() => flight.stops > 0 && setExpanded(!expanded)}
+      data-testid={`flight-card-${flight.id}`}
+      role="article"
+      aria-label={`Flight from ${flight.origin} to ${flight.destination}, ${formatPrice(flight.price_usd)}`}
+    >
+      <div className="flex items-center gap-4 p-4">
+        {/* Airline badge */}
+        <div className="flex h-10 w-10 items-center justify-center rounded-lg bg-gray-100 text-xs font-bold text-gray-600 flex-shrink-0" data-testid="airline-badge">
+          {firstSeg.airline_code}
+        </div>
+        {/* Main info */}
+        <div className="flex flex-1 items-center gap-6">
+          {/* Times */}
+          <div className="flex items-center gap-3 flex-1">
+            <div className="text-right">
+              <div className="text-lg font-medium" data-testid="departure-time">{formatTime(flight.departure)}</div>
+              <div className="text-xs text-gray-500">{flight.origin}</div>
+            </div>
+            <div className="flex flex-1 flex-col items-center px-2">
+              <div className="text-xs text-gray-500">{formatDuration(flight.total_duration_minutes)}</div>
+              <div className="relative my-1 h-px w-full bg-gray-300">
+                <div className="absolute left-0 top-1/2 h-2 w-2 -translate-y-1/2 rounded-full border-2 border-gray-400 bg-white" />
+                <div className="absolute right-0 top-1/2 h-2 w-2 -translate-y-1/2 rounded-full border-2 border-gray-400 bg-white" />
+                {/* Stop indicators */}
+                {flight.stops > 0 && flight.segments.slice(0, -1).map((_, i) => (
+                  <div
+                    key={i}
+                    className="absolute top-1/2 h-2 w-2 -translate-y-1/2 rounded-full bg-gray-400"
+                    style={{ left: `${((i + 1) / flight.segments.length) * 100}%` }}
+                  />
+                ))}
+              </div>
+              <div className="text-xs text-gray-500" data-testid="stops">{formatStops(flight.stops)}</div>
+            </div>
+            <div>
+              <div className="text-lg font-medium" data-testid="arrival-time">
+                {formatTime(flight.arrival)}
+                {dayDiff && <sup className="ml-0.5 text-xs text-red-500">+1</sup>}
+              </div>
+              <div className="text-xs text-gray-500">{flight.destination}</div>
+            </div>
+          </div>
+          {/* Airline name */}
+          <div className="hidden md:block text-xs text-gray-500 min-w-[100px]" data-testid="airline-name">
+            {firstSeg.airline_name}
+          </div>
+        </div>
+        {/* Price */}
+        <div className="text-right pl-4 min-w-[80px]">
+          <div className="text-lg font-medium text-gray-900" data-testid="price">
+            {formatPrice(flight.price_usd)}
+          </div>
+          <div className="text-xs text-gray-500">{flight.cabin_class.replace('_', ' ')}</div>
+        </div>
+        {/* Expand icon for connecting flights */}
+        {flight.stops > 0 && (
+          <svg
+            className={`h-5 w-5 text-gray-400 transition-transform ${expanded ? 'rotate-180' : ''}`}
+            viewBox="0 0 20 20" fill="currentColor"
+          >
+            <path fillRule="evenodd" d="M5.23 7.21a.75.75 0 011.06.02L10 11.168l3.71-3.938a.75.75 0 111.08 1.04l-4.25 4.5a.75.75 0 01-1.08 0l-4.25-4.5a.75.75 0 01.02-1.06z" clipRule="evenodd"/>
+          </svg>
+        )}
+      </div>
+      {/* Expanded segment details */}
+      {expanded && flight.stops > 0 && (
+        <div className="border-t border-gray-100 px-4 pb-4" data-testid="segments-detail">
+          {flight.segments.map((seg, i) => (
+            <div key={i}>
+              <FlightSegmentView segment={seg} showDetails />
+              {i < flight.segments.length - 1 && (
+                <div className="ml-14 border-l-2 border-dashed border-gray-200 py-2 pl-4">
+                  <span className="text-xs text-gray-400">
+                    Layover at {seg.destination} ({seg.destination_city})
+                  </span>
+                </div>
+              )}
+            </div>
+          ))}
+        </div>
+      )}
+    </div>
+  );
+}

frontend/src/components/results/FlightSegment.tsx ADDED Viewed

	@@ -0,0 +1,42 @@

+import type { FlightSegment as SegmentType } from '../../api/types';
+import { formatDuration, formatTime } from '../../utils/format';
+interface Props {
+  segment: SegmentType;
+  showDetails?: boolean;
+}
+export default function FlightSegmentView({ segment, showDetails }: Props) {
+  return (
+    <div className="flex items-center gap-4 py-2" data-testid="flight-segment">
+      {/* Airline badge */}
+      <div className="flex h-9 w-9 items-center justify-center rounded bg-gray-100 text-xs font-bold text-gray-600 flex-shrink-0" data-testid="airline-badge">
+        {segment.airline_code}
+      </div>
+      {/* Timeline */}
+      <div className="flex flex-1 items-center gap-3">
+        <div className="text-right min-w-[70px]">
+          <div className="text-base font-medium" data-testid="departure-time">{formatTime(segment.departure)}</div>
+          <div className="text-xs text-gray-500" data-testid="origin-code">{segment.origin}</div>
+        </div>
+        <div className="flex flex-1 flex-col items-center px-2">
+          <div className="text-xs text-gray-500">{formatDuration(segment.duration_minutes)}</div>
+          <div className="relative my-1 h-px w-full bg-gray-300">
+            <div className="absolute left-0 top-1/2 h-2 w-2 -translate-y-1/2 rounded-full border-2 border-gray-400 bg-white" />
+            <div className="absolute right-0 top-1/2 h-2 w-2 -translate-y-1/2 rounded-full border-2 border-gray-400 bg-white" />
+          </div>
+          {showDetails && (
+            <div className="text-xs text-gray-400">{segment.airline_name} · {segment.flight_number} · {segment.aircraft}</div>
+          )}
+        </div>
+        <div className="min-w-[70px]">
+          <div className="text-base font-medium" data-testid="arrival-time">{formatTime(segment.arrival)}</div>
+          <div className="text-xs text-gray-500" data-testid="destination-code">{segment.destination}</div>
+        </div>
+      </div>
+    </div>
+  );
+}

frontend/src/components/results/NoResults.tsx ADDED Viewed

	@@ -0,0 +1,29 @@

+interface Props {
+  hasFilters: boolean;
+  onClearFilters?: () => void;
+}
+export default function NoResults({ hasFilters, onClearFilters }: Props) {
+  return (
+    <div className="flex flex-col items-center justify-center py-16 text-center" data-testid="no-results">
+      <svg className="mb-4 h-16 w-16 text-gray-300" viewBox="0 0 24 24" fill="currentColor">
+        <path d="M21 16v-2l-8-5V3.5c0-.83-.67-1.5-1.5-1.5S10 2.67 10 3.5V9l-8 5v2l8-2.5V19l-2 1.5V22l3.5-1 3.5 1v-1.5L13 19v-5.5l8 2.5z"/>
+      </svg>
+      <h3 className="text-lg font-medium text-gray-700">No flights found</h3>
+      <p className="mt-1 text-sm text-gray-500">
+        {hasFilters
+          ? 'Try adjusting your filters or search criteria.'
+          : 'No direct or connecting flights available for this route.'}
+      </p>
+      {hasFilters && onClearFilters && (
+        <button
+          onClick={onClearFilters}
+          className="mt-4 text-sm font-medium text-[#1a73e8] hover:underline cursor-pointer"
+          data-testid="clear-filters-button"
+        >
+          Clear all filters
+        </button>
+      )}
+    </div>
+  );
+}

frontend/src/components/results/SortBar.tsx ADDED Viewed

	@@ -0,0 +1,40 @@

+import type { SortBy } from '../../api/types';
+const OPTIONS: { value: SortBy; label: string }[] = [
+  { value: 'best', label: 'Best' },
+  { value: 'cheapest', label: 'Cheapest' },
+  { value: 'fastest', label: 'Fastest' },
+];
+interface Props {
+  value: SortBy;
+  onChange: (v: SortBy) => void;
+  resultCount: number;
+}
+export default function SortBar({ value, onChange, resultCount }: Props) {
+  return (
+    <div className="flex items-center justify-between" data-testid="sort-bar">
+      <span className="text-sm text-gray-500" data-testid="result-count">
+        {resultCount} result{resultCount !== 1 ? 's' : ''}
+      </span>
+      <div className="flex gap-1 rounded-lg bg-gray-100 p-1">
+        {OPTIONS.map(o => (
+          <button
+            key={o.value}
+            onClick={() => onChange(o.value)}
+            className={`rounded-md px-4 py-1.5 text-sm font-medium transition-colors cursor-pointer ${
+              value === o.value
+                ? 'bg-white text-[#1a73e8] shadow-sm'
+                : 'text-gray-600 hover:text-gray-900'
+            }`}
+            data-testid={`sort-${o.value}`}
+            aria-pressed={value === o.value}
+          >
+            {o.label}
+          </button>
+        ))}
+      </div>
+    </div>
+  );
+}

frontend/src/components/search/AirportInput.tsx ADDED Viewed

	@@ -0,0 +1,107 @@

+import { useEffect, useRef, useState } from 'react';
+import { searchAirports } from '../../api/client';
+import type { AutocompleteResult } from '../../api/types';
+import { useDebounce } from '../../hooks/useDebounce';
+interface Props {
+  label: string;
+  value: string;          // IATA code
+  displayValue: string;   // "New York (JFK)"
+  onChange: (iata: string, display: string) => void;
+  placeholder?: string;
+  testId?: string;
+}
+export default function AirportInput({ label, value, displayValue, onChange, placeholder, testId }: Props) {
+  const [query, setQuery] = useState(displayValue);
+  const [results, setResults] = useState<AutocompleteResult[]>([]);
+  const [open, setOpen] = useState(false);
+  const [focused, setFocused] = useState(false);
+  const debouncedQuery = useDebounce(query, 200);
+  const wrapperRef = useRef<HTMLDivElement>(null);
+  // Sync display value when parent changes it
+  useEffect(() => {
+    if (!focused) setQuery(displayValue);
+  }, [displayValue, focused]);
+  // Fetch autocomplete results
+  useEffect(() => {
+    if (!focused) return;
+    if (debouncedQuery.length < 1) {
+      setResults([]);
+      return;
+    }
+    let cancelled = false;
+    searchAirports(debouncedQuery).then(r => {
+      if (!cancelled) {
+        setResults(r);
+        setOpen(r.length > 0);
+      }
+    });
+    return () => { cancelled = true; };
+  }, [debouncedQuery, focused]);
+  // Close on click outside
+  useEffect(() => {
+    function handler(e: MouseEvent) {
+      if (wrapperRef.current && !wrapperRef.current.contains(e.target as Node)) {
+        setOpen(false);
+        setFocused(false);
+        if (!value) setQuery('');
+      }
+    }
+    document.addEventListener('mousedown', handler);
+    return () => document.removeEventListener('mousedown', handler);
+  }, [value]);
+  function select(r: AutocompleteResult) {
+    onChange(r.iata, `${r.city_name} (${r.iata})`);
+    setQuery(`${r.city_name} (${r.iata})`);
+    setOpen(false);
+    setFocused(false);
+  }
+  return (
+    <div ref={wrapperRef} className="relative flex-1 min-w-[180px]" data-testid={testId}>
+      <label className="absolute -top-2 left-3 bg-white px-1 text-xs text-gray-500 z-10">{label}</label>
+      <input
+        type="text"
+        value={query}
+        onChange={e => { setQuery(e.target.value); setOpen(true); }}
+        onFocus={() => { setFocused(true); setQuery(''); setOpen(true); }}
+        placeholder={placeholder || 'City or airport'}
+        className="w-full rounded-md border border-gray-300 px-3 py-3 text-sm text-gray-900 placeholder-gray-400 hover:border-gray-400 focus:border-[#1a73e8] focus:outline-none"
+        aria-label={label}
+        data-testid={testId ? `${testId}-input` : undefined}
+        autoComplete="off"
+      />
+      {open && results.length > 0 && (
+        <ul
+          className="absolute top-full left-0 right-0 z-50 mt-1 max-h-64 overflow-y-auto rounded-lg border border-gray-200 bg-white shadow-lg"
+          data-testid={testId ? `${testId}-dropdown` : undefined}
+          role="listbox"
+        >
+          {results.map(r => (
+            <li
+              key={r.iata}
+              onClick={() => select(r)}
+              className="flex cursor-pointer items-center gap-3 px-4 py-3 hover:bg-gray-50"
+              role="option"
+              data-testid={`airport-option-${r.iata}`}
+              aria-selected={r.iata === value}
+            >
+              <svg className="h-5 w-5 flex-shrink-0 text-gray-400" viewBox="0 0 24 24" fill="none">
+                <path d="M12 2C8.13 2 5 5.13 5 9c0 5.25 7 13 7 13s7-7.75 7-13c0-3.87-3.13-7-7-7zm0 9.5c-1.38 0-2.5-1.12-2.5-2.5s1.12-2.5 2.5-2.5 2.5 1.12 2.5 2.5-1.12 2.5-2.5 2.5z" fill="currentColor"/>
+              </svg>
+              <div className="flex-1 min-w-0">
+                <div className="text-sm font-medium text-gray-900 truncate">{r.city_name} ({r.iata})</div>
+                <div className="text-xs text-gray-500 truncate">{r.name}, {r.country}</div>
+              </div>
+            </li>
+          ))}
+        </ul>
+      )}
+    </div>
+  );
+}

frontend/src/components/search/ClassSelector.tsx ADDED Viewed

	@@ -0,0 +1,30 @@

+import type { CabinClass } from '../../api/types';
+import { cabinClassLabel } from '../../utils/format';
+const OPTIONS: CabinClass[] = ['economy', 'premium_economy', 'business', 'first'];
+interface Props {
+  value: CabinClass;
+  onChange: (v: CabinClass) => void;
+}
+export default function ClassSelector({ value, onChange }: Props) {
+  return (
+    <div className="relative" data-testid="class-selector">
+      <select
+        value={value}
+        onChange={e => onChange(e.target.value as CabinClass)}
+        className="appearance-none rounded-md border border-gray-300 bg-white px-3 py-2 pr-8 text-sm text-gray-700 hover:border-gray-400 focus:border-[#1a73e8] focus:outline-none cursor-pointer"
+        aria-label="Cabin class"
+        data-testid="class-select"
+      >
+        {OPTIONS.map(c => (
+          <option key={c} value={c}>{cabinClassLabel(c)}</option>
+        ))}
+      </select>
+      <svg className="pointer-events-none absolute right-2 top-1/2 -translate-y-1/2 h-4 w-4 text-gray-500" viewBox="0 0 20 20" fill="currentColor">
+        <path fillRule="evenodd" d="M5.23 7.21a.75.75 0 011.06.02L10 11.168l3.71-3.938a.75.75 0 111.08 1.04l-4.25 4.5a.75.75 0 01-1.08 0l-4.25-4.5a.75.75 0 01.02-1.06z" clipRule="evenodd"/>
+      </svg>
+    </div>
+  );
+}

frontend/src/components/search/DatePicker.tsx ADDED Viewed

	@@ -0,0 +1,22 @@

+interface Props {
+  label: string;
+  value: string; // YYYY-MM-DD
+  onChange: (v: string) => void;
+  testId?: string;
+}
+export default function DatePicker({ label, value, onChange, testId }: Props) {
+  return (
+    <div className="relative min-w-[150px]" data-testid={testId}>
+      <label className="absolute -top-2 left-3 bg-white px-1 text-xs text-gray-500 z-10">{label}</label>
+      <input
+        type="date"
+        value={value}
+        onChange={e => onChange(e.target.value)}
+        className="w-full rounded-md border border-gray-300 px-3 py-3 text-sm text-gray-900 hover:border-gray-400 focus:border-[#1a73e8] focus:outline-none cursor-pointer"
+        aria-label={label}
+        data-testid={testId ? `${testId}-input` : undefined}
+      />
+    </div>
+  );
+}

frontend/src/components/search/PassengerSelector.tsx ADDED Viewed

	@@ -0,0 +1,86 @@

+import { useEffect, useRef, useState } from 'react';
+import type { Passengers } from '../../api/types';
+interface Props {
+  value: Passengers;
+  onChange: (v: Passengers) => void;
+}
+export default function PassengerSelector({ value, onChange }: Props) {
+  const [open, setOpen] = useState(false);
+  const ref = useRef<HTMLDivElement>(null);
+  useEffect(() => {
+    function handler(e: MouseEvent) {
+      if (ref.current && !ref.current.contains(e.target as Node)) setOpen(false);
+    }
+    document.addEventListener('mousedown', handler);
+    return () => document.removeEventListener('mousedown', handler);
+  }, []);
+  const total = value.adults + value.children + value.infants;
+  function update(field: keyof Passengers, delta: number) {
+    const v = { ...value };
+    v[field] = Math.max(field === 'adults' ? 1 : 0, Math.min(9, v[field] + delta));
+    onChange(v);
+  }
+  return (
+    <div ref={ref} className="relative" data-testid="passenger-selector">
+      <button
+        onClick={() => setOpen(!open)}
+        className="rounded-md border border-gray-300 px-3 py-3 text-sm text-gray-700 hover:border-gray-400 focus:border-[#1a73e8] focus:outline-none flex items-center gap-1 cursor-pointer"
+        aria-label="Passengers"
+        data-testid="passenger-button"
+      >
+        <svg className="h-4 w-4 text-gray-500" viewBox="0 0 24 24" fill="currentColor">
+          <path d="M12 12c2.21 0 4-1.79 4-4s-1.79-4-4-4-4 1.79-4 4 1.79 4 4 4zm0 2c-2.67 0-8 1.34-8 4v2h16v-2c0-2.66-5.33-4-8-4z"/>
+        </svg>
+        <span>{total}</span>
+        <svg className="h-4 w-4 text-gray-400" viewBox="0 0 20 20" fill="currentColor">
+          <path fillRule="evenodd" d="M5.23 7.21a.75.75 0 011.06.02L10 11.168l3.71-3.938a.75.75 0 111.08 1.04l-4.25 4.5a.75.75 0 01-1.08 0l-4.25-4.5a.75.75 0 01.02-1.06z" clipRule="evenodd"/>
+        </svg>
+      </button>
+      {open && (
+        <div className="absolute top-full right-0 z-50 mt-1 w-64 rounded-lg border border-gray-200 bg-white p-4 shadow-lg" data-testid="passenger-dropdown">
+          {([
+            { key: 'adults' as const, label: 'Adults', sub: '' },
+            { key: 'children' as const, label: 'Children', sub: 'Aged 2-11' },
+            { key: 'infants' as const, label: 'Infants', sub: 'Under 2' },
+          ]).map(row => (
+            <div key={row.key} className="flex items-center justify-between py-2">
+              <div>
+                <div className="text-sm font-medium text-gray-900">{row.label}</div>
+                {row.sub && <div className="text-xs text-gray-500">{row.sub}</div>}
+              </div>
+              <div className="flex items-center gap-3">
+                <button
+                  onClick={() => update(row.key, -1)}
+                  disabled={value[row.key] <= (row.key === 'adults' ? 1 : 0)}
+                  className="h-8 w-8 rounded-full border border-gray-300 text-gray-600 hover:bg-gray-50 disabled:opacity-30 disabled:cursor-not-allowed cursor-pointer flex items-center justify-center"
+                  aria-label={`Decrease ${row.label}`}
+                  data-testid={`${row.key}-decrease`}
+                >−</button>
+                <span className="w-4 text-center text-sm" data-testid={`${row.key}-count`}>{value[row.key]}</span>
+                <button
+                  onClick={() => update(row.key, 1)}
+                  disabled={value[row.key] >= 9}
+                  className="h-8 w-8 rounded-full border border-gray-300 text-gray-600 hover:bg-gray-50 disabled:opacity-30 disabled:cursor-not-allowed cursor-pointer flex items-center justify-center"
+                  aria-label={`Increase ${row.label}`}
+                  data-testid={`${row.key}-increase`}
+                >+</button>
+              </div>
+            </div>
+          ))}
+          <button
+            onClick={() => setOpen(false)}
+            className="mt-2 w-full rounded-md bg-[#1a73e8] py-2 text-sm text-white hover:bg-[#1765cc] cursor-pointer"
+            data-testid="passenger-done"
+          >Done</button>
+        </div>
+      )}
+    </div>
+  );
+}

frontend/src/components/search/SearchForm.tsx ADDED Viewed

	@@ -0,0 +1,114 @@

+import { useState } from 'react';
+import type { CabinClass, Passengers, TripType } from '../../api/types';
+import { getDefaultDepartureDate, getDefaultReturnDate } from '../../utils/date';
+import AirportInput from './AirportInput';
+import ClassSelector from './ClassSelector';
+import DatePicker from './DatePicker';
+import PassengerSelector from './PassengerSelector';
+import SwapButton from './SwapButton';
+import TripTypeSelector from './TripTypeSelector';
+export interface SearchFormData {
+  tripType: TripType;
+  origin: string;
+  originDisplay: string;
+  destination: string;
+  destinationDisplay: string;
+  departureDate: string;
+  returnDate: string;
+  passengers: Passengers;
+  cabinClass: CabinClass;
+}
+interface Props {
+  initial?: Partial<SearchFormData>;
+  onSearch: (data: SearchFormData) => void;
+  compact?: boolean;
+}
+export default function SearchForm({ initial, onSearch, compact }: Props) {
+  const [tripType, setTripType] = useState<TripType>(initial?.tripType || 'round_trip');
+  const [origin, setOrigin] = useState(initial?.origin || '');
+  const [originDisplay, setOriginDisplay] = useState(initial?.originDisplay || '');
+  const [destination, setDestination] = useState(initial?.destination || '');
+  const [destinationDisplay, setDestinationDisplay] = useState(initial?.destinationDisplay || '');
+  const [departureDate, setDepartureDate] = useState(initial?.departureDate || getDefaultDepartureDate());
+  const [returnDate, setReturnDate] = useState(initial?.returnDate || getDefaultReturnDate());
+  const [passengers, setPassengers] = useState<Passengers>(initial?.passengers || { adults: 1, children: 0, infants: 0 });
+  const [cabinClass, setCabinClass] = useState<CabinClass>(initial?.cabinClass || 'economy');
+  function handleSwap() {
+    const tmpCode = origin;
+    const tmpDisplay = originDisplay;
+    setOrigin(destination);
+    setOriginDisplay(destinationDisplay);
+    setDestination(tmpCode);
+    setDestinationDisplay(tmpDisplay);
+  }
+  function handleSubmit(e: React.FormEvent) {
+    e.preventDefault();
+    if (!origin || !destination) return;
+    onSearch({
+      tripType, origin, originDisplay, destination, destinationDisplay,
+      departureDate, returnDate, passengers, cabinClass,
+    });
+  }
+  return (
+    <form onSubmit={handleSubmit} data-testid="search-form">
+      {/* Row 1: Trip type, passengers, class */}
+      <div className="mb-3 flex flex-wrap items-center gap-2">
+        <TripTypeSelector value={tripType} onChange={setTripType} />
+        <PassengerSelector value={passengers} onChange={setPassengers} />
+        <ClassSelector value={cabinClass} onChange={setCabinClass} />
+      </div>
+      {/* Row 2: Airport inputs + dates + search button */}
+      <div className={`flex flex-wrap items-end gap-2 ${compact ? '' : 'rounded-xl border border-gray-300 p-3'}`}>
+        <AirportInput
+          label="Where from?"
+          value={origin}
+          displayValue={originDisplay}
+          onChange={(iata, display) => { setOrigin(iata); setOriginDisplay(display); }}
+          testId="origin"
+        />
+        <SwapButton onClick={handleSwap} />
+        <AirportInput
+          label="Where to?"
+          value={destination}
+          displayValue={destinationDisplay}
+          onChange={(iata, display) => { setDestination(iata); setDestinationDisplay(display); }}
+          testId="destination"
+        />
+        <DatePicker
+          label="Departure"
+          value={departureDate}
+          onChange={setDepartureDate}
+          testId="departure-date"
+        />
+        {tripType === 'round_trip' && (
+          <DatePicker
+            label="Return"
+            value={returnDate}
+            onChange={setReturnDate}
+            testId="return-date"
+          />
+        )}
+        <button
+          type="submit"
+          disabled={!origin || !destination}
+          className="rounded-full bg-[#1a73e8] px-6 py-3 text-sm font-medium text-white hover:bg-[#1765cc] hover:shadow-md disabled:opacity-40 disabled:cursor-not-allowed focus:outline-none cursor-pointer"
+          data-testid="search-button"
+        >
+          Search
+        </button>
+      </div>
+    </form>
+  );
+}

frontend/src/components/search/SwapButton.tsx ADDED Viewed

	@@ -0,0 +1,18 @@

+interface Props {
+  onClick: () => void;
+}
+export default function SwapButton({ onClick }: Props) {
+  return (
+    <button
+      onClick={onClick}
+      className="flex h-10 w-10 items-center justify-center rounded-full border border-gray-300 bg-white text-gray-500 hover:bg-gray-50 hover:text-gray-700 focus:outline-none self-center cursor-pointer"
+      aria-label="Swap origin and destination"
+      data-testid="swap-button"
+    >
+      <svg className="h-5 w-5" viewBox="0 0 24 24" fill="currentColor">
+        <path d="M6.99 11L3 15l3.99 4v-3H14v-2H6.99v-3zM21 9l-3.99-4v3H10v2h7.01v3L21 9z"/>
+      </svg>
+    </button>
+  );
+}

frontend/src/components/search/TripTypeSelector.tsx ADDED Viewed

	@@ -0,0 +1,33 @@

+import type { TripType } from '../../api/types';
+const OPTIONS: { value: TripType; label: string }[] = [
+  { value: 'round_trip', label: 'Round trip' },
+  { value: 'one_way', label: 'One way' },
+  { value: 'multi_city', label: 'Multi-city' },
+];
+interface Props {
+  value: TripType;
+  onChange: (v: TripType) => void;
+}
+export default function TripTypeSelector({ value, onChange }: Props) {
+  return (
+    <div className="relative" data-testid="trip-type-selector">
+      <select
+        value={value}
+        onChange={e => onChange(e.target.value as TripType)}
+        className="appearance-none rounded-md border border-gray-300 bg-white px-3 py-2 pr-8 text-sm text-gray-700 hover:border-gray-400 focus:border-[#1a73e8] focus:outline-none cursor-pointer"
+        aria-label="Trip type"
+        data-testid="trip-type-select"
+      >
+        {OPTIONS.map(o => (
+          <option key={o.value} value={o.value}>{o.label}</option>
+        ))}
+      </select>
+      <svg className="pointer-events-none absolute right-2 top-1/2 -translate-y-1/2 h-4 w-4 text-gray-500" viewBox="0 0 20 20" fill="currentColor">
+        <path fillRule="evenodd" d="M5.23 7.21a.75.75 0 011.06.02L10 11.168l3.71-3.938a.75.75 0 111.08 1.04l-4.25 4.5a.75.75 0 01-1.08 0l-4.25-4.5a.75.75 0 01.02-1.06z" clipRule="evenodd"/>
+      </svg>
+    </div>
+  );
+}

frontend/src/components/shared/Header.tsx ADDED Viewed

	@@ -0,0 +1,22 @@

+import { useNavigate } from 'react-router-dom';
+export default function Header() {
+  const navigate = useNavigate();
+  return (
+    <header className="border-b border-gray-200 bg-white" data-testid="header">
+      <div className="mx-auto max-w-7xl px-4 py-3 flex items-center gap-3">
+        <button
+          onClick={() => navigate('/')}
+          className="flex items-center gap-2 text-xl font-medium text-gray-900 hover:opacity-80 cursor-pointer"
+          data-testid="logo"
+        >
+          <svg width="24" height="24" viewBox="0 0 24 24" fill="none" className="text-[#1a73e8]">
+            <path d="M21 16v-2l-8-5V3.5c0-.83-.67-1.5-1.5-1.5S10 2.67 10 3.5V9l-8 5v2l8-2.5V19l-2 1.5V22l3.5-1 3.5 1v-1.5L13 19v-5.5l8 2.5z" fill="currentColor"/>
+          </svg>
+          Flights
+        </button>
+      </div>
+    </header>
+  );
+}

frontend/src/components/shared/Loading.tsx ADDED Viewed

	@@ -0,0 +1,8 @@

+export default function Loading() {
+  return (
+    <div className="flex flex-col items-center justify-center py-20" data-testid="loading">
+      <div className="h-10 w-10 animate-spin rounded-full border-4 border-gray-200 border-t-[#1a73e8]" />
+      <p className="mt-4 text-sm text-gray-500">Searching flights...</p>
+    </div>
+  );
+}

frontend/src/hooks/useDebounce.ts ADDED Viewed

	@@ -0,0 +1,12 @@

+import { useEffect, useState } from 'react';
+export function useDebounce<T>(value: T, delay: number): T {
+  const [debounced, setDebounced] = useState(value);
+  useEffect(() => {
+    const timer = setTimeout(() => setDebounced(value), delay);
+    return () => clearTimeout(timer);
+  }, [value, delay]);
+  return debounced;
+}

frontend/src/hooks/useFlightSearch.ts ADDED Viewed

	@@ -0,0 +1,70 @@

+import { useCallback, useState } from 'react';
+import { searchFlights } from '../api/client';
+import type { CabinClass, Filters, FlightOffer, Passengers, SearchRequest, SortBy, TripType } from '../api/types';
+interface SearchState {
+  outboundFlights: FlightOffer[];
+  returnFlights: FlightOffer[];
+  loading: boolean;
+  error: string | null;
+  searched: boolean;
+}
+export function useFlightSearch() {
+  const [state, setState] = useState<SearchState>({
+    outboundFlights: [],
+    returnFlights: [],
+    loading: false,
+    error: null,
+    searched: false,
+  });
+  const search = useCallback(async (params: {
+    tripType: TripType;
+    origin: string;
+    destination: string;
+    departureDate: string;
+    returnDate?: string;
+    passengers: Passengers;
+    cabinClass: CabinClass;
+    filters: Filters;
+    sortBy: SortBy;
+  }) => {
+    setState(s => ({ ...s, loading: true, error: null }));
+    const legs = [{ origin: params.origin, destination: params.destination, date: params.departureDate }];
+    if (params.tripType === 'round_trip' && params.returnDate) {
+      legs.push({ origin: params.destination, destination: params.origin, date: params.returnDate });
+    }
+    const req: SearchRequest = {
+      trip_type: params.tripType,
+      legs,
+      passengers: params.passengers,
+      cabin_class: params.cabinClass,
+      filters: params.filters,
+      sort_by: params.sortBy,
+    };
+    try {
+      const res = await searchFlights(req);
+      setState({
+        outboundFlights: res.outbound_flights,
+        returnFlights: res.return_flights,
+        loading: false,
+        error: null,
+        searched: true,
+      });
+    } catch (err) {
+      setState({
+        outboundFlights: [],
+        returnFlights: [],
+        loading: false,
+        error: err instanceof Error ? err.message : 'Search failed',
+        searched: true,
+      });
+    }
+  }, []);
+  return { ...state, search };
+}