microclimate-x / README.md
W1nd5pac's picture
Deploy 2026-05-20T06:52:08Z โ€” 11e81c5
4eefabb verified
---
title: MicroClimate-X
emoji: ๐ŸŒง๏ธ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
license: mit
short_description: Hybrid microclimate risk for complex terrain (FYP demo)
---
# MicroClimate-X
> Intelligent Meteorological Analysis System for Complex Terrain
> ้ขๅ‘ๅคๆ‚ๅœฐๅฝข็š„ๆ™บ่ƒฝๆฐ”่ฑกๅˆ†ๆž็ณป็ปŸ
> **Live demo / ๅœจ็บฟๆผ”็คบ**: <https://huggingface.co/spaces/W1nd5pac/microclimate-x>
> (Deployed as a Hugging Face Space โ€” Docker SDK. See [`docs/DEPLOY_HF.md`](docs/DEPLOY_HF.md) for the deployment recipe.)
![CI](https://github.com/KyoukoLi/microclimate-x/actions/workflows/ci.yml/badge.svg)
![Python](https://img.shields.io/badge/Python-3.9%20%7C%203.11%20%7C%203.12-blue)
![FastAPI](https://img.shields.io/badge/FastAPI-0.110%2B-009688)
![Vue3](https://img.shields.io/badge/Vue.js-3-4FC08D)
![ML](https://img.shields.io/badge/ML-RandomForest-orange)
![Coverage](https://img.shields.io/badge/coverage-97%25-brightgreen)
![Tests](https://img.shields.io/badge/tests-70%20passing-success)
![Docker](https://img.shields.io/badge/Docker-multi--stage-2496ED?logo=docker&logoColor=white)
![License](https://img.shields.io/badge/License-MIT-green)
A Final Year Project at **Universiti Kebangsaan Malaysia (UKM)** โ€” Faculty of Information Science & Technology.
### For thesis supervisors / ๅฏผๅธˆ้˜…่ฏป่ทฏๅพ„
| Step | Document | What it shows |
|---|---|---|
| 1. Dataset | [`docs/dataset.md`](docs/dataset.md) | Source ยท schema ยท **Y derivation** ยท train/test split |
| 2. Model | [`models/MODEL_CARD.md`](models/MODEL_CARD.md) | Intended use ยท metrics ยท limitations ยท ethics |
| 3. Evaluation | [`figures/`](figures/) + [`figures/evaluation_summary.json`](figures/evaluation_summary.json) | 6 publication figures, all reproducible via `make evaluate` |
| 4. Architecture | [`docs/architecture.md`](docs/architecture.md) + [`docs/thresholds.md`](docs/thresholds.md) | Hybrid engine, every threshold cited |
| 5. Pipeline order | [`docs/pipeline_order.md`](docs/pipeline_order.md) | Explicit "dataset โ†’ model โ†’ app" sequence |
| 6. Meeting brief | [`docs/supervisor_meeting_brief.md`](docs/supervisor_meeting_brief.md) | Detailed bilingual EN/ZH script |
| 7. **Cheat sheet** | [`docs/MEETING_CHEAT_SHEET.md`](docs/MEETING_CHEAT_SHEET.md) ยท [HTML](docs/MEETING_CHEAT_SHEET.html) | **Open on screen during the meeting** โ€” tab-order ยท demo script ยท Q&A ยท checklist |
---
## 1. Problem Statement / ็—›็‚น
Traditional weather forecasting relies on **macro-scale grids (20 km ร— 20 km)** that fail catastrophically in complex terrain. A single forecast cell may cover a mountain peak, a valley floor, and a windward slope โ€” all of which have vastly different microclimates.
ไผ ็ปŸๅคฉๆฐ”้ข„ๆŠฅไฝฟ็”จ **20 km ร— 20 km ๅฎ่ง‚็ฝ‘ๆ ผ**๏ผŒๅœจๅฑฑๅŒบไผšไธฅ้‡ๅคฑ็œŸใ€‚ๅŒไธ€็ฝ‘ๆ ผๅ†…ๅฏ่ƒฝๅŒๆ—ถๅŒ…ๅซๅฑฑ้กถใ€่ฐทๅบ•ๅ’Œ่ฟŽ้ฃŽๅก๏ผŒไฝ†ๅฎƒไปฌ็š„ๅพฎๆฐ”ๅ€™ๅฎŒๅ…จไธๅŒใ€‚
## 2. Solution: The Hybrid Engine / ่งฃๅ†ณๆ–นๆกˆ
MicroClimate-X uses a **dual-engine hybrid architecture** combining a Machine Learning predictor with a topographic Rule-Based Expert System.
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ User clicks a coordinate on the map (lat, lon) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Open-Meteo (weather) + Open Topo Data (DEM) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Engine A โ”‚ โ”‚ Engine B โ”‚
โ”‚ Random Forest โ”‚ probabilityโ”‚ Topographic Rules โ”‚
โ”‚ (in-distribution โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ + Veto Triggers โ”‚
โ”‚ rain probability) โ”‚ โ”‚ (safety-critical) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Risk Score 0-100 โ”‚
โ”‚ + Bilingual Advice โ”‚
โ”‚ + XAI Inference Log โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### Why Hybrid? / ไธบไป€ไนˆๆททๅˆ๏ผŸ
Pure ML can fail catastrophically out-of-distribution. Example: feed Mount Everest coordinates โ†’ ML predicts 0% rain โ†’ returns "Safe" โ€” ignoring -30ยฐC, hypoxia, gale-force winds.
**Engine B's Veto mechanism** provides bounded safety guarantees by overriding the ML score when physical thresholds are breached. This follows the **Neuro-Symbolic AI** paradigm (Garcez & Lamb, 2020).
### Engine B internals โ€” one-to-one with D5 proposal ยง3.7 / P4
The rule engine is decomposed exactly along the lines of the thesis proposal so every line of code maps to a section number:
| Proposal step | Code | Output |
|---|---|---|
| **P4.1** Load Dynamic Risk Rules | `backend/config.py` | All thresholds, weights, and the R1-R4 decision table, each annotated with its academic citation |
| **P4.2** Fetch User Context | `?activity=hiker\|driver\|construction\|general` | Activity is plumbed into the request flow |
| **P4.3** Evaluate Environmental Risks | Four `score_*_risk()` functions in `rule_engine.py` | Rainfall / Fog / Wind-gust / Thunderstorm sub-scores (each 0-100) |
| **ยง3.7.2 Table 4.2** Decision Table | `apply_decision_table_3_7_2()` | Which of R1-R4 fired (hidden rain / no amplification / heavy downpour / standard rain) |
| Veto cascade | `_collect_veto_triggers()` | Life-safety overrides (Mt-Everest type) โ€” capped at 100 |
| **P4.4** Activity weighting | `apply_activity_weighting()` | (activity ร— hazard) weight matrix |
| **P4.5** Composite score | Same | `0.80 ยท max(weighted) + 0.20 ยท mean(rest)` โ€” dominant hazard wins |
| **P4.6** Actionable advice | `_normal_advice()` / `_veto_advice()` | Bilingual EN/ZH paragraph that names the dominant hazard |
Four hazard categories surfaced in the UI as four mini-gauges; the four R1-R4 indicators light up beside the score card whenever a rule fires.
## 3. Tech Stack / ๆŠ€ๆœฏๆ ˆ
| Layer | Technology |
|---|---|
| Frontend | Vue 3 (CDN) + Tailwind CSS + Leaflet.js + ECharts |
| Backend | Python 3.10+, FastAPI, Uvicorn |
| ML | Scikit-Learn (Random Forest), Pandas, NumPy |
| Storage | SQLite 3 (WAL mode, risk-adaptive TTL cache) |
| External | Open-Meteo Historical Archive (ERA5), Open Topo Data (SRTM DEM) |
## 4. Dataset / ๆ•ฐๆฎ้›†
- **Source**: [Open-Meteo Historical Weather API](https://open-meteo.com/en/docs/historical-weather-api) (ERA5 reanalysis)
- **Region**: Malaysian mountain areas (Genting Highlands, Cameron Highlands, Fraser's Hill, Klang Valley, Mount Kinabalu region)
- **Time Range**: 2020-01-01 to 2023-12-31 (hourly resolution, 5 sites ร— ~35 000 hours each)
- **Features (X)**: `elevation_m`, `temperature_c`, `humidity_pct`, `wind_speed_kmh`, `wind_direction_deg`, `surface_pressure_hpa`
- **Target (Y)**: `is_rain_event` โ€” binary, 1 if `precipitation(t+1h) > 0.1 mm` else 0 (per WMO trace-precipitation definition)
## 5. Quick Start / ๅฟซ้€Ÿๅผ€ๅง‹
```bash
git clone https://github.com/KyoukoLi/microclimate-x.git
cd microclimate-x
# Fast path โ€” everything via the Makefile
make install-dev # 1. create venv + install runtime + dev deps
make synth # 2. generate synthetic dataset (offline)
# โ€ฆor `make` nothing here and run `python scripts/1_download_dataset.py`
# to fetch real ERA5 data when network is available.
make preprocess # 3. feature engineering + Y derivation
make train # 4. RF training + time-based CV
make evaluate # 5. ROC / PR / calibration / threshold-sweep figures
make run # 6. uvicorn dev server on http://localhost:8000
# Then open frontend/index.html (or browse to http://localhost:8000/app/)
```
### Docker one-liner
```bash
docker compose up --build
# API lives on http://localhost:8000 ยท frontend on http://localhost:8000/app/
```
### Test it
```bash
make test # 70 tests, ~12 s
make lint # ruff โ€” zero errors expected
```
### Training results on real ERA5 data / ็œŸๅฎž ERA5 ๆ•ฐๆฎ่ฎญ็ปƒ็ป“ๆžœ
Trained on **175 315 hourly samples** from Open-Meteo Historical Archive
(ECMWF ERA5 reanalysis) covering five Malaysian mountain sites,
2020-01-01 โ†’ 2024-12-31. Time-based split: last 20 % per site held out
(n = 35 063 test samples). See [`models/MODEL_CARD.md`](models/MODEL_CARD.md)
for the full evaluation card and `figures/` for publication-ready plots.
| Metric | Value | Source |
|---|---|---|
| Test ROC AUC | **0.871** | `figures/01_roc_curve.png` |
| Test PR Average Precision | **0.750** | `figures/02_pr_curve.png` |
| Brier score (calibration) | **0.138** | `figures/03_calibration_curve.png` |
| Best F2 @ ฯ„ = 0.20 | **0.778** | `figures/04_threshold_sweep.png` |
| Recall (at chosen ฯ„ = 0.20) | **0.934** โ€” safety-critical recall |
| Class balance | 29.2 % positive (Malaysian mountain climatology) |
We deliberately operate at **ฯ„ = 0.20**, not the default 0.50, because
in safety-critical settings a missed rain event (false negative) on a
windward slope is dramatically worse than a false positive. F2 score
weights recall 4ร— higher than precision and is the principled metric
for this regime.
**5-fold time-series CV** on the training fold gives AUC ranging
0.828-0.908 (mean โ‰ˆ 0.858), confirming the model is not over-fitting a
single temporal slice.
#### Feature importance โ€” what the model actually learned
| Rank | Feature | Importance | Interpretation |
|---|---|---|---|
| 1 | `precipitation_lag_1h` | 37.1 % | Rain autocorrelation โ€” the well-documented "rain begets rain" persistence signal in short-term nowcasting (Wilson et al., 2010). |
| 2-3 | `hour_cos`, `hour_sin` | 18.6 % | Diurnal convective cycle โ€” Malaysian mountain rainfall peaks in late afternoon. |
| 4 | `pressure_change_3h` | 4.7 % | Falling pressure precedes incoming storms โ€” the classical synoptic-scale precursor. |
| 5-6 | `wind_v`, `dew_point_c` | 8.1 % | Moisture transport + saturation potential. |
| 7-14 | other meteorological X | 22 % | T, humidity, cloud cover, wind, dew-point depression, pressure. |
| 15-17 | `month_*`, `elevation_m` | 4 % | Low because the time-of-day and lag features already absorb most of the seasonal/static signal. |
| 18 | `cape_jkg` | **0.0 %** | โš ๏ธ ERA5 archive CAPE values for these coordinates are predominantly zero โ€” a known coverage gap. The Veto-rule engine still uses CAPE thresholds directly from the live Open-Meteo forecast at inference time. |
#### Why F2 instead of accuracy?
Accuracy is misleading on imbalanced safety-critical classification.
A model that predicts "no rain" 100 % of the time achieves
**69.2 % accuracy** here while being completely useless. F2 weights
recall twice as heavily as precision, which is correct for a
hiker-safety app where missing a real rain event (False Negative) is
far worse than a false alarm (False Positive).
See `models/training_report.json` for the full 5-fold CV report.
## 6. Project Structure / ้กน็›ฎ็ป“ๆž„
```
microclimate-x/
โ”œโ”€โ”€ backend/
โ”‚ โ”œโ”€โ”€ main.py # FastAPI app + lifespan
โ”‚ โ”œโ”€โ”€ ml_engine.py # Loads RF model, predict_proba
โ”‚ โ”œโ”€โ”€ rule_engine.py # Veto rules + risk scoring + bilingual advice
โ”‚ โ”œโ”€โ”€ terrain.py # DEM-based Valley/Slope/Flat classification
โ”‚ โ”œโ”€โ”€ cache.py # SQLite WAL cache, risk-adaptive TTL
โ”‚ โ”œโ”€โ”€ schemas.py # Pydantic request/response models
โ”‚ โ””โ”€โ”€ config.py # Thresholds + academic citations
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ 1_download_dataset.py # Open-Meteo + Open-Topo-Data (real ERA5)
โ”‚ โ”œโ”€โ”€ 1b_synth_dataset.py # physically-plausible offline fallback
โ”‚ โ”œโ”€โ”€ 2_preprocess.py
โ”‚ โ””โ”€โ”€ 3_train_model.py
โ”œโ”€โ”€ frontend/
โ”‚ โ””โ”€โ”€ index.html # Single-file Vue3 SPA
โ”œโ”€โ”€ docs/
โ”‚ โ”œโ”€โ”€ architecture.md
โ”‚ โ””โ”€โ”€ thresholds.md # Veto thresholds with academic citations
โ”œโ”€โ”€ tests/
โ”‚ โ””โ”€โ”€ test_rule_engine.py
โ”œโ”€โ”€ data/ # raw/processed CSVs (gitignored)
โ”œโ”€โ”€ models/ # trained .pkl artifacts (gitignored)
โ””โ”€โ”€ requirements.txt
```
## 7. Key Design Decisions / ๅ…ณ้”ฎ่ฎพ่ฎก
| Decision | Rationale |
|---|---|
| **Random Forest over SVM / Deep Learning** | Handles non-linear weather-terrain interactions; outputs interpretable feature importance; no GPU needed; robust on tabular data |
| **Binary classification (`is_rain_event`)** | One-hour-ahead nowcasting matches the use case (hikers' immediate decisions) |
| **Time-based train/test split** | Random split would leak temporal correlation โ†’ inflated metrics |
| **Class-weight balanced** | Rain is the minority class (~25% in Malaysian mountains) |
| **Wind direction as u/v components** | Raw degrees treat 0ยฐ and 360ยฐ as far apart โ€” mathematically incorrect |
| **Risk-adaptive cache TTL** | High-risk scenarios refresh faster (60 s) than safe ones (600 s) |
| **SQLite WAL mode** | Allows concurrent reads during writes โ€” critical for FastAPI async |
## 8. Academic References / ๅญฆๆœฏๅ‚่€ƒ
1. **Bhuiyan, M. A. E., et al.** (2020). *Improving satellite-based precipitation estimates over complex terrain using machine learning algorithms*. **Journal of Hydrology**.
2. **Maclean, I. M., et al.** (2018). *Microclima: An R package for modelling meso- and microclimate*. **Methods in Ecology and Evolution**.
3. **Garcez, A. d., & Lamb, L. C.** (2020). *Neurosymbolic AI: The 3rd Wave*. arXiv:2012.05876.
4. **Luks, A. M., et al.** (2019). *Wilderness Medical Society Practice Guidelines for the Prevention and Treatment of Acute Altitude Illness*.
5. **Vandal, T., et al.** (2017). *DeepSD: Generating high-resolution climate change projections through single image super-resolution*. **KDD**.
See `docs/thresholds.md` for the full citation table per Veto threshold.
## 9. Roadmap
- [x] Frontend dashboard with XAI inference log
- [x] SQLite caching with WAL + risk-adaptive TTL
- [x] Terrain detection engine (Valley / Slope / Flat)
- [x] Rule-based Veto + 0-100 scoring engine (19/19 unit tests passing)
- [x] Bilingual (EN/ZH) advice generation
- [x] Dataset download script (Open-Meteo + Open Topo Data) + offline synthetic fallback
- [x] Preprocessing pipeline (feature engineering + label `is_rain_event`)
- [x] Random Forest training with time-based CV โ€” **trained on real ERA5 data, test AUC = 0.871**
- [ ] Model comparison (RFC vs LogReg vs XGBoost) โ€” thesis Chapter 5
- [ ] Hindcast validation against real Malaysian flood events
- [ ] PWA offline mode for low-network mountain use
## 10. License
MIT โ€” see `LICENSE`.
---
*Developed by L.ZH @ Universiti Kebangsaan Malaysia (UKM) for the Final Year Project (FYP).*