# Dataset Specification
# 数据集说明

> The exact dataset structure that the supervisor approved at the 4/15 review.
> 4 月 15 日导师 review 后确认的数据集结构。

## 1. Source / 数据来源

| Component | Source | URL |
|---|---|---|
| Hourly weather | Open-Meteo Historical Weather API (ECMWF ERA5 reanalysis) | https://open-meteo.com/en/docs/historical-weather-api |
| Elevation     | Open-Topo-Data (SRTM 30 m DEM) | https://www.opentopodata.org/datasets/srtm/ |

ERA5 is the gold-standard reanalysis dataset in academic meteorology, providing physically-consistent hourly values from 1940 to present.

## 2. Spatial coverage / 空间覆盖

Five Malaysian mountain locations, chosen to span a range of elevations and terrain types:

| Site | Latitude | Longitude | Approx. elev. | Terrain |
|---|---|---|---|---|
| Genting Highlands  | 3.4225 | 101.7935 | ~1865 m | Slope |
| Cameron Highlands  | 4.4694 | 101.3776 | ~1500 m | Highland plateau |
| Fraser's Hill      | 3.7256 | 101.7378 | ~1300 m | Slope |
| Klang Valley       | 3.0738 | 101.5183 |  ~100 m | Valley floor |
| Mt Kinabalu (base) | 6.0535 | 116.5586 | ~1800 m | Mountain |

## 3. Temporal coverage / 时间范围

**2020-01-01 → 2023-12-31**, hourly resolution (one row per hour per site).

Expected sample count: 5 sites × 4 years × 365.25 days × 24 hours ≈ **175 320 rows**.

## 4. Schema / 列结构

| Position | Column | Type | Role | Description |
|---|---|---|---|---|
| 0 | `site`               | str   | meta    | Site name |
| 1 | `latitude`           | float | meta    | WGS84 |
| 2 | `longitude`          | float | meta    | WGS84 |
| 3 | `elevation_m`        | float | **X**   | DEM-derived altitude (static per site) |
| 4 | `time`               | datetime | meta | Hourly UTC+8 (Asia/Kuala_Lumpur) |
| 5 | `temperature_c`      | float | **X**   | 2 m air temperature |
| 6 | `humidity_pct`       | float | **X**   | Relative humidity 0-100 |
| 7 | `precipitation`      | float | (raw)   | mm in past hour — used to derive Y |
| 8 | `wind_speed_kmh`     | float | **X**   | 10 m wind speed |
| 9 | `wind_direction_deg` | float | **X**   | Direction FROM which wind blows, 0-360° |
| 10 | `wind_u`            | float | **X**   | u = speed · sin(dir) |
| 11 | `wind_v`            | float | **X**   | v = speed · cos(dir) |
| 12 | `pressure_hpa`      | float | **X**   | Surface pressure |
| 13 | `pressure_change_3h`| float | **X**   | Δp over preceding 3 h (storm precursor) |
| 14 | `dew_point_c`       | float | **X**   | 2 m dew-point |
| 15 | `dew_point_depression` | float | **X** | T − T_dew (saturation proxy) |
| 16 | `cloud_cover_pct`   | float | **X**   | Total cloud cover 0-100 |
| 17 | `cape_jkg`          | float | **X**   | Convective Available Potential Energy |
| 18 | `precipitation_lag_1h` | float | **X** | Previous hour's precipitation |
| 19 | `hour_sin`, `hour_cos` | float | **X** | Cyclic encoding of hour-of-day |
| 20 | `month_sin`, `month_cos` | float | **X** | Cyclic encoding of month (captures monsoon) |
| 21 | **`is_rain_event`** | **int {0,1}** | **Y** | **1 if `precipitation(t+1h) > 0.1 mm` else 0** |

## 5. Target label derivation / 目标标签的衍生

This is **THE** column that earlier supervisor feedback flagged as missing in the raw CSV. It is engineered explicitly in `scripts/2_preprocess.py`:

```python
df['is_rain_event'] = (df['precipitation'].shift(-1) > 0.1).astype(int)
```

Three things the panel should notice:

1. **`.shift(-1)` means future**: features at time `t` are paired with the rain outcome at `t+1h`. The model never sees future data as input — this prevents temporal data leakage.
2. **0.1 mm threshold**: this matches the **WMO definition of trace precipitation** — i.e. it is *not* an arbitrary cutoff.
3. **Binary**, not amount-of-rain. The pipeline could be extended to a regression task; we deliberately model classification because the downstream user decision is binary ("go / no-go").

## 6. Train / test split / 划分策略

**Time-based**, not random. The last 20 % of each site's chronological data is reserved as the hold-out test set; the remaining 80 % goes to a 5-fold `TimeSeriesSplit` cross-validation. Random splits would leak temporal autocorrelation and inflate accuracy by 5-15 percentage points.

## 7. Class balance / 类别分布

Empirically in tropical Malaysia, `is_rain_event = 1` holds in approximately 20-30 % of hours (more in monsoon months, less in dry season). We pass `class_weight='balanced'` to the Random Forest to prevent it from collapsing to a trivial "always predict no-rain" classifier.

## 8. Reproducibility / 可复现性

```bash
# Real ERA5 path (preferred)
python scripts/1_download_dataset.py    # ~5-10 min, network-bound
python scripts/2_preprocess.py          # < 30 s
python scripts/3_train_model.py         # ~30-90 s on a modern laptop
```

All scripts are idempotent — re-running them does not duplicate data or re-download files that already exist locally.

## 9. Offline / synthetic-data fallback / 离线合成数据回退

For environments without network access (e.g. exam labs, restricted classroom networks) we ship `scripts/1b_synth_dataset.py`, a deterministic physics-informed synthetic generator (seed = 42, see file header for the meteorological assumptions encoded).

The synthetic dataset:
- has the **identical schema** as the Open-Meteo download,
- preserves Malaysia's bimodal monsoon seasonality, tropical diurnal cycle, lapse rate, hydrostatic pressure decay, and zero-inflated rain distribution,
- yields a comparable class balance (~26 % positive),
- lets the **entire pipeline + frontend + tests** be exercised without any external network calls.

It is **not** a substitute for real ERA5 data in the final thesis evaluation. The recommended workflow once network is restored is:

```bash
rm data/raw_*.csv data/processed.csv         # clear synthetic data
python scripts/1_download_dataset.py         # fetch real ERA5 via Open-Meteo
python scripts/2_preprocess.py
python scripts/3_train_model.py              # retrain on real data
```