Spaces:
Paused
Paused
File size: 6,111 Bytes
4eefabb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | # Dataset Specification
# 数据集说明
> The exact dataset structure that the supervisor approved at the 4/15 review.
> 4 月 15 日导师 review 后确认的数据集结构。
## 1. Source / 数据来源
| Component | Source | URL |
|---|---|---|
| Hourly weather | Open-Meteo Historical Weather API (ECMWF ERA5 reanalysis) | https://open-meteo.com/en/docs/historical-weather-api |
| Elevation | Open-Topo-Data (SRTM 30 m DEM) | https://www.opentopodata.org/datasets/srtm/ |
ERA5 is the gold-standard reanalysis dataset in academic meteorology, providing physically-consistent hourly values from 1940 to present.
## 2. Spatial coverage / 空间覆盖
Five Malaysian mountain locations, chosen to span a range of elevations and terrain types:
| Site | Latitude | Longitude | Approx. elev. | Terrain |
|---|---|---|---|---|
| Genting Highlands | 3.4225 | 101.7935 | ~1865 m | Slope |
| Cameron Highlands | 4.4694 | 101.3776 | ~1500 m | Highland plateau |
| Fraser's Hill | 3.7256 | 101.7378 | ~1300 m | Slope |
| Klang Valley | 3.0738 | 101.5183 | ~100 m | Valley floor |
| Mt Kinabalu (base) | 6.0535 | 116.5586 | ~1800 m | Mountain |
## 3. Temporal coverage / 时间范围
**2020-01-01 → 2023-12-31**, hourly resolution (one row per hour per site).
Expected sample count: 5 sites × 4 years × 365.25 days × 24 hours ≈ **175 320 rows**.
## 4. Schema / 列结构
| Position | Column | Type | Role | Description |
|---|---|---|---|---|
| 0 | `site` | str | meta | Site name |
| 1 | `latitude` | float | meta | WGS84 |
| 2 | `longitude` | float | meta | WGS84 |
| 3 | `elevation_m` | float | **X** | DEM-derived altitude (static per site) |
| 4 | `time` | datetime | meta | Hourly UTC+8 (Asia/Kuala_Lumpur) |
| 5 | `temperature_c` | float | **X** | 2 m air temperature |
| 6 | `humidity_pct` | float | **X** | Relative humidity 0-100 |
| 7 | `precipitation` | float | (raw) | mm in past hour — used to derive Y |
| 8 | `wind_speed_kmh` | float | **X** | 10 m wind speed |
| 9 | `wind_direction_deg` | float | **X** | Direction FROM which wind blows, 0-360° |
| 10 | `wind_u` | float | **X** | u = speed · sin(dir) |
| 11 | `wind_v` | float | **X** | v = speed · cos(dir) |
| 12 | `pressure_hpa` | float | **X** | Surface pressure |
| 13 | `pressure_change_3h`| float | **X** | Δp over preceding 3 h (storm precursor) |
| 14 | `dew_point_c` | float | **X** | 2 m dew-point |
| 15 | `dew_point_depression` | float | **X** | T − T_dew (saturation proxy) |
| 16 | `cloud_cover_pct` | float | **X** | Total cloud cover 0-100 |
| 17 | `cape_jkg` | float | **X** | Convective Available Potential Energy |
| 18 | `precipitation_lag_1h` | float | **X** | Previous hour's precipitation |
| 19 | `hour_sin`, `hour_cos` | float | **X** | Cyclic encoding of hour-of-day |
| 20 | `month_sin`, `month_cos` | float | **X** | Cyclic encoding of month (captures monsoon) |
| 21 | **`is_rain_event`** | **int {0,1}** | **Y** | **1 if `precipitation(t+1h) > 0.1 mm` else 0** |
## 5. Target label derivation / 目标标签的衍生
This is **THE** column that earlier supervisor feedback flagged as missing in the raw CSV. It is engineered explicitly in `scripts/2_preprocess.py`:
```python
df['is_rain_event'] = (df['precipitation'].shift(-1) > 0.1).astype(int)
```
Three things the panel should notice:
1. **`.shift(-1)` means future**: features at time `t` are paired with the rain outcome at `t+1h`. The model never sees future data as input — this prevents temporal data leakage.
2. **0.1 mm threshold**: this matches the **WMO definition of trace precipitation** — i.e. it is *not* an arbitrary cutoff.
3. **Binary**, not amount-of-rain. The pipeline could be extended to a regression task; we deliberately model classification because the downstream user decision is binary ("go / no-go").
## 6. Train / test split / 划分策略
**Time-based**, not random. The last 20 % of each site's chronological data is reserved as the hold-out test set; the remaining 80 % goes to a 5-fold `TimeSeriesSplit` cross-validation. Random splits would leak temporal autocorrelation and inflate accuracy by 5-15 percentage points.
## 7. Class balance / 类别分布
Empirically in tropical Malaysia, `is_rain_event = 1` holds in approximately 20-30 % of hours (more in monsoon months, less in dry season). We pass `class_weight='balanced'` to the Random Forest to prevent it from collapsing to a trivial "always predict no-rain" classifier.
## 8. Reproducibility / 可复现性
```bash
# Real ERA5 path (preferred)
python scripts/1_download_dataset.py # ~5-10 min, network-bound
python scripts/2_preprocess.py # < 30 s
python scripts/3_train_model.py # ~30-90 s on a modern laptop
```
All scripts are idempotent — re-running them does not duplicate data or re-download files that already exist locally.
## 9. Offline / synthetic-data fallback / 离线合成数据回退
For environments without network access (e.g. exam labs, restricted classroom networks) we ship `scripts/1b_synth_dataset.py`, a deterministic physics-informed synthetic generator (seed = 42, see file header for the meteorological assumptions encoded).
The synthetic dataset:
- has the **identical schema** as the Open-Meteo download,
- preserves Malaysia's bimodal monsoon seasonality, tropical diurnal cycle, lapse rate, hydrostatic pressure decay, and zero-inflated rain distribution,
- yields a comparable class balance (~26 % positive),
- lets the **entire pipeline + frontend + tests** be exercised without any external network calls.
It is **not** a substitute for real ERA5 data in the final thesis evaluation. The recommended workflow once network is restored is:
```bash
rm data/raw_*.csv data/processed.csv # clear synthetic data
python scripts/1_download_dataset.py # fetch real ERA5 via Open-Meteo
python scripts/2_preprocess.py
python scripts/3_train_model.py # retrain on real data
```
|