Spaces:
Paused
Dataset Specification
数据集说明
The exact dataset structure that the supervisor approved at the 4/15 review. 4 月 15 日导师 review 后确认的数据集结构。
1. Source / 数据来源
| Component | Source | URL |
|---|---|---|
| Hourly weather | Open-Meteo Historical Weather API (ECMWF ERA5 reanalysis) | https://open-meteo.com/en/docs/historical-weather-api |
| Elevation | Open-Topo-Data (SRTM 30 m DEM) | https://www.opentopodata.org/datasets/srtm/ |
ERA5 is the gold-standard reanalysis dataset in academic meteorology, providing physically-consistent hourly values from 1940 to present.
2. Spatial coverage / 空间覆盖
Five Malaysian mountain locations, chosen to span a range of elevations and terrain types:
| Site | Latitude | Longitude | Approx. elev. | Terrain |
|---|---|---|---|---|
| Genting Highlands | 3.4225 | 101.7935 | ~1865 m | Slope |
| Cameron Highlands | 4.4694 | 101.3776 | ~1500 m | Highland plateau |
| Fraser's Hill | 3.7256 | 101.7378 | ~1300 m | Slope |
| Klang Valley | 3.0738 | 101.5183 | ~100 m | Valley floor |
| Mt Kinabalu (base) | 6.0535 | 116.5586 | ~1800 m | Mountain |
3. Temporal coverage / 时间范围
2020-01-01 → 2023-12-31, hourly resolution (one row per hour per site).
Expected sample count: 5 sites × 4 years × 365.25 days × 24 hours ≈ 175 320 rows.
4. Schema / 列结构
| Position | Column | Type | Role | Description |
|---|---|---|---|---|
| 0 | site |
str | meta | Site name |
| 1 | latitude |
float | meta | WGS84 |
| 2 | longitude |
float | meta | WGS84 |
| 3 | elevation_m |
float | X | DEM-derived altitude (static per site) |
| 4 | time |
datetime | meta | Hourly UTC+8 (Asia/Kuala_Lumpur) |
| 5 | temperature_c |
float | X | 2 m air temperature |
| 6 | humidity_pct |
float | X | Relative humidity 0-100 |
| 7 | precipitation |
float | (raw) | mm in past hour — used to derive Y |
| 8 | wind_speed_kmh |
float | X | 10 m wind speed |
| 9 | wind_direction_deg |
float | X | Direction FROM which wind blows, 0-360° |
| 10 | wind_u |
float | X | u = speed · sin(dir) |
| 11 | wind_v |
float | X | v = speed · cos(dir) |
| 12 | pressure_hpa |
float | X | Surface pressure |
| 13 | pressure_change_3h |
float | X | Δp over preceding 3 h (storm precursor) |
| 14 | dew_point_c |
float | X | 2 m dew-point |
| 15 | dew_point_depression |
float | X | T − T_dew (saturation proxy) |
| 16 | cloud_cover_pct |
float | X | Total cloud cover 0-100 |
| 17 | cape_jkg |
float | X | Convective Available Potential Energy |
| 18 | precipitation_lag_1h |
float | X | Previous hour's precipitation |
| 19 | hour_sin, hour_cos |
float | X | Cyclic encoding of hour-of-day |
| 20 | month_sin, month_cos |
float | X | Cyclic encoding of month (captures monsoon) |
| 21 | is_rain_event |
int {0,1} | Y | 1 if precipitation(t+1h) > 0.1 mm else 0 |
5. Target label derivation / 目标标签的衍生
This is THE column that earlier supervisor feedback flagged as missing in the raw CSV. It is engineered explicitly in scripts/2_preprocess.py:
df['is_rain_event'] = (df['precipitation'].shift(-1) > 0.1).astype(int)
Three things the panel should notice:
.shift(-1)means future: features at timetare paired with the rain outcome att+1h. The model never sees future data as input — this prevents temporal data leakage.- 0.1 mm threshold: this matches the WMO definition of trace precipitation — i.e. it is not an arbitrary cutoff.
- Binary, not amount-of-rain. The pipeline could be extended to a regression task; we deliberately model classification because the downstream user decision is binary ("go / no-go").
6. Train / test split / 划分策略
Time-based, not random. The last 20 % of each site's chronological data is reserved as the hold-out test set; the remaining 80 % goes to a 5-fold TimeSeriesSplit cross-validation. Random splits would leak temporal autocorrelation and inflate accuracy by 5-15 percentage points.
7. Class balance / 类别分布
Empirically in tropical Malaysia, is_rain_event = 1 holds in approximately 20-30 % of hours (more in monsoon months, less in dry season). We pass class_weight='balanced' to the Random Forest to prevent it from collapsing to a trivial "always predict no-rain" classifier.
8. Reproducibility / 可复现性
# Real ERA5 path (preferred)
python scripts/1_download_dataset.py # ~5-10 min, network-bound
python scripts/2_preprocess.py # < 30 s
python scripts/3_train_model.py # ~30-90 s on a modern laptop
All scripts are idempotent — re-running them does not duplicate data or re-download files that already exist locally.
9. Offline / synthetic-data fallback / 离线合成数据回退
For environments without network access (e.g. exam labs, restricted classroom networks) we ship scripts/1b_synth_dataset.py, a deterministic physics-informed synthetic generator (seed = 42, see file header for the meteorological assumptions encoded).
The synthetic dataset:
- has the identical schema as the Open-Meteo download,
- preserves Malaysia's bimodal monsoon seasonality, tropical diurnal cycle, lapse rate, hydrostatic pressure decay, and zero-inflated rain distribution,
- yields a comparable class balance (~26 % positive),
- lets the entire pipeline + frontend + tests be exercised without any external network calls.
It is not a substitute for real ERA5 data in the final thesis evaluation. The recommended workflow once network is restored is:
rm data/raw_*.csv data/processed.csv # clear synthetic data
python scripts/1_download_dataset.py # fetch real ERA5 via Open-Meteo
python scripts/2_preprocess.py
python scripts/3_train_model.py # retrain on real data