Spaces:

W1nd5pac
/

microclimate-x

Paused

App Files Files Community

microclimate-x / docs /dataset.md

W1nd5pac

Deploy 2026-05-20T06:52:08Z — 11e81c5

4eefabb verified 1 day ago

preview code

raw

history blame contribute delete

6.11 kB

Dataset Specification

数据集说明

The exact dataset structure that the supervisor approved at the 4/15 review. 4 月 15 日导师 review 后确认的数据集结构。

1. Source / 数据来源

Component	Source	URL
Hourly weather	Open-Meteo Historical Weather API (ECMWF ERA5 reanalysis)	https://open-meteo.com/en/docs/historical-weather-api
Elevation	Open-Topo-Data (SRTM 30 m DEM)	https://www.opentopodata.org/datasets/srtm/

ERA5 is the gold-standard reanalysis dataset in academic meteorology, providing physically-consistent hourly values from 1940 to present.

2. Spatial coverage / 空间覆盖

Five Malaysian mountain locations, chosen to span a range of elevations and terrain types:

Site	Latitude	Longitude	Approx. elev.	Terrain
Genting Highlands	3.4225	101.7935	~1865 m	Slope
Cameron Highlands	4.4694	101.3776	~1500 m	Highland plateau
Fraser's Hill	3.7256	101.7378	~1300 m	Slope
Klang Valley	3.0738	101.5183	~100 m	Valley floor
Mt Kinabalu (base)	6.0535	116.5586	~1800 m	Mountain

3. Temporal coverage / 时间范围

2020-01-01 → 2023-12-31, hourly resolution (one row per hour per site).

Expected sample count: 5 sites × 4 years × 365.25 days × 24 hours ≈ 175 320 rows.

4. Schema / 列结构

Position	Column	Type	Role	Description
0	`site`	str	meta	Site name
1	`latitude`	float	meta	WGS84
2	`longitude`	float	meta	WGS84
3	`elevation_m`	float	X	DEM-derived altitude (static per site)
4	`time`	datetime	meta	Hourly UTC+8 (Asia/Kuala_Lumpur)
5	`temperature_c`	float	X	2 m air temperature
6	`humidity_pct`	float	X	Relative humidity 0-100
7	`precipitation`	float	(raw)	mm in past hour — used to derive Y
8	`wind_speed_kmh`	float	X	10 m wind speed
9	`wind_direction_deg`	float	X	Direction FROM which wind blows, 0-360°
10	`wind_u`	float	X	u = speed · sin(dir)
11	`wind_v`	float	X	v = speed · cos(dir)
12	`pressure_hpa`	float	X	Surface pressure
13	`pressure_change_3h`	float	X	Δp over preceding 3 h (storm precursor)
14	`dew_point_c`	float	X	2 m dew-point
15	`dew_point_depression`	float	X	T − T_dew (saturation proxy)
16	`cloud_cover_pct`	float	X	Total cloud cover 0-100
17	`cape_jkg`	float	X	Convective Available Potential Energy
18	`precipitation_lag_1h`	float	X	Previous hour's precipitation
19	`hour_sin`, `hour_cos`	float	X	Cyclic encoding of hour-of-day
20	`month_sin`, `month_cos`	float	X	Cyclic encoding of month (captures monsoon)
21	`is_rain_event`	int {0,1}	Y	1 if `precipitation(t+1h) > 0.1 mm` else 0

5. Target label derivation / 目标标签的衍生

This is THE column that earlier supervisor feedback flagged as missing in the raw CSV. It is engineered explicitly in scripts/2_preprocess.py:

df['is_rain_event'] = (df['precipitation'].shift(-1) > 0.1).astype(int)

Three things the panel should notice:

.shift(-1) means future: features at time t are paired with the rain outcome at t+1h. The model never sees future data as input — this prevents temporal data leakage.
0.1 mm threshold: this matches the WMO definition of trace precipitation — i.e. it is not an arbitrary cutoff.
Binary, not amount-of-rain. The pipeline could be extended to a regression task; we deliberately model classification because the downstream user decision is binary ("go / no-go").

6. Train / test split / 划分策略

Time-based, not random. The last 20 % of each site's chronological data is reserved as the hold-out test set; the remaining 80 % goes to a 5-fold TimeSeriesSplit cross-validation. Random splits would leak temporal autocorrelation and inflate accuracy by 5-15 percentage points.

7. Class balance / 类别分布

Empirically in tropical Malaysia, is_rain_event = 1 holds in approximately 20-30 % of hours (more in monsoon months, less in dry season). We pass class_weight='balanced' to the Random Forest to prevent it from collapsing to a trivial "always predict no-rain" classifier.

8. Reproducibility / 可复现性

# Real ERA5 path (preferred)
python scripts/1_download_dataset.py    # ~5-10 min, network-bound
python scripts/2_preprocess.py          # < 30 s
python scripts/3_train_model.py         # ~30-90 s on a modern laptop

All scripts are idempotent — re-running them does not duplicate data or re-download files that already exist locally.

9. Offline / synthetic-data fallback / 离线合成数据回退

For environments without network access (e.g. exam labs, restricted classroom networks) we ship scripts/1b_synth_dataset.py, a deterministic physics-informed synthetic generator (seed = 42, see file header for the meteorological assumptions encoded).

The synthetic dataset:

has the identical schema as the Open-Meteo download,
preserves Malaysia's bimodal monsoon seasonality, tropical diurnal cycle, lapse rate, hydrostatic pressure decay, and zero-inflated rain distribution,
yields a comparable class balance (~26 % positive),
lets the entire pipeline + frontend + tests be exercised without any external network calls.

It is not a substitute for real ERA5 data in the final thesis evaluation. The recommended workflow once network is restored is:

rm data/raw_*.csv data/processed.csv         # clear synthetic data
python scripts/1_download_dataset.py         # fetch real ERA5 via Open-Meteo
python scripts/2_preprocess.py
python scripts/3_train_model.py              # retrain on real data