bong9513 commited on
Commit
4ac637a
Β·
verified Β·
1 Parent(s): b53e714

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +254 -51
README.md CHANGED
@@ -1,69 +1,272 @@
1
- # μ‹œμ • 예츑 λͺ¨λΈλ§ ν”„λ‘œμ νŠΈ
2
 
3
- 이 ν”„λ‘œμ νŠΈλŠ” λŒ€κΈ°μ§ˆ 데이터λ₯Ό λΆ„μ„ν•˜κ³  μ˜ˆμΈ‘ν•˜λŠ” λ¨Έμ‹ λŸ¬λ‹/λ”₯λŸ¬λ‹ λͺ¨λΈμ„ κ°œλ°œν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€.
4
 
5
- ## ν”„λ‘œμ νŠΈ ꡬ쑰
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ```
8
- paper/
9
- β”œβ”€β”€ Analysis_code/ # 뢄석 및 λͺ¨λΈλ§ μ½”λ“œ
10
- β”‚ β”œβ”€β”€ find_reason/ # 원인 뢄석 μ½”λ“œ
11
- β”‚ β”œβ”€β”€ optima/ # μ΅œμ ν™” κ΄€λ ¨ μ½”λ“œ
12
- β”‚ β”œβ”€β”€ sampling_data_test/ # μƒ˜ν”Œλ§ ν…ŒμŠ€νŠΈ μ½”λ“œ
13
- β”‚ β”œβ”€β”€ final_test/ # μ΅œμ’… ν…ŒμŠ€νŠΈ μ½”λ“œ
14
- β”‚ β”œβ”€β”€ make_oversample_data/ # μ˜€λ²„μƒ˜ν”Œλ§ 데이터 생성
15
- β”‚ β”œβ”€β”€ save_model/ # μ €μž₯된 λͺ¨λΈ
16
- β”‚ β”œβ”€β”€ model_result/ # λͺ¨λΈ κ²°κ³Ό
17
- β”‚ β”œβ”€β”€ models/ # λͺ¨λΈ κ΅¬ν˜„ μ½”λ“œ
18
- β”‚ └── model_voting_test_best_sample/ # 앙상블 λͺ¨λΈ ν…ŒμŠ€νŠΈ
19
- └── data/ # 데이터 파일
20
  ```
21
 
22
- ## μ£Όμš” 파일 μ„€λͺ…
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- ### 데이터 μ „μ²˜λ¦¬ 및 뢄석
25
- - `0.air_data_merge.ipynb`: λŒ€κΈ°μ§ˆ 데이터 병합
26
- - `1.data_merge.ipynb`: μΆ”κ°€ 데이터 병합
27
- - `2.eda_preproccesing.ipynb`: 탐색적 데이터 뢄석 및 μ „μ²˜λ¦¬
28
- - `3.oversampling.ipynb`: 데이터 μ˜€λ²„μƒ˜ν”Œλ§
29
 
30
- ### λͺ¨λΈλ§
31
- - `deeplearning_model_binary.ipynb`: 이진 λΆ„λ₯˜ λ”₯λŸ¬λ‹ λͺ¨λΈ
32
- - `deeplearning_model_multi.ipynb`: 닀쀑 λΆ„λ₯˜ λ”₯λŸ¬λ‹ λͺ¨λΈ
33
- - `model_visualize.ipynb`: λͺ¨λΈ κ²°κ³Ό μ‹œκ°ν™”
34
- - `make_train_test.ipynb`: ν•™μŠ΅/ν…ŒμŠ€νŠΈ 데이터 λΆ„ν• 
35
 
36
- ### λͺ¨λΈ κ΅¬ν˜„
37
- - `resnet_like.py`: ResNet 기반 λͺ¨λΈ κ΅¬ν˜„
38
- - `deepgbm.py`: DeepGBM λͺ¨λΈ κ΅¬ν˜„
39
- - `ft_transformer.py`: Feature Transformer λͺ¨λΈ κ΅¬ν˜„
40
 
41
- ### μ €μž₯된 λͺ¨λΈ
42
- - `best_model_f1.pth`: 졜고 μ„±λŠ₯ F1 λͺ¨λΈ
43
- - `best_deepgbm_model.pth`: 졜고 μ„±λŠ₯ DeepGBM λͺ¨λΈ
44
 
45
- ## μ‚¬μš© 방법
46
 
47
- 1. 데이터 μ€€λΉ„
48
- - `data/` 폴더에 ν•„μš”ν•œ 데이터 νŒŒμΌμ„ μœ„μΉ˜μ‹œν‚΅λ‹ˆλ‹€.
 
 
49
 
50
- 2. 데이터 μ „μ²˜λ¦¬
51
- - `0.air_data_merge.ipynb`λΆ€ν„° μˆœμ„œλŒ€λ‘œ μ‹€ν–‰ν•˜μ—¬ 데이터λ₯Ό μ€€λΉ„ν•©λ‹ˆλ‹€.
52
 
53
- 3. λͺ¨λΈ ν•™μŠ΅
54
- - `deeplearning_model_binary.ipynb` λ˜λŠ” `deeplearning_model_multi.ipynb`λ₯Ό μ‹€ν–‰ν•˜μ—¬ λͺ¨λΈμ„ ν•™μŠ΅ν•©λ‹ˆλ‹€.
55
 
56
- 4. κ²°κ³Ό 뢄석
57
- - `model_visualize.ipynb`λ₯Ό μ‹€ν–‰ν•˜μ—¬ λͺ¨λΈ κ²°κ³Όλ₯Ό λΆ„μ„ν•©λ‹ˆλ‹€.
58
 
59
- ## μš”κ΅¬μ‚¬ν•­
60
 
61
- - Python 3.8 이상
62
- - PyTorch
63
- - pandas
64
- - numpy
65
- - scikit-learn
66
- - matplotlib
67
- - seaborn
68
 
69
- ## λΌμ΄μ„ μŠ€
 
 
1
+ ### κ°€μ‹œκ±°λ¦¬(Visibility) 예츑 λͺ¨λΈλ§ ν”„λ‘œμ νŠΈ
2
 
3
+ κΈ°μƒΒ·λŒ€κΈ°μ˜€μ—ΌΒ·ν•­κ³΅μ •λ³΄(ASOS, DataOn, TAF)λ₯Ό 톡합해 κ°€μ‹œκ±°λ¦¬(`visi`)λ₯Ό μ˜ˆμΈ‘ν•©λ‹ˆλ‹€. λΆˆκ· ν˜• 데이터λ₯Ό SMOTENC/CTGAN으둜 λ³΄κ°•ν•˜κ³ , GBDT(LightGBM/XGBoost)와 νƒ­μšΈλŸ¬ λ”₯λŸ¬λ‹(ResNet-like, FT-Transformer, DeepGBM)을 κ²°ν•©ν•΄ 닀쀑/이진 λΆ„λ₯˜λ₯Ό μˆ˜ν–‰ν•©λ‹ˆλ‹€.
4
 
5
+ ### 기술 μŠ€νƒ(Tech Stack)
6
 
7
+ - 데이터 처리: `pandas`, `numpy`
8
+ - EDA/μ‹œκ°ν™”: `matplotlib`, `seaborn`
9
+ - μƒ˜ν”Œλ§/λΆˆκ· ν˜• 처리: `imbalanced-learn (SMOTENC)`, `CTGAN`, `Optuna`(CTGAN ν•˜μ΄νΌνŒŒλΌλ―Έν„°), μ§€μ—­/연도 기반 λΆ„ν• 
10
+ - λͺ¨λΈλ§(GBDT): `LightGBM`, `XGBoost`(GPU μ˜΅μ…˜ 포함, μ‚¬μš©μž μ •μ˜ CSI 평가)
11
+ - λͺ¨λΈλ§(λ”₯λŸ¬λ‹): `PyTorch` 기반 `ResNetLike`, `FTTransformer`, `DeepGBM`
12
+ - μ΅œμ ν™”: `hyperopt`(LightGBM/XGBoost), `Optuna`(CTGAN)
13
+ - μœ ν‹Έ/μ €μž₯: `joblib`
14
+
15
+ ### μ‹œμŠ€ν…œ μ•„ν‚€ν…μ²˜(νŒŒμ΄ν”„λΌμΈ)
16
+
17
+ 1) 데이터 μˆ˜μ§‘/적재: `data/ASOS`, `data/dataon`, `data/data_for_TAF`
18
+ 2) 병합/μ „μ²˜λ¦¬: `0.air_data_merge.ipynb` β†’ `1.data_merge.ipynb` β†’ `2.eda_preproccesing.ipynb`
19
+ 3) 데이터 증강(λΆˆκ· ν˜• 처리): `Analysis_code/make_oversample_data/` λ‚΄ `SMOTENC` β†’ `CTGAN(+Optuna)` β†’ κ·œμΉ™ 기반 필터링
20
+ 4) 데이터 λΆ„ν• : 지역별(`*_train.csv`, `*_test.csv`), 연도 기반 3-Fold ν™€λ“œμ•„μ›ƒ
21
+ 5) ν•™μŠ΅: GBDT(`optima/*.py`)와 λ”₯λŸ¬λ‹ λ…ΈνŠΈλΆ(`deeplearning_model_*`)
22
+ 6) 평가/뢄석: μ‚¬μš©μž μ •μ˜ `CSI` + F1/Accuracy, `model_visualize.ipynb`, `find_reason/*`(νŠΈλ Œλ“œ, 뢄포 비ꡐ)
23
+ 7) 앙상블/μ΅œμ’…: `model_voting_test_best_sample/*`, `final_test/final.ipynb`
24
+
25
+ ### TL;DR (λΉ λ₯Έ μ‹œμž‘)
26
+
27
+ 1) 파이썬 ν™˜κ²½ μ€€λΉ„ ν›„ ν•„μˆ˜ νŒ¨ν‚€μ§€ μ„€μΉ˜
28
+
29
+ ```bash
30
+ pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
31
+ pip install pandas numpy scikit-learn matplotlib seaborn imbalanced-learn optuna ctgan xgboost lightgbm joblib
32
  ```
33
+
34
+ 2) 데이터 배치
35
+ - μ›μ²œ/쀑간 μ‚°μΆœλ¬Όμ„ `data/` ν•˜μœ„μ— 배치. ν•™μŠ΅μš© CSV/featherλŠ” `data/data_for_modeling/` μ°Έκ³ .
36
+
37
+ 3) μ˜€λ²„μƒ˜ν”Œλ§ μˆ˜ν–‰(SMOTE/CTGAN)
38
+
39
+ ```bash
40
+ cd Analysis_code/make_oversample_data
41
+ python smote_sample_1.py
42
+ python oversampling_code.py
 
 
43
  ```
44
 
45
+ 4) GBDT μ΅œμ ν™”/ν•™μŠ΅ μ˜ˆμ‹œ(μ„œμšΈμ‹œ)
46
+
47
+ ```bash
48
+ cd ../optima
49
+ python LGB_smote_seoul.py
50
+ python XGB_smote_seoul.py
51
+ ```
52
+
53
+ 5) λ”₯λŸ¬λ‹ λͺ¨λΈ ν•™μŠ΅/평가: λ…ΈνŠΈλΆ μ‹€ν–‰(`Analysis_code/` λ‚΄ `.ipynb`)
54
+
55
+ ---
56
+
57
+ ### ν”„λ‘œμ νŠΈ ꡬ쑰
58
+
59
+ ```
60
+ visibility_prediction/
61
+ β”œβ”€β”€ Analysis_code/
62
+ β”‚ β”œβ”€β”€ 0.air_data_merge.ipynb
63
+ β”‚ β”œβ”€β”€ 1.data_merge.ipynb
64
+ β”‚ β”œβ”€β”€ 2.eda_preproccesing.ipynb
65
+ β”‚ β”œβ”€β”€ 3.oversampling.ipynb
66
+ β”‚ β”œβ”€β”€ deeplearning_model_binary.ipynb
67
+ β”‚ β”œβ”€β”€ deeplearning_model_multi.ipynb
68
+ β”‚ β”œβ”€β”€ make_train_test.ipynb
69
+ β”‚ β”œβ”€β”€ model_visualize.ipynb
70
+ β”‚ β”œβ”€β”€ final_test/
71
+ β”‚ β”‚ └── final.ipynb
72
+ β”‚ β”œβ”€β”€ find_reason/ # 지역별 νŠΈλ Œλ“œ/원인 뢄석 λ…ΈνŠΈλΆ
73
+ β”‚ β”œβ”€β”€ sampling_data_test/ # μƒ˜ν”Œλ§ 데이터 μ„±λŠ₯ ν…ŒμŠ€νŠΈ λ…ΈνŠΈλΆ
74
+ β”‚ β”œβ”€β”€ model_voting_test_best_sample/
75
+ β”‚ β”‚ └── ensemble__voting_best_sample.ipynb
76
+ β”‚ β”œβ”€β”€ make_oversample_data/
77
+ β”‚ β”‚ β”œβ”€β”€ oversampling_code.py # SMOTENC+CTGAN νŒŒμ΄ν”„λΌμΈ
78
+ β”‚ β”‚ β”œβ”€β”€ smote_sample_1.py # 연도/μ „μ²˜λ¦¬ 포함 SMOTE μƒ˜ν”Œ
79
+ β”‚ β”‚ └── (gan_sample_*.py λ“±)
80
+ β”‚ β”œβ”€β”€ optima/ # GBDT ν•˜μ΄νΌνŒŒλΌλ―Έν„° 탐색/ν•™μŠ΅ 슀크립트
81
+ β”‚ β”‚ β”œβ”€β”€ LGB_smote_seoul.py
82
+ β”‚ β”‚ └── XGB_smote_seoul.py
83
+ β”‚ β”œβ”€β”€ models/
84
+ β”‚ β”‚ β”œβ”€β”€ best_resnet_model.pth
85
+ β”‚ β”‚ └── tabnet_model.zip
86
+ β”‚ β”œβ”€β”€ deepgbm.py
87
+ β”‚ β”œβ”€β”€ ft_transformer.py
88
+ β”‚ └── resnet_like.py
89
+ β”œβ”€β”€ data/
90
+ β”‚ β”œβ”€β”€ ASOS/ # 기상
91
+ β”‚ β”œβ”€β”€ dataon/ # λŒ€κΈ°μ˜€μ—Ό(λŒ€μš©λŸ‰ μΌμžλ³„ CSV)
92
+ β”‚ β”œβ”€β”€ data_for_modeling/ # 지역별 train/test CSV 및 feather
93
+ β”‚ β”œβ”€β”€ data_for_demo/
94
+ β”‚ β”œβ”€β”€ data_for_TAF/ # 곡항 TAF(항곡기상) CSV
95
+ β”‚ └── data_oversampled/
96
+ β”‚ β”œβ”€β”€ smote/
97
+ β”‚ β”œβ”€β”€ ctgan7000/
98
+ β”‚ β”œβ”€β”€ ctgan10000/
99
+ β”‚ └── ctgan20000/
100
+ └── README.md
101
+ ```
102
+
103
+ ---
104
+
105
+ ### 데이터 및 λ³€μˆ˜(Variables)
106
+
107
+ - λͺ©ν‘œ λ³€μˆ˜
108
+ - `visi`: κ°€μ‹œκ±°λ¦¬(연속값). ν•©μ„± ν‘œλ³Έ 필터링 κ·œμΉ™μ—μ„œ ν™•μΈλ˜λŠ” ꡬ간 μ˜ˆμ‹œ: class 0은 [0,100), class 1은 [100,500), class 2λŠ” κ·Έ μ™Έ κ΅¬κ°„μœΌλ‘œ μ‚¬μš©λ©λ‹ˆλ‹€.
109
+ - `multi_class`: 닀쀑 λΆ„λ₯˜ 라벨(μ •μˆ˜ 0/1/2)
110
+ - `binary_class`: 이진 라벨. κ·œμΉ™: `binary_class = 0 if multi_class == 2 else 1`
111
+
112
+ - μ£Όμš” ν”Όμ²˜ κ·Έλ£Ή(μ½”λ“œ κΈ°μ€€)
113
+ - 기상(ASOS): `temp_C`, `precip_mm`, `wind_speed`, `wind_dir`(μ •μ˜¨β†’0 μΉ˜ν™˜), `hm`, `vap_pressure`, `dewpoint_C`, `loc_pressure`, `sea_pressure`, `solarRad`, `snow_cm`, `cloudcover`(int), `lm_cloudcover`(int), `low_cloudbase`, `groundtemp`
114
+ - λŒ€κΈ°μ˜€μ—Ό(DataOn): `O3`, `NO2`, `PM10`, `PM25`
115
+ - μ‹œκ°„/μ£ΌκΈ°: `year`(int), `month`(int), `hour`(int), `hour_sin`, `hour_cos`, `month_sin`, `month_cos`
116
+ - νŒŒμƒ: `ground_temp - temp_C`(μ§€λ©΄-기온 μ°¨)
117
+
118
+ - λ²”μ£Όν˜• λ³€μˆ˜(λͺ¨λΈ/μƒ˜ν”Œλ§ 관점)
119
+ - `wind_dir`, `cloudcover`, `lm_cloudcover`, 그리고 `int` νƒ€μž…μ˜ μ‹œκ°„ λ³€μˆ˜(`year`, `month`, `hour`)λŠ” SMOTENC/GBDTμ—μ„œ λ²”μ£Όν˜•μœΌλ‘œ 취급됨(μ½”λ“œμ—μ„œ `float64`κ°€ μ•„λ‹Œ μ—΄ 인덱슀 μžλ™ 탐지)
120
+
121
+ - μ „μ²˜λ¦¬ κ·œμΉ™(발췌)
122
+ - `wind_dir` 쀑 `'μ •μ˜¨'`은 "0"으둜 μΉ˜ν™˜ ν›„ μ •μˆ˜ν˜• λ³€ν™˜
123
+ - `cloudcover, lm_cloudcover` μ •μˆ˜ν˜• λ³€ν™˜
124
+ - ν•™μŠ΅ μ‹œ 타깃/보쑰 μ—΄(`multi_class, binary_class`) 뢄리 ν›„ ν•„μš” μ‹œ μž¬κ³„μ‚°
125
+
126
+ ---
127
+
128
+ ### EDA 및 μ „μ²˜λ¦¬
129
+
130
+ - 병합/정리
131
+ - 인덱슀 μ—΄ 제거: `Unnamed: 0` λ“œλ‘­
132
+ - μžλ£Œν˜• μ •ν•©μ„±: `cloudcover`, `lm_cloudcover` μ •μˆ˜ν˜•; `year`, `month`, `hour` μ •μˆ˜ν˜•
133
+ - νŠΉμˆ˜κ°’ μΉ˜ν™˜: `wind_dir == 'μ •μ˜¨'` β†’ "0" ν›„ μ •μˆ˜ν˜• λ³€ν™˜
134
+
135
+ - νŠΉμ§• 곡학
136
+ - μ£ΌκΈ°ν˜• 인코딩: `hour_sin`, `hour_cos`, `month_sin`, `month_cos`
137
+ - μ°¨λΆ„ν˜• νŒŒμƒ: `ground_temp - temp_C`
138
+
139
+ - 뢄포/νŠΈλ Œλ“œ 뢄석
140
+ - 지역별 μ‹œκ³„μ—΄ νŠΈλ Œλ“œ: `find_reason/*_trend.ipynb`
141
+ - 뢄포 비ꡐ/λ³€ν™” 감지: `find_reason/wasserstein_distance.ipynb`(Wasserstein 거리 기반 뢄포 차이 μ •λŸ‰ν™”)
142
+
143
+ - 데이터 λΆ„ν• 
144
+ - μ§€μ—­ λ‹¨μœ„ 데이터셋(`*_train.csv`, `*_test.csv`)
145
+ - 연도 기반 ν™€λ“œμ•„μ›ƒ 3-Fold(2018–2020 μ‘°ν•©)둜 μΌλ°˜ν™” μ„±λŠ₯ 검증
146
+
147
+ ### λΆˆκ· ν˜• 처리 및 ν•©μ„± μƒ˜ν”Œλ§
148
+
149
+ - SMOTENC
150
+ - λ²”μ£Όν˜• 인덱슀: μž…λ ₯ νŠΉμ„± 쀑 `float64`κ°€ μ•„λ‹Œ μ—΄μ˜ μœ„μΉ˜ 인덱슀 μ‚¬μš©
151
+ - μƒ˜ν”Œλ§ μ „λž΅ μ˜ˆμ‹œ: `{0: 10000, 1: 10000, 2: κΈ°μ‘΄ 개수}` λ˜λŠ” 데이터 규λͺ¨μ— 따라 `{0: 500/1000, 1: ceil(n1/100)*100, 2: n2}`
152
+ - μž¬κ³„μ‚°: μƒ˜ν”Œλ§ ν›„ `multi_class`μ—μ„œ `binary_class` 및 μ£ΌκΈ°/μ°¨λΆ„ νŒŒμƒμ„ 볡ꡬ
153
+
154
+ - CTGAN(+Optuna)
155
+ - 클래슀 0, 1을 λŒ€μƒμœΌλ‘œ Optuna둜 `embedding_dim, generator_dim, discriminator_dim, pac, batch_size, discriminator_steps` 탐색 ν›„ ν•©μ„±
156
+ - 생성 ν‘œλ³Έ ν’ˆμ§ˆ ν•„ν„°: `class 0 β†’ 0 ≀ visi < 100`, `class 1 β†’ 100 ≀ visi < 500`
157
+ - μ΅œμ’… ν•©λ³Έ ν›„ νŒŒμƒ/보쑰 ν”Όμ²˜(`binary_class`, μ£ΌκΈ°/μ°¨λΆ„ ν•­λͺ©) 볡ꡬ
158
+
159
+ - μ‚°μΆœλ¬Ό
160
+ - `data/data_oversampled/smote/`, `ctgan7000/`, `ctgan10000/`, `ctgan20000/` ν•˜μœ„μ— 지역별 CSV μ €μž₯
161
+
162
+ ---
163
+
164
+ ### λͺ¨λΈ μ•„ν‚€ν…μ²˜(상세)
165
+
166
+ - λ”₯λŸ¬λ‹(tabular)
167
+ - `resnet_like.py`
168
+ - μž…λ ₯: `x_num [B, N_num]`, `x_cat [B, N_cat]` β†’ concat β†’ μž…λ ₯μ„ ν˜•(`d_main=128`) β†’ μž”μ°¨λΈ”λ‘(`n_blocks=4`, `d_hidden=64`, `dropout_first=0.25`) β†’ 좜λ ₯μΈ΅
169
+ - 좜λ ₯: `num_classes == 2 β†’ 1 λ‘œμ§“`, `> 2 β†’ K λ‘œμ§“`
170
+ - `ft_transformer.py`
171
+ - 수치: Linear(`num_features β†’ d_token=192`), λ²”μ£Ό: `cat_cardinalities`별 `nn.Embedding(d_token)` ν›„ ν•©μ„±
172
+ - 인코더: `TransformerEncoderLayer(d_model=d_token, nhead=8, dropoutβ‰ˆ0.2)` Γ— `n_blocks=6` β†’ 평균 풀링 β†’ λΆ„λ₯˜ ν—€λ“œ
173
+ - `deepgbm.py`
174
+ - 수치 Linear(`d_main=128`) + λ²”μ£Ό μž„λ² λ”© ν•©μ‚° β†’ μž”μ°¨ MLP 블둝(`n_blocks=4`, `d_hidden=64`, `dropoutβ‰ˆ0.2`) β†’ λΆ„λ₯˜ ν—€λ“œ
175
+
176
+ - GBDT
177
+ - LightGBM(`optima/LGB_smote_seoul.py`): `objective='multiclassova'`, `n_estimatorsβ‰ˆ4000`, μ‘°κΈ°μ’…λ£Œ, GPU μ˜΅μ…˜ μ˜ˆμ‹œ 쑴재, `hyperopt`둜 `max_depth, min_child_weight, num_leaves, subsample, learning_rate` 탐색
178
+ - XGBoost(`optima/XGB_smote_seoul.py`): `objective='multi:softprob'`, `tree_method='hist'`, `enable_categorical=True`, GPU μ˜΅μ…˜, `hyperopt`둜 핡심 ν•˜μ΄νΌνŒŒλΌλ―Έν„° 탐색, `eval_metric=CSI`
179
+
180
+ ---
181
+
182
+ ### ν•™μŠ΅/검증 μ „λž΅
183
+
184
+ - 연도 기반 ν™€λ“œμ•„μ›ƒ 3-Fold(μ˜ˆμ‹œ)
185
+ - Fold1: Train 2018–2019 β†’ Val 2020
186
+ - Fold2: Train 2018–2020 β†’ Val 2019
187
+ - Fold3: Train 2019–2020 β†’ Val 2018
188
+ - μ§€μ—­ λ‹¨μœ„λ‘œ 별도 ν•™μŠ΅(예: `seoul_train.csv` λ“±)
189
+
190
+ ---
191
+
192
+ ### 평가 μ§€ν‘œ
193
+
194
+ - μ‚¬μš©μž μ •μ˜ CSI(Critical Success Index) 닀쀑뢄λ₯˜ 버전
195
+
196
+ ```python
197
+ H = cm[0, 0] + cm[1, 1]
198
+ F = (cm[1, 0] + cm[2, 0] + cm[0, 1] + cm[2, 1])
199
+ M = (cm[0, 2] + cm[1, 2])
200
+ CSI = H / (H + F + M + 1e-10)
201
+ ```
202
+
203
+ - κ·Έ μ™Έ: 정확도, F1 λ“± λ…ΈνŠΈλΆ/οΏ½οΏ½ν¬λ¦½νŠΈμ—μ„œ 병행 확인
204
+
205
+ ---
206
+
207
+ ### μ‹€ν–‰ 방법(상세)
208
+
209
+ - ν™˜κ²½ μ€€λΉ„
210
+ - Python 3.8+ ꢌμž₯, CUDA 지원 μ‹œ GPU μ‚¬μš© κ°€λŠ₯(CTGAN/GBDT 속도 ν–₯상)
211
+ - LightGBM GPUκ°€ λ―Έμ„€μΉ˜λΌλ©΄ `pip install lightgbm`으둜 CPU 버전 μ‚¬μš© λ˜λŠ” GPU λΉŒλ“œ ν•„μš”
212
+
213
+ - 데이터 μ€€λΉ„
214
+ - `data/ASOS/`: 연도별 기상 μ›μ²œ
215
+ - `data/dataon/`: λŒ€κΈ°μ˜€μ—Ό μΌμžλ³„ CSV(λŒ€μš©λŸ‰)
216
+ - `data/data_for_modeling/`: 지역별 ν•™μŠ΅/평가 μ„ΈνŠΈ(`*_train.csv`, `*_test.csv`, `df_*.feather`)
217
+ - `data/data_for_TAF/`: 곡항별 TAF(항곡기상)
218
+
219
+ - μ „μ²˜λ¦¬/탐색
220
+ - `Analysis_code/0.air_data_merge.ipynb` β†’ `1.data_merge.ipynb` β†’ `2.eda_preproccesing.ipynb`
221
+
222
+ - μ˜€λ²„μƒ˜ν”Œλ§
223
+ - `Analysis_code/make_oversample_data/`μ—μ„œ 슀크립트 μ‹€ν–‰(상단 TL;DR μ°Έμ‘°)
224
+
225
+ - GBDT μ΅œμ ν™”/ν•™μŠ΅
226
+ - `Analysis_code/optima/LGB_smote_seoul.py`, `XGB_smote_seoul.py` μ‹€ν–‰
227
+ - μ‚°μΆœ λͺ¨λΈμ€ `Analysis_code/save_model/` ν•˜μœ„μ— `.pkl`둜 μ €μž₯
228
+
229
+ - λ”₯λŸ¬λ‹ ν•™μŠ΅
230
+ - `deeplearning_model_*` λ…ΈνŠΈλΆμ—μ„œ νƒ­μšΈλŸ¬ λͺ¨λΈ ν•™μŠ΅/평가, `model_visualize.ipynb`둜 μ‹œκ°ν™”
231
+
232
+ - 앙상블/μ΅œμ’… 평가
233
+ - `model_voting_test_best_sample/ensemble__voting_best_sample.ipynb`
234
+ - `final_test/final.ipynb`
235
+
236
+ ---
237
+
238
+ ### λͺ¨λΈ μž…μΆœλ ₯ 규격(μš”μ•½)
239
+
240
+ - 수치 μž…λ ₯ `x_num`: `float32` ν…μ„œ `[batch, num_numeric_features]`
241
+ - λ²”μ£Ό μž…λ ₯ `x_cat`: μ •μˆ˜ 인덱슀 ν…μ„œ `[batch, num_categorical_features]`
242
+ - 좜λ ₯: 이진(1 λ‘œμ§“) λ˜λŠ” 닀쀑뢄λ₯˜(K λ‘œμ§“). 손싀/μž„κ³„κ°’μ€ λ…ΈνŠΈλΆ λ‚΄ μ„€μ • μ°Έκ³ 
243
 
244
+ ---
 
 
 
 
245
 
246
+ ### μž¬ν˜„μ„±/μ‹œλ“œ
 
 
 
 
247
 
248
+ - `random_state=42`(SMOTENC), λͺ¨λΈ 슀크립트 λ‚΄ `random_state=120` λ“±μ˜ κ³ μ •κ°’ μ‚¬μš©
249
+ - 데이터/ν•˜λ“œμ›¨μ–΄ 차이에 따라 μž¬ν˜„λ₯ μ΄ λ‹€λ₯Ό 수 μžˆμœΌλ―€λ‘œ fold/seedλ₯Ό λͺ…μ‹œμ μœΌλ‘œ μ„€μ • ꢌμž₯
 
 
250
 
251
+ ---
 
 
252
 
253
+ ### 주의/νŠΈλŸ¬λΈ”μŠˆνŒ…
254
 
255
+ - `optima/LGB_smote_seoul.py`의 `sys.path.append(...)`λŠ” ν™˜κ²½ 의쑴적 κ²½λ‘œμž…λ‹ˆλ‹€. 일반 ν™˜κ²½μ—μ„œλŠ” μ œκ±°ν•΄λ„ `from lightgbm import LGBMClassifier`κ°€ λ™μž‘ν•΄μ•Ό ν•©λ‹ˆλ‹€.
256
+ - μŠ€ν¬λ¦½νŠΈλŠ” μƒλŒ€ 경둜λ₯Ό κ°€μ •ν•©λ‹ˆλ‹€. μ‹€ν–‰ μ „ ν˜„μž¬ μž‘μ—… 디렉터리가 `Analysis_code/*` ν•˜μœ„μΈμ§€ ν™•μΈν•˜μ„Έμš”.
257
+ - `wind_dir`의 `'μ •μ˜¨'` κ°’ μΉ˜ν™˜/ν˜•λ³€ν™˜μ΄ λˆ„λ½λ˜λ©΄ GBDT/XGBμ—μ„œ 였λ₯˜κ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
258
+ - `dataon/`은 맀우 λŒ€μš©λŸ‰μž…λ‹ˆλ‹€. λ©”λͺ¨λ¦¬ μ—¬μœ λ₯Ό ν™•λ³΄ν•˜κ±°λ‚˜ 연도/μ§€μ—­ λ‹¨μœ„λ‘œ μ²˜λ¦¬ν•˜μ„Έμš”.
259
 
260
+ ---
 
261
 
262
+ ### μ˜μ‘΄μ„±
 
263
 
264
+ - Python 3.8+
265
+ - PyTorch, pandas, numpy, scikit-learn, imbalanced-learn, optuna, ctgan, xgboost, lightgbm, joblib, matplotlib, seaborn
266
 
267
+ ---
268
 
269
+ ### λΌμ΄μ„ μŠ€/인용
 
 
 
 
 
 
270
 
271
+ - λΌμ΄μ„ μŠ€: μΆ”ν›„ μ—…λ°μ΄νŠΈ μ˜ˆμ •
272
+ - λ³Έ ν”„λ‘œμ νŠΈ/결과물을 인용 μ‹œ `visibility_prediction` μ €μž₯μ†Œμ™€ μ‚¬μš©λœ 데이터 μ†ŒμŠ€(ASOS, DataOn, TAF)λ₯Ό λͺ…μ‹œν•΄ μ£Όμ„Έμš”.