hanjunlee commited on
Commit
3a36548
ยท
verified ยท
1 Parent(s): 6c1a68f

Upload 23 files

Browse files
.env.example ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค์ •
2
+ # ํ† ํฐ์€ https://huggingface.co/settings/tokens ์—์„œ ์ƒ์„ฑํ•˜์„ธ์š”
3
+ HF_TOKEN=your_huggingface_token_here
4
+
5
+ # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ๋ฐ์ดํ„ฐ์…‹ ์ €์žฅ์†Œ
6
+ HF_REPO_ID=your_username/minjoo-press-releases
7
+
8
+ # ๊ตญ๋ฏผ์˜ํž˜ ๋ฐ์ดํ„ฐ์…‹ ์ €์žฅ์†Œ
9
+ HF_REPO_ID_PPP=your_username/ppp-press-releases
10
+
11
+ # ์‚ฌ์šฉ๋ฒ•:
12
+ # 1. ์ด ํŒŒ์ผ์„ .env ๋กœ ๋ณต์‚ฌํ•˜์„ธ์š”
13
+ # 2. HF_TOKEN์— ์‹ค์ œ ํ† ํฐ์„ ์ž…๋ ฅํ•˜์„ธ์š”
14
+ # 3. HF_REPO_ID๋ฅผ ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„์œผ๋กœ ๋ณ€๊ฒฝํ•˜์„ธ์š”
15
+ # 4. HF_REPO_ID_PPP๋ฅผ ๊ตญ๋ฏผ์˜ํž˜ ๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„์œผ๋กœ ๋ณ€๊ฒฝํ•˜์„ธ์š”
QUICKSTART.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๋น ๋ฅธ ์‹œ์ž‘ ๊ฐ€์ด๋“œ
2
+
3
+ ## 5๋ถ„ ์•ˆ์— ์‹œ์ž‘ํ•˜๊ธฐ
4
+
5
+ ### 1๋‹จ๊ณ„: ์„ค์น˜ (1๋ถ„)
6
+ ```bash
7
+ setup.bat
8
+ ```
9
+
10
+ ### 2๋‹จ๊ณ„: ํ—ˆ๊น…ํŽ˜์ด์Šค ํ† ํฐ ์„ค์ • (2๋ถ„)
11
+
12
+ 1. https://huggingface.co/settings/tokens ์ ‘์†
13
+ 2. "New token" โ†’ ์ด๋ฆ„: `party-crawler` โ†’ ๊ถŒํ•œ: **Write** โ†’ ์ƒ์„ฑ ํ›„ ๋ณต์‚ฌ
14
+
15
+ 3. `.env` ํŒŒ์ผ์„ ๋ฉ”๋ชจ์žฅ์œผ๋กœ ์—ด๊ณ  ์ž…๋ ฅ:
16
+ ```
17
+ HF_TOKEN=์—ฌ๊ธฐ์—_๋ณต์‚ฌํ•œ_ํ† ํฐ_๋ถ™์—ฌ๋„ฃ๊ธฐ
18
+
19
+ HF_REPO_ID=your_username/minjoo-press-releases
20
+ HF_REPO_ID_PPP=your_username/ppp-press-releases
21
+ HF_REPO_ID_REBUILDING=your_username/rebuilding-press-releases
22
+ HF_REPO_ID_REFORM=your_username/reform-press-releases
23
+ HF_REPO_ID_BASIC_INCOME=your_username/basic-income-press-releases
24
+ HF_REPO_ID_JINBO=your_username/jinbo-press-releases
25
+ ```
26
+
27
+ > **์ค‘์š”**: `your_username`์„ ์‹ค์ œ ํ—ˆ๊น…ํŽ˜์ด์Šค ์‚ฌ์šฉ์ž๋ช…์œผ๋กœ ๋ณ€๊ฒฝํ•˜์„ธ์š”!
28
+
29
+ ### 3๋‹จ๊ณ„: ์‹คํ–‰ (2๋ถ„)
30
+
31
+ #### ์ „์ฒด ์ •๋‹น ํ•œ ๋ฒˆ์— ์ˆ˜์ง‘ (์ถ”์ฒœ)
32
+ ```bash
33
+ python main.py
34
+ ```
35
+
36
+ #### ํŠน์ • ์ •๋‹น๋งŒ ์ˆ˜์ง‘
37
+ ```bash
38
+ python main.py --party minjoo # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น
39
+ python main.py --party ppp # ๊ตญ๋ฏผ์˜ํž˜
40
+ python main.py --party rebuilding # ์กฐ๊ตญํ˜์‹ ๋‹น
41
+ python main.py --party reform # ๊ฐœํ˜์‹ ๋‹น
42
+ python main.py --party basic_income # ๊ธฐ๋ณธ์†Œ๋“๋‹น
43
+ python main.py --party jinbo # ์ง„๋ณด๋‹น
44
+ ```
45
+
46
+ #### ๋‚ ์งœ ๋ฒ”์œ„ ์ง€์ •
47
+ ```bash
48
+ python main.py --start-date 2024-01-01
49
+ python main.py --party reform --start-date 2024-01-01 --end-date 2024-06-30
50
+ ```
51
+
52
+ ## ์™„๋ฃŒ!
53
+
54
+ ๋ฐ์ดํ„ฐ ์ €์žฅ ์œ„์น˜:
55
+ - **๋กœ์ปฌ**: `./data/` ํด๋” (CSV, Excel)
56
+ - **ํ—ˆ๊น…ํŽ˜์ด์Šค**: ๊ฐ ์ •๋‹น๋ณ„ ์ €์žฅ์†Œ์— ์ž๋™ ์—…๋กœ๋“œ
57
+
58
+ ## ์ „์ฒด ์˜ต์…˜ ์š”์•ฝ
59
+
60
+ | ๋ช…๋ น์–ด | ์„ค๋ช… |
61
+ |--------|------|
62
+ | `python main.py` | 6๊ฐœ ์ •๋‹น ์ „์ฒด ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ |
63
+ | `python main.py --party [์ฝ”๋“œ]` | ํŠน์ • ์ •๋‹น๋งŒ |
64
+ | `python main.py --start-date YYYY-MM-DD` | ์‹œ์ž‘ ๋‚ ์งœ ์ง€์ • |
65
+ | `python unified_scheduler.py` | ๋งค์ผ ์ž๋™ ์‹คํ–‰ (์Šค์ผ€์ค„๋Ÿฌ) |
66
+
67
+ ## ์ •๋‹น ์ฝ”๋“œ ๋ชฉ๋ก
68
+
69
+ | ์ฝ”๋“œ | ์ •๋‹น |
70
+ |------|------|
71
+ | `minjoo` | ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น |
72
+ | `ppp` | ๊ตญ๋ฏผ์˜ํž˜ |
73
+ | `rebuilding` | ์กฐ๊ตญํ˜์‹ ๋‹น |
74
+ | `reform` | ๊ฐœํ˜์‹ ๋‹น |
75
+ | `basic_income` | ๊ธฐ๋ณธ์†Œ๋“๋‹น |
76
+ | `jinbo` | ์ง„๋ณด๋‹น |
77
+ | `all` | ์ „์ฒด (๊ธฐ๋ณธ๊ฐ’) |
78
+
79
+ ## ๋ฌธ์ œ ํ•ด๊ฒฐ
80
+
81
+ | ๋ฌธ์ œ | ํ•ด๊ฒฐ |
82
+ |------|------|
83
+ | "HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค" | `.env` ํŒŒ์ผ์˜ `HF_TOKEN` ํ™•์ธ |
84
+ | "Module not found" | `setup.bat` ๋‹ค์‹œ ์‹คํ–‰ |
85
+ | ํฌ๋กค๋ง์ด ๋А๋ ค์š” | `crawler_config.json`์—์„œ `concurrent_requests`๋ฅผ 30์œผ๋กœ ์ฆ๊ฐ€ |
86
+ | ํŠน์ • ์ •๋‹น๋งŒ ์‹คํŒจ | `python main.py --party [์ฝ”๋“œ]`๋กœ ๊ฐœ๋ณ„ ์‹คํ–‰ํ•˜์—ฌ ํ™•์ธ |
87
+
88
+ ## ๋„์›€๋ง
89
+
90
+ ```bash
91
+ python main.py --help
92
+ ```
93
+
94
+ ์ „์ฒด ๋ฌธ์„œ: `README.md`
QUICKSTART_UNIFIED.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # ๋น ๋ฅธ ์‹œ์ž‘ ๊ฐ€์ด๋“œ
2
+
3
+ > **์ด ํŒŒ์ผ์€ QUICKSTART.md ๋กœ ํ†ตํ•ฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.**
4
+ > ์ตœ์‹  ๊ฐ€์ด๋“œ๋Š” [QUICKSTART.md](QUICKSTART.md) ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ์ •๋‹น ๋ณด๋„์ž๋ฃŒ ํฌ๋กค๋Ÿฌ
2
+
3
+ 6๊ฐœ ์ •๋‹น ์›น์‚ฌ์ดํŠธ์—์„œ ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰/๋ธŒ๋ฆฌํ•‘, ๋ชจ๋‘๋ฐœ์–ธ์„ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘ํ•˜๊ณ  ํ—ˆ๊น…ํŽ˜์ด์Šค์— ์—…๋กœ๋“œํ•˜๋Š” ํฌ๋กค๋Ÿฌ์ž…๋‹ˆ๋‹ค.
4
+
5
+ **์ง€์› ์ •๋‹น**: ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น
6
+
7
+ ## ์ฃผ์š” ํŠน์ง•
8
+
9
+ - **๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ (asyncio + aiohttp)**: ๊ธฐ์กด ๋Œ€๋น„ 10-20๋ฐฐ ๋น ๋ฅธ ์†๋„
10
+ - **6๊ฐœ ์ •๋‹น ๋ณ‘๋ ฌ ํฌ๋กค๋ง**: ๋™์‹œ์— ์‹คํ–‰ํ•˜์—ฌ ์‹œ๊ฐ„ ๋‹จ์ถ•
11
+ - **์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ**: ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘
12
+ - **ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ**: ์ •๋‹น๋ณ„ ๋…๋ฆฝ ์ €์žฅ์†Œ์— ์ž๋™ ๋ณ‘ํ•ฉ
13
+
14
+ ## ์„ค์น˜
15
+
16
+ ```bash
17
+ pip install -r requirements.txt
18
+ ```
19
+
20
+ ๋˜๋Š” Windows:
21
+ ```bash
22
+ setup.bat
23
+ ```
24
+
25
+ ## ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •
26
+
27
+ `.env` ํŒŒ์ผ ์ƒ์„ฑ ํ›„ ์•„๋ž˜ ๋‚ด์šฉ ์ž…๋ ฅ:
28
+
29
+ ```
30
+ HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
31
+
32
+ # ๊ฐ ์ •๋‹น๋ณ„ ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ฐ์ดํ„ฐ์…‹ ์ €์žฅ์†Œ
33
+ HF_REPO_ID=your_username/minjoo-press-releases
34
+ HF_REPO_ID_PPP=your_username/ppp-press-releases
35
+ HF_REPO_ID_REBUILDING=your_username/rebuilding-press-releases
36
+ HF_REPO_ID_REFORM=your_username/reform-press-releases
37
+ HF_REPO_ID_BASIC_INCOME=your_username/basic-income-press-releases
38
+ HF_REPO_ID_JINBO=your_username/jinbo-press-releases
39
+ ```
40
+
41
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
42
+
43
+ ### main.py - ํ†ตํ•ฉ ์ง„์ž…์  (์ถ”์ฒœ)
44
+
45
+ ```bash
46
+ # ์ „์ฒด ์ •๋‹น ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ (๊ธฐ๋ณธ)
47
+ python main.py
48
+
49
+ # ํŠน์ • ์ •๋‹น๋งŒ
50
+ python main.py --party minjoo # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น
51
+ python main.py --party ppp # ๊ตญ๋ฏผ์˜ํž˜
52
+ python main.py --party rebuilding # ์กฐ๊ตญํ˜์‹ ๋‹น
53
+ python main.py --party reform # ๊ฐœํ˜์‹ ๋‹น
54
+ python main.py --party basic_income # ๊ธฐ๋ณธ์†Œ๋“๋‹น
55
+ python main.py --party jinbo # ์ง„๋ณด๋‹น
56
+
57
+ # ๋‚ ์งœ ๋ฒ”์œ„ ์ง€์ •
58
+ python main.py --start-date 2024-01-01
59
+ python main.py --party reform --start-date 2024-01-01 --end-date 2024-06-30
60
+
61
+ # ๋„์›€๋ง
62
+ python main.py --help
63
+ ```
64
+
65
+ ### ๊ฐœ๋ณ„ ํฌ๋กค๋Ÿฌ ์ง์ ‘ ์‹คํ–‰
66
+
67
+ ```bash
68
+ python minjoo_crawler_async.py
69
+ python ppp_crawler_async.py
70
+ python rebuilding_crawler_async.py
71
+ python reform_crawler_async.py
72
+ python basic_income_crawler_async.py
73
+ python jinbo_crawler_async.py
74
+ ```
75
+
76
+ ### ๋งค์ผ ์ž๋™ ์‹คํ–‰ (์Šค์ผ€์ค„๋Ÿฌ)
77
+
78
+ ```bash
79
+ python unified_scheduler.py # ๋งค์ผ ์˜ค์ „ 9์‹œ ์ „์ฒด ์ž๋™ ์‹คํ–‰
80
+ ```
81
+
82
+ ### Windows ๋ฐฐ์น˜ ํŒŒ์ผ
83
+
84
+ | ํŒŒ์ผ | ์„ค๋ช… |
85
+ |------|------|
86
+ | `run_unified.bat` | ์ „์ฒด ๋™์‹œ ํฌ๋กค๋ง (ํ•œ ๋ฒˆ) |
87
+ | `run_unified_scheduler.bat` | ์ „์ฒด ๋งค์ผ ์ž๋™ ํฌ๋กค๋ง |
88
+ | `run_once.bat` | ๋ฏผ์ฃผ๋‹น๋งŒ |
89
+ | `run_ppp.bat` | ๊ตญ๋ฏผ์˜ํž˜๋งŒ |
90
+
91
+ ## ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ
92
+
93
+ | ์ •๋‹น | ๊ฒŒ์‹œํŒ | ์ˆ˜์ง‘ ์‹œ์ž‘์ผ |
94
+ |------|--------|------------|
95
+ | ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰/๋ธŒ๋ฆฌํ•‘, ๋ชจ๋‘๋ฐœ์–ธ | 2003-11-11 |
96
+ | ๊ตญ๋ฏผ์˜ํž˜ | ๋Œ€๋ณ€์ธ ๋…ผํ‰๋ณด๋„์ž๋ฃŒ, ์›๋‚ด ๋ณด๋„์ž๋ฃŒ, ๋ฏธ๋””์–ดํŠน์œ„ | 2000-03-10 |
97
+ | ์กฐ๊ตญํ˜์‹ ๋‹น | ๊ธฐ์žํšŒ๊ฒฌ๋ฌธ, ๋…ผํ‰๋ธŒ๋ฆฌํ•‘, ๋ณด๋„์ž๋ฃŒ | 2024-03-04 |
98
+ | ๊ฐœํ˜์‹ ๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰๋ธŒ๋ฆฌํ•‘ | 2024-02-13 |
99
+ | ๊ธฐ๋ณธ์†Œ๋“๋‹น | ๋…ผํ‰ยท๋ณด๋„์ž๋ฃŒ (๋…ผํ‰/๋ฐœ์–ธ/๋ณด๋„์ž๋ฃŒ) | 2020-01-08 |
100
+ | ์ง„๋ณด๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰, ๋ชจ๋‘๋ฐœ์–ธ | 2017-10-14 |
101
+
102
+ ## ์„ค์ • (crawler_config.json)
103
+
104
+ ๊ฐ ์ •๋‹น๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์„ค์ • ๊ฐ€๋Šฅ:
105
+
106
+ ```json
107
+ {
108
+ "minjoo": { ... },
109
+ "ppp": { ... },
110
+ "rebuilding": { ... },
111
+ "reform": { ... },
112
+ "basic_income": { ... },
113
+ "jinbo": { ... }
114
+ }
115
+ ```
116
+
117
+ | ์„ค์ • | ์„ค๋ช… |
118
+ |------|------|
119
+ | `boards` | ์ˆ˜์ง‘ํ•  ๊ฒŒ์‹œํŒ ๋ชฉ๋ก |
120
+ | `start_date` | ์ตœ์ดˆ ํฌ๋กค๋ง ์‹œ์ž‘ ๋‚ ์งœ |
121
+ | `max_pages` | ์ตœ๋Œ€ ํŽ˜์ด์ง€ ์ˆ˜ |
122
+ | `concurrent_requests` | ๋™์‹œ ์š”์ฒญ ์ˆ˜ (์„œ๋ฒ„ ๋ถ€๋‹ด ๊ณ ๋ ค) |
123
+ | `request_delay` | ์š”์ฒญ ๊ฐ„ ๋Œ€๊ธฐ ์‹œ๊ฐ„(์ดˆ) |
124
+ | `output_path` | ๋กœ์ปฌ ์ €์žฅ ๊ฒฝ๋กœ |
125
+
126
+ ## ํŒŒ์ผ ๊ตฌ์กฐ
127
+
128
+ ```
129
+ ์ •๋‹นํฌ๋กค๋Ÿฌ/
130
+ โ”œโ”€โ”€ main.py # ํ†ตํ•ฉ ์ง„์ž…์  (CLI ์ธ์ž ์ง€์›)
131
+ โ”œโ”€โ”€ unified_crawler.py # 6๊ฐœ ์ •๋‹น ํ†ตํ•ฉ ํฌ๋กค๋Ÿฌ
132
+ โ”œโ”€โ”€ unified_scheduler.py # ํ†ตํ•ฉ ์Šค์ผ€์ค„๋Ÿฌ
133
+ โ”œโ”€โ”€ minjoo_crawler_async.py # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น
134
+ โ”œโ”€โ”€ ppp_crawler_async.py # ๊ตญ๋ฏผ์˜ํž˜
135
+ โ”œโ”€โ”€ rebuilding_crawler_async.py # ์กฐ๊ตญํ˜์‹ ๋‹น
136
+ โ”œโ”€โ”€ reform_crawler_async.py # ๊ฐœํ˜์‹ ๋‹น
137
+ โ”œโ”€โ”€ basic_income_crawler_async.py # ๊ธฐ๋ณธ์†Œ๋“๋‹น
138
+ โ”œโ”€โ”€ jinbo_crawler_async.py # ์ง„๋ณด๋‹น
139
+ โ”œโ”€โ”€ scheduler.py # ๋ฏผ์ฃผ๋‹น ์ „์šฉ ์Šค์ผ€์ค„๋Ÿฌ (๋ ˆ๊ฑฐ์‹œ)
140
+ โ”œโ”€โ”€ crawler_config.json # ํฌ๋กค๋ง ์„ค์ • (6๊ฐœ ์ •๋‹น)
141
+ โ”œโ”€โ”€ crawler_state.json # ํฌ๋กค๋ง ์ƒํƒœ (์ž๋™ ์ƒ์„ฑ)
142
+ โ”œโ”€โ”€ requirements.txt # Python ์˜์กด์„ฑ
143
+ โ””โ”€โ”€ .env # ํ™˜๊ฒฝ ๋ณ€์ˆ˜ (์ง์ ‘ ์ƒ์„ฑ)
144
+ ```
145
+
146
+ ## ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ (๊ณตํ†ต)
147
+
148
+ | ์ปฌ๋Ÿผ | ์„ค๋ช… |
149
+ |------|------|
150
+ | `board_name` | ๊ฒŒ์‹œํŒ ์ด๋ฆ„ |
151
+ | `title` | ์ œ๋ชฉ |
152
+ | `category` | ์นดํ…Œ๊ณ ๋ฆฌ/๋ถ„๋ฅ˜ |
153
+ | `date` | ๊ฒŒ์‹œ ๋‚ ์งœ |
154
+ | `writer` | ์ž‘์„ฑ์ž |
155
+ | `text` | ๋ณธ๋ฌธ |
156
+ | `url` | ์›๋ฌธ URL |
157
+
158
+ > **์ฐธ๊ณ **: ๊ตญ๋ฏผ์˜ํž˜์€ `category` ๋Œ€์‹  `section`, `no` ์ปฌ๋Ÿผ ์ถ”๊ฐ€ ํฌํ•จ
159
+
160
+ ## ์„ฑ๋Šฅ
161
+
162
+ | ํ•ญ๋ชฉ | ๋น„๋™๊ธฐ ๋ฒ„์ „ | ๊ธฐ์กด ๋™๊ธฐ ๋ฒ„์ „ |
163
+ |------|------------|--------------|
164
+ | ์ •๋‹น 1๊ฐœ (1000๊ฐœ) | ~5๋ถ„ | ~80๋ถ„ |
165
+ | 6๊ฐœ ์ •๋‹น ๋™์‹œ | ~5-10๋ถ„ | ~480๋ถ„ |
166
+
167
+ ## ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ์ž‘๋™ ๋ฐฉ์‹
168
+
169
+ 1. **์ฒซ ์‹คํ–‰**: `start_date`๋ถ€ํ„ฐ ์˜ค๋Š˜๊นŒ์ง€ ์ „์ฒด ์ˆ˜์ง‘
170
+ 2. **์ดํ›„ ์‹คํ–‰**: ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ๋‚ ์งœ ๋‹ค์Œ๋‚ ๋ถ€ํ„ฐ๋งŒ ์ˆ˜์ง‘
171
+ 3. **ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ณ‘ํ•ฉ**: ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ž๋™ ๋ณ‘ํ•ฉ + URL ๊ธฐ์ค€ ์ค‘๋ณต ์ œ๊ฑฐ
172
+ 4. **์ƒํƒœ ๊ด€๋ฆฌ**: ์ •๋‹น๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ `crawler_state.json`์— ๊ธฐ๋ก
173
+
174
+ ## ๋ฌธ์ œ ํ•ด๊ฒฐ
175
+
176
+ | ๋ฌธ์ œ | ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• |
177
+ |------|----------|
178
+ | `HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค` | `.env` ํŒŒ์ผ์˜ `HF_TOKEN` ํ™•์ธ |
179
+ | ํฌ๋กค๋ง์ด ๋А๋ฆผ | `crawler_config.json`์—์„œ `concurrent_requests` ์ฆ๊ฐ€ |
180
+ | ์„œ๋ฒ„ ์—ฐ๊ฒฐ ์˜ค๋ฅ˜ | `crawler_config.json`์—์„œ `request_delay` ์ฆ๊ฐ€ |
181
+ | ํŠน์ • ์ •๋‹น๋งŒ ์‹คํŒจ | `python main.py --party [์ •๋‹น์ฝ”๋“œ]` ๋กœ ๊ฐœ๋ณ„ ์‹คํ–‰ํ•˜์—ฌ ํ™•์ธ |
182
+
183
+ ## ๋กœ๊ทธ ํ™•์ธ
184
+
185
+ ```bash
186
+ type main.log # main.py ์‹คํ–‰ ๋กœ๊ทธ
187
+ type unified_crawler.log # ํ†ตํ•ฉ ํฌ๋กค๋Ÿฌ ๋กœ๊ทธ
188
+ type unified_scheduler.log # ์Šค์ผ€์ค„๋Ÿฌ ๋กœ๊ทธ
189
+ ```
190
+
191
+ ## Windows ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰
192
+
193
+ ```bash
194
+ # ๋ฐฐ์น˜ ํŒŒ์ผ
195
+ start /B python main.py > main.log 2>&1
196
+
197
+ # ๋˜๋Š” Windows ์ž‘์—… ์Šค์ผ€์ค„๋Ÿฌ
198
+ # ํŠธ๋ฆฌ๊ฑฐ: ๋งค์ผ ์˜ค์ „ 9์‹œ โ†’ ๋™์ž‘: python unified_scheduler.py
199
+ ```
200
+
201
+ ## ์ฃผ์˜์‚ฌํ•ญ
202
+
203
+ 1. `concurrent_requests`๋Š” 10-20 ์ดํ•˜ ๊ถŒ์žฅ (์„œ๋ฒ„ ๋ถ€๋‹ด ์ตœ์†Œํ™”)
204
+ 2. ์ˆ˜์ง‘ ์ „ ์›น์‚ฌ์ดํŠธ robots.txt ํ™•์ธ
205
+ 3. ๊ณต๊ฐœ ์‹œ ๊ฐœ์ธ์ •๋ณด ํฌํ•จ ์—ฌ๋ถ€ ํ™•์ธ ๋ฐ ์ถœ์ฒ˜ ๋ช…์‹œ
206
+
207
+ ## ๋ผ์ด์„ ์Šค
208
+
209
+ MIT License
README_UNIFIED.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ
2
+
3
+ > **์ด ํŒŒ์ผ์€ README.md ๋กœ ํ†ตํ•ฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.**
4
+ > ์ตœ์‹  ๋ฌธ์„œ๋Š” [README.md](README.md) ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
basic_income_crawler_async.py ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ๊ธฐ๋ณธ์†Œ๋“๋‹น ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - ๊ทธ๋ˆ„๋ณด๋“œ 5 ๊ธฐ๋ฐ˜ ์‚ฌ์ดํŠธ (basicincomeparty.kr)
6
+ - td.td_subject / td.td_datetime(YY.MM.DD.) / div#bo_v_con ๊ตฌ์กฐ
7
+ """
8
+
9
+ import os
10
+ import json
11
+ import re
12
+ import asyncio
13
+ from datetime import datetime, timedelta
14
+ from typing import List, Dict, Optional
15
+ import pandas as pd
16
+ from tqdm.asyncio import tqdm as async_tqdm
17
+ import aiohttp
18
+ from bs4 import BeautifulSoup
19
+ from dotenv import load_dotenv
20
+ from huggingface_hub import HfApi, login
21
+ from datasets import Dataset, load_dataset
22
+
23
+ load_dotenv()
24
+
25
+
26
+ class BasicIncomeAsyncCrawler:
27
+ def __init__(self, config_path="crawler_config.json"):
28
+ self.base_url = "https://basicincomeparty.kr"
29
+ self.party_name = "๊ธฐ๋ณธ์†Œ๋“๋‹น"
30
+ self.config_path = config_path
31
+ self.state_path = "crawler_state.json"
32
+
33
+ self.load_config()
34
+
35
+ self.hf_token = os.getenv("HF_TOKEN")
36
+ self.hf_repo_id = os.getenv("HF_REPO_ID_BASIC_INCOME", "basic-income-press-releases")
37
+
38
+ self.semaphore = asyncio.Semaphore(10)
39
+
40
+ def load_config(self):
41
+ default_config = {
42
+ "boards": {
43
+ "๋…ผํ‰๋ณด๋„์ž๋ฃŒ": "bikr/press"
44
+ },
45
+ "start_date": "2020-01-08",
46
+ "max_pages": 10000,
47
+ "concurrent_requests": 10,
48
+ "request_delay": 0.3,
49
+ "output_path": "./data"
50
+ }
51
+
52
+ if os.path.exists(self.config_path):
53
+ with open(self.config_path, 'r', encoding='utf-8') as f:
54
+ config = json.load(f)
55
+ self.config = config.get('basic_income', default_config)
56
+ else:
57
+ self.config = default_config
58
+
59
+ self.boards = self.config["boards"]
60
+ self.start_date = self.config["start_date"]
61
+ self.max_pages = self.config["max_pages"]
62
+ self.output_path = self.config["output_path"]
63
+
64
+ def load_state(self) -> Dict:
65
+ if os.path.exists(self.state_path):
66
+ with open(self.state_path, 'r', encoding='utf-8') as f:
67
+ state = json.load(f)
68
+ return state.get('basic_income', {})
69
+ return {}
70
+
71
+ def save_state(self, state: Dict):
72
+ all_state = {}
73
+ if os.path.exists(self.state_path):
74
+ with open(self.state_path, 'r', encoding='utf-8') as f:
75
+ all_state = json.load(f)
76
+ all_state['basic_income'] = state
77
+ with open(self.state_path, 'w', encoding='utf-8') as f:
78
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
79
+
80
+ @staticmethod
81
+ def parse_date(date_str: str) -> Optional[datetime]:
82
+ """YY.MM.DD. ๋˜๋Š” YYYY.MM.DD. ๋˜๋Š” YYYY-MM-DD ํŒŒ์‹ฑ"""
83
+ date_str = date_str.strip().rstrip('.')
84
+ try:
85
+ parts = date_str.split('.')
86
+ if len(parts) >= 3:
87
+ year = int(parts[0])
88
+ year = 2000 + year if year < 100 else year
89
+ return datetime(year, int(parts[1]), int(parts[2]))
90
+ except:
91
+ pass
92
+ try:
93
+ return datetime.strptime(date_str[:10], '%Y-%m-%d')
94
+ except:
95
+ return None
96
+
97
+ @staticmethod
98
+ def clean_text(text: str) -> str:
99
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
100
+ return text.strip()
101
+
102
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
103
+ max_retries: int = 3) -> Optional[str]:
104
+ async with self.semaphore:
105
+ for attempt in range(max_retries):
106
+ try:
107
+ await asyncio.sleep(self.config.get("request_delay", 0.3))
108
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
109
+ if response.status == 200:
110
+ return await response.text()
111
+ except Exception:
112
+ if attempt < max_retries - 1:
113
+ await asyncio.sleep(1)
114
+ else:
115
+ return None
116
+ return None
117
+
118
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
119
+ board_name: str, board_path: str, page_num: int,
120
+ start_date: datetime, end_date: datetime) -> tuple:
121
+ url = f"{self.base_url}/{board_path}?page={page_num}"
122
+
123
+ html = await self.fetch_with_retry(session, url)
124
+ if not html:
125
+ return [], False
126
+
127
+ soup = BeautifulSoup(html, 'html.parser')
128
+ rows = soup.select('table tbody tr')
129
+ if not rows:
130
+ return [], True
131
+
132
+ data = []
133
+ stop_flag = False
134
+
135
+ for row in rows:
136
+ try:
137
+ # ์ œ๋ชฉยทURL: td.td_subject div.bo_tit a
138
+ title_a = row.select_one('td.td_subject div.bo_tit a')
139
+ if not title_a:
140
+ continue
141
+
142
+ title = title_a.get_text(strip=True)
143
+ href = title_a.get('href', '')
144
+ # page ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ๊ฑฐ ํ›„ ์ ˆ๋Œ€ URL
145
+ article_url = re.sub(r'\?.*$', '', href)
146
+ if not article_url.startswith('http'):
147
+ article_url = self.base_url + article_url
148
+
149
+ # ๋‚ ์งœ: td.td_datetime (YY.MM.DD. ํ˜•์‹)
150
+ date_td = row.select_one('td.td_datetime')
151
+ if not date_td:
152
+ continue
153
+ date_str = date_td.get_text(strip=True)
154
+
155
+ # ์นดํ…Œ๊ณ ๋ฆฌ: td.td_num2 a.bo_cate_link
156
+ cate_a = row.select_one('td.td_num2 a.bo_cate_link')
157
+ category = cate_a.get_text(strip=True) if cate_a else ""
158
+
159
+ article_date = self.parse_date(date_str)
160
+ if not article_date:
161
+ continue
162
+ if article_date < start_date:
163
+ stop_flag = True
164
+ break
165
+ if article_date > end_date:
166
+ continue
167
+
168
+ data.append({
169
+ 'board_name': board_name,
170
+ 'title': title,
171
+ 'category': category,
172
+ 'date': article_date.strftime('%Y-%m-%d'), # YYYY-MM-DD ์ •๊ทœํ™”
173
+ 'url': article_url
174
+ })
175
+ except:
176
+ continue
177
+
178
+ return data, stop_flag
179
+
180
+ async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
181
+ html = await self.fetch_with_retry(session, url)
182
+ if not html:
183
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': ""}
184
+
185
+ soup = BeautifulSoup(html, 'html.parser')
186
+ text_parts = []
187
+ writer = ""
188
+
189
+ # ๋ณธ๋ฌธ: div#bo_v_con
190
+ contents_div = soup.find('div', id='bo_v_con')
191
+ if contents_div:
192
+ for p in contents_div.find_all('p'):
193
+ cleaned = self.clean_text(p.get_text(strip=True))
194
+ if cleaned:
195
+ text_parts.append(cleaned)
196
+
197
+ # ์ž‘์„ฑ์ž: section#bo_v_info div.profile_info_ct ์•ˆ์˜ span.sv_member
198
+ info_div = soup.select_one('section#bo_v_info div.profile_info_ct')
199
+ if info_div:
200
+ writer_el = info_div.find('span', class_='sv_member')
201
+ if writer_el:
202
+ writer = writer_el.get_text(strip=True)
203
+
204
+ return {'text': '\n'.join(text_parts), 'writer': writer}
205
+
206
+ async def collect_board(self, board_name: str, board_path: str,
207
+ start_date: str, end_date: str) -> List[Dict]:
208
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
209
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
210
+
211
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
212
+
213
+ headers = {
214
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
215
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
216
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
217
+ }
218
+
219
+ async with aiohttp.ClientSession(headers=headers) as session:
220
+ all_items = []
221
+ page_num = 1
222
+ empty_pages = 0
223
+ max_empty_pages = 3
224
+
225
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
226
+ while page_num <= self.max_pages:
227
+ items, stop_flag = await self.fetch_list_page(
228
+ session, board_name, board_path, page_num, start_dt, end_dt
229
+ )
230
+
231
+ if not items:
232
+ empty_pages += 1
233
+ if empty_pages >= max_empty_pages or stop_flag:
234
+ break
235
+ else:
236
+ empty_pages = 0
237
+ all_items.extend(items)
238
+
239
+ pbar.update(1)
240
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
241
+
242
+ if stop_flag:
243
+ break
244
+
245
+ page_num += 1
246
+
247
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
248
+
249
+ if all_items:
250
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
251
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
252
+
253
+ details = []
254
+ for coro in async_tqdm(asyncio.as_completed(tasks),
255
+ total=len(tasks),
256
+ desc=f"[{board_name}] ์ƒ์„ธ"):
257
+ detail = await coro
258
+ details.append(detail)
259
+
260
+ for item, detail in zip(all_items, details):
261
+ item.update(detail)
262
+
263
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
264
+ return all_items
265
+
266
+ async def collect_all(self, start_date: Optional[str] = None,
267
+ end_date: Optional[str] = None) -> pd.DataFrame:
268
+ if not end_date:
269
+ end_date = datetime.now().strftime('%Y-%m-%d')
270
+ if not start_date:
271
+ start_date = self.start_date
272
+
273
+ print(f"\n{'='*60}")
274
+ print(f"๊ธฐ๋ณธ์†Œ๋“๋‹น ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
275
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
276
+ print(f"{'='*60}")
277
+
278
+ tasks = [
279
+ self.collect_board(board_name, board_path, start_date, end_date)
280
+ for board_name, board_path in self.boards.items()
281
+ ]
282
+ results = await asyncio.gather(*tasks)
283
+
284
+ all_data = []
285
+ for items in results:
286
+ all_data.extend(items)
287
+
288
+ if not all_data:
289
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
290
+ return pd.DataFrame()
291
+
292
+ df = pd.DataFrame(all_data)
293
+ df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
294
+ df = df[(df['title'] != "") & (df['text'] != "")]
295
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
296
+
297
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
298
+ return df
299
+
300
+ def save_local(self, df: pd.DataFrame):
301
+ os.makedirs(self.output_path, exist_ok=True)
302
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
303
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
304
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
305
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
306
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
307
+ print(f"โœ“ CSV: {csv_path}")
308
+ print(f"โœ“ Excel: {xlsx_path}")
309
+
310
+ def upload_to_huggingface(self, df: pd.DataFrame):
311
+ if not self.hf_token:
312
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
313
+ return
314
+
315
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
316
+ try:
317
+ login(token=self.hf_token)
318
+ new_dataset = Dataset.from_pandas(df)
319
+ try:
320
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
321
+ existing_df = existing_dataset.to_pandas()
322
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
323
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
324
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
325
+ final_dataset = Dataset.from_pandas(combined_df)
326
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ")
327
+ except:
328
+ final_dataset = new_dataset
329
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
330
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
331
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
332
+ except Exception as e:
333
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
334
+
335
+ async def run_incremental(self):
336
+ state = self.load_state()
337
+ last_date = state.get('last_crawl_date')
338
+
339
+ if last_date:
340
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
341
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
342
+ else:
343
+ start_date = self.start_date
344
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
345
+
346
+ end_date = datetime.now().strftime('%Y-%m-%d')
347
+ df = await self.collect_all(start_date, end_date)
348
+
349
+ if df.empty:
350
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
351
+ return
352
+
353
+ self.save_local(df)
354
+ self.upload_to_huggingface(df)
355
+
356
+ state['last_crawl_date'] = end_date
357
+ state['last_crawl_time'] = datetime.now().isoformat()
358
+ state['last_count'] = len(df)
359
+ self.save_state(state)
360
+
361
+ print(f"\n{'='*60}\nโœ“ ์™„๋ฃŒ!\n{'='*60}\n")
362
+
363
+
364
+ async def main():
365
+ crawler = BasicIncomeAsyncCrawler()
366
+ await crawler.run_incremental()
367
+
368
+
369
+ if __name__ == "__main__":
370
+ asyncio.run(main())
crawler_config.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "minjoo": {
3
+ "boards": {
4
+ "๋ณด๋„์ž๋ฃŒ": "188",
5
+ "๋…ผํ‰_๋ธŒ๋ฆฌํ•‘": "11",
6
+ "๋ชจ๋‘๋ฐœ์–ธ": "230"
7
+ },
8
+ "start_date": "2003-11-11",
9
+ "max_pages": 10000,
10
+ "concurrent_requests": 20,
11
+ "request_delay": 0.1,
12
+ "output_path": "./data"
13
+ },
14
+ "ppp": {
15
+ "boards": {
16
+ "๋Œ€๋ณ€์ธ_๋…ผํ‰๋ณด๋„์ž๋ฃŒ": "BBSDD0001",
17
+ "์›๋‚ด_๋ณด๋„์ž๋ฃŒ": "BBSDD0002",
18
+ "๋ฏธ๋””์–ดํŠน์œ„_๋ณด๋„์ž๋ฃŒ": "BBSDD0042"
19
+ },
20
+ "start_date": "2000-03-10",
21
+ "max_pages": 10000,
22
+ "concurrent_requests": 20,
23
+ "request_delay": 0.1,
24
+ "output_path": "./data"
25
+ },
26
+ "rebuilding": {
27
+ "boards": {
28
+ "๊ธฐ์žํšŒ๊ฒฌ๋ฌธ": "news/press-conference",
29
+ "๋…ผํ‰๋ธŒ๋ฆฌํ•‘": "news/commentary-briefing",
30
+ "๋ณด๋„์ž๋ฃŒ": "news/press-release"
31
+ },
32
+ "start_date": "2024-03-04",
33
+ "max_pages": 10000,
34
+ "concurrent_requests": 10,
35
+ "request_delay": 0.5,
36
+ "output_path": "./data"
37
+ },
38
+ "reform": {
39
+ "boards": {
40
+ "๋ณด๋„์ž๋ฃŒ": "press",
41
+ "๋…ผํ‰๋ธŒ๋ฆฌํ•‘": "briefing"
42
+ },
43
+ "start_date": "2024-02-13",
44
+ "max_pages": 10000,
45
+ "concurrent_requests": 10,
46
+ "request_delay": 0.3,
47
+ "output_path": "./data"
48
+ },
49
+ "basic_income": {
50
+ "boards": {
51
+ "๋…ผํ‰๋ณด๋„์ž๋ฃŒ": "bikr/press"
52
+ },
53
+ "start_date": "2020-01-08",
54
+ "max_pages": 10000,
55
+ "concurrent_requests": 10,
56
+ "request_delay": 0.3,
57
+ "output_path": "./data"
58
+ },
59
+ "jinbo": {
60
+ "boards": {
61
+ "๋ณด๋„์ž๋ฃŒ": {"p": "286", "b": "b_1_111"},
62
+ "๋…ผํ‰": {"p": "15", "b": "b_1_2"},
63
+ "๋ชจ๋‘๋ฐœ์–ธ": {"p": "14", "b": "b_1_1"}
64
+ },
65
+ "start_date": "2017-10-14",
66
+ "max_pages": 10000,
67
+ "concurrent_requests": 10,
68
+ "request_delay": 0.3,
69
+ "output_path": "./data"
70
+ }
71
+ }
jinbo_crawler_async.py ADDED
@@ -0,0 +1,424 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ์ง„๋ณด๋‹น ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - jinboparty.com ์ž์ฒด CMS ์‚ฌ์šฉ
6
+ - ๋ณด๋„์ž๋ฃŒ: ์นด๋“œํ˜• ๋ ˆ์ด์•„์›ƒ (div.img_list_item)
7
+ - ๋…ผํ‰/๋ชจ๋‘๋ฐœ์–ธ: ํ…Œ์ด๋ธ”ํ˜• ๋ ˆ์ด์•„์›ƒ (div#moTable)
8
+ - js_board_view('ID') โ†’ /pages/?p=...&b=...&bn=ID&m=read ํŒจํ„ด
9
+ """
10
+
11
+ import os
12
+ import json
13
+ import re
14
+ import asyncio
15
+ from datetime import datetime, timedelta
16
+ from typing import List, Dict, Optional
17
+ import pandas as pd
18
+ from tqdm.asyncio import tqdm as async_tqdm
19
+ import aiohttp
20
+ from bs4 import BeautifulSoup
21
+ from dotenv import load_dotenv
22
+ from huggingface_hub import HfApi, login
23
+ from datasets import Dataset, load_dataset
24
+
25
+ load_dotenv()
26
+
27
+
28
+ class JinboAsyncCrawler:
29
+ def __init__(self, config_path="crawler_config.json"):
30
+ self.base_url = "https://jinboparty.com"
31
+ self.party_name = "์ง„๋ณด๋‹น"
32
+ self.config_path = config_path
33
+ self.state_path = "crawler_state.json"
34
+
35
+ self.load_config()
36
+
37
+ self.hf_token = os.getenv("HF_TOKEN")
38
+ self.hf_repo_id = os.getenv("HF_REPO_ID_JINBO", "jinbo-press-releases")
39
+
40
+ self.semaphore = asyncio.Semaphore(10)
41
+
42
+ def load_config(self):
43
+ # boards ๊ฐ’์€ {"p": "...", "b": "..."} ํ˜•ํƒœ์˜ dict
44
+ default_config = {
45
+ "boards": {
46
+ "๋ณด๋„์ž๋ฃŒ": {"p": "286", "b": "b_1_111"},
47
+ "๋…ผํ‰": {"p": "15", "b": "b_1_2"},
48
+ "๋ชจ๋‘๋ฐœ์–ธ": {"p": "14", "b": "b_1_1"}
49
+ },
50
+ "start_date": "2017-10-14",
51
+ "max_pages": 10000,
52
+ "concurrent_requests": 10,
53
+ "request_delay": 0.3,
54
+ "output_path": "./data"
55
+ }
56
+
57
+ if os.path.exists(self.config_path):
58
+ with open(self.config_path, 'r', encoding='utf-8') as f:
59
+ config = json.load(f)
60
+ self.config = config.get('jinbo', default_config)
61
+ else:
62
+ self.config = default_config
63
+
64
+ self.boards = self.config["boards"]
65
+ self.start_date = self.config["start_date"]
66
+ self.max_pages = self.config["max_pages"]
67
+ self.output_path = self.config["output_path"]
68
+
69
+ def load_state(self) -> Dict:
70
+ if os.path.exists(self.state_path):
71
+ with open(self.state_path, 'r', encoding='utf-8') as f:
72
+ state = json.load(f)
73
+ return state.get('jinbo', {})
74
+ return {}
75
+
76
+ def save_state(self, state: Dict):
77
+ all_state = {}
78
+ if os.path.exists(self.state_path):
79
+ with open(self.state_path, 'r', encoding='utf-8') as f:
80
+ all_state = json.load(f)
81
+ all_state['jinbo'] = state
82
+ with open(self.state_path, 'w', encoding='utf-8') as f:
83
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
84
+
85
+ @staticmethod
86
+ def parse_date(date_str: str) -> Optional[datetime]:
87
+ """YYYY.MM.DD ๋˜๋Š” YYYY-MM-DD ํŒŒ์‹ฑ"""
88
+ date_str = date_str.strip()
89
+ for fmt in ('%Y.%m.%d', '%Y-%m-%d'):
90
+ try:
91
+ return datetime.strptime(date_str[:10], fmt)
92
+ except:
93
+ continue
94
+ return None
95
+
96
+ @staticmethod
97
+ def clean_text(text: str) -> str:
98
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
99
+ return text.strip()
100
+
101
+ @staticmethod
102
+ def extract_board_id(href: str) -> Optional[str]:
103
+ """js_board_view('ID') ์—์„œ ID ์ถ”์ถœ"""
104
+ match = re.search(r"js_board_view\('(\d+)'\)", href)
105
+ return match.group(1) if match else None
106
+
107
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
108
+ max_retries: int = 3) -> Optional[str]:
109
+ async with self.semaphore:
110
+ for attempt in range(max_retries):
111
+ try:
112
+ await asyncio.sleep(self.config.get("request_delay", 0.3))
113
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
114
+ if response.status == 200:
115
+ return await response.text()
116
+ except Exception:
117
+ if attempt < max_retries - 1:
118
+ await asyncio.sleep(1)
119
+ else:
120
+ return None
121
+ return None
122
+
123
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
124
+ board_name: str, board_cfg: Dict, page_num: int,
125
+ start_date: datetime, end_date: datetime) -> tuple:
126
+ p = board_cfg['p']
127
+ b = board_cfg['b']
128
+ url = f"{self.base_url}/pages/index.php?nPage={page_num}&p={p}&b={b}"
129
+
130
+ html = await self.fetch_with_retry(session, url)
131
+ if not html:
132
+ return [], False
133
+
134
+ soup = BeautifulSoup(html, 'html.parser')
135
+ data = []
136
+ stop_flag = False
137
+
138
+ # โ”€โ”€ ์นด๋“œํ˜• ๋ ˆ์ด์•„์›ƒ (๋ณด๋„์ž๋ฃŒ) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
139
+ card_items = soup.select('div.img_list_item')
140
+ if card_items:
141
+ for item in card_items:
142
+ try:
143
+ link = item.select_one('a[href]')
144
+ if not link:
145
+ continue
146
+ bn = self.extract_board_id(link.get('href', ''))
147
+ if not bn:
148
+ continue
149
+
150
+ title_el = item.select_one('h4._tit span')
151
+ title = title_el.get_text(strip=True) if title_el else ""
152
+
153
+ # ๋‚ ์งœ: icon_cal ๋‹ค์Œ span
154
+ date_str = ""
155
+ for span in item.select('div.item_bottom span'):
156
+ text = span.get_text(strip=True)
157
+ if re.match(r'\d{4}\.\d{2}\.\d{2}', text):
158
+ date_str = text[:10]
159
+ break
160
+
161
+ if not date_str:
162
+ continue
163
+
164
+ article_date = self.parse_date(date_str)
165
+ if not article_date:
166
+ continue
167
+ if article_date < start_date:
168
+ stop_flag = True
169
+ break
170
+ if article_date > end_date:
171
+ continue
172
+
173
+ detail_url = f"{self.base_url}/pages/?p={p}&b={b}&bn={bn}&m=read"
174
+ data.append({
175
+ 'board_name': board_name,
176
+ 'title': title,
177
+ 'category': board_name,
178
+ 'date': article_date.strftime('%Y-%m-%d'),
179
+ 'url': detail_url
180
+ })
181
+ except:
182
+ continue
183
+ return data, stop_flag
184
+
185
+ # โ”€โ”€ ํ…Œ์ด๋ธ”ํ˜• ๋ ˆ์ด์•„์›ƒ (๋…ผํ‰ยท๋ชจ๋‘๋ฐœ์–ธ) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
186
+ table_items = soup.select('div#moTable li:not(.t_head)')
187
+ if table_items:
188
+ for item in table_items:
189
+ try:
190
+ link = item.select_one('div.tb_title_area a')
191
+ if not link:
192
+ continue
193
+ bn = self.extract_board_id(link.get('href', ''))
194
+ if not bn:
195
+ continue
196
+
197
+ title_el = item.select_one('p.title')
198
+ title = title_el.get_text(strip=True) if title_el else ""
199
+
200
+ # ๋‚ ์งœ: div.col.wid_140 ("๋“ฑ๋ก์ผ YYYY.MM.DD")
201
+ date_div = item.select_one('div.col.wid_140')
202
+ date_str = ""
203
+ if date_div:
204
+ raw = re.sub(r'๋“ฑ๋ก์ผ\s*', '', date_div.get_text(strip=True)).strip()
205
+ date_str = raw[:10]
206
+
207
+ if not date_str:
208
+ continue
209
+
210
+ article_date = self.parse_date(date_str)
211
+ if not article_date:
212
+ continue
213
+ if article_date < start_date:
214
+ stop_flag = True
215
+ break
216
+ if article_date > end_date:
217
+ continue
218
+
219
+ detail_url = f"{self.base_url}/pages/?p={p}&b={b}&bn={bn}&m=read"
220
+ data.append({
221
+ 'board_name': board_name,
222
+ 'title': title,
223
+ 'category': board_name,
224
+ 'date': article_date.strftime('%Y-%m-%d'),
225
+ 'url': detail_url
226
+ })
227
+ except:
228
+ continue
229
+ return data, stop_flag
230
+
231
+ # ๋‘˜ ๋‹ค ์—†์œผ๋ฉด ๋นˆ ํŽ˜์ด์ง€
232
+ return [], True
233
+
234
+ async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
235
+ html = await self.fetch_with_retry(session, url)
236
+ if not html:
237
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': ""}
238
+
239
+ soup = BeautifulSoup(html, 'html.parser')
240
+ text_parts = []
241
+ writer = ""
242
+
243
+ # ๋ณธ๋ฌธ: div.content_box (class="td wid_full content_box")
244
+ contents_div = soup.select_one('div.content_box')
245
+ if contents_div:
246
+ for p in contents_div.find_all('p'):
247
+ cleaned = self.clean_text(p.get_text(strip=True))
248
+ if cleaned:
249
+ text_parts.append(cleaned)
250
+
251
+ # ์ž‘์„ฑ์ž: ul.info_list li ์ค‘ "์ž‘์„ฑ์ž" ํ•ญ๋ชฉ
252
+ for li in soup.select('ul.info_list li'):
253
+ b_tag = li.find('b')
254
+ if b_tag and '์ž‘์„ฑ์ž' in b_tag.get_text():
255
+ writer = li.get_text(strip=True).replace(b_tag.get_text(strip=True), '').strip()
256
+ break
257
+
258
+ return {'text': '\n'.join(text_parts), 'writer': writer}
259
+
260
+ async def collect_board(self, board_name: str, board_cfg: Dict,
261
+ start_date: str, end_date: str) -> List[Dict]:
262
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
263
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
264
+
265
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
266
+
267
+ headers = {
268
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
269
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
270
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
271
+ }
272
+
273
+ async with aiohttp.ClientSession(headers=headers) as session:
274
+ all_items = []
275
+ page_num = 1
276
+ empty_pages = 0
277
+ max_empty_pages = 3
278
+
279
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
280
+ while page_num <= self.max_pages:
281
+ items, stop_flag = await self.fetch_list_page(
282
+ session, board_name, board_cfg, page_num, start_dt, end_dt
283
+ )
284
+
285
+ if not items:
286
+ empty_pages += 1
287
+ if empty_pages >= max_empty_pages or stop_flag:
288
+ break
289
+ else:
290
+ empty_pages = 0
291
+ all_items.extend(items)
292
+
293
+ pbar.update(1)
294
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
295
+
296
+ if stop_flag:
297
+ break
298
+
299
+ page_num += 1
300
+
301
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
302
+
303
+ if all_items:
304
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
305
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
306
+
307
+ details = []
308
+ for coro in async_tqdm(asyncio.as_completed(tasks),
309
+ total=len(tasks),
310
+ desc=f"[{board_name}] ์ƒ์„ธ"):
311
+ detail = await coro
312
+ details.append(detail)
313
+
314
+ for item, detail in zip(all_items, details):
315
+ item.update(detail)
316
+
317
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
318
+ return all_items
319
+
320
+ async def collect_all(self, start_date: Optional[str] = None,
321
+ end_date: Optional[str] = None) -> pd.DataFrame:
322
+ if not end_date:
323
+ end_date = datetime.now().strftime('%Y-%m-%d')
324
+ if not start_date:
325
+ start_date = self.start_date
326
+
327
+ print(f"\n{'='*60}")
328
+ print(f"์ง„๋ณด๋‹น ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
329
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
330
+ print(f"{'='*60}")
331
+
332
+ tasks = [
333
+ self.collect_board(board_name, board_cfg, start_date, end_date)
334
+ for board_name, board_cfg in self.boards.items()
335
+ ]
336
+ results = await asyncio.gather(*tasks)
337
+
338
+ all_data = []
339
+ for items in results:
340
+ all_data.extend(items)
341
+
342
+ if not all_data:
343
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
344
+ return pd.DataFrame()
345
+
346
+ df = pd.DataFrame(all_data)
347
+ df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
348
+ df = df[(df['title'] != "") & (df['text'] != "")]
349
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
350
+
351
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
352
+ return df
353
+
354
+ def save_local(self, df: pd.DataFrame):
355
+ os.makedirs(self.output_path, exist_ok=True)
356
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
357
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
358
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
359
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
360
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
361
+ print(f"โœ“ CSV: {csv_path}")
362
+ print(f"โœ“ Excel: {xlsx_path}")
363
+
364
+ def upload_to_huggingface(self, df: pd.DataFrame):
365
+ if not self.hf_token:
366
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
367
+ return
368
+
369
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
370
+ try:
371
+ login(token=self.hf_token)
372
+ new_dataset = Dataset.from_pandas(df)
373
+ try:
374
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
375
+ existing_df = existing_dataset.to_pandas()
376
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
377
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
378
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
379
+ final_dataset = Dataset.from_pandas(combined_df)
380
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ")
381
+ except:
382
+ final_dataset = new_dataset
383
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
384
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
385
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
386
+ except Exception as e:
387
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
388
+
389
+ async def run_incremental(self):
390
+ state = self.load_state()
391
+ last_date = state.get('last_crawl_date')
392
+
393
+ if last_date:
394
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
395
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
396
+ else:
397
+ start_date = self.start_date
398
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
399
+
400
+ end_date = datetime.now().strftime('%Y-%m-%d')
401
+ df = await self.collect_all(start_date, end_date)
402
+
403
+ if df.empty:
404
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
405
+ return
406
+
407
+ self.save_local(df)
408
+ self.upload_to_huggingface(df)
409
+
410
+ state['last_crawl_date'] = end_date
411
+ state['last_crawl_time'] = datetime.now().isoformat()
412
+ state['last_count'] = len(df)
413
+ self.save_state(state)
414
+
415
+ print(f"\n{'='*60}\nโœ“ ์™„๋ฃŒ!\n{'='*60}\n")
416
+
417
+
418
+ async def main():
419
+ crawler = JinboAsyncCrawler()
420
+ await crawler.run_incremental()
421
+
422
+
423
+ if __name__ == "__main__":
424
+ asyncio.run(main())
main.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ์ •๋‹น ๋ณด๋„์ž๋ฃŒ ํฌ๋กค๋Ÿฌ - ๋ฉ”์ธ ์ง„์ž…์ 
5
+ ์ง€์› ์ •๋‹น: ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น
6
+
7
+ ์‚ฌ์šฉ๋ฒ•:
8
+ python main.py # ์ „์ฒด ์ •๋‹น ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ
9
+ python main.py --party minjoo # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น๋งŒ
10
+ python main.py --party ppp # ๊ตญ๋ฏผ์˜ํž˜๋งŒ
11
+ python main.py --party rebuilding # ์กฐ๊ตญํ˜์‹ ๋‹น๋งŒ
12
+ python main.py --party reform # ๊ฐœํ˜์‹ ๋‹น๋งŒ
13
+ python main.py --party basic_income # ๊ธฐ๋ณธ์†Œ๋“๋‹น๋งŒ
14
+ python main.py --party jinbo # ์ง„๋ณด๋‹น๋งŒ
15
+ python main.py --start-date 2024-01-01 # ๋‚ ์งœ ๋ฒ”์œ„ ์ง€์ •
16
+ python main.py --party ppp --start-date 2024-01-01 --end-date 2024-06-30
17
+ """
18
+
19
+ import asyncio
20
+ import argparse
21
+ import logging
22
+ from datetime import datetime
23
+
24
+ from minjoo_crawler_async import MinjooAsyncCrawler
25
+ from ppp_crawler_async import PPPAsyncCrawler
26
+ from rebuilding_crawler_async import RebuildingAsyncCrawler
27
+ from reform_crawler_async import ReformAsyncCrawler
28
+ from basic_income_crawler_async import BasicIncomeAsyncCrawler
29
+ from jinbo_crawler_async import JinboAsyncCrawler
30
+
31
+ logging.basicConfig(
32
+ level=logging.INFO,
33
+ format='%(asctime)s [%(levelname)s] %(message)s',
34
+ handlers=[
35
+ logging.FileHandler('main.log', encoding='utf-8'),
36
+ logging.StreamHandler()
37
+ ]
38
+ )
39
+ logger = logging.getLogger(__name__)
40
+
41
+ PARTY_LABELS = {
42
+ 'minjoo': '๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น',
43
+ 'ppp': '๊ตญ๋ฏผ์˜ํž˜',
44
+ 'rebuilding': '์กฐ๊ตญํ˜์‹ ๋‹น',
45
+ 'reform': '๊ฐœํ˜์‹ ๋‹น',
46
+ 'basic_income':'๊ธฐ๋ณธ์†Œ๋“๋‹น',
47
+ 'jinbo': '์ง„๋ณด๋‹น',
48
+ 'all': '์ „์ฒด (6๊ฐœ ์ •๋‹น)',
49
+ }
50
+
51
+ ALL_PARTIES = ['minjoo', 'ppp', 'rebuilding', 'reform', 'basic_income', 'jinbo']
52
+
53
+
54
+ def parse_args():
55
+ parser = argparse.ArgumentParser(
56
+ description='์ •๋‹น ๋ณด๋„์ž๋ฃŒ ํฌ๋กค๋Ÿฌ',
57
+ formatter_class=argparse.RawTextHelpFormatter
58
+ )
59
+ parser.add_argument(
60
+ '--party',
61
+ choices=list(PARTY_LABELS.keys()),
62
+ default='all',
63
+ help=(
64
+ 'ํฌ๋กค๋งํ•  ์ •๋‹น ์„ ํƒ (๊ธฐ๋ณธ๊ฐ’: all)\n'
65
+ ' minjoo : ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น\n'
66
+ ' ppp : ๊ตญ๋ฏผ์˜ํž˜\n'
67
+ ' rebuilding : ์กฐ๊ตญํ˜์‹ ๋‹น\n'
68
+ ' reform : ๊ฐœํ˜์‹ ๋‹น\n'
69
+ ' basic_income : ๊ธฐ๋ณธ์†Œ๋“๋‹น\n'
70
+ ' jinbo : ์ง„๋ณด๋‹น\n'
71
+ ' all : ์ „์ฒด ๋™์‹œ ํฌ๋กค๋ง'
72
+ )
73
+ )
74
+ parser.add_argument(
75
+ '--start-date',
76
+ metavar='YYYY-MM-DD',
77
+ default=None,
78
+ help='์ˆ˜์ง‘ ์‹œ์ž‘ ๋‚ ์งœ (์˜ˆ: 2024-01-01)\n๋ฏธ์ž…๋ ฅ ์‹œ ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ์ดํ›„๋ถ€ํ„ฐ (์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ)'
79
+ )
80
+ parser.add_argument(
81
+ '--end-date',
82
+ metavar='YYYY-MM-DD',
83
+ default=None,
84
+ help='์ˆ˜์ง‘ ์ข…๋ฃŒ ๋‚ ์งœ (์˜ˆ: 2024-12-31)\n๋ฏธ์ž…๋ ฅ ์‹œ ์˜ค๋Š˜ ๋‚ ์งœ'
85
+ )
86
+ return parser.parse_args()
87
+
88
+
89
+ def get_crawler(party: str):
90
+ """์ •๋‹น ์ฝ”๋“œ์— ๋งž๋Š” ํฌ๋กค๋Ÿฌ ์ธ์Šคํ„ด์Šค ๋ฐ˜ํ™˜"""
91
+ return {
92
+ 'minjoo': MinjooAsyncCrawler,
93
+ 'ppp': PPPAsyncCrawler,
94
+ 'rebuilding': RebuildingAsyncCrawler,
95
+ 'reform': ReformAsyncCrawler,
96
+ 'basic_income': BasicIncomeAsyncCrawler,
97
+ 'jinbo': JinboAsyncCrawler,
98
+ }[party]()
99
+
100
+
101
+ async def run_party(party: str, start_date=None, end_date=None):
102
+ """๋‹จ์ผ ์ •๋‹น ํฌ๋กค๋ง ์‹คํ–‰"""
103
+ crawler = get_crawler(party)
104
+ if start_date or end_date:
105
+ df = await crawler.collect_all(start_date, end_date)
106
+ if not df.empty:
107
+ crawler.save_local(df)
108
+ crawler.upload_to_huggingface(df)
109
+ else:
110
+ await crawler.run_incremental()
111
+
112
+
113
+ async def main():
114
+ args = parse_args()
115
+ start_time = datetime.now()
116
+
117
+ target_parties = ALL_PARTIES if args.party == 'all' else [args.party]
118
+
119
+ logger.info("=" * 60)
120
+ logger.info("์ •๋‹น ๋ณด๋„์ž๋ฃŒ ํฌ๋กค๋Ÿฌ ์‹œ์ž‘")
121
+ logger.info(f"๋Œ€์ƒ ์ •๋‹น : {PARTY_LABELS[args.party]}")
122
+ logger.info(f"์ˆ˜์ง‘ ๊ธฐ๊ฐ„ : {args.start_date or '์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ'} ~ {args.end_date or '์˜ค๋Š˜'}")
123
+ logger.info("=" * 60)
124
+
125
+ if len(target_parties) == 1:
126
+ await run_party(target_parties[0], args.start_date, args.end_date)
127
+ else:
128
+ results = await asyncio.gather(
129
+ *[run_party(p, args.start_date, args.end_date) for p in target_parties],
130
+ return_exceptions=True
131
+ )
132
+ for party, result in zip(target_parties, results):
133
+ if isinstance(result, Exception):
134
+ logger.error(f"{PARTY_LABELS[party]} ํฌ๋กค๋ง ์‹คํŒจ: {result}")
135
+ else:
136
+ logger.info(f"{PARTY_LABELS[party]} ํฌ๏ฟฝ๏ฟฝ๏ฟฝ๋ง ์™„๋ฃŒ")
137
+
138
+ duration = (datetime.now() - start_time).total_seconds()
139
+ logger.info("=" * 60)
140
+ logger.info(f"์ „์ฒด ์™„๋ฃŒ! ์†Œ์š” ์‹œ๊ฐ„: {duration:.1f}์ดˆ ({duration / 60:.1f}๋ถ„)")
141
+ logger.info("=" * 60)
142
+
143
+
144
+ if __name__ == "__main__":
145
+ asyncio.run(main())
minjoo_crawler_async.py ADDED
@@ -0,0 +1,453 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - asyncio + aiohttp (10-20๋ฐฐ ๋น ๋ฅธ ์†๋„)
6
+ - ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œ์–ด (์„œ๋ฒ„ ๋ถ€๋‹ด ์ตœ์†Œํ™”)
7
+ - ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ (๋งˆ์ง€๋ง‰ ๋‚ ์งœ ์ดํ›„๋งŒ ํฌ๋กค๋ง)
8
+ - ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
9
+ - ์ผ ๋‹จ์œ„ ์Šค์ผ€์ค„๋ง
10
+ """
11
+
12
+ import os
13
+ import json
14
+ import time
15
+ import re
16
+ import asyncio
17
+ from datetime import datetime, timedelta
18
+ from typing import List, Dict, Optional
19
+ import pandas as pd
20
+ from tqdm.asyncio import tqdm as async_tqdm
21
+ import aiohttp
22
+ from bs4 import BeautifulSoup
23
+ from dotenv import load_dotenv
24
+ from huggingface_hub import HfApi, login
25
+ from datasets import Dataset, load_dataset, concatenate_datasets
26
+
27
+ # .env ํŒŒ์ผ ๋กœ๋“œ
28
+ load_dotenv()
29
+
30
+ class MinjooAsyncCrawler:
31
+ def __init__(self, config_path="crawler_config.json"):
32
+ self.base_url = "https://theminjoo.kr/main/sub"
33
+ self.party_name = "๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น"
34
+ self.config_path = config_path
35
+ self.state_path = "crawler_state.json"
36
+
37
+ # ์„ค์ • ๋กœ๋“œ
38
+ self.load_config()
39
+
40
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค์ •
41
+ self.hf_token = os.getenv("HF_TOKEN")
42
+ self.hf_repo_id = os.getenv("HF_REPO_ID", "minjoo-press-releases")
43
+
44
+ # ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œํ•œ (์„œ๋ฒ„ ๋ถ€๋‹ด ๋ฐฉ์ง€)
45
+ self.semaphore = asyncio.Semaphore(20)
46
+
47
+ def load_config(self):
48
+ """์„ค์ • ํŒŒ์ผ ๋กœ๋“œ"""
49
+ default_config = {
50
+ "boards": {
51
+ "๋ณด๋„์ž๋ฃŒ": "188",
52
+ "๋…ผํ‰_๋ธŒ๋ฆฌํ•‘": "11",
53
+ "๋ชจ๋‘๋ฐœ์–ธ": "230"
54
+ },
55
+ "start_date": "2003-11-11",
56
+ "max_pages": 10000,
57
+ "concurrent_requests": 20,
58
+ "request_delay": 0.1,
59
+ "output_path": "./data"
60
+ }
61
+
62
+ if os.path.exists(self.config_path):
63
+ with open(self.config_path, 'r', encoding='utf-8') as f:
64
+ config = json.load(f)
65
+ # ๋ฏผ์ฃผ๋‹น ์„ค์ •๋งŒ ์ถ”์ถœ
66
+ if 'minjoo' in config:
67
+ self.config = config['minjoo']
68
+ else:
69
+ self.config = default_config
70
+ else:
71
+ self.config = default_config
72
+
73
+ self.boards = self.config["boards"]
74
+ self.start_date = self.config["start_date"]
75
+ self.max_pages = self.config["max_pages"]
76
+ self.output_path = self.config["output_path"]
77
+
78
+ def load_state(self) -> Dict:
79
+ """ํฌ๋กค๋Ÿฌ ์ƒํƒœ ๋กœ๋“œ (๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ๋‚ ์งœ)"""
80
+ if os.path.exists(self.state_path):
81
+ with open(self.state_path, 'r', encoding='utf-8') as f:
82
+ state = json.load(f)
83
+ return state.get('minjoo', {})
84
+ return {}
85
+
86
+ def save_state(self, state: Dict):
87
+ """ํฌ๋กค๋Ÿฌ ์ƒํƒœ ์ €์žฅ"""
88
+ all_state = {}
89
+ if os.path.exists(self.state_path):
90
+ with open(self.state_path, 'r', encoding='utf-8') as f:
91
+ all_state = json.load(f)
92
+
93
+ all_state['minjoo'] = state
94
+
95
+ with open(self.state_path, 'w', encoding='utf-8') as f:
96
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
97
+
98
+ @staticmethod
99
+ def parse_date(date_str: str) -> Optional[datetime]:
100
+ """๋‚ ์งœ ํŒŒ์‹ฑ"""
101
+ try:
102
+ return datetime.strptime(date_str.strip().split()[0], '%Y-%m-%d')
103
+ except:
104
+ return None
105
+
106
+ @staticmethod
107
+ def clean_text(text: str) -> str:
108
+ """ํ…์ŠคํŠธ ์ •๋ฆฌ"""
109
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
110
+ return text.strip()
111
+
112
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
113
+ max_retries: int = 3) -> Optional[str]:
114
+ """์žฌ์‹œ๋„ ๋กœ์ง์ด ์žˆ๋Š” ๋น„๋™๊ธฐ ์š”์ฒญ"""
115
+ async with self.semaphore:
116
+ for attempt in range(max_retries):
117
+ try:
118
+ await asyncio.sleep(self.config.get("request_delay", 0.1))
119
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
120
+ if response.status == 200:
121
+ return await response.text()
122
+ except Exception as e:
123
+ if attempt < max_retries - 1:
124
+ await asyncio.sleep(1)
125
+ else:
126
+ return None
127
+ return None
128
+
129
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
130
+ board_id: str, page_num: int,
131
+ start_date: datetime, end_date: datetime) -> tuple:
132
+ """๋ชฉ๋ก ํŽ˜์ด์ง€ ํ•˜๋‚˜ ๊ฐ€์ ธ์˜ค๊ธฐ"""
133
+ if page_num == 0:
134
+ url = f"{self.base_url}/news/list.php?brd={board_id}"
135
+ else:
136
+ url = f"{self.base_url}/news/list.php?sno={page_num}&par=&&brd={board_id}"
137
+
138
+ html = await self.fetch_with_retry(session, url)
139
+ if not html:
140
+ return [], False
141
+
142
+ soup = BeautifulSoup(html, 'html.parser')
143
+ board_items = soup.find_all('div', {'class': 'board-item'})
144
+
145
+ if not board_items:
146
+ return [], True # ๋นˆ ํŽ˜์ด์ง€
147
+
148
+ data = []
149
+ stop_flag = False
150
+
151
+ for item in board_items:
152
+ try:
153
+ link_tag = item.find('a')
154
+ if not link_tag:
155
+ continue
156
+
157
+ title_span = link_tag.find('span')
158
+ if not title_span:
159
+ continue
160
+
161
+ title = title_span.get_text(strip=True).replace('\n', ' ')
162
+
163
+ # URL ์ฒ˜๋ฆฌ
164
+ article_url = link_tag.get('href', '')
165
+ if article_url.startswith('./'):
166
+ article_url = self.base_url + '/news/' + article_url[2:]
167
+ elif not article_url.startswith('http'):
168
+ article_url = self.base_url + article_url
169
+
170
+ # ์นดํ…Œ๊ณ ๋ฆฌ
171
+ category_tag = item.find('p', {'class': 'category'})
172
+ category = ""
173
+ if category_tag:
174
+ category_span = category_tag.find('span')
175
+ if category_span:
176
+ category = category_span.get_text(strip=True)
177
+
178
+ # ๋‚ ์งœ
179
+ time_tag = item.find('time')
180
+ if not time_tag:
181
+ continue
182
+
183
+ date_str = time_tag.get('datetime', '') or time_tag.get_text(strip=True)
184
+ article_date = self.parse_date(date_str)
185
+
186
+ if not article_date:
187
+ continue
188
+ if article_date < start_date:
189
+ stop_flag = True
190
+ break
191
+ if article_date > end_date:
192
+ continue
193
+
194
+ data.append({
195
+ 'category': category,
196
+ 'title': title,
197
+ 'date': date_str.split()[0] if ' ' in date_str else date_str,
198
+ 'url': article_url
199
+ })
200
+ except:
201
+ continue
202
+
203
+ return data, stop_flag
204
+
205
+ async def fetch_article_detail(self, session: aiohttp.ClientSession,
206
+ url: str) -> Dict:
207
+ """์ƒ์„ธ ํŽ˜์ด์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ"""
208
+ html = await self.fetch_with_retry(session, url)
209
+ if not html:
210
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': "", 'published_date': ""}
211
+
212
+ soup = BeautifulSoup(html, 'html.parser')
213
+ text_parts = []
214
+ writer = ""
215
+ published_date = ""
216
+
217
+ # ๊ฒŒ์‹œ์ผ
218
+ date_li = soup.find('li', {'class': 'date'})
219
+ if date_li:
220
+ date_text = date_li.get_text(strip=True)
221
+ match = re.search(r'(\d{4}-\d{2}-\d{2})', date_text)
222
+ if match:
223
+ published_date = match.group(1)
224
+
225
+ # ๋ณธ๋ฌธ
226
+ contents_div = soup.find('div', {'class': 'board-view__contents'})
227
+ if contents_div:
228
+ for element in contents_div.descendants:
229
+ if element.name == 'p':
230
+ text = element.get_text(strip=True)
231
+ cleaned = self.clean_text(text)
232
+ if cleaned:
233
+ text_parts.append(cleaned)
234
+ elif element.name == 'b':
235
+ text = element.get_text(strip=True)
236
+ cleaned = self.clean_text(text)
237
+ if cleaned and not writer:
238
+ if '๋ฏผ์ฃผ๋‹น' in cleaned or '๊ณต๋ณด๊ตญ' in cleaned or '๋Œ€๋ณ€์ธ' in cleaned:
239
+ writer = cleaned
240
+
241
+ return {
242
+ 'text': '\n'.join(text_parts),
243
+ 'writer': writer,
244
+ 'published_date': published_date
245
+ }
246
+
247
+ async def collect_board(self, board_name: str, board_id: str,
248
+ start_date: str, end_date: str) -> List[Dict]:
249
+ """ํ•œ ๊ฒŒ์‹œํŒ ์ „์ฒด ์ˆ˜์ง‘ (๋น„๋™๊ธฐ)"""
250
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
251
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
252
+
253
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
254
+
255
+ headers = {
256
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
257
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
258
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
259
+ }
260
+
261
+ async with aiohttp.ClientSession(headers=headers) as session:
262
+ # 1๋‹จ๊ณ„: ๋ชฉ๋ก ํŽ˜์ด์ง€ ์ˆ˜์ง‘
263
+ all_items = []
264
+ page_num = 0
265
+ empty_pages = 0
266
+ max_empty_pages = 3
267
+
268
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
269
+ while page_num <= self.max_pages * 20:
270
+ items, stop_flag = await self.fetch_list_page(
271
+ session, board_id, page_num, start_dt, end_dt
272
+ )
273
+
274
+ if not items:
275
+ empty_pages += 1
276
+ if empty_pages >= max_empty_pages or stop_flag:
277
+ break
278
+ else:
279
+ empty_pages = 0
280
+ all_items.extend(items)
281
+
282
+ pbar.update(1)
283
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
284
+
285
+ if stop_flag:
286
+ break
287
+
288
+ page_num += 20
289
+
290
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
291
+
292
+ # 2๋‹จ๊ณ„: ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ (๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ)
293
+ if all_items:
294
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
295
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
296
+
297
+ # ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ์‹คํ–‰
298
+ details = []
299
+ for coro in async_tqdm(asyncio.as_completed(tasks),
300
+ total=len(tasks),
301
+ desc=f"[{board_name}] ์ƒ์„ธ"):
302
+ detail = await coro
303
+ details.append(detail)
304
+
305
+ # ์ƒ์„ธ ์ •๋ณด ๋ณ‘ํ•ฉ
306
+ for item, detail in zip(all_items, details):
307
+ item.update(detail)
308
+ item['board_name'] = board_name
309
+
310
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
311
+ return all_items
312
+
313
+ async def collect_all(self, start_date: Optional[str] = None,
314
+ end_date: Optional[str] = None) -> pd.DataFrame:
315
+ """๋ชจ๋“  ๊ฒŒ์‹œํŒ ์ˆ˜์ง‘"""
316
+ if not end_date:
317
+ end_date = datetime.now().strftime('%Y-%m-%d')
318
+ if not start_date:
319
+ start_date = self.start_date
320
+
321
+ print(f"\n{'='*60}")
322
+ print(f"๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
323
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
324
+ print(f"{'='*60}")
325
+
326
+ # ๋ชจ๋“  ๊ฒŒ์‹œํŒ ๋ณ‘๋ ฌ ์ˆ˜์ง‘
327
+ tasks = [
328
+ self.collect_board(board_name, board_id, start_date, end_date)
329
+ for board_name, board_id in self.boards.items()
330
+ ]
331
+
332
+ results = await asyncio.gather(*tasks)
333
+
334
+ # ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉ
335
+ all_data = []
336
+ for items in results:
337
+ all_data.extend(items)
338
+
339
+ if not all_data:
340
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
341
+ return pd.DataFrame()
342
+
343
+ df = pd.DataFrame(all_data)
344
+ df = df[['board_name', 'title', 'category', 'published_date', 'writer', 'text', 'url']]
345
+ df = df[(df['title'] != "") & (df['text'] != "")]
346
+ df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
347
+ df = df.rename(columns={'published_date': 'date'})
348
+
349
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
350
+ return df
351
+
352
+ def save_local(self, df: pd.DataFrame):
353
+ """๋กœ์ปฌ์— ์ €์žฅ"""
354
+ os.makedirs(self.output_path, exist_ok=True)
355
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
356
+
357
+ # CSV
358
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
359
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
360
+
361
+ # Excel
362
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
363
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
364
+
365
+ print(f"โœ“ CSV: {csv_path}")
366
+ print(f"โœ“ Excel: {xlsx_path}")
367
+
368
+ def upload_to_huggingface(self, df: pd.DataFrame):
369
+ """ํ—ˆ๊น…ํŽ˜์ด์Šค์— ์—…๋กœ๋“œ"""
370
+ if not self.hf_token:
371
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. .env ํŒŒ์ผ์„ ํ™•์ธํ•˜์„ธ์š”.")
372
+ return
373
+
374
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
375
+
376
+ try:
377
+ # ๋กœ๊ทธ์ธ
378
+ login(token=self.hf_token)
379
+ api = HfApi()
380
+
381
+ # ์ƒˆ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ
382
+ new_dataset = Dataset.from_pandas(df)
383
+
384
+ # ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ ๋ฐ ๋ณ‘ํ•ฉ
385
+ try:
386
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
387
+ print(f" โ„น๏ธ ๊ธฐ์กด ๋ฐ์ดํ„ฐ: {len(existing_dataset)}๊ฐœ")
388
+
389
+ # ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ์œ„ํ•ด URL ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉ
390
+ existing_df = existing_dataset.to_pandas()
391
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
392
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
393
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
394
+
395
+ final_dataset = Dataset.from_pandas(combined_df)
396
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ (์ค‘๋ณต ์ œ๊ฑฐ๋จ)")
397
+ except:
398
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
399
+ final_dataset = new_dataset
400
+
401
+ # ์—…๋กœ๋“œ
402
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
403
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
404
+ print(f" ๐Ÿ”— https://huggingface.co/datasets/{self.hf_repo_id}")
405
+
406
+ except Exception as e:
407
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
408
+
409
+ async def run_incremental(self):
410
+ """์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ์‹คํ–‰ (๋งˆ์ง€๋ง‰ ๋‚ ์งœ ์ดํ›„๋งŒ)"""
411
+ state = self.load_state()
412
+ last_date = state.get('last_crawl_date')
413
+
414
+ if last_date:
415
+ # ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ๋‚ ์งœ ๋‹ค์Œ๋‚ ๋ถ€ํ„ฐ
416
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
417
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
418
+ else:
419
+ start_date = self.start_date
420
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
421
+
422
+ end_date = datetime.now().strftime('%Y-%m-%d')
423
+
424
+ # ํฌ๋กค๋ง
425
+ df = await self.collect_all(start_date, end_date)
426
+
427
+ if df.empty:
428
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
429
+ return
430
+
431
+ # ๋กœ์ปฌ ์ €์žฅ
432
+ self.save_local(df)
433
+
434
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ
435
+ self.upload_to_huggingface(df)
436
+
437
+ # ์ƒํƒœ ์ €์žฅ
438
+ state['last_crawl_date'] = end_date
439
+ state['last_crawl_time'] = datetime.now().isoformat()
440
+ state['last_count'] = len(df)
441
+ self.save_state(state)
442
+
443
+ print(f"\n{'='*60}")
444
+ print(f"โœ“ ์™„๋ฃŒ! ๋‹ค์Œ ์‹คํ–‰: ๋‚ด์ผ")
445
+ print(f"{'='*60}\n")
446
+
447
+ async def main():
448
+ """๋ฉ”์ธ ํ•จ์ˆ˜"""
449
+ crawler = MinjooAsyncCrawler()
450
+ await crawler.run_incremental()
451
+
452
+ if __name__ == "__main__":
453
+ asyncio.run(main())
ppp_crawler_async.py ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ๊ตญ๋ฏผ์˜ํž˜ ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - asyncio + aiohttp (10-20๋ฐฐ ๋น ๋ฅธ ์†๋„)
6
+ - ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œ์–ด (์„œ๋ฒ„ ๋ถ€๋‹ด ์ตœ์†Œํ™”)
7
+ - ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ (๋งˆ์ง€๋ง‰ ๋‚ ์งœ ์ดํ›„๋งŒ ํฌ๋กค๋ง)
8
+ - ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
9
+ - ์ผ ๋‹จ์œ„ ์Šค์ผ€์ค„๋ง
10
+ """
11
+
12
+ import os
13
+ import json
14
+ import re
15
+ import asyncio
16
+ from datetime import datetime, timedelta
17
+ from typing import List, Dict, Optional
18
+ import pandas as pd
19
+ from tqdm.asyncio import tqdm as async_tqdm
20
+ import aiohttp
21
+ from bs4 import BeautifulSoup
22
+ from dotenv import load_dotenv
23
+ from huggingface_hub import HfApi, login
24
+ from datasets import Dataset, load_dataset
25
+
26
+ # .env ํŒŒ์ผ ๋กœ๋“œ
27
+ load_dotenv()
28
+
29
+ class PPPAsyncCrawler:
30
+ def __init__(self, config_path="crawler_config.json"):
31
+ self.base_url = "https://www.peoplepowerparty.kr"
32
+ self.party_name = "๊ตญ๋ฏผ์˜ํž˜"
33
+ self.config_path = config_path
34
+ self.state_path = "crawler_state.json"
35
+
36
+ # ์„ค์ • ๋กœ๋“œ
37
+ self.load_config()
38
+
39
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค ์„ค์ •
40
+ self.hf_token = os.getenv("HF_TOKEN")
41
+ self.hf_repo_id = os.getenv("HF_REPO_ID_PPP", "ppp-press-releases")
42
+
43
+ # ๋™์‹œ ์š”์ฒญ ์ˆ˜ ์ œํ•œ
44
+ self.semaphore = asyncio.Semaphore(20)
45
+
46
+ def load_config(self):
47
+ """์„ค์ • ํŒŒ์ผ ๋กœ๋“œ"""
48
+ default_config = {
49
+ "boards": {
50
+ "๋Œ€๋ณ€์ธ_๋…ผํ‰๋ณด๋„์ž๋ฃŒ": "BBSDD0001",
51
+ "์›๋‚ด_๋ณด๋„์ž๋ฃŒ": "BBSDD0002",
52
+ "๋ฏธ๋””์–ดํŠน์œ„_๋ณด๋„์ž๋ฃŒ": "BBSDD0042"
53
+ },
54
+ "start_date": "2000-03-10",
55
+ "max_pages": 10000,
56
+ "concurrent_requests": 20,
57
+ "request_delay": 0.1,
58
+ "output_path": "./data"
59
+ }
60
+
61
+ if os.path.exists(self.config_path):
62
+ with open(self.config_path, 'r', encoding='utf-8') as f:
63
+ config = json.load(f)
64
+ # ๊ตญ๋ฏผ์˜ํž˜ ์„ค์ •๋งŒ ์ถ”์ถœ
65
+ if 'ppp' in config:
66
+ self.config = config['ppp']
67
+ else:
68
+ self.config = default_config
69
+ else:
70
+ self.config = default_config
71
+
72
+ self.boards = self.config["boards"]
73
+ self.start_date = self.config["start_date"]
74
+ self.max_pages = self.config["max_pages"]
75
+ self.output_path = self.config["output_path"]
76
+
77
+ def load_state(self) -> Dict:
78
+ """ํฌ๋กค๋Ÿฌ ์ƒํƒœ ๋กœ๋“œ"""
79
+ if os.path.exists(self.state_path):
80
+ with open(self.state_path, 'r', encoding='utf-8') as f:
81
+ state = json.load(f)
82
+ return state.get('ppp', {})
83
+ return {}
84
+
85
+ def save_state(self, state: Dict):
86
+ """ํฌ๋กค๋Ÿฌ ์ƒํƒœ ์ €์žฅ"""
87
+ all_state = {}
88
+ if os.path.exists(self.state_path):
89
+ with open(self.state_path, 'r', encoding='utf-8') as f:
90
+ all_state = json.load(f)
91
+
92
+ all_state['ppp'] = state
93
+
94
+ with open(self.state_path, 'w', encoding='utf-8') as f:
95
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
96
+
97
+ @staticmethod
98
+ def parse_date(date_str: str) -> Optional[datetime]:
99
+ """๋‚ ์งœ ํŒŒ์‹ฑ"""
100
+ try:
101
+ return datetime.strptime(date_str.strip(), '%Y-%m-%d')
102
+ except:
103
+ return None
104
+
105
+ @staticmethod
106
+ def clean_text(text: str) -> str:
107
+ """ํ…์ŠคํŠธ ์ •๋ฆฌ"""
108
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
109
+ return text.strip()
110
+
111
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
112
+ max_retries: int = 3) -> Optional[str]:
113
+ """์žฌ์‹œ๋„ ๋กœ์ง์ด ์žˆ๋Š” ๋น„๋™๊ธฐ ์š”์ฒญ"""
114
+ async with self.semaphore:
115
+ for attempt in range(max_retries):
116
+ try:
117
+ await asyncio.sleep(self.config.get("request_delay", 0.1))
118
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
119
+ if response.status == 200:
120
+ return await response.text()
121
+ except Exception as e:
122
+ if attempt < max_retries - 1:
123
+ await asyncio.sleep(1)
124
+ else:
125
+ return None
126
+ return None
127
+
128
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
129
+ board_id: str, page_num: int,
130
+ start_date: datetime, end_date: datetime) -> tuple:
131
+ """๋ชฉ๋ก ํŽ˜์ด์ง€ ํ•˜๋‚˜ ๊ฐ€์ ธ์˜ค๊ธฐ"""
132
+ url = f"{self.base_url}/news/comment/{board_id}?page={page_num}"
133
+
134
+ html = await self.fetch_with_retry(session, url)
135
+ if not html:
136
+ return [], False
137
+
138
+ soup = BeautifulSoup(html, 'html.parser')
139
+
140
+ table_div = soup.find('div', {'class': 'board-tbl'})
141
+ if not table_div:
142
+ return [], True
143
+
144
+ tbody = table_div.find('tbody')
145
+ if not tbody:
146
+ return [], True
147
+
148
+ rows = tbody.find_all('tr')
149
+ if not rows:
150
+ return [], True
151
+
152
+ data = []
153
+ stop_flag = False
154
+
155
+ for row in rows:
156
+ cols = row.find_all('td')
157
+ if len(cols) < 3:
158
+ continue
159
+
160
+ try:
161
+ no_td = row.find('td', {'class': 'no'})
162
+ class_td = row.find('td', {'class': 'class'})
163
+
164
+ no = no_td.get_text(strip=True) if no_td else cols[0].get_text(strip=True)
165
+ section = class_td.get_text(strip=True) if class_td else cols[1].get_text(strip=True)
166
+
167
+ link_tag = row.find('a')
168
+ if not link_tag:
169
+ continue
170
+
171
+ title = link_tag.get_text(strip=True).replace('\n', ' ')
172
+ article_url = self.base_url + link_tag.get('href', '')
173
+
174
+ # ๋‚ ์งœ ์ถ”์ถœ
175
+ date_str = ""
176
+ if len(cols) >= 4:
177
+ date_str = cols[3].get_text(strip=True)
178
+
179
+ if not date_str or not re.match(r'\d{4}-\d{2}-\d{2}', date_str):
180
+ dd_date = row.find('dd', {'class': 'date'})
181
+ if dd_date:
182
+ span = dd_date.find('span')
183
+ if span:
184
+ span.decompose()
185
+ date_str = dd_date.get_text(strip=True)
186
+
187
+ article_date = self.parse_date(date_str)
188
+
189
+ if not article_date:
190
+ continue
191
+ if article_date < start_date:
192
+ stop_flag = True
193
+ break
194
+ if article_date > end_date:
195
+ continue
196
+
197
+ data.append({
198
+ 'no': no,
199
+ 'section': section,
200
+ 'title': title,
201
+ 'date': date_str,
202
+ 'url': article_url
203
+ })
204
+ except:
205
+ continue
206
+
207
+ return data, stop_flag
208
+
209
+ async def fetch_article_detail(self, session: aiohttp.ClientSession,
210
+ url: str) -> Dict:
211
+ """์ƒ์„ธ ํŽ˜์ด์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ"""
212
+ html = await self.fetch_with_retry(session, url)
213
+ if not html:
214
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': ""}
215
+
216
+ soup = BeautifulSoup(html, 'html.parser')
217
+ text_parts = []
218
+ writer = ""
219
+
220
+ conts_tag = soup.select_one('dd.conts')
221
+
222
+ if conts_tag:
223
+ hwp_div = conts_tag.find('div', {'id': 'hwpEditorBoardContent'})
224
+ if hwp_div:
225
+ hwp_div.decompose()
226
+
227
+ p_tags = conts_tag.find_all('p')
228
+
229
+ for p in p_tags:
230
+ style = p.get('style', '')
231
+ is_center = 'text-align:center' in style.replace(' ', '').lower()
232
+
233
+ raw_text = p.get_text(strip=True)
234
+ cleaned_text = self.clean_text(raw_text)
235
+
236
+ if not cleaned_text:
237
+ continue
238
+
239
+ if is_center:
240
+ if not re.match(r'\d{4}\.\s*\d{1,2}\.\s*\d{1,2}', cleaned_text):
241
+ writer = cleaned_text
242
+ else:
243
+ text_parts.append(cleaned_text)
244
+
245
+ return {'text': '\n'.join(text_parts), 'writer': writer}
246
+
247
+ async def collect_board(self, board_name: str, board_id: str,
248
+ start_date: str, end_date: str) -> List[Dict]:
249
+ """ํ•œ ๊ฒŒ์‹œํŒ ์ „์ฒด ์ˆ˜์ง‘ (๋น„๋™๊ธฐ)"""
250
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
251
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
252
+
253
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
254
+
255
+ headers = {
256
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
257
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
258
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
259
+ }
260
+
261
+ async with aiohttp.ClientSession(headers=headers) as session:
262
+ # 1๋‹จ๊ณ„: ๋ชฉ๋ก ํŽ˜์ด์ง€ ์ˆ˜์ง‘
263
+ all_items = []
264
+ page_num = 1
265
+ empty_pages = 0
266
+ max_empty_pages = 3
267
+
268
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
269
+ while page_num <= self.max_pages:
270
+ items, stop_flag = await self.fetch_list_page(
271
+ session, board_id, page_num, start_dt, end_dt
272
+ )
273
+
274
+ if not items:
275
+ empty_pages += 1
276
+ if empty_pages >= max_empty_pages or stop_flag:
277
+ break
278
+ else:
279
+ empty_pages = 0
280
+ all_items.extend(items)
281
+
282
+ pbar.update(1)
283
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
284
+
285
+ if stop_flag:
286
+ break
287
+
288
+ page_num += 1
289
+
290
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
291
+
292
+ # 2๋‹จ๊ณ„: ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ (๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ)
293
+ if all_items:
294
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
295
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
296
+
297
+ details = []
298
+ for coro in async_tqdm(asyncio.as_completed(tasks),
299
+ total=len(tasks),
300
+ desc=f"[{board_name}] ์ƒ์„ธ"):
301
+ detail = await coro
302
+ details.append(detail)
303
+
304
+ # ์ƒ์„ธ ์ •๋ณด ๋ณ‘ํ•ฉ
305
+ for item, detail in zip(all_items, details):
306
+ item.update(detail)
307
+ item['board_name'] = board_name
308
+
309
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
310
+ return all_items
311
+
312
+ async def collect_all(self, start_date: Optional[str] = None,
313
+ end_date: Optional[str] = None) -> pd.DataFrame:
314
+ """๋ชจ๋“  ๊ฒŒ์‹œํŒ ์ˆ˜์ง‘"""
315
+ if not end_date:
316
+ end_date = datetime.now().strftime('%Y-%m-%d')
317
+ if not start_date:
318
+ start_date = self.start_date
319
+
320
+ print(f"\n{'='*60}")
321
+ print(f"๊ตญ๋ฏผ์˜ํž˜ ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
322
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
323
+ print(f"{'='*60}")
324
+
325
+ # ๋ชจ๋“  ๊ฒŒ์‹œํŒ ๋ณ‘๋ ฌ ์ˆ˜์ง‘
326
+ tasks = [
327
+ self.collect_board(board_name, board_id, start_date, end_date)
328
+ for board_name, board_id in self.boards.items()
329
+ ]
330
+
331
+ results = await asyncio.gather(*tasks)
332
+
333
+ # ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉ
334
+ all_data = []
335
+ for items in results:
336
+ all_data.extend(items)
337
+
338
+ if not all_data:
339
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
340
+ return pd.DataFrame()
341
+
342
+ df = pd.DataFrame(all_data)
343
+ df = df[['board_name', 'no', 'title', 'section', 'date', 'writer', 'text', 'url']]
344
+ df = df[(df['title'] != "") & (df['text'] != "")]
345
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
346
+
347
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
348
+ return df
349
+
350
+ def save_local(self, df: pd.DataFrame):
351
+ """๋กœ์ปฌ์— ์ €์žฅ"""
352
+ os.makedirs(self.output_path, exist_ok=True)
353
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
354
+
355
+ # CSV
356
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
357
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
358
+
359
+ # Excel
360
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
361
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
362
+
363
+ print(f"โœ“ CSV: {csv_path}")
364
+ print(f"โœ“ Excel: {xlsx_path}")
365
+
366
+ def upload_to_huggingface(self, df: pd.DataFrame):
367
+ """ํ—ˆ๊น…ํŽ˜์ด์Šค์— ์—…๋กœ๋“œ"""
368
+ if not self.hf_token:
369
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. .env ํŒŒ์ผ์„ ํ™•์ธํ•˜์„ธ์š”.")
370
+ return
371
+
372
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
373
+
374
+ try:
375
+ login(token=self.hf_token)
376
+ api = HfApi()
377
+
378
+ new_dataset = Dataset.from_pandas(df)
379
+
380
+ # ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ ๋ฐ ๋ณ‘ํ•ฉ
381
+ try:
382
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
383
+ print(f" โ„น๏ธ ๊ธฐ์กด ๋ฐ์ดํ„ฐ: {len(existing_dataset)}๊ฐœ")
384
+
385
+ existing_df = existing_dataset.to_pandas()
386
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
387
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
388
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
389
+
390
+ final_dataset = Dataset.from_pandas(combined_df)
391
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ (์ค‘๋ณต ์ œ๊ฑฐ๋จ)")
392
+ except:
393
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
394
+ final_dataset = new_dataset
395
+
396
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
397
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
398
+ print(f" ๐Ÿ”— https://huggingface.co/datasets/{self.hf_repo_id}")
399
+
400
+ except Exception as e:
401
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
402
+
403
+ async def run_incremental(self):
404
+ """์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ์‹คํ–‰"""
405
+ state = self.load_state()
406
+ last_date = state.get('last_crawl_date')
407
+
408
+ if last_date:
409
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
410
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
411
+ else:
412
+ start_date = self.start_date
413
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
414
+
415
+ end_date = datetime.now().strftime('%Y-%m-%d')
416
+
417
+ # ํฌ๋กค๋ง
418
+ df = await self.collect_all(start_date, end_date)
419
+
420
+ if df.empty:
421
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
422
+ return
423
+
424
+ # ๋กœ์ปฌ ์ €์žฅ
425
+ self.save_local(df)
426
+
427
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ
428
+ self.upload_to_huggingface(df)
429
+
430
+ # ์ƒํƒœ ์ €์žฅ
431
+ state['last_crawl_date'] = end_date
432
+ state['last_crawl_time'] = datetime.now().isoformat()
433
+ state['last_count'] = len(df)
434
+ self.save_state(state)
435
+
436
+ print(f"\n{'='*60}")
437
+ print(f"โœ“ ์™„๋ฃŒ!")
438
+ print(f"{'='*60}\n")
439
+
440
+ async def main():
441
+ """๋ฉ”์ธ ํ•จ์ˆ˜"""
442
+ crawler = PPPAsyncCrawler()
443
+ await crawler.run_incremental()
444
+
445
+ if __name__ == "__main__":
446
+ asyncio.run(main())
rebuilding_crawler_async.py ADDED
@@ -0,0 +1,376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ์กฐ๊ตญํ˜์‹ ๋‹น ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - ๊ธฐ์กด sync(requests) ๋ฐฉ์‹์„ async(aiohttp) ๋กœ ์ „ํ™˜
6
+ - ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ, ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
7
+ """
8
+
9
+ import os
10
+ import json
11
+ import re
12
+ import asyncio
13
+ from datetime import datetime, timedelta
14
+ from typing import List, Dict, Optional
15
+ import pandas as pd
16
+ from tqdm.asyncio import tqdm as async_tqdm
17
+ import aiohttp
18
+ from bs4 import BeautifulSoup
19
+ from dotenv import load_dotenv
20
+ from huggingface_hub import HfApi, login
21
+ from datasets import Dataset, load_dataset
22
+
23
+ load_dotenv()
24
+
25
+
26
+ class RebuildingAsyncCrawler:
27
+ def __init__(self, config_path="crawler_config.json"):
28
+ self.base_url = "https://rebuildingkoreaparty.kr"
29
+ self.party_name = "์กฐ๊ตญํ˜์‹ ๋‹น"
30
+ self.config_path = config_path
31
+ self.state_path = "crawler_state.json"
32
+
33
+ self.load_config()
34
+
35
+ self.hf_token = os.getenv("HF_TOKEN")
36
+ self.hf_repo_id = os.getenv("HF_REPO_ID_REBUILDING", "rebuilding-press-releases")
37
+
38
+ self.semaphore = asyncio.Semaphore(10)
39
+
40
+ def load_config(self):
41
+ default_config = {
42
+ "boards": {
43
+ "๊ธฐ์žํšŒ๊ฒฌ๋ฌธ": "news/press-conference",
44
+ "๋…ผํ‰๋ธŒ๋ฆฌํ•‘": "news/commentary-briefing",
45
+ "๋ณด๋„์ž๋ฃŒ": "news/press-release"
46
+ },
47
+ "start_date": "2024-03-04",
48
+ "max_pages": 10000,
49
+ "concurrent_requests": 10,
50
+ "request_delay": 0.5,
51
+ "output_path": "./data"
52
+ }
53
+
54
+ if os.path.exists(self.config_path):
55
+ with open(self.config_path, 'r', encoding='utf-8') as f:
56
+ config = json.load(f)
57
+ self.config = config.get('rebuilding', default_config)
58
+ else:
59
+ self.config = default_config
60
+
61
+ self.boards = self.config["boards"]
62
+ self.start_date = self.config["start_date"]
63
+ self.max_pages = self.config["max_pages"]
64
+ self.output_path = self.config["output_path"]
65
+
66
+ def load_state(self) -> Dict:
67
+ if os.path.exists(self.state_path):
68
+ with open(self.state_path, 'r', encoding='utf-8') as f:
69
+ state = json.load(f)
70
+ return state.get('rebuilding', {})
71
+ return {}
72
+
73
+ def save_state(self, state: Dict):
74
+ all_state = {}
75
+ if os.path.exists(self.state_path):
76
+ with open(self.state_path, 'r', encoding='utf-8') as f:
77
+ all_state = json.load(f)
78
+ all_state['rebuilding'] = state
79
+ with open(self.state_path, 'w', encoding='utf-8') as f:
80
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
81
+
82
+ @staticmethod
83
+ def parse_date(date_str: str) -> Optional[datetime]:
84
+ try:
85
+ return datetime.strptime(date_str.strip(), '%Y-%m-%d')
86
+ except:
87
+ return None
88
+
89
+ @staticmethod
90
+ def clean_text(text: str) -> str:
91
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
92
+ return text.strip()
93
+
94
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
95
+ max_retries: int = 3) -> Optional[str]:
96
+ async with self.semaphore:
97
+ for attempt in range(max_retries):
98
+ try:
99
+ await asyncio.sleep(self.config.get("request_delay", 0.5))
100
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
101
+ if response.status == 200:
102
+ return await response.text()
103
+ except Exception:
104
+ if attempt < max_retries - 1:
105
+ await asyncio.sleep(1)
106
+ else:
107
+ return None
108
+ return None
109
+
110
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
111
+ board_name: str, board_path: str, page_num: int,
112
+ start_date: datetime, end_date: datetime) -> tuple:
113
+ if page_num == 1:
114
+ url = f"{self.base_url}/{board_path}"
115
+ else:
116
+ url = f"{self.base_url}/{board_path}?page={page_num}"
117
+
118
+ html = await self.fetch_with_retry(session, url)
119
+ if not html:
120
+ return [], False
121
+
122
+ soup = BeautifulSoup(html, 'html.parser')
123
+
124
+ # <a href="/news/{board_path}/..."> ํŒจํ„ด์œผ๋กœ ๊ฒŒ์‹œ๊ธ€ ๋งํฌ ํƒ์ƒ‰
125
+ article_links = soup.find_all('a', href=re.compile(f'^/news/{re.escape(board_path)}/'))
126
+ if not article_links:
127
+ return [], True
128
+
129
+ data = []
130
+ stop_flag = False
131
+ seen_urls = set()
132
+
133
+ for link in article_links:
134
+ try:
135
+ article_url = link.get('href', '')
136
+ if article_url.startswith('/'):
137
+ article_url = self.base_url + article_url
138
+ if article_url in seen_urls:
139
+ continue
140
+ seen_urls.add(article_url)
141
+
142
+ title = link.get_text(strip=True).replace('\n', ' ')
143
+
144
+ # ๊ฐ™์€ <ul> ์•ˆ์—์„œ ๋‚ ์งœยท์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ถœ
145
+ parent = link.find_parent('ul')
146
+ if not parent:
147
+ parent_li = link.find_parent('li')
148
+ if parent_li:
149
+ parent = parent_li.find_parent('ul')
150
+
151
+ date_str = ""
152
+ category = ""
153
+ if parent:
154
+ date_li = parent.find('li', {'class': 'td date'})
155
+ if date_li:
156
+ date_str = date_li.get_text(strip=True)
157
+ cate_li = parent.find('li', {'class': 'td category'})
158
+ if cate_li:
159
+ category = cate_li.get_text(strip=True)
160
+
161
+ if not date_str:
162
+ continue
163
+
164
+ article_date = self.parse_date(date_str)
165
+ if not article_date:
166
+ continue
167
+ if article_date < start_date:
168
+ stop_flag = True
169
+ break
170
+ if article_date > end_date:
171
+ continue
172
+
173
+ data.append({
174
+ 'board_name': board_name,
175
+ 'category': category,
176
+ 'title': title,
177
+ 'date': date_str,
178
+ 'url': article_url
179
+ })
180
+ except:
181
+ continue
182
+
183
+ return data, stop_flag
184
+
185
+ async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
186
+ html = await self.fetch_with_retry(session, url)
187
+ if not html:
188
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': ""}
189
+
190
+ soup = BeautifulSoup(html, 'html.parser')
191
+ text_parts = []
192
+ writer = ""
193
+
194
+ # ๋ณธ๋ฌธ: <div class="editor ck-content"> ์•ˆ์˜ <p> ํƒœ๊ทธ
195
+ contents_div = soup.find('div', {'class': 'editor ck-content'})
196
+ if contents_div:
197
+ paragraphs = contents_div.find_all('p')
198
+ for p in paragraphs:
199
+ cleaned = self.clean_text(p.get_text(strip=True))
200
+ if cleaned:
201
+ text_parts.append(cleaned)
202
+
203
+ # ์ž‘์„ฑ์ž: ๋์ชฝ <p> ์—์„œ ๋‹น๋ช…/๋Œ€๋ณ€์ธ ํฌํ•จ ํ…์ŠคํŠธ
204
+ for p in reversed(paragraphs):
205
+ cleaned = self.clean_text(p.get_text(strip=True))
206
+ if '์กฐ๊ตญํ˜์‹ ๋‹น' in cleaned or '๋Œ€๋ณ€์ธ' in cleaned or '์œ„์›ํšŒ' in cleaned:
207
+ writer = cleaned
208
+ break
209
+
210
+ return {'text': '\n'.join(text_parts), 'writer': writer}
211
+
212
+ async def collect_board(self, board_name: str, board_path: str,
213
+ start_date: str, end_date: str) -> List[Dict]:
214
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
215
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
216
+
217
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
218
+
219
+ headers = {
220
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
221
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
222
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
223
+ }
224
+
225
+ async with aiohttp.ClientSession(headers=headers) as session:
226
+ all_items = []
227
+ page_num = 1
228
+ empty_pages = 0
229
+ max_empty_pages = 3
230
+
231
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
232
+ while page_num <= self.max_pages:
233
+ items, stop_flag = await self.fetch_list_page(
234
+ session, board_name, board_path, page_num, start_dt, end_dt
235
+ )
236
+
237
+ if not items:
238
+ empty_pages += 1
239
+ if empty_pages >= max_empty_pages or stop_flag:
240
+ break
241
+ else:
242
+ empty_pages = 0
243
+ all_items.extend(items)
244
+
245
+ pbar.update(1)
246
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
247
+
248
+ if stop_flag:
249
+ break
250
+
251
+ page_num += 1
252
+
253
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
254
+
255
+ if all_items:
256
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
257
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
258
+
259
+ details = []
260
+ for coro in async_tqdm(asyncio.as_completed(tasks),
261
+ total=len(tasks),
262
+ desc=f"[{board_name}] ์ƒ์„ธ"):
263
+ detail = await coro
264
+ details.append(detail)
265
+
266
+ for item, detail in zip(all_items, details):
267
+ item.update(detail)
268
+
269
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
270
+ return all_items
271
+
272
+ async def collect_all(self, start_date: Optional[str] = None,
273
+ end_date: Optional[str] = None) -> pd.DataFrame:
274
+ if not end_date:
275
+ end_date = datetime.now().strftime('%Y-%m-%d')
276
+ if not start_date:
277
+ start_date = self.start_date
278
+
279
+ print(f"\n{'='*60}")
280
+ print(f"์กฐ๊ตญํ˜์‹ ๋‹น ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
281
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
282
+ print(f"{'='*60}")
283
+
284
+ tasks = [
285
+ self.collect_board(board_name, board_path, start_date, end_date)
286
+ for board_name, board_path in self.boards.items()
287
+ ]
288
+ results = await asyncio.gather(*tasks)
289
+
290
+ all_data = []
291
+ for items in results:
292
+ all_data.extend(items)
293
+
294
+ if not all_data:
295
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
296
+ return pd.DataFrame()
297
+
298
+ df = pd.DataFrame(all_data)
299
+ df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
300
+ df = df[(df['title'] != "") & (df['text'] != "")]
301
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
302
+
303
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
304
+ return df
305
+
306
+ def save_local(self, df: pd.DataFrame):
307
+ os.makedirs(self.output_path, exist_ok=True)
308
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
309
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
310
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
311
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
312
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
313
+ print(f"โœ“ CSV: {csv_path}")
314
+ print(f"โœ“ Excel: {xlsx_path}")
315
+
316
+ def upload_to_huggingface(self, df: pd.DataFrame):
317
+ if not self.hf_token:
318
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
319
+ return
320
+
321
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
322
+ try:
323
+ login(token=self.hf_token)
324
+ new_dataset = Dataset.from_pandas(df)
325
+ try:
326
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
327
+ existing_df = existing_dataset.to_pandas()
328
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
329
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
330
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
331
+ final_dataset = Dataset.from_pandas(combined_df)
332
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ")
333
+ except:
334
+ final_dataset = new_dataset
335
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
336
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
337
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
338
+ except Exception as e:
339
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
340
+
341
+ async def run_incremental(self):
342
+ state = self.load_state()
343
+ last_date = state.get('last_crawl_date')
344
+
345
+ if last_date:
346
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
347
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
348
+ else:
349
+ start_date = self.start_date
350
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
351
+
352
+ end_date = datetime.now().strftime('%Y-%m-%d')
353
+ df = await self.collect_all(start_date, end_date)
354
+
355
+ if df.empty:
356
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
357
+ return
358
+
359
+ self.save_local(df)
360
+ self.upload_to_huggingface(df)
361
+
362
+ state['last_crawl_date'] = end_date
363
+ state['last_crawl_time'] = datetime.now().isoformat()
364
+ state['last_count'] = len(df)
365
+ self.save_state(state)
366
+
367
+ print(f"\n{'='*60}\nโœ“ ์™„๋ฃŒ!\n{'='*60}\n")
368
+
369
+
370
+ async def main():
371
+ crawler = RebuildingAsyncCrawler()
372
+ await crawler.run_incremental()
373
+
374
+
375
+ if __name__ == "__main__":
376
+ asyncio.run(main())
reform_crawler_async.py ADDED
@@ -0,0 +1,358 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ๊ฐœํ˜์‹ ๋‹น ํฌ๋กค๋Ÿฌ - ๊ณ ์„ฑ๋Šฅ ๋น„๋™๊ธฐ ๋ฒ„์ „ + ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ
5
+ - ๊ทธ๋ˆ„๋ณด๋“œ 5 ๊ธฐ๋ฐ˜ ์‚ฌ์ดํŠธ (reformparty.kr)
6
+ - td.td_subject / td.td_datetime / div#bo_v_con ๊ตฌ์กฐ
7
+ """
8
+
9
+ import os
10
+ import json
11
+ import re
12
+ import asyncio
13
+ from datetime import datetime, timedelta
14
+ from typing import List, Dict, Optional
15
+ import pandas as pd
16
+ from tqdm.asyncio import tqdm as async_tqdm
17
+ import aiohttp
18
+ from bs4 import BeautifulSoup
19
+ from dotenv import load_dotenv
20
+ from huggingface_hub import HfApi, login
21
+ from datasets import Dataset, load_dataset
22
+
23
+ load_dotenv()
24
+
25
+
26
+ class ReformAsyncCrawler:
27
+ def __init__(self, config_path="crawler_config.json"):
28
+ self.base_url = "https://www.reformparty.kr"
29
+ self.party_name = "๊ฐœํ˜์‹ ๋‹น"
30
+ self.config_path = config_path
31
+ self.state_path = "crawler_state.json"
32
+
33
+ self.load_config()
34
+
35
+ self.hf_token = os.getenv("HF_TOKEN")
36
+ self.hf_repo_id = os.getenv("HF_REPO_ID_REFORM", "reform-press-releases")
37
+
38
+ self.semaphore = asyncio.Semaphore(10)
39
+
40
+ def load_config(self):
41
+ default_config = {
42
+ "boards": {
43
+ "๋ณด๋„์ž๋ฃŒ": "press",
44
+ "๋…ผํ‰๋ธŒ๋ฆฌํ•‘": "briefing"
45
+ },
46
+ "start_date": "2024-02-13",
47
+ "max_pages": 10000,
48
+ "concurrent_requests": 10,
49
+ "request_delay": 0.3,
50
+ "output_path": "./data"
51
+ }
52
+
53
+ if os.path.exists(self.config_path):
54
+ with open(self.config_path, 'r', encoding='utf-8') as f:
55
+ config = json.load(f)
56
+ self.config = config.get('reform', default_config)
57
+ else:
58
+ self.config = default_config
59
+
60
+ self.boards = self.config["boards"]
61
+ self.start_date = self.config["start_date"]
62
+ self.max_pages = self.config["max_pages"]
63
+ self.output_path = self.config["output_path"]
64
+
65
+ def load_state(self) -> Dict:
66
+ if os.path.exists(self.state_path):
67
+ with open(self.state_path, 'r', encoding='utf-8') as f:
68
+ state = json.load(f)
69
+ return state.get('reform', {})
70
+ return {}
71
+
72
+ def save_state(self, state: Dict):
73
+ all_state = {}
74
+ if os.path.exists(self.state_path):
75
+ with open(self.state_path, 'r', encoding='utf-8') as f:
76
+ all_state = json.load(f)
77
+ all_state['reform'] = state
78
+ with open(self.state_path, 'w', encoding='utf-8') as f:
79
+ json.dump(all_state, f, ensure_ascii=False, indent=2)
80
+
81
+ @staticmethod
82
+ def parse_date(date_str: str) -> Optional[datetime]:
83
+ """YYYY-MM-DD HH:MM:SS ๋˜๋Š” YYYY-MM-DD ํŒŒ์‹ฑ"""
84
+ try:
85
+ return datetime.strptime(date_str.strip()[:10], '%Y-%m-%d')
86
+ except:
87
+ return None
88
+
89
+ @staticmethod
90
+ def clean_text(text: str) -> str:
91
+ text = text.replace('\xa0', '').replace('\u200b', '').replace('โ€‹', '')
92
+ return text.strip()
93
+
94
+ async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
95
+ max_retries: int = 3) -> Optional[str]:
96
+ async with self.semaphore:
97
+ for attempt in range(max_retries):
98
+ try:
99
+ await asyncio.sleep(self.config.get("request_delay", 0.3))
100
+ async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
101
+ if response.status == 200:
102
+ return await response.text()
103
+ except Exception:
104
+ if attempt < max_retries - 1:
105
+ await asyncio.sleep(1)
106
+ else:
107
+ return None
108
+ return None
109
+
110
+ async def fetch_list_page(self, session: aiohttp.ClientSession,
111
+ board_name: str, board_slug: str, page_num: int,
112
+ start_date: datetime, end_date: datetime) -> tuple:
113
+ url = f"{self.base_url}/{board_slug}?page={page_num}"
114
+
115
+ html = await self.fetch_with_retry(session, url)
116
+ if not html:
117
+ return [], False
118
+
119
+ soup = BeautifulSoup(html, 'html.parser')
120
+ rows = soup.select('table tbody tr')
121
+ if not rows:
122
+ return [], True
123
+
124
+ data = []
125
+ stop_flag = False
126
+
127
+ for row in rows:
128
+ try:
129
+ # ์ œ๋ชฉยทURL: td.td_subject div.bo_tit a
130
+ title_a = row.select_one('td.td_subject div.bo_tit a')
131
+ if not title_a:
132
+ continue
133
+
134
+ title = title_a.get_text(strip=True)
135
+ href = title_a.get('href', '')
136
+ # page ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ๊ฑฐ ํ›„ ์ ˆ๋Œ€ URL
137
+ article_url = self.base_url + re.sub(r'\?.*$', '', href)
138
+
139
+ # ๏ฟฝ๏ฟฝ์งœ: td.td_datetime (YYYY-MM-DD HH:MM:SS)
140
+ date_td = row.select_one('td.td_datetime')
141
+ if not date_td:
142
+ continue
143
+ date_str = date_td.get_text(strip=True)[:10]
144
+
145
+ # ์นดํ…Œ๊ณ ๋ฆฌ: td.td_cate a.bo_cate
146
+ cate_a = row.select_one('td.td_cate a.bo_cate')
147
+ category = cate_a.get_text(strip=True) if cate_a else ""
148
+
149
+ article_date = self.parse_date(date_str)
150
+ if not article_date:
151
+ continue
152
+ if article_date < start_date:
153
+ stop_flag = True
154
+ break
155
+ if article_date > end_date:
156
+ continue
157
+
158
+ data.append({
159
+ 'board_name': board_name,
160
+ 'title': title,
161
+ 'category': category,
162
+ 'date': date_str,
163
+ 'url': article_url
164
+ })
165
+ except:
166
+ continue
167
+
168
+ return data, stop_flag
169
+
170
+ async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
171
+ html = await self.fetch_with_retry(session, url)
172
+ if not html:
173
+ return {'text': "๋ณธ๋ฌธ ์กฐํšŒ ์‹คํŒจ", 'writer': ""}
174
+
175
+ soup = BeautifulSoup(html, 'html.parser')
176
+ text_parts = []
177
+ writer = ""
178
+
179
+ # ๋ณธ๋ฌธ: div#bo_v_con
180
+ contents_div = soup.find('div', id='bo_v_con')
181
+ if contents_div:
182
+ for p in contents_div.find_all('p'):
183
+ cleaned = self.clean_text(p.get_text(strip=True))
184
+ if cleaned:
185
+ text_parts.append(cleaned)
186
+
187
+ # ์ž‘์„ฑ์ž: p.name span.content span.sv_member
188
+ writer_el = soup.select_one('p.name span.content span.sv_member')
189
+ if writer_el:
190
+ writer = writer_el.get_text(strip=True)
191
+
192
+ return {'text': '\n'.join(text_parts), 'writer': writer}
193
+
194
+ async def collect_board(self, board_name: str, board_slug: str,
195
+ start_date: str, end_date: str) -> List[Dict]:
196
+ start_dt = datetime.strptime(start_date, '%Y-%m-%d')
197
+ end_dt = datetime.strptime(end_date, '%Y-%m-%d')
198
+
199
+ print(f"\nโ–ถ [{board_name}] ๋ชฉ๋ก ์ˆ˜์ง‘ ์‹œ์ž‘...")
200
+
201
+ headers = {
202
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
203
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
204
+ 'Accept-Language': 'ko-KR,ko;q=0.9',
205
+ }
206
+
207
+ async with aiohttp.ClientSession(headers=headers) as session:
208
+ all_items = []
209
+ page_num = 1
210
+ empty_pages = 0
211
+ max_empty_pages = 3
212
+
213
+ with async_tqdm(desc=f"[{board_name}] ๋ชฉ๋ก", unit="ํŽ˜์ด์ง€") as pbar:
214
+ while page_num <= self.max_pages:
215
+ items, stop_flag = await self.fetch_list_page(
216
+ session, board_name, board_slug, page_num, start_dt, end_dt
217
+ )
218
+
219
+ if not items:
220
+ empty_pages += 1
221
+ if empty_pages >= max_empty_pages or stop_flag:
222
+ break
223
+ else:
224
+ empty_pages = 0
225
+ all_items.extend(items)
226
+
227
+ pbar.update(1)
228
+ pbar.set_postfix({"์ˆ˜์ง‘": len(all_items)})
229
+
230
+ if stop_flag:
231
+ break
232
+
233
+ page_num += 1
234
+
235
+ print(f" โœ“ {len(all_items)}๊ฐœ ํ•ญ๋ชฉ ๋ฐœ๊ฒฌ")
236
+
237
+ if all_items:
238
+ print(f" โ–ถ ์ƒ์„ธ ํŽ˜์ด์ง€ ์ˆ˜์ง‘ ์ค‘...")
239
+ tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
240
+
241
+ details = []
242
+ for coro in async_tqdm(asyncio.as_completed(tasks),
243
+ total=len(tasks),
244
+ desc=f"[{board_name}] ์ƒ์„ธ"):
245
+ detail = await coro
246
+ details.append(detail)
247
+
248
+ for item, detail in zip(all_items, details):
249
+ item.update(detail)
250
+
251
+ print(f"โœ“ [{board_name}] ์™„๋ฃŒ: {len(all_items)}๊ฐœ")
252
+ return all_items
253
+
254
+ async def collect_all(self, start_date: Optional[str] = None,
255
+ end_date: Optional[str] = None) -> pd.DataFrame:
256
+ if not end_date:
257
+ end_date = datetime.now().strftime('%Y-%m-%d')
258
+ if not start_date:
259
+ start_date = self.start_date
260
+
261
+ print(f"\n{'='*60}")
262
+ print(f"๊ฐœํ˜์‹ ๋‹น ๋ณด๋„์ž๋ฃŒ ์ˆ˜์ง‘ - ๋น„๋™๊ธฐ ๊ณ ์„ฑ๋Šฅ ๋ฒ„์ „")
263
+ print(f"๊ธฐ๊ฐ„: {start_date} ~ {end_date}")
264
+ print(f"{'='*60}")
265
+
266
+ tasks = [
267
+ self.collect_board(board_name, board_slug, start_date, end_date)
268
+ for board_name, board_slug in self.boards.items()
269
+ ]
270
+ results = await asyncio.gather(*tasks)
271
+
272
+ all_data = []
273
+ for items in results:
274
+ all_data.extend(items)
275
+
276
+ if not all_data:
277
+ print("\nโš ๏ธ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ์—†์Œ")
278
+ return pd.DataFrame()
279
+
280
+ df = pd.DataFrame(all_data)
281
+ df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
282
+ df = df[(df['title'] != "") & (df['text'] != "")]
283
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
284
+
285
+ print(f"\nโœ“ ์ด {len(df)}๊ฐœ ์ˆ˜์ง‘ ์™„๋ฃŒ")
286
+ return df
287
+
288
+ def save_local(self, df: pd.DataFrame):
289
+ os.makedirs(self.output_path, exist_ok=True)
290
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
291
+ csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
292
+ xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
293
+ df.to_csv(csv_path, index=False, encoding='utf-8-sig')
294
+ df.to_excel(xlsx_path, index=False, engine='openpyxl')
295
+ print(f"โœ“ CSV: {csv_path}")
296
+ print(f"โœ“ Excel: {xlsx_path}")
297
+
298
+ def upload_to_huggingface(self, df: pd.DataFrame):
299
+ if not self.hf_token:
300
+ print("\nโš ๏ธ HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
301
+ return
302
+
303
+ print(f"\nโ–ถ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์ค‘... (repo: {self.hf_repo_id})")
304
+ try:
305
+ login(token=self.hf_token)
306
+ new_dataset = Dataset.from_pandas(df)
307
+ try:
308
+ existing_dataset = load_dataset(self.hf_repo_id, split='train')
309
+ existing_df = existing_dataset.to_pandas()
310
+ combined_df = pd.concat([existing_df, df], ignore_index=True)
311
+ combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
312
+ combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
313
+ final_dataset = Dataset.from_pandas(combined_df)
314
+ print(f" โœ“ ๋ณ‘ํ•ฉ ํ›„: {len(final_dataset)}๊ฐœ")
315
+ except:
316
+ final_dataset = new_dataset
317
+ print(f" โ„น๏ธ ์‹ ๊ทœ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ")
318
+ final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
319
+ print(f"โœ“ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ ์™„๋ฃŒ!")
320
+ except Exception as e:
321
+ print(f"โœ— ์—…๋กœ๋“œ ์‹คํŒจ: {e}")
322
+
323
+ async def run_incremental(self):
324
+ state = self.load_state()
325
+ last_date = state.get('last_crawl_date')
326
+
327
+ if last_date:
328
+ start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
329
+ print(f"๐Ÿ“… ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ: {start_date} ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘")
330
+ else:
331
+ start_date = self.start_date
332
+ print(f"๐Ÿ“… ์ „์ฒด ์ˆ˜์ง‘: {start_date}๋ถ€ํ„ฐ")
333
+
334
+ end_date = datetime.now().strftime('%Y-%m-%d')
335
+ df = await self.collect_all(start_date, end_date)
336
+
337
+ if df.empty:
338
+ print("โœ“ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์—†์Œ")
339
+ return
340
+
341
+ self.save_local(df)
342
+ self.upload_to_huggingface(df)
343
+
344
+ state['last_crawl_date'] = end_date
345
+ state['last_crawl_time'] = datetime.now().isoformat()
346
+ state['last_count'] = len(df)
347
+ self.save_state(state)
348
+
349
+ print(f"\n{'='*60}\nโœ“ ์™„๋ฃŒ!\n{'='*60}\n")
350
+
351
+
352
+ async def main():
353
+ crawler = ReformAsyncCrawler()
354
+ await crawler.run_incremental()
355
+
356
+
357
+ if __name__ == "__main__":
358
+ asyncio.run(main())
requirements.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ์›น ํฌ๋กค๋ง
2
+ aiohttp==3.9.1
3
+ beautifulsoup4==4.12.2
4
+ lxml==5.1.0
5
+
6
+ # ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
7
+ pandas==2.1.4
8
+ openpyxl==3.1.2
9
+
10
+ # ํ—ˆ๊น…ํŽ˜์ด์Šค
11
+ huggingface-hub==0.20.2
12
+ datasets==2.16.1
13
+
14
+ # ์Šค์ผ€์ค„๋ง
15
+ APScheduler==3.10.4
16
+
17
+ # ํ™˜๊ฒฝ ๋ณ€์ˆ˜
18
+ python-dotenv==1.0.0
19
+
20
+ # ์ง„ํ–‰๋ฅ  ํ‘œ์‹œ
21
+ tqdm==4.66.1
run_once.bat ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ - ํ•œ ๋ฒˆ ์‹คํ–‰
5
+ echo ============================================
6
+ echo.
7
+
8
+ python minjoo_crawler_async.py
9
+
10
+ echo.
11
+ echo ============================================
12
+ echo ์™„๋ฃŒ!
13
+ echo ============================================
14
+ pause
run_ppp.bat ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ๊ตญ๋ฏผ์˜ํž˜ ํฌ๋กค๋Ÿฌ - ํ•œ ๋ฒˆ ์‹คํ–‰
5
+ echo ============================================
6
+ echo.
7
+
8
+ python ppp_crawler_async.py
9
+
10
+ echo.
11
+ echo ============================================
12
+ echo ์™„๋ฃŒ!
13
+ echo ============================================
14
+ pause
run_scheduler.bat ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ - ์Šค์ผ€์ค„๋Ÿฌ ์‹คํ–‰
5
+ echo ============================================
6
+ echo.
7
+ echo ๋งค์ผ ์˜ค์ „ 9์‹œ์— ์ž๋™์œผ๋กœ ํฌ๋กค๋ง์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
8
+ echo ์ข…๋ฃŒํ•˜๋ ค๋ฉด Ctrl+C๋ฅผ ๋ˆ„๋ฅด์„ธ์š”.
9
+ echo.
10
+ echo ๋กœ๊ทธ ํŒŒ์ผ: crawler_scheduler.log
11
+ echo ============================================
12
+ echo.
13
+
14
+ python scheduler.py
15
+
16
+ pause
run_unified.bat ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ - ํ•œ ๋ฒˆ ์‹คํ–‰
5
+ echo (๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น)
6
+ echo ============================================
7
+ echo.
8
+
9
+ python unified_crawler.py
10
+
11
+ echo.
12
+ echo ============================================
13
+ echo ์™„๋ฃŒ!
14
+ echo ============================================
15
+ pause
run_unified_scheduler.bat ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ - ์Šค์ผ€์ค„๋Ÿฌ ์‹คํ–‰
5
+ echo (๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น)
6
+ echo ============================================
7
+ echo.
8
+ echo ๋งค์ผ ์˜ค์ „ 9์‹œ์— ์ž๋™์œผ๋กœ ํฌ๋กค๋ง์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
9
+ echo ์ข…๋ฃŒํ•˜๋ ค๋ฉด Ctrl+C๋ฅผ ๋ˆ„๋ฅด์„ธ์š”.
10
+ echo.
11
+ echo ๋กœ๊ทธ ํŒŒ์ผ: unified_scheduler.log
12
+ echo ============================================
13
+ echo.
14
+
15
+ python unified_scheduler.py
16
+
17
+ pause
scheduler.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ ์Šค์ผ€์ค„๋Ÿฌ
5
+ - ๋งค์ผ ์ง€์ •๋œ ์‹œ๊ฐ„์— ์ž๋™ ์‹คํ–‰
6
+ - ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰ ์ง€์›
7
+ - ๋กœ๊ทธ ๊ธฐ๋ก
8
+ """
9
+
10
+ import asyncio
11
+ import logging
12
+ from datetime import datetime
13
+ from apscheduler.schedulers.asyncio import AsyncIOScheduler
14
+ from apscheduler.triggers.cron import CronTrigger
15
+ from minjoo_crawler_async import MinjooAsyncCrawler
16
+
17
+ # ๋กœ๊น… ์„ค์ •
18
+ logging.basicConfig(
19
+ level=logging.INFO,
20
+ format='%(asctime)s [%(levelname)s] %(message)s',
21
+ handlers=[
22
+ logging.FileHandler('crawler_scheduler.log', encoding='utf-8'),
23
+ logging.StreamHandler()
24
+ ]
25
+ )
26
+
27
+ logger = logging.getLogger(__name__)
28
+
29
+ async def scheduled_task():
30
+ """์Šค์ผ€์ค„๋œ ์ž‘์—…"""
31
+ logger.info("="*60)
32
+ logger.info("์Šค์ผ€์ค„๋œ ํฌ๋กค๋ง ์‹œ์ž‘")
33
+ logger.info("="*60)
34
+
35
+ try:
36
+ crawler = MinjooAsyncCrawler()
37
+ await crawler.run_incremental()
38
+ logger.info("ํฌ๋กค๋ง ์™„๋ฃŒ")
39
+ except Exception as e:
40
+ logger.error(f"ํฌ๋กค๋ง ์‹คํŒจ: {e}", exc_info=True)
41
+
42
+ def main():
43
+ """์Šค์ผ€์ค„๋Ÿฌ ๋ฉ”์ธ"""
44
+ scheduler = AsyncIOScheduler()
45
+
46
+ # ๋งค์ผ ์˜ค์ „ 9์‹œ์— ์‹คํ–‰
47
+ scheduler.add_job(
48
+ scheduled_task,
49
+ trigger=CronTrigger(hour=9, minute=0),
50
+ id='daily_crawl',
51
+ name='๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ ์ผ์ผ ์‹คํ–‰',
52
+ replace_existing=True
53
+ )
54
+
55
+ # ์ฆ‰์‹œ ํ•œ ๋ฒˆ ์‹คํ–‰ (ํ…Œ์ŠคํŠธ์šฉ)
56
+ # scheduler.add_job(scheduled_task, 'date', run_date=datetime.now())
57
+
58
+ logger.info("์Šค์ผ€์ค„๋Ÿฌ ์‹œ์ž‘")
59
+ logger.info("๋งค์ผ ์˜ค์ „ 9์‹œ์— ํฌ๋กค๋ง ์‹คํ–‰")
60
+ logger.info("์ข…๋ฃŒํ•˜๋ ค๋ฉด Ctrl+C๋ฅผ ๋ˆ„๋ฅด์„ธ์š”")
61
+
62
+ scheduler.start()
63
+
64
+ try:
65
+ # ์ด๋ฒคํŠธ ๋ฃจํ”„ ์‹คํ–‰
66
+ asyncio.get_event_loop().run_forever()
67
+ except (KeyboardInterrupt, SystemExit):
68
+ logger.info("์Šค์ผ€์ค„๋Ÿฌ ์ข…๋ฃŒ")
69
+
70
+ if __name__ == "__main__":
71
+ main()
setup.bat ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ chcp 65001 > nul
3
+ echo ============================================
4
+ echo ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น ํฌ๋กค๋Ÿฌ ์„ค์ •
5
+ echo ============================================
6
+ echo.
7
+
8
+ echo [1/3] Python ๋ฒ„์ „ ํ™•์ธ...
9
+ python --version
10
+ if errorlevel 1 (
11
+ echo โŒ Python์ด ์„ค์น˜๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
12
+ echo https://www.python.org/downloads/ ์—์„œ Python์„ ์„ค์น˜ํ•˜์„ธ์š”.
13
+ pause
14
+ exit /b 1
15
+ )
16
+ echo โœ“ Python ์„ค์น˜ ํ™•์ธ
17
+ echo.
18
+
19
+ echo [2/3] ์˜์กด์„ฑ ์„ค์น˜ ์ค‘...
20
+ pip install -r requirements.txt
21
+ if errorlevel 1 (
22
+ echo โŒ ์˜์กด์„ฑ ์„ค์น˜ ์‹คํŒจ
23
+ pause
24
+ exit /b 1
25
+ )
26
+ echo โœ“ ์˜์กด์„ฑ ์„ค์น˜ ์™„๋ฃŒ
27
+ echo.
28
+
29
+ echo [3/3] ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •...
30
+ if not exist .env (
31
+ copy .env.example .env
32
+ echo โœ“ .env ํŒŒ์ผ์ด ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
33
+ echo โš ๏ธ .env ํŒŒ์ผ์„ ์—ด์–ด์„œ HF_TOKEN์„ ์„ค์ •ํ•˜์„ธ์š”!
34
+ echo https://huggingface.co/settings/tokens ์—์„œ ํ† ํฐ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
35
+ ) else (
36
+ echo โ„น๏ธ .env ํŒŒ์ผ์ด ์ด๋ฏธ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
37
+ )
38
+ echo.
39
+
40
+ echo ============================================
41
+ echo โœ“ ์„ค์ • ์™„๋ฃŒ!
42
+ echo ============================================
43
+ echo.
44
+ echo ๋‹ค์Œ ๋‹จ๊ณ„:
45
+ echo 1. .env ํŒŒ์ผ์„ ์—ด์–ด์„œ HF_TOKEN ์„ค์ •
46
+ echo 2. run_once.bat ์‹คํ–‰ (ํ•œ ๋ฒˆ๋งŒ ์‹คํ–‰)
47
+ echo 3. run_scheduler.bat ์‹คํ–‰ (๋งค์ผ ์ž๋™ ์‹คํ–‰)
48
+ echo.
49
+ pause
unified_crawler.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ
5
+ - ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น ๋™์‹œ ํฌ๋กค๋ง
6
+ - ๊ฐ ์ •๋‹น๋ณ„ ๋…๋ฆฝ์ ์ธ ํ—ˆ๊น…ํŽ˜์ด์Šค ์—…๋กœ๋“œ
7
+ - ๋น„๋™๊ธฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
8
+
9
+ โ€ป CLI ์ธ์ž ์ง€์›์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ main.py ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”.
10
+ """
11
+
12
+ import asyncio
13
+ import logging
14
+ from datetime import datetime
15
+
16
+ from minjoo_crawler_async import MinjooAsyncCrawler
17
+ from ppp_crawler_async import PPPAsyncCrawler
18
+ from rebuilding_crawler_async import RebuildingAsyncCrawler
19
+ from reform_crawler_async import ReformAsyncCrawler
20
+ from basic_income_crawler_async import BasicIncomeAsyncCrawler
21
+ from jinbo_crawler_async import JinboAsyncCrawler
22
+
23
+ logging.basicConfig(
24
+ level=logging.INFO,
25
+ format='%(asctime)s [%(levelname)s] %(message)s',
26
+ handlers=[
27
+ logging.FileHandler('unified_crawler.log', encoding='utf-8'),
28
+ logging.StreamHandler()
29
+ ]
30
+ )
31
+ logger = logging.getLogger(__name__)
32
+
33
+ CRAWLERS = {
34
+ '๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น': MinjooAsyncCrawler,
35
+ '๊ตญ๋ฏผ์˜ํž˜': PPPAsyncCrawler,
36
+ '์กฐ๊ตญํ˜์‹ ๋‹น': RebuildingAsyncCrawler,
37
+ '๊ฐœํ˜์‹ ๋‹น': ReformAsyncCrawler,
38
+ '๊ธฐ๋ณธ์†Œ๋“๋‹น': BasicIncomeAsyncCrawler,
39
+ '์ง„๋ณด๋‹น': JinboAsyncCrawler,
40
+ }
41
+
42
+
43
+ async def crawl_all_parties():
44
+ """6๊ฐœ ์ •๋‹น ๋™์‹œ ํฌ๋กค๋ง"""
45
+ logger.info("=" * 60)
46
+ logger.info("ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ ์‹œ์ž‘")
47
+ logger.info(" + ".join(CRAWLERS.keys()))
48
+ logger.info("=" * 60)
49
+
50
+ start_time = datetime.now()
51
+
52
+ crawlers = [cls() for cls in CRAWLERS.values()]
53
+ party_names = list(CRAWLERS.keys())
54
+
55
+ results = await asyncio.gather(
56
+ *[crawler.run_incremental() for crawler in crawlers],
57
+ return_exceptions=True
58
+ )
59
+
60
+ for party, result in zip(party_names, results):
61
+ if isinstance(result, Exception):
62
+ logger.error(f"{party} ํฌ๋กค๋ง ์‹คํŒจ: {result}")
63
+ else:
64
+ logger.info(f"{party} ํฌ๋กค๋ง ์™„๋ฃŒ")
65
+
66
+ duration = (datetime.now() - start_time).total_seconds()
67
+ logger.info("=" * 60)
68
+ logger.info(f"์ „์ฒด ํฌ๋กค๋ง ์™„๋ฃŒ")
69
+ logger.info(f"์†Œ์š” ์‹œ๊ฐ„: {duration:.1f}์ดˆ ({duration / 60:.1f}๋ถ„)")
70
+ logger.info("=" * 60)
71
+
72
+
73
+ # ํ•˜์œ„ ํ˜ธํ™˜์„ฑ ์œ ์ง€
74
+ async def crawl_both_parties():
75
+ await crawl_all_parties()
76
+
77
+
78
+ async def main():
79
+ await crawl_all_parties()
80
+
81
+
82
+ if __name__ == "__main__":
83
+ asyncio.run(main())
unified_scheduler.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+ """
4
+ ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ ์Šค์ผ€์ค„๋Ÿฌ
5
+ - ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น ๋งค์ผ ์ž๋™ ํฌ๋กค๋ง
6
+ """
7
+
8
+ import asyncio
9
+ import logging
10
+ from apscheduler.schedulers.asyncio import AsyncIOScheduler
11
+ from apscheduler.triggers.cron import CronTrigger
12
+ from unified_crawler import crawl_all_parties
13
+
14
+ logging.basicConfig(
15
+ level=logging.INFO,
16
+ format='%(asctime)s [%(levelname)s] %(message)s',
17
+ handlers=[
18
+ logging.FileHandler('unified_scheduler.log', encoding='utf-8'),
19
+ logging.StreamHandler()
20
+ ]
21
+ )
22
+ logger = logging.getLogger(__name__)
23
+
24
+
25
+ async def scheduled_task():
26
+ logger.info("=" * 60)
27
+ logger.info("์Šค์ผ€์ค„๋œ ํฌ๋กค๋ง ์‹œ์ž‘ (6๊ฐœ ์ •๋‹น)")
28
+ logger.info("=" * 60)
29
+ try:
30
+ await crawl_all_parties()
31
+ logger.info("์Šค์ผ€์ค„๋œ ํฌ๋กค๋ง ์™„๋ฃŒ")
32
+ except Exception as e:
33
+ logger.error(f"ํฌ๋กค๋ง ์‹คํŒจ: {e}", exc_info=True)
34
+
35
+
36
+ def main():
37
+ scheduler = AsyncIOScheduler()
38
+
39
+ scheduler.add_job(
40
+ scheduled_task,
41
+ trigger=CronTrigger(hour=9, minute=0),
42
+ id='daily_crawl_all',
43
+ name='ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ ์ผ์ผ ์‹คํ–‰',
44
+ replace_existing=True
45
+ )
46
+
47
+ logger.info("ํ†ตํ•ฉ ์ •๋‹น ํฌ๋กค๋Ÿฌ ์Šค์ผ€์ค„๋Ÿฌ ์‹œ์ž‘")
48
+ logger.info("๋งค์ผ ์˜ค์ „ 9์‹œ์— 6๊ฐœ ์ •๋‹น ํฌ๋กค๋ง ์‹คํ–‰")
49
+ logger.info("์ข…๋ฃŒํ•˜๋ ค๋ฉด Ctrl+C๋ฅผ ๋ˆ„๋ฅด์„ธ์š”")
50
+
51
+ scheduler.start()
52
+
53
+ try:
54
+ asyncio.get_event_loop().run_forever()
55
+ except (KeyboardInterrupt, SystemExit):
56
+ logger.info("์Šค์ผ€์ค„๋Ÿฌ ์ข…๋ฃŒ")
57
+
58
+
59
+ if __name__ == "__main__":
60
+ main()