hanjunlee commited on Feb 26

Commit

3a36548

verified ·

1 Parent(s): 6c1a68f

Upload 23 files

Browse files

Files changed (23) hide show

.env.example +15 -0
QUICKSTART.md +94 -0
QUICKSTART_UNIFIED.md +4 -0
README.md +209 -0
README_UNIFIED.md +4 -0
basic_income_crawler_async.py +370 -0
crawler_config.json +71 -0
jinbo_crawler_async.py +424 -0
main.py +145 -0
minjoo_crawler_async.py +453 -0
ppp_crawler_async.py +446 -0
rebuilding_crawler_async.py +376 -0
reform_crawler_async.py +358 -0
requirements.txt +21 -0
run_once.bat +14 -0
run_ppp.bat +14 -0
run_scheduler.bat +16 -0
run_unified.bat +15 -0
run_unified_scheduler.bat +17 -0
scheduler.py +71 -0
setup.bat +49 -0
unified_crawler.py +83 -0
unified_scheduler.py +60 -0

.env.example ADDED Viewed

	@@ -0,0 +1,15 @@

+# 허깅페이스 설정
+# 토큰은 https://huggingface.co/settings/tokens 에서 생성하세요
+HF_TOKEN=your_huggingface_token_here
+# 더불어민주당 데이터셋 저장소
+HF_REPO_ID=your_username/minjoo-press-releases
+# 국민의힘 데이터셋 저장소
+HF_REPO_ID_PPP=your_username/ppp-press-releases
+# 사용법:
+# 1. 이 파일을 .env 로 복사하세요
+# 2. HF_TOKEN에 실제 토큰을 입력하세요
+# 3. HF_REPO_ID를 원하는 데이터셋 이름으로 변경하세요
+# 4. HF_REPO_ID_PPP를 국민의힘 데이터셋 이름으로 변경하세요

QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,94 @@

+# 빠른 시작 가이드
+## 5분 안에 시작하기
+### 1단계: 설치 (1분)
+```bash
+setup.bat
+```
+### 2단계: 허깅페이스 토큰 설정 (2분)
+1. https://huggingface.co/settings/tokens 접속
+2. "New token" → 이름: `party-crawler` → 권한: **Write** → 생성 후 복사
+3. `.env` 파일을 메모장으로 열고 입력:
+```
+HF_TOKEN=여기에_복사한_토큰_붙여넣기
+HF_REPO_ID=your_username/minjoo-press-releases
+HF_REPO_ID_PPP=your_username/ppp-press-releases
+HF_REPO_ID_REBUILDING=your_username/rebuilding-press-releases
+HF_REPO_ID_REFORM=your_username/reform-press-releases
+HF_REPO_ID_BASIC_INCOME=your_username/basic-income-press-releases
+HF_REPO_ID_JINBO=your_username/jinbo-press-releases
+```
+> **중요**: `your_username`을 실제 허깅페이스 사용자명으로 변경하세요!
+### 3단계: 실행 (2분)
+#### 전체 정당 한 번에 수집 (추천)
+```bash
+python main.py
+```
+#### 특정 정당만 수집
+```bash
+python main.py --party minjoo        # 더불어민주당
+python main.py --party ppp           # 국민의힘
+python main.py --party rebuilding    # 조국혁신당
+python main.py --party reform        # 개혁신당
+python main.py --party basic_income  # 기본소득당
+python main.py --party jinbo         # 진보당
+```
+#### 날짜 범위 지정
+```bash
+python main.py --start-date 2024-01-01
+python main.py --party reform --start-date 2024-01-01 --end-date 2024-06-30
+```
+## 완료!
+데이터 저장 위치:
+- **로컬**: `./data/` 폴더 (CSV, Excel)
+- **허깅페이스**: 각 정당별 저장소에 자동 업로드
+## 전체 옵션 요약
+| 명령어 | 설명 |
+|--------|------|
+| `python main.py` | 6개 정당 전체 증분 업데이트 |
+| `python main.py --party [코드]` | 특정 정당만 |
+| `python main.py --start-date YYYY-MM-DD` | 시작 날짜 지정 |
+| `python unified_scheduler.py` | 매일 자동 실행 (스케줄러) |
+## 정당 코드 목록
+| 코드 | 정당 |
+|------|------|
+| `minjoo` | 더불어민주당 |
+| `ppp` | 국민의힘 |
+| `rebuilding` | 조국혁신당 |
+| `reform` | 개혁신당 |
+| `basic_income` | 기본소득당 |
+| `jinbo` | 진보당 |
+| `all` | 전체 (기본값) |
+## 문제 해결
+| 문제 | 해결 |
+|------|------|
+| "HF_TOKEN이 설정되지 않았습니다" | `.env` 파일의 `HF_TOKEN` 확인 |
+| "Module not found" | `setup.bat` 다시 실행 |
+| 크롤링이 느려요 | `crawler_config.json`에서 `concurrent_requests`를 30으로 증가 |
+| 특정 정당만 실패 | `python main.py --party [코드]`로 개별 실행하여 확인 |
+## 도움말
+```bash
+python main.py --help
+```
+전체 문서: `README.md`

QUICKSTART_UNIFIED.md ADDED Viewed

	@@ -0,0 +1,4 @@

+# 빠른 시작 가이드
+> **이 파일은 QUICKSTART.md 로 통합되었습니다.**
+> 최신 가이드는 [QUICKSTART.md](QUICKSTART.md) 를 참조하세요.

README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+# 정당 보도자료 크롤러
+6개 정당 웹사이트에서 보도자료, 논평/브리핑, 모두발언을 자동으로 수집하고 허깅페이스에 업로드하는 크롤러입니다.
+**지원 정당**: 더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당
+## 주요 특징
+- **비동기 처리 (asyncio + aiohttp)**: 기존 대비 10-20배 빠른 속도
+- **6개 정당 병렬 크롤링**: 동시에 실행하여 시간 단축
+- **증분 업데이트**: 마지막 크롤링 이후 데이터만 수집
+- **허깅페이스 자동 업로드**: 정당별 독립 저장소에 자동 병합
+## 설치
+```bash
+pip install -r requirements.txt
+```
+또는 Windows:
+```bash
+setup.bat
+```
+## 환경 변수 설정
+`.env` 파일 생성 후 아래 내용 입력:
+```
+HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
+# 각 정당별 허깅페이스 데이터셋 저장소
+HF_REPO_ID=your_username/minjoo-press-releases
+HF_REPO_ID_PPP=your_username/ppp-press-releases
+HF_REPO_ID_REBUILDING=your_username/rebuilding-press-releases
+HF_REPO_ID_REFORM=your_username/reform-press-releases
+HF_REPO_ID_BASIC_INCOME=your_username/basic-income-press-releases
+HF_REPO_ID_JINBO=your_username/jinbo-press-releases
+```
+## 사용 방법
+### main.py - 통합 진입점 (추천)
+```bash
+# 전체 정당 증분 업데이트 (기본)
+python main.py
+# 특정 정당만
+python main.py --party minjoo        # 더불어민주당
+python main.py --party ppp           # 국민의힘
+python main.py --party rebuilding    # 조국혁신당
+python main.py --party reform        # 개혁신당
+python main.py --party basic_income  # 기본소득당
+python main.py --party jinbo         # 진보당
+# 날짜 범위 지정
+python main.py --start-date 2024-01-01
+python main.py --party reform --start-date 2024-01-01 --end-date 2024-06-30
+# 도움말
+python main.py --help
+```
+### 개별 크롤러 직접 실행
+```bash
+python minjoo_crawler_async.py
+python ppp_crawler_async.py
+python rebuilding_crawler_async.py
+python reform_crawler_async.py
+python basic_income_crawler_async.py
+python jinbo_crawler_async.py
+```
+### 매일 자동 실행 (스케줄러)
+```bash
+python unified_scheduler.py   # 매일 오전 9시 전체 자동 실행
+```
+### Windows 배치 파일
+| 파일 | 설명 |
+|------|------|
+| `run_unified.bat` | 전체 동시 크롤링 (한 번) |
+| `run_unified_scheduler.bat` | 전체 매일 자동 크롤링 |
+| `run_once.bat` | 민주당만 |
+| `run_ppp.bat` | 국민의힘만 |
+## 수집 데이터
+| 정당 | 게시판 | 수집 시작일 |
+|------|--------|------------|
+| 더불어민주당 | 보도자료, 논평/브리핑, 모두발언 | 2003-11-11 |
+| 국민의힘 | 대변인 논평보도자료, 원내 보도자료, 미디어특위 | 2000-03-10 |
+| 조국혁신당 | 기자회견문, 논평브리핑, 보도자료 | 2024-03-04 |
+| 개혁신당 | 보도자료, 논평브리핑 | 2024-02-13 |
+| 기본소득당 | 논평·보도자료 (논평/발언/보도자료) | 2020-01-08 |
+| 진보당 | 보도자료, 논평, 모두발언 | 2017-10-14 |
+## 설정 (crawler_config.json)
+각 정당별로 독립적으로 설정 가능:
+```json
+{
+  "minjoo": { ... },
+  "ppp": { ... },
+  "rebuilding": { ... },
+  "reform": { ... },
+  "basic_income": { ... },
+  "jinbo": { ... }
+}
+```
+| 설정 | 설명 |
+|------|------|
+| `boards` | 수집할 게시판 목록 |
+| `start_date` | 최초 크롤링 시작 날짜 |
+| `max_pages` | 최대 페이지 수 |
+| `concurrent_requests` | 동시 요청 수 (서버 부담 고려) |
+| `request_delay` | 요청 간 대기 시간(초) |
+| `output_path` | 로컬 저장 경로 |
+## 파일 구조
+```
+정당크롤러/
+├── main.py                        # 통합 진입점 (CLI 인자 지원)
+├── unified_crawler.py             # 6개 정당 통합 크롤러
+├── unified_scheduler.py           # 통합 스케줄러
+├── minjoo_crawler_async.py        # 더불어민주당
+├── ppp_crawler_async.py           # 국민의힘
+├── rebuilding_crawler_async.py    # 조국혁신당
+├── reform_crawler_async.py        # 개혁신당
+├── basic_income_crawler_async.py  # 기본소득당
+├── jinbo_crawler_async.py         # 진보당
+├── scheduler.py                   # 민주당 전용 스케줄러 (레거시)
+├── crawler_config.json            # 크롤링 설정 (6개 정당)
+├── crawler_state.json             # 크롤링 상태 (자동 생성)
+├── requirements.txt               # Python 의존성
+└── .env                           # 환경 변수 (직접 생성)
+```
+## 데이터 컬럼 (공통)
+| 컬럼 | 설명 |
+|------|------|
+| `board_name` | 게시판 이름 |
+| `title` | 제목 |
+| `category` | 카테고리/분류 |
+| `date` | 게시 날짜 |
+| `writer` | 작성자 |
+| `text` | 본문 |
+| `url` | 원문 URL |
+> **참고**: 국민의힘은 `category` 대신 `section`, `no` 컬럼 추가 포함
+## 성능
+| 항목 | 비동기 버전 | 기존 동기 버전 |
+|------|------------|--------------|
+| 정당 1개 (1000개) | ~5분 | ~80분 |
+| 6개 정당 동시 | ~5-10분 | ~480분 |
+## 증분 업데이트 작동 방식
+1. **첫 실행**: `start_date`부터 오늘까지 전체 수집
+2. **이후 실행**: 마지막 크롤링 날짜 다음날부터만 수집
+3. **허깅페이스 병합**: 기존 데이터셋과 자동 병합 + URL 기준 중복 제거
+4. **상태 관리**: 정당별로 독립적으로 `crawler_state.json`에 기록
+## 문제 해결
+| 문제 | 해결 방법 |
+|------|----------|
+| `HF_TOKEN이 설정되지 않았습니다` | `.env` 파일의 `HF_TOKEN` 확인 |
+| 크롤링이 느림 | `crawler_config.json`에서 `concurrent_requests` 증가 |
+| 서버 연결 오류 | `crawler_config.json`에서 `request_delay` 증가 |
+| 특정 정당만 실패 | `python main.py --party [정당코드]` 로 개별 실행하여 확인 |
+## 로그 확인
+```bash
+type main.log              # main.py 실행 로그
+type unified_crawler.log   # 통합 크롤러 로그
+type unified_scheduler.log # 스케줄러 로그
+```
+## Windows 백그라운드 실행
+```bash
+# 배치 파일
+start /B python main.py > main.log 2>&1
+# 또는 Windows 작업 스케줄러
+# 트리거: 매일 오전 9시 → 동작: python unified_scheduler.py
+```
+## 주의사항
+1. `concurrent_requests`는 10-20 이하 권장 (서버 부담 최소화)
+2. 수집 전 웹사이트 robots.txt 확인
+3. 공개 시 개인정보 포함 여부 확인 및 출처 명시
+## 라이선스
+MIT License

README_UNIFIED.md ADDED Viewed

	@@ -0,0 +1,4 @@

+# 통합 정당 크롤러
+> **이 파일은 README.md 로 통합되었습니다.**
+> 최신 문서는 [README.md](README.md) 를 참조하세요.

basic_income_crawler_async.py ADDED Viewed

	@@ -0,0 +1,370 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+기본소득당 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- 그누보드 5 기반 사이트 (basicincomeparty.kr)
+- td.td_subject / td.td_datetime(YY.MM.DD.) / div#bo_v_con 구조
+"""
+import os
+import json
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset
+load_dotenv()
+class BasicIncomeAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://basicincomeparty.kr"
+        self.party_name = "기본소득당"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        self.load_config()
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID_BASIC_INCOME", "basic-income-press-releases")
+        self.semaphore = asyncio.Semaphore(10)
+    def load_config(self):
+        default_config = {
+            "boards": {
+                "논평보도자료": "bikr/press"
+            },
+            "start_date": "2020-01-08",
+            "max_pages": 10000,
+            "concurrent_requests": 10,
+            "request_delay": 0.3,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                self.config = config.get('basic_income', default_config)
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('basic_income', {})
+        return {}
+    def save_state(self, state: Dict):
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['basic_income'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        """YY.MM.DD. 또는 YYYY.MM.DD. 또는 YYYY-MM-DD 파싱"""
+        date_str = date_str.strip().rstrip('.')
+        try:
+            parts = date_str.split('.')
+            if len(parts) >= 3:
+                year = int(parts[0])
+                year = 2000 + year if year < 100 else year
+                return datetime(year, int(parts[1]), int(parts[2]))
+        except:
+            pass
+        try:
+            return datetime.strptime(date_str[:10], '%Y-%m-%d')
+        except:
+            return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.3))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                              board_name: str, board_path: str, page_num: int,
+                              start_date: datetime, end_date: datetime) -> tuple:
+        url = f"{self.base_url}/{board_path}?page={page_num}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        rows = soup.select('table tbody tr')
+        if not rows:
+            return [], True
+        data = []
+        stop_flag = False
+        for row in rows:
+            try:
+                # 제목·URL: td.td_subject div.bo_tit a
+                title_a = row.select_one('td.td_subject div.bo_tit a')
+                if not title_a:
+                    continue
+                title = title_a.get_text(strip=True)
+                href = title_a.get('href', '')
+                # page 파라미터 제거 후 절대 URL
+                article_url = re.sub(r'\?.*$', '', href)
+                if not article_url.startswith('http'):
+                    article_url = self.base_url + article_url
+                # 날짜: td.td_datetime (YY.MM.DD. 형식)
+                date_td = row.select_one('td.td_datetime')
+                if not date_td:
+                    continue
+                date_str = date_td.get_text(strip=True)
+                # 카테고리: td.td_num2 a.bo_cate_link
+                cate_a = row.select_one('td.td_num2 a.bo_cate_link')
+                category = cate_a.get_text(strip=True) if cate_a else ""
+                article_date = self.parse_date(date_str)
+                if not article_date:
+                    continue
+                if article_date < start_date:
+                    stop_flag = True
+                    break
+                if article_date > end_date:
+                    continue
+                data.append({
+                    'board_name': board_name,
+                    'title': title,
+                    'category': category,
+                    'date': article_date.strftime('%Y-%m-%d'),  # YYYY-MM-DD 정규화
+                    'url': article_url
+                })
+            except:
+                continue
+        return data, stop_flag
+    async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        # 본문: div#bo_v_con
+        contents_div = soup.find('div', id='bo_v_con')
+        if contents_div:
+            for p in contents_div.find_all('p'):
+                cleaned = self.clean_text(p.get_text(strip=True))
+                if cleaned:
+                    text_parts.append(cleaned)
+        # 작성자: section#bo_v_info div.profile_info_ct 안의 span.sv_member
+        info_div = soup.select_one('section#bo_v_info div.profile_info_ct')
+        if info_div:
+            writer_el = info_div.find('span', class_='sv_member')
+            if writer_el:
+                writer = writer_el.get_text(strip=True)
+        return {'text': '\n'.join(text_parts), 'writer': writer}
+    async def collect_board(self, board_name: str, board_path: str,
+                            start_date: str, end_date: str) -> List[Dict]:
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            all_items = []
+            page_num = 1
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_name, board_path, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 1
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                       total=len(tasks),
+                                       desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                          end_date: Optional[str] = None) -> pd.DataFrame:
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"기본소득당 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        tasks = [
+            self.collect_board(board_name, board_path, start_date, end_date)
+            for board_name, board_path in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['date'] = pd.to_datetime(df['date'], errors='coerce')
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            login(token=self.hf_token)
+            new_dataset = Dataset.from_pandas(df)
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개")
+            except:
+                final_dataset = new_dataset
+                print(f"  ℹ️ 신규 데이터셋 생성")
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        self.save_local(df)
+        self.upload_to_huggingface(df)
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}\n✓ 완료!\n{'='*60}\n")
+async def main():
+    crawler = BasicIncomeAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

crawler_config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "minjoo": {
+    "boards": {
+      "보도자료": "188",
+      "논평_브리핑": "11",
+      "모두발언": "230"
+    },
+    "start_date": "2003-11-11",
+    "max_pages": 10000,
+    "concurrent_requests": 20,
+    "request_delay": 0.1,
+    "output_path": "./data"
+  },
+  "ppp": {
+    "boards": {
+      "대변인_논평보도자료": "BBSDD0001",
+      "원내_보도자료": "BBSDD0002",
+      "미디어특위_보도자료": "BBSDD0042"
+    },
+    "start_date": "2000-03-10",
+    "max_pages": 10000,
+    "concurrent_requests": 20,
+    "request_delay": 0.1,
+    "output_path": "./data"
+  },
+  "rebuilding": {
+    "boards": {
+      "기자회견문": "news/press-conference",
+      "논평브리핑": "news/commentary-briefing",
+      "보도자료": "news/press-release"
+    },
+    "start_date": "2024-03-04",
+    "max_pages": 10000,
+    "concurrent_requests": 10,
+    "request_delay": 0.5,
+    "output_path": "./data"
+  },
+  "reform": {
+    "boards": {
+      "보도자료": "press",
+      "논평브리핑": "briefing"
+    },
+    "start_date": "2024-02-13",
+    "max_pages": 10000,
+    "concurrent_requests": 10,
+    "request_delay": 0.3,
+    "output_path": "./data"
+  },
+  "basic_income": {
+    "boards": {
+      "논평보도자료": "bikr/press"
+    },
+    "start_date": "2020-01-08",
+    "max_pages": 10000,
+    "concurrent_requests": 10,
+    "request_delay": 0.3,
+    "output_path": "./data"
+  },
+  "jinbo": {
+    "boards": {
+      "보도자료": {"p": "286", "b": "b_1_111"},
+      "논평": {"p": "15", "b": "b_1_2"},
+      "모두발언": {"p": "14", "b": "b_1_1"}
+    },
+    "start_date": "2017-10-14",
+    "max_pages": 10000,
+    "concurrent_requests": 10,
+    "request_delay": 0.3,
+    "output_path": "./data"
+  }
+}

jinbo_crawler_async.py ADDED Viewed

	@@ -0,0 +1,424 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+진보당 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- jinboparty.com 자체 CMS 사용
+- 보도자료: 카드형 레이아웃 (div.img_list_item)
+- 논평/모두발언: 테이블형 레이아웃 (div#moTable)
+- js_board_view('ID') → /pages/?p=...&b=...&bn=ID&m=read 패턴
+"""
+import os
+import json
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset
+load_dotenv()
+class JinboAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://jinboparty.com"
+        self.party_name = "진보당"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        self.load_config()
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID_JINBO", "jinbo-press-releases")
+        self.semaphore = asyncio.Semaphore(10)
+    def load_config(self):
+        # boards 값은 {"p": "...", "b": "..."} 형태의 dict
+        default_config = {
+            "boards": {
+                "보도자료": {"p": "286", "b": "b_1_111"},
+                "논평": {"p": "15", "b": "b_1_2"},
+                "모두발언": {"p": "14", "b": "b_1_1"}
+            },
+            "start_date": "2017-10-14",
+            "max_pages": 10000,
+            "concurrent_requests": 10,
+            "request_delay": 0.3,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                self.config = config.get('jinbo', default_config)
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('jinbo', {})
+        return {}
+    def save_state(self, state: Dict):
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['jinbo'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        """YYYY.MM.DD 또는 YYYY-MM-DD 파싱"""
+        date_str = date_str.strip()
+        for fmt in ('%Y.%m.%d', '%Y-%m-%d'):
+            try:
+                return datetime.strptime(date_str[:10], fmt)
+            except:
+                continue
+        return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    @staticmethod
+    def extract_board_id(href: str) -> Optional[str]:
+        """js_board_view('ID') 에서 ID 추출"""
+        match = re.search(r"js_board_view\('(\d+)'\)", href)
+        return match.group(1) if match else None
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.3))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                              board_name: str, board_cfg: Dict, page_num: int,
+                              start_date: datetime, end_date: datetime) -> tuple:
+        p = board_cfg['p']
+        b = board_cfg['b']
+        url = f"{self.base_url}/pages/index.php?nPage={page_num}&p={p}&b={b}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        data = []
+        stop_flag = False
+        # ── 카드형 레이아웃 (보도자료) ──────────────────────────────
+        card_items = soup.select('div.img_list_item')
+        if card_items:
+            for item in card_items:
+                try:
+                    link = item.select_one('a[href]')
+                    if not link:
+                        continue
+                    bn = self.extract_board_id(link.get('href', ''))
+                    if not bn:
+                        continue
+                    title_el = item.select_one('h4._tit span')
+                    title = title_el.get_text(strip=True) if title_el else ""
+                    # 날짜: icon_cal 다음 span
+                    date_str = ""
+                    for span in item.select('div.item_bottom span'):
+                        text = span.get_text(strip=True)
+                        if re.match(r'\d{4}\.\d{2}\.\d{2}', text):
+                            date_str = text[:10]
+                            break
+                    if not date_str:
+                        continue
+                    article_date = self.parse_date(date_str)
+                    if not article_date:
+                        continue
+                    if article_date < start_date:
+                        stop_flag = True
+                        break
+                    if article_date > end_date:
+                        continue
+                    detail_url = f"{self.base_url}/pages/?p={p}&b={b}&bn={bn}&m=read"
+                    data.append({
+                        'board_name': board_name,
+                        'title': title,
+                        'category': board_name,
+                        'date': article_date.strftime('%Y-%m-%d'),
+                        'url': detail_url
+                    })
+                except:
+                    continue
+            return data, stop_flag
+        # ── 테이블형 레이아웃 (논평·모두발언) ──────────────────────
+        table_items = soup.select('div#moTable li:not(.t_head)')
+        if table_items:
+            for item in table_items:
+                try:
+                    link = item.select_one('div.tb_title_area a')
+                    if not link:
+                        continue
+                    bn = self.extract_board_id(link.get('href', ''))
+                    if not bn:
+                        continue
+                    title_el = item.select_one('p.title')
+                    title = title_el.get_text(strip=True) if title_el else ""
+                    # 날짜: div.col.wid_140 ("등록일 YYYY.MM.DD")
+                    date_div = item.select_one('div.col.wid_140')
+                    date_str = ""
+                    if date_div:
+                        raw = re.sub(r'등록일\s*', '', date_div.get_text(strip=True)).strip()
+                        date_str = raw[:10]
+                    if not date_str:
+                        continue
+                    article_date = self.parse_date(date_str)
+                    if not article_date:
+                        continue
+                    if article_date < start_date:
+                        stop_flag = True
+                        break
+                    if article_date > end_date:
+                        continue
+                    detail_url = f"{self.base_url}/pages/?p={p}&b={b}&bn={bn}&m=read"
+                    data.append({
+                        'board_name': board_name,
+                        'title': title,
+                        'category': board_name,
+                        'date': article_date.strftime('%Y-%m-%d'),
+                        'url': detail_url
+                    })
+                except:
+                    continue
+            return data, stop_flag
+        # 둘 다 없으면 빈 페이지
+        return [], True
+    async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        # 본문: div.content_box (class="td wid_full content_box")
+        contents_div = soup.select_one('div.content_box')
+        if contents_div:
+            for p in contents_div.find_all('p'):
+                cleaned = self.clean_text(p.get_text(strip=True))
+                if cleaned:
+                    text_parts.append(cleaned)
+        # 작성자: ul.info_list li 중 "작성자" 항목
+        for li in soup.select('ul.info_list li'):
+            b_tag = li.find('b')
+            if b_tag and '작성자' in b_tag.get_text():
+                writer = li.get_text(strip=True).replace(b_tag.get_text(strip=True), '').strip()
+                break
+        return {'text': '\n'.join(text_parts), 'writer': writer}
+    async def collect_board(self, board_name: str, board_cfg: Dict,
+                            start_date: str, end_date: str) -> List[Dict]:
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            all_items = []
+            page_num = 1
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_name, board_cfg, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 1
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                       total=len(tasks),
+                                       desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                          end_date: Optional[str] = None) -> pd.DataFrame:
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"진보당 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        tasks = [
+            self.collect_board(board_name, board_cfg, start_date, end_date)
+            for board_name, board_cfg in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['date'] = pd.to_datetime(df['date'], errors='coerce')
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            login(token=self.hf_token)
+            new_dataset = Dataset.from_pandas(df)
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개")
+            except:
+                final_dataset = new_dataset
+                print(f"  ℹ️ 신규 데이터셋 생성")
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        self.save_local(df)
+        self.upload_to_huggingface(df)
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}\n✓ 완료!\n{'='*60}\n")
+async def main():
+    crawler = JinboAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

main.py ADDED Viewed

	@@ -0,0 +1,145 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+정당 보도자료 크롤러 - 메인 진입점
+지원 정당: 더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당
+사용법:
+  python main.py                                    # 전체 정당 증분 업데이트
+  python main.py --party minjoo                     # 더불어민주당만
+  python main.py --party ppp                        # 국민의힘만
+  python main.py --party rebuilding                 # 조국혁신당만
+  python main.py --party reform                     # 개혁신당만
+  python main.py --party basic_income               # 기본소득당만
+  python main.py --party jinbo                      # 진보당만
+  python main.py --start-date 2024-01-01            # 날짜 범위 지정
+  python main.py --party ppp --start-date 2024-01-01 --end-date 2024-06-30
+"""
+import asyncio
+import argparse
+import logging
+from datetime import datetime
+from minjoo_crawler_async import MinjooAsyncCrawler
+from ppp_crawler_async import PPPAsyncCrawler
+from rebuilding_crawler_async import RebuildingAsyncCrawler
+from reform_crawler_async import ReformAsyncCrawler
+from basic_income_crawler_async import BasicIncomeAsyncCrawler
+from jinbo_crawler_async import JinboAsyncCrawler
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[
+        logging.FileHandler('main.log', encoding='utf-8'),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+PARTY_LABELS = {
+    'minjoo':      '더불어민주당',
+    'ppp':         '국민의힘',
+    'rebuilding':  '조국혁신당',
+    'reform':      '개혁신당',
+    'basic_income':'기본소득당',
+    'jinbo':       '진보당',
+    'all':         '전체 (6개 정당)',
+}
+ALL_PARTIES = ['minjoo', 'ppp', 'rebuilding', 'reform', 'basic_income', 'jinbo']
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='정당 보도자료 크롤러',
+        formatter_class=argparse.RawTextHelpFormatter
+    )
+    parser.add_argument(
+        '--party',
+        choices=list(PARTY_LABELS.keys()),
+        default='all',
+        help=(
+            '크롤링할 정당 선택 (기본값: all)\n'
+            '  minjoo       : 더불어민주당\n'
+            '  ppp          : 국민의힘\n'
+            '  rebuilding   : 조국혁신당\n'
+            '  reform       : 개혁신당\n'
+            '  basic_income : 기본소득당\n'
+            '  jinbo        : 진보당\n'
+            '  all          : 전체 동시 크롤링'
+        )
+    )
+    parser.add_argument(
+        '--start-date',
+        metavar='YYYY-MM-DD',
+        default=None,
+        help='수집 시작 날짜 (예: 2024-01-01)\n미입력 시 마지막 크롤링 이후부터 (증분 업데이트)'
+    )
+    parser.add_argument(
+        '--end-date',
+        metavar='YYYY-MM-DD',
+        default=None,
+        help='수집 종료 날짜 (예: 2024-12-31)\n미입력 시 오늘 날짜'
+    )
+    return parser.parse_args()
+def get_crawler(party: str):
+    """정당 코드에 맞는 크롤러 인스턴스 반환"""
+    return {
+        'minjoo':       MinjooAsyncCrawler,
+        'ppp':          PPPAsyncCrawler,
+        'rebuilding':   RebuildingAsyncCrawler,
+        'reform':       ReformAsyncCrawler,
+        'basic_income': BasicIncomeAsyncCrawler,
+        'jinbo':        JinboAsyncCrawler,
+    }[party]()
+async def run_party(party: str, start_date=None, end_date=None):
+    """단일 정당 크롤링 실행"""
+    crawler = get_crawler(party)
+    if start_date or end_date:
+        df = await crawler.collect_all(start_date, end_date)
+        if not df.empty:
+            crawler.save_local(df)
+            crawler.upload_to_huggingface(df)
+    else:
+        await crawler.run_incremental()
+async def main():
+    args = parse_args()
+    start_time = datetime.now()
+    target_parties = ALL_PARTIES if args.party == 'all' else [args.party]
+    logger.info("=" * 60)
+    logger.info("정당 보도자료 크롤러 시작")
+    logger.info(f"대상 정당 : {PARTY_LABELS[args.party]}")
+    logger.info(f"수집 기간 : {args.start_date or '증분 업데이트'} ~ {args.end_date or '오늘'}")
+    logger.info("=" * 60)
+    if len(target_parties) == 1:
+        await run_party(target_parties[0], args.start_date, args.end_date)
+    else:
+        results = await asyncio.gather(
+            *[run_party(p, args.start_date, args.end_date) for p in target_parties],
+            return_exceptions=True
+        )
+        for party, result in zip(target_parties, results):
+            if isinstance(result, Exception):
+                logger.error(f"{PARTY_LABELS[party]} 크롤링 실패: {result}")
+            else:
+                logger.info(f"{PARTY_LABELS[party]} 크���링 완료")
+    duration = (datetime.now() - start_time).total_seconds()
+    logger.info("=" * 60)
+    logger.info(f"전체 완료! 소요 시간: {duration:.1f}초 ({duration / 60:.1f}분)")
+    logger.info("=" * 60)
+if __name__ == "__main__":
+    asyncio.run(main())

minjoo_crawler_async.py ADDED Viewed

	@@ -0,0 +1,453 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+더불어민주당 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- asyncio + aiohttp (10-20배 빠른 속도)
+- 동시 요청 수 제어 (서버 부담 최소화)
+- 증분 업데이트 (마지막 날짜 이후만 크롤링)
+- 허깅페이스 자동 업로드
+- 일 단위 스케줄링
+"""
+import os
+import json
+import time
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset, concatenate_datasets
+# .env 파일 로드
+load_dotenv()
+class MinjooAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://theminjoo.kr/main/sub"
+        self.party_name = "더불어민주당"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        # 설정 로드
+        self.load_config()
+        # 허깅페이스 설정
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID", "minjoo-press-releases")
+        # 동시 요청 수 제한 (서버 부담 방지)
+        self.semaphore = asyncio.Semaphore(20)
+    def load_config(self):
+        """설정 파일 로드"""
+        default_config = {
+            "boards": {
+                "보도자료": "188",
+                "논평_브리핑": "11",
+                "모두발언": "230"
+            },
+            "start_date": "2003-11-11",
+            "max_pages": 10000,
+            "concurrent_requests": 20,
+            "request_delay": 0.1,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                # 민주당 설정만 추출
+                if 'minjoo' in config:
+                    self.config = config['minjoo']
+                else:
+                    self.config = default_config
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        """크롤러 상태 로드 (마지막 크롤링 날짜)"""
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('minjoo', {})
+        return {}
+    def save_state(self, state: Dict):
+        """크롤러 상태 저장"""
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['minjoo'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        """날짜 파싱"""
+        try:
+            return datetime.strptime(date_str.strip().split()[0], '%Y-%m-%d')
+        except:
+            return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        """텍스트 정리"""
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        """재시도 로직이 있는 비동기 요청"""
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.1))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception as e:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                             board_id: str, page_num: int,
+                             start_date: datetime, end_date: datetime) -> tuple:
+        """목록 페이지 하나 가져오기"""
+        if page_num == 0:
+            url = f"{self.base_url}/news/list.php?brd={board_id}"
+        else:
+            url = f"{self.base_url}/news/list.php?sno={page_num}&par=&&brd={board_id}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        board_items = soup.find_all('div', {'class': 'board-item'})
+        if not board_items:
+            return [], True  # 빈 페이지
+        data = []
+        stop_flag = False
+        for item in board_items:
+            try:
+                link_tag = item.find('a')
+                if not link_tag:
+                    continue
+                title_span = link_tag.find('span')
+                if not title_span:
+                    continue
+                title = title_span.get_text(strip=True).replace('\n', ' ')
+                # URL 처리
+                article_url = link_tag.get('href', '')
+                if article_url.startswith('./'):
+                    article_url = self.base_url + '/news/' + article_url[2:]
+                elif not article_url.startswith('http'):
+                    article_url = self.base_url + article_url
+                # 카테고리
+                category_tag = item.find('p', {'class': 'category'})
+                category = ""
+                if category_tag:
+                    category_span = category_tag.find('span')
+                    if category_span:
+                        category = category_span.get_text(strip=True)
+                # 날짜
+                time_tag = item.find('time')
+                if not time_tag:
+                    continue
+                date_str = time_tag.get('datetime', '') or time_tag.get_text(strip=True)
+                article_date = self.parse_date(date_str)
+                if not article_date:
+                    continue
+                if article_date < start_date:
+                    stop_flag = True
+                    break
+                if article_date > end_date:
+                    continue
+                data.append({
+                    'category': category,
+                    'title': title,
+                    'date': date_str.split()[0] if ' ' in date_str else date_str,
+                    'url': article_url
+                })
+            except:
+                continue
+        return data, stop_flag
+    async def fetch_article_detail(self, session: aiohttp.ClientSession,
+                                   url: str) -> Dict:
+        """상세 페이지 가져오기"""
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': "", 'published_date': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        published_date = ""
+        # 게시일
+        date_li = soup.find('li', {'class': 'date'})
+        if date_li:
+            date_text = date_li.get_text(strip=True)
+            match = re.search(r'(\d{4}-\d{2}-\d{2})', date_text)
+            if match:
+                published_date = match.group(1)
+        # 본문
+        contents_div = soup.find('div', {'class': 'board-view__contents'})
+        if contents_div:
+            for element in contents_div.descendants:
+                if element.name == 'p':
+                    text = element.get_text(strip=True)
+                    cleaned = self.clean_text(text)
+                    if cleaned:
+                        text_parts.append(cleaned)
+                elif element.name == 'b':
+                    text = element.get_text(strip=True)
+                    cleaned = self.clean_text(text)
+                    if cleaned and not writer:
+                        if '민주당' in cleaned or '공보국' in cleaned or '대변인' in cleaned:
+                            writer = cleaned
+        return {
+            'text': '\n'.join(text_parts),
+            'writer': writer,
+            'published_date': published_date
+        }
+    async def collect_board(self, board_name: str, board_id: str,
+                           start_date: str, end_date: str) -> List[Dict]:
+        """한 게시판 전체 수집 (비동기)"""
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            # 1단계: 목록 페이지 수집
+            all_items = []
+            page_num = 0
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages * 20:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_id, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 20
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            # 2단계: 상세 페이지 수집 (병렬 처리)
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                # 진행률 표시와 함께 병렬 실행
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                      total=len(tasks),
+                                      desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                # 상세 정보 병합
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+                    item['board_name'] = board_name
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                         end_date: Optional[str] = None) -> pd.DataFrame:
+        """모든 게시판 수집"""
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"더불어민주당 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        # 모든 게시판 병렬 수집
+        tasks = [
+            self.collect_board(board_name, board_id, start_date, end_date)
+            for board_name, board_id in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        # 데이터 결합
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'title', 'category', 'published_date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
+        df = df.rename(columns={'published_date': 'date'})
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        """로컬에 저장"""
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        # CSV
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        # Excel
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        """허깅페이스에 업로드"""
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다. .env 파일을 확인하세요.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            # 로그인
+            login(token=self.hf_token)
+            api = HfApi()
+            # 새 데이터셋 생성
+            new_dataset = Dataset.from_pandas(df)
+            # 기존 데이터셋 확인 및 병합
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                print(f"  ℹ️ 기존 데이터: {len(existing_dataset)}개")
+                # 중복 제거를 위해 URL 기준으로 병합
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개 (중복 제거됨)")
+            except:
+                print(f"  ℹ️ 신규 데이터셋 생성")
+                final_dataset = new_dataset
+            # 업로드
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+            print(f"  🔗 https://huggingface.co/datasets/{self.hf_repo_id}")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        """증분 업데이트 실행 (마지막 날짜 이후만)"""
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            # 마지막 크롤링 날짜 다음날부터
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        # 크롤링
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        # 로컬 저장
+        self.save_local(df)
+        # 허깅페이스 업로드
+        self.upload_to_huggingface(df)
+        # 상태 저장
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}")
+        print(f"✓ 완료! 다음 실행: 내일")
+        print(f"{'='*60}\n")
+async def main():
+    """메인 함수"""
+    crawler = MinjooAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

ppp_crawler_async.py ADDED Viewed

	@@ -0,0 +1,446 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+국민의힘 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- asyncio + aiohttp (10-20배 빠른 속도)
+- 동시 요청 수 제어 (서버 부담 최소화)
+- 증분 업데이트 (마지막 날짜 이후만 크롤링)
+- 허깅페이스 자동 업로드
+- 일 단위 스케줄링
+"""
+import os
+import json
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset
+# .env 파일 로드
+load_dotenv()
+class PPPAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://www.peoplepowerparty.kr"
+        self.party_name = "국민의힘"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        # 설정 로드
+        self.load_config()
+        # 허깅페이스 설정
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID_PPP", "ppp-press-releases")
+        # 동시 요청 수 제한
+        self.semaphore = asyncio.Semaphore(20)
+    def load_config(self):
+        """설정 파일 로드"""
+        default_config = {
+            "boards": {
+                "대변인_논평보도자료": "BBSDD0001",
+                "원내_보도자료": "BBSDD0002",
+                "미디어특위_보도자료": "BBSDD0042"
+            },
+            "start_date": "2000-03-10",
+            "max_pages": 10000,
+            "concurrent_requests": 20,
+            "request_delay": 0.1,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                # 국민의힘 설정만 추출
+                if 'ppp' in config:
+                    self.config = config['ppp']
+                else:
+                    self.config = default_config
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        """크롤러 상태 로드"""
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('ppp', {})
+        return {}
+    def save_state(self, state: Dict):
+        """크롤러 상태 저장"""
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['ppp'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        """날짜 파싱"""
+        try:
+            return datetime.strptime(date_str.strip(), '%Y-%m-%d')
+        except:
+            return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        """텍스트 정리"""
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        """재시도 로직이 있는 비동기 요청"""
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.1))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception as e:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                             board_id: str, page_num: int,
+                             start_date: datetime, end_date: datetime) -> tuple:
+        """목록 페이지 하나 가져오기"""
+        url = f"{self.base_url}/news/comment/{board_id}?page={page_num}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        table_div = soup.find('div', {'class': 'board-tbl'})
+        if not table_div:
+            return [], True
+        tbody = table_div.find('tbody')
+        if not tbody:
+            return [], True
+        rows = tbody.find_all('tr')
+        if not rows:
+            return [], True
+        data = []
+        stop_flag = False
+        for row in rows:
+            cols = row.find_all('td')
+            if len(cols) < 3:
+                continue
+            try:
+                no_td = row.find('td', {'class': 'no'})
+                class_td = row.find('td', {'class': 'class'})
+                no = no_td.get_text(strip=True) if no_td else cols[0].get_text(strip=True)
+                section = class_td.get_text(strip=True) if class_td else cols[1].get_text(strip=True)
+                link_tag = row.find('a')
+                if not link_tag:
+                    continue
+                title = link_tag.get_text(strip=True).replace('\n', ' ')
+                article_url = self.base_url + link_tag.get('href', '')
+                # 날짜 추출
+                date_str = ""
+                if len(cols) >= 4:
+                    date_str = cols[3].get_text(strip=True)
+                if not date_str or not re.match(r'\d{4}-\d{2}-\d{2}', date_str):
+                    dd_date = row.find('dd', {'class': 'date'})
+                    if dd_date:
+                        span = dd_date.find('span')
+                        if span:
+                            span.decompose()
+                        date_str = dd_date.get_text(strip=True)
+                article_date = self.parse_date(date_str)
+                if not article_date:
+                    continue
+                if article_date < start_date:
+                    stop_flag = True
+                    break
+                if article_date > end_date:
+                    continue
+                data.append({
+                    'no': no,
+                    'section': section,
+                    'title': title,
+                    'date': date_str,
+                    'url': article_url
+                })
+            except:
+                continue
+        return data, stop_flag
+    async def fetch_article_detail(self, session: aiohttp.ClientSession,
+                                   url: str) -> Dict:
+        """상세 페이지 가져오기"""
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        conts_tag = soup.select_one('dd.conts')
+        if conts_tag:
+            hwp_div = conts_tag.find('div', {'id': 'hwpEditorBoardContent'})
+            if hwp_div:
+                hwp_div.decompose()
+            p_tags = conts_tag.find_all('p')
+            for p in p_tags:
+                style = p.get('style', '')
+                is_center = 'text-align:center' in style.replace(' ', '').lower()
+                raw_text = p.get_text(strip=True)
+                cleaned_text = self.clean_text(raw_text)
+                if not cleaned_text:
+                    continue
+                if is_center:
+                    if not re.match(r'\d{4}\.\s*\d{1,2}\.\s*\d{1,2}', cleaned_text):
+                        writer = cleaned_text
+                else:
+                    text_parts.append(cleaned_text)
+        return {'text': '\n'.join(text_parts), 'writer': writer}
+    async def collect_board(self, board_name: str, board_id: str,
+                           start_date: str, end_date: str) -> List[Dict]:
+        """한 게시판 전체 수집 (비동기)"""
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            # 1단계: 목록 페이지 수집
+            all_items = []
+            page_num = 1
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_id, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 1
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            # 2단계: 상세 페이지 수집 (병렬 처리)
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                      total=len(tasks),
+                                      desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                # 상세 정보 병합
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+                    item['board_name'] = board_name
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                         end_date: Optional[str] = None) -> pd.DataFrame:
+        """모든 게시판 수집"""
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"국민의힘 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        # 모든 게시판 병렬 수집
+        tasks = [
+            self.collect_board(board_name, board_id, start_date, end_date)
+            for board_name, board_id in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        # 데이터 결합
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'no', 'title', 'section', 'date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['date'] = pd.to_datetime(df['date'], errors='coerce')
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        """로컬에 저장"""
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        # CSV
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        # Excel
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        """허깅페이스에 업로드"""
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다. .env 파일을 확인하세요.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            login(token=self.hf_token)
+            api = HfApi()
+            new_dataset = Dataset.from_pandas(df)
+            # 기존 데이터셋 확인 및 병합
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                print(f"  ℹ️ 기존 데이터: {len(existing_dataset)}개")
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개 (중복 제거됨)")
+            except:
+                print(f"  ℹ️ 신규 데이터셋 생성")
+                final_dataset = new_dataset
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+            print(f"  🔗 https://huggingface.co/datasets/{self.hf_repo_id}")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        """증분 업데이트 실행"""
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        # 크롤링
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        # 로컬 저장
+        self.save_local(df)
+        # 허깅페이스 업로드
+        self.upload_to_huggingface(df)
+        # 상태 저장
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}")
+        print(f"✓ 완료!")
+        print(f"{'='*60}\n")
+async def main():
+    """메인 함수"""
+    crawler = PPPAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

rebuilding_crawler_async.py ADDED Viewed

	@@ -0,0 +1,376 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+조국혁신당 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- 기존 sync(requests) 방식을 async(aiohttp) 로 전환
+- 증분 업데이트, 허깅페이스 자동 업로드
+"""
+import os
+import json
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset
+load_dotenv()
+class RebuildingAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://rebuildingkoreaparty.kr"
+        self.party_name = "조국혁신당"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        self.load_config()
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID_REBUILDING", "rebuilding-press-releases")
+        self.semaphore = asyncio.Semaphore(10)
+    def load_config(self):
+        default_config = {
+            "boards": {
+                "기자회견문": "news/press-conference",
+                "논평브리핑": "news/commentary-briefing",
+                "보도자료": "news/press-release"
+            },
+            "start_date": "2024-03-04",
+            "max_pages": 10000,
+            "concurrent_requests": 10,
+            "request_delay": 0.5,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                self.config = config.get('rebuilding', default_config)
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('rebuilding', {})
+        return {}
+    def save_state(self, state: Dict):
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['rebuilding'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        try:
+            return datetime.strptime(date_str.strip(), '%Y-%m-%d')
+        except:
+            return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.5))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                              board_name: str, board_path: str, page_num: int,
+                              start_date: datetime, end_date: datetime) -> tuple:
+        if page_num == 1:
+            url = f"{self.base_url}/{board_path}"
+        else:
+            url = f"{self.base_url}/{board_path}?page={page_num}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        # <a href="/news/{board_path}/..."> 패턴으로 게시글 링크 탐색
+        article_links = soup.find_all('a', href=re.compile(f'^/news/{re.escape(board_path)}/'))
+        if not article_links:
+            return [], True
+        data = []
+        stop_flag = False
+        seen_urls = set()
+        for link in article_links:
+            try:
+                article_url = link.get('href', '')
+                if article_url.startswith('/'):
+                    article_url = self.base_url + article_url
+                if article_url in seen_urls:
+                    continue
+                seen_urls.add(article_url)
+                title = link.get_text(strip=True).replace('\n', ' ')
+                # 같은 <ul> 안에서 날짜·카테고리 추출
+                parent = link.find_parent('ul')
+                if not parent:
+                    parent_li = link.find_parent('li')
+                    if parent_li:
+                        parent = parent_li.find_parent('ul')
+                date_str = ""
+                category = ""
+                if parent:
+                    date_li = parent.find('li', {'class': 'td date'})
+                    if date_li:
+                        date_str = date_li.get_text(strip=True)
+                    cate_li = parent.find('li', {'class': 'td category'})
+                    if cate_li:
+                        category = cate_li.get_text(strip=True)
+                if not date_str:
+                    continue
+                article_date = self.parse_date(date_str)
+                if not article_date:
+                    continue
+                if article_date < start_date:
+                    stop_flag = True
+                    break
+                if article_date > end_date:
+                    continue
+                data.append({
+                    'board_name': board_name,
+                    'category': category,
+                    'title': title,
+                    'date': date_str,
+                    'url': article_url
+                })
+            except:
+                continue
+        return data, stop_flag
+    async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        # 본문: <div class="editor ck-content"> 안의 <p> 태그
+        contents_div = soup.find('div', {'class': 'editor ck-content'})
+        if contents_div:
+            paragraphs = contents_div.find_all('p')
+            for p in paragraphs:
+                cleaned = self.clean_text(p.get_text(strip=True))
+                if cleaned:
+                    text_parts.append(cleaned)
+            # 작성자: 끝쪽 <p> 에서 당명/대변인 포함 텍스트
+            for p in reversed(paragraphs):
+                cleaned = self.clean_text(p.get_text(strip=True))
+                if '조국혁신당' in cleaned or '대변인' in cleaned or '위원회' in cleaned:
+                    writer = cleaned
+                    break
+        return {'text': '\n'.join(text_parts), 'writer': writer}
+    async def collect_board(self, board_name: str, board_path: str,
+                            start_date: str, end_date: str) -> List[Dict]:
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            all_items = []
+            page_num = 1
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_name, board_path, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 1
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                       total=len(tasks),
+                                       desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                          end_date: Optional[str] = None) -> pd.DataFrame:
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"조국혁신당 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        tasks = [
+            self.collect_board(board_name, board_path, start_date, end_date)
+            for board_name, board_path in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['date'] = pd.to_datetime(df['date'], errors='coerce')
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            login(token=self.hf_token)
+            new_dataset = Dataset.from_pandas(df)
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개")
+            except:
+                final_dataset = new_dataset
+                print(f"  ℹ️ 신규 데이터셋 생성")
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        self.save_local(df)
+        self.upload_to_huggingface(df)
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}\n✓ 완료!\n{'='*60}\n")
+async def main():
+    crawler = RebuildingAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

reform_crawler_async.py ADDED Viewed

	@@ -0,0 +1,358 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+개혁신당 크롤러 - 고성능 비동기 버전 + 허깅페이스 자동 업로드
+- 그누보드 5 기반 사이트 (reformparty.kr)
+- td.td_subject / td.td_datetime / div#bo_v_con 구조
+"""
+import os
+import json
+import re
+import asyncio
+from datetime import datetime, timedelta
+from typing import List, Dict, Optional
+import pandas as pd
+from tqdm.asyncio import tqdm as async_tqdm
+import aiohttp
+from bs4 import BeautifulSoup
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, login
+from datasets import Dataset, load_dataset
+load_dotenv()
+class ReformAsyncCrawler:
+    def __init__(self, config_path="crawler_config.json"):
+        self.base_url = "https://www.reformparty.kr"
+        self.party_name = "개혁신당"
+        self.config_path = config_path
+        self.state_path = "crawler_state.json"
+        self.load_config()
+        self.hf_token = os.getenv("HF_TOKEN")
+        self.hf_repo_id = os.getenv("HF_REPO_ID_REFORM", "reform-press-releases")
+        self.semaphore = asyncio.Semaphore(10)
+    def load_config(self):
+        default_config = {
+            "boards": {
+                "보도자료": "press",
+                "논평브리핑": "briefing"
+            },
+            "start_date": "2024-02-13",
+            "max_pages": 10000,
+            "concurrent_requests": 10,
+            "request_delay": 0.3,
+            "output_path": "./data"
+        }
+        if os.path.exists(self.config_path):
+            with open(self.config_path, 'r', encoding='utf-8') as f:
+                config = json.load(f)
+                self.config = config.get('reform', default_config)
+        else:
+            self.config = default_config
+        self.boards = self.config["boards"]
+        self.start_date = self.config["start_date"]
+        self.max_pages = self.config["max_pages"]
+        self.output_path = self.config["output_path"]
+    def load_state(self) -> Dict:
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                state = json.load(f)
+                return state.get('reform', {})
+        return {}
+    def save_state(self, state: Dict):
+        all_state = {}
+        if os.path.exists(self.state_path):
+            with open(self.state_path, 'r', encoding='utf-8') as f:
+                all_state = json.load(f)
+        all_state['reform'] = state
+        with open(self.state_path, 'w', encoding='utf-8') as f:
+            json.dump(all_state, f, ensure_ascii=False, indent=2)
+    @staticmethod
+    def parse_date(date_str: str) -> Optional[datetime]:
+        """YYYY-MM-DD HH:MM:SS 또는 YYYY-MM-DD 파싱"""
+        try:
+            return datetime.strptime(date_str.strip()[:10], '%Y-%m-%d')
+        except:
+            return None
+    @staticmethod
+    def clean_text(text: str) -> str:
+        text = text.replace('\xa0', '').replace('\u200b', '').replace('', '')
+        return text.strip()
+    async def fetch_with_retry(self, session: aiohttp.ClientSession, url: str,
+                               max_retries: int = 3) -> Optional[str]:
+        async with self.semaphore:
+            for attempt in range(max_retries):
+                try:
+                    await asyncio.sleep(self.config.get("request_delay", 0.3))
+                    async with session.get(url, timeout=aiohttp.ClientTimeout(total=15)) as response:
+                        if response.status == 200:
+                            return await response.text()
+                except Exception:
+                    if attempt < max_retries - 1:
+                        await asyncio.sleep(1)
+                    else:
+                        return None
+        return None
+    async def fetch_list_page(self, session: aiohttp.ClientSession,
+                              board_name: str, board_slug: str, page_num: int,
+                              start_date: datetime, end_date: datetime) -> tuple:
+        url = f"{self.base_url}/{board_slug}?page={page_num}"
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return [], False
+        soup = BeautifulSoup(html, 'html.parser')
+        rows = soup.select('table tbody tr')
+        if not rows:
+            return [], True
+        data = []
+        stop_flag = False
+        for row in rows:
+            try:
+                # 제목·URL: td.td_subject div.bo_tit a
+                title_a = row.select_one('td.td_subject div.bo_tit a')
+                if not title_a:
+                    continue
+                title = title_a.get_text(strip=True)
+                href = title_a.get('href', '')
+                # page 파라미터 제거 후 절대 URL
+                article_url = self.base_url + re.sub(r'\?.*$', '', href)
+                # ��짜: td.td_datetime (YYYY-MM-DD HH:MM:SS)
+                date_td = row.select_one('td.td_datetime')
+                if not date_td:
+                    continue
+                date_str = date_td.get_text(strip=True)[:10]
+                # 카테고리: td.td_cate a.bo_cate
+                cate_a = row.select_one('td.td_cate a.bo_cate')
+                category = cate_a.get_text(strip=True) if cate_a else ""
+                article_date = self.parse_date(date_str)
+                if not article_date:
+                    continue
+                if article_date < start_date:
+                    stop_flag = True
+                    break
+                if article_date > end_date:
+                    continue
+                data.append({
+                    'board_name': board_name,
+                    'title': title,
+                    'category': category,
+                    'date': date_str,
+                    'url': article_url
+                })
+            except:
+                continue
+        return data, stop_flag
+    async def fetch_article_detail(self, session: aiohttp.ClientSession, url: str) -> Dict:
+        html = await self.fetch_with_retry(session, url)
+        if not html:
+            return {'text': "본문 조회 실패", 'writer': ""}
+        soup = BeautifulSoup(html, 'html.parser')
+        text_parts = []
+        writer = ""
+        # 본문: div#bo_v_con
+        contents_div = soup.find('div', id='bo_v_con')
+        if contents_div:
+            for p in contents_div.find_all('p'):
+                cleaned = self.clean_text(p.get_text(strip=True))
+                if cleaned:
+                    text_parts.append(cleaned)
+        # 작성자: p.name span.content span.sv_member
+        writer_el = soup.select_one('p.name span.content span.sv_member')
+        if writer_el:
+            writer = writer_el.get_text(strip=True)
+        return {'text': '\n'.join(text_parts), 'writer': writer}
+    async def collect_board(self, board_name: str, board_slug: str,
+                            start_date: str, end_date: str) -> List[Dict]:
+        start_dt = datetime.strptime(start_date, '%Y-%m-%d')
+        end_dt = datetime.strptime(end_date, '%Y-%m-%d')
+        print(f"\n▶ [{board_name}] 목록 수집 시작...")
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'ko-KR,ko;q=0.9',
+        }
+        async with aiohttp.ClientSession(headers=headers) as session:
+            all_items = []
+            page_num = 1
+            empty_pages = 0
+            max_empty_pages = 3
+            with async_tqdm(desc=f"[{board_name}] 목록", unit="페이지") as pbar:
+                while page_num <= self.max_pages:
+                    items, stop_flag = await self.fetch_list_page(
+                        session, board_name, board_slug, page_num, start_dt, end_dt
+                    )
+                    if not items:
+                        empty_pages += 1
+                        if empty_pages >= max_empty_pages or stop_flag:
+                            break
+                    else:
+                        empty_pages = 0
+                        all_items.extend(items)
+                    pbar.update(1)
+                    pbar.set_postfix({"수집": len(all_items)})
+                    if stop_flag:
+                        break
+                    page_num += 1
+            print(f"  ✓ {len(all_items)}개 항목 발견")
+            if all_items:
+                print(f"  ▶ 상세 페이지 수집 중...")
+                tasks = [self.fetch_article_detail(session, item['url']) for item in all_items]
+                details = []
+                for coro in async_tqdm(asyncio.as_completed(tasks),
+                                       total=len(tasks),
+                                       desc=f"[{board_name}] 상세"):
+                    detail = await coro
+                    details.append(detail)
+                for item, detail in zip(all_items, details):
+                    item.update(detail)
+        print(f"✓ [{board_name}] 완료: {len(all_items)}개")
+        return all_items
+    async def collect_all(self, start_date: Optional[str] = None,
+                          end_date: Optional[str] = None) -> pd.DataFrame:
+        if not end_date:
+            end_date = datetime.now().strftime('%Y-%m-%d')
+        if not start_date:
+            start_date = self.start_date
+        print(f"\n{'='*60}")
+        print(f"개혁신당 보도자료 수집 - 비동기 고성능 버전")
+        print(f"기간: {start_date} ~ {end_date}")
+        print(f"{'='*60}")
+        tasks = [
+            self.collect_board(board_name, board_slug, start_date, end_date)
+            for board_name, board_slug in self.boards.items()
+        ]
+        results = await asyncio.gather(*tasks)
+        all_data = []
+        for items in results:
+            all_data.extend(items)
+        if not all_data:
+            print("\n⚠️ 수집된 데이터 없음")
+            return pd.DataFrame()
+        df = pd.DataFrame(all_data)
+        df = df[['board_name', 'title', 'category', 'date', 'writer', 'text', 'url']]
+        df = df[(df['title'] != "") & (df['text'] != "")]
+        df['date'] = pd.to_datetime(df['date'], errors='coerce')
+        print(f"\n✓ 총 {len(df)}개 수집 완료")
+        return df
+    def save_local(self, df: pd.DataFrame):
+        os.makedirs(self.output_path, exist_ok=True)
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        csv_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.csv")
+        xlsx_path = os.path.join(self.output_path, f"{self.party_name}_{timestamp}.xlsx")
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        df.to_excel(xlsx_path, index=False, engine='openpyxl')
+        print(f"✓ CSV: {csv_path}")
+        print(f"✓ Excel: {xlsx_path}")
+    def upload_to_huggingface(self, df: pd.DataFrame):
+        if not self.hf_token:
+            print("\n⚠️ HF_TOKEN이 설정되지 않았습니다.")
+            return
+        print(f"\n▶ 허깅페이스 업로드 중... (repo: {self.hf_repo_id})")
+        try:
+            login(token=self.hf_token)
+            new_dataset = Dataset.from_pandas(df)
+            try:
+                existing_dataset = load_dataset(self.hf_repo_id, split='train')
+                existing_df = existing_dataset.to_pandas()
+                combined_df = pd.concat([existing_df, df], ignore_index=True)
+                combined_df = combined_df.drop_duplicates(subset=['url'], keep='last')
+                combined_df = combined_df.sort_values('date', ascending=False).reset_index(drop=True)
+                final_dataset = Dataset.from_pandas(combined_df)
+                print(f"  ✓ 병합 후: {len(final_dataset)}개")
+            except:
+                final_dataset = new_dataset
+                print(f"  ℹ️ 신규 데이터셋 생성")
+            final_dataset.push_to_hub(self.hf_repo_id, token=self.hf_token)
+            print(f"✓ 허깅페이스 업로드 완료!")
+        except Exception as e:
+            print(f"✗ 업로드 실패: {e}")
+    async def run_incremental(self):
+        state = self.load_state()
+        last_date = state.get('last_crawl_date')
+        if last_date:
+            start_date = (datetime.strptime(last_date, '%Y-%m-%d') + timedelta(days=1)).strftime('%Y-%m-%d')
+            print(f"📅 증분 업데이트: {start_date} 이후 데이터만 수집")
+        else:
+            start_date = self.start_date
+            print(f"📅 전체 수집: {start_date}부터")
+        end_date = datetime.now().strftime('%Y-%m-%d')
+        df = await self.collect_all(start_date, end_date)
+        if df.empty:
+            print("✓ 새로운 데이터 없음")
+            return
+        self.save_local(df)
+        self.upload_to_huggingface(df)
+        state['last_crawl_date'] = end_date
+        state['last_crawl_time'] = datetime.now().isoformat()
+        state['last_count'] = len(df)
+        self.save_state(state)
+        print(f"\n{'='*60}\n✓ 완료!\n{'='*60}\n")
+async def main():
+    crawler = ReformAsyncCrawler()
+    await crawler.run_incremental()
+if __name__ == "__main__":
+    asyncio.run(main())

requirements.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+# 웹 크롤링
+aiohttp==3.9.1
+beautifulsoup4==4.12.2
+lxml==5.1.0
+# 데이터 처리
+pandas==2.1.4
+openpyxl==3.1.2
+# 허깅페이스
+huggingface-hub==0.20.2
+datasets==2.16.1
+# 스케줄링
+APScheduler==3.10.4
+# 환경 변수
+python-dotenv==1.0.0
+# 진행률 표시
+tqdm==4.66.1

run_once.bat ADDED Viewed

	@@ -0,0 +1,14 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 더불어민주당 크롤러 - 한 번 실행
+echo ============================================
+echo.
+python minjoo_crawler_async.py
+echo.
+echo ============================================
+echo 완료!
+echo ============================================
+pause

run_ppp.bat ADDED Viewed

	@@ -0,0 +1,14 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 국민의힘 크롤러 - 한 번 실행
+echo ============================================
+echo.
+python ppp_crawler_async.py
+echo.
+echo ============================================
+echo 완료!
+echo ============================================
+pause

run_scheduler.bat ADDED Viewed

	@@ -0,0 +1,16 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 더불어민주당 크롤러 - 스케줄러 실행
+echo ============================================
+echo.
+echo 매일 오전 9시에 자동으로 크롤링을 실행합니다.
+echo 종료하려면 Ctrl+C를 누르세요.
+echo.
+echo 로그 파일: crawler_scheduler.log
+echo ============================================
+echo.
+python scheduler.py
+pause

run_unified.bat ADDED Viewed

	@@ -0,0 +1,15 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 통합 정당 크롤러 - 한 번 실행
+echo (더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당)
+echo ============================================
+echo.
+python unified_crawler.py
+echo.
+echo ============================================
+echo 완료!
+echo ============================================
+pause

run_unified_scheduler.bat ADDED Viewed

	@@ -0,0 +1,17 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 통합 정당 크롤러 - 스케줄러 실행
+echo (더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당)
+echo ============================================
+echo.
+echo 매일 오전 9시에 자동으로 크롤링을 실행합니다.
+echo 종료하려면 Ctrl+C를 누르세요.
+echo.
+echo 로그 파일: unified_scheduler.log
+echo ============================================
+echo.
+python unified_scheduler.py
+pause

scheduler.py ADDED Viewed

	@@ -0,0 +1,71 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+더불어민주당 크롤러 스케줄러
+- 매일 지정된 시간에 자동 실행
+- 백그라운드 실행 지원
+- 로그 기록
+"""
+import asyncio
+import logging
+from datetime import datetime
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+from apscheduler.triggers.cron import CronTrigger
+from minjoo_crawler_async import MinjooAsyncCrawler
+# 로깅 설정
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[
+        logging.FileHandler('crawler_scheduler.log', encoding='utf-8'),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+async def scheduled_task():
+    """스케줄된 작업"""
+    logger.info("="*60)
+    logger.info("스케줄된 크롤링 시작")
+    logger.info("="*60)
+    try:
+        crawler = MinjooAsyncCrawler()
+        await crawler.run_incremental()
+        logger.info("크롤링 완료")
+    except Exception as e:
+        logger.error(f"크롤링 실패: {e}", exc_info=True)
+def main():
+    """스케줄러 메인"""
+    scheduler = AsyncIOScheduler()
+    # 매일 오전 9시에 실행
+    scheduler.add_job(
+        scheduled_task,
+        trigger=CronTrigger(hour=9, minute=0),
+        id='daily_crawl',
+        name='민주당 크롤러 일일 실행',
+        replace_existing=True
+    )
+    # 즉시 한 번 실행 (테스트용)
+    # scheduler.add_job(scheduled_task, 'date', run_date=datetime.now())
+    logger.info("스케줄러 시작")
+    logger.info("매일 오전 9시에 크롤링 실행")
+    logger.info("종료하려면 Ctrl+C를 누르세요")
+    scheduler.start()
+    try:
+        # 이벤트 루프 실행
+        asyncio.get_event_loop().run_forever()
+    except (KeyboardInterrupt, SystemExit):
+        logger.info("스케줄러 종료")
+if __name__ == "__main__":
+    main()

setup.bat ADDED Viewed

	@@ -0,0 +1,49 @@

+@echo off
+chcp 65001 > nul
+echo ============================================
+echo 더불어민주당 크롤러 설정
+echo ============================================
+echo.
+echo [1/3] Python 버전 확인...
+python --version
+if errorlevel 1 (
+    echo ❌ Python이 설치되지 않았습니다.
+    echo    https://www.python.org/downloads/ 에서 Python을 설치하세요.
+    pause
+    exit /b 1
+)
+echo ✓ Python 설치 확인
+echo.
+echo [2/3] 의존성 설치 중...
+pip install -r requirements.txt
+if errorlevel 1 (
+    echo ❌ 의존성 설치 실패
+    pause
+    exit /b 1
+)
+echo ✓ 의존성 설치 완료
+echo.
+echo [3/3] 환경 변수 설정...
+if not exist .env (
+    copy .env.example .env
+    echo ✓ .env 파일이 생성되었습니다.
+    echo ⚠️  .env 파일을 열어서 HF_TOKEN을 설정하세요!
+    echo    https://huggingface.co/settings/tokens 에서 토큰을 생성할 수 있습니다.
+) else (
+    echo ℹ️  .env 파일이 이미 존재합니다.
+)
+echo.
+echo ============================================
+echo ✓ 설정 완료!
+echo ============================================
+echo.
+echo 다음 단계:
+echo 1. .env 파일을 열어서 HF_TOKEN 설정
+echo 2. run_once.bat 실행 (한 번만 실행)
+echo 3. run_scheduler.bat 실행 (매일 자동 실행)
+echo.
+pause

unified_crawler.py ADDED Viewed

	@@ -0,0 +1,83 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+통합 정당 크롤러
+- 더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당 동시 크롤링
+- 각 정당별 독립적인 허깅페이스 업로드
+- 비동기 병렬 처리
+※ CLI 인자 지원이 필요한 경우 main.py 를 사용하세요.
+"""
+import asyncio
+import logging
+from datetime import datetime
+from minjoo_crawler_async import MinjooAsyncCrawler
+from ppp_crawler_async import PPPAsyncCrawler
+from rebuilding_crawler_async import RebuildingAsyncCrawler
+from reform_crawler_async import ReformAsyncCrawler
+from basic_income_crawler_async import BasicIncomeAsyncCrawler
+from jinbo_crawler_async import JinboAsyncCrawler
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[
+        logging.FileHandler('unified_crawler.log', encoding='utf-8'),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+CRAWLERS = {
+    '더불어민주당': MinjooAsyncCrawler,
+    '국민의힘':     PPPAsyncCrawler,
+    '조국혁신당':   RebuildingAsyncCrawler,
+    '개혁신당':     ReformAsyncCrawler,
+    '기본소득당':   BasicIncomeAsyncCrawler,
+    '진보당':       JinboAsyncCrawler,
+}
+async def crawl_all_parties():
+    """6개 정당 동시 크롤링"""
+    logger.info("=" * 60)
+    logger.info("통합 정당 크롤러 시작")
+    logger.info(" + ".join(CRAWLERS.keys()))
+    logger.info("=" * 60)
+    start_time = datetime.now()
+    crawlers = [cls() for cls in CRAWLERS.values()]
+    party_names = list(CRAWLERS.keys())
+    results = await asyncio.gather(
+        *[crawler.run_incremental() for crawler in crawlers],
+        return_exceptions=True
+    )
+    for party, result in zip(party_names, results):
+        if isinstance(result, Exception):
+            logger.error(f"{party} 크롤링 실패: {result}")
+        else:
+            logger.info(f"{party} 크롤링 완료")
+    duration = (datetime.now() - start_time).total_seconds()
+    logger.info("=" * 60)
+    logger.info(f"전체 크롤링 완료")
+    logger.info(f"소요 시간: {duration:.1f}초 ({duration / 60:.1f}분)")
+    logger.info("=" * 60)
+# 하위 호환성 유지
+async def crawl_both_parties():
+    await crawl_all_parties()
+async def main():
+    await crawl_all_parties()
+if __name__ == "__main__":
+    asyncio.run(main())

unified_scheduler.py ADDED Viewed

	@@ -0,0 +1,60 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+통합 정당 크롤러 스케줄러
+- 더불어민주당, 국민의힘, 조국혁신당, 개혁신당, 기본소득당, 진보당 매일 자동 크롤링
+"""
+import asyncio
+import logging
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+from apscheduler.triggers.cron import CronTrigger
+from unified_crawler import crawl_all_parties
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(levelname)s] %(message)s',
+    handlers=[
+        logging.FileHandler('unified_scheduler.log', encoding='utf-8'),
+        logging.StreamHandler()
+    ]
+)
+logger = logging.getLogger(__name__)
+async def scheduled_task():
+    logger.info("=" * 60)
+    logger.info("스케줄된 크롤링 시작 (6개 정당)")
+    logger.info("=" * 60)
+    try:
+        await crawl_all_parties()
+        logger.info("스케줄된 크롤링 완료")
+    except Exception as e:
+        logger.error(f"크롤링 실패: {e}", exc_info=True)
+def main():
+    scheduler = AsyncIOScheduler()
+    scheduler.add_job(
+        scheduled_task,
+        trigger=CronTrigger(hour=9, minute=0),
+        id='daily_crawl_all',
+        name='통합 정당 크롤러 일일 실행',
+        replace_existing=True
+    )
+    logger.info("통합 정당 크롤러 스케줄러 시작")
+    logger.info("매일 오전 9시에 6개 정당 크롤링 실행")
+    logger.info("종료하려면 Ctrl+C를 누르세요")
+    scheduler.start()
+    try:
+        asyncio.get_event_loop().run_forever()
+    except (KeyboardInterrupt, SystemExit):
+        logger.info("스케줄러 종료")
+if __name__ == "__main__":
+    main()