--- tags: - ml-intern --- # VulnIA V2 — Java Security Dataset Scraper Pipeline complet de scraping, extraction, nettoyage et préparation d'un dataset Java pour continual fine-tuning de CodeBERT V2. ## Objectif - Labels V1 (0-65) conservés inchangés - Nouveaux labels : CRYPTO=66, SAFE=67 - 4 familles cibles : XXE, Deserialization, Crypto, SAFE - Minimum 1500 exemples par famille - Déduplication SHA256, scoring confiance ≥ 0.75, pas de leakage ## Structure ``` vulnia-v2/ ├── config/ │ └── scraping_queries.yaml # Requêtes GitHub par famille/sous-type ├── scraper/ │ ├── __init__.py │ ├── constants.py # Patterns API, CWE mapping │ ├── utils.py # Nettoyage code Java, hash, leakage │ ├── ghsa_client.py # Client API GHSA + synthetic fallback │ ├── reference_extractor.py # Extraction URLs GitHub │ ├── code_fetcher.py # Fetch fichiers .java │ ├── fence_extractor.py # Extraction code fences │ ├── reporter.py # Rapport qualité │ └── stages/ # Scripts de pipeline ├── run_pipeline.py # Orchestrateur unifié ├── scrape_vulnia_v2.py # Pipeline principal (synthetic + nettoyage) ├── verify_vulnia_v2.py # Vérification dataset final ├── vulnia_v2_label_mapping.json # Mapping labels 0-67 ├── requirements.txt └── README.md ``` ## Installation ```bash git clone https://huggingface.co/MaryamEl/vulnia-v2-scraper cd vulnia-v2-scraper pip install -r requirements.txt ``` ## Usage ### 1. Pipeline complet (avec token GitHub) ```bash export GITHUB_TOKEN="ghp_xxxxxxxx" python run_pipeline.py --stage all ``` ### 2. Stages individuels ```bash python run_pipeline.py --stage collect-advisories python run_pipeline.py --stage dedup-advisories python run_pipeline.py --stage extract-advisory-references python run_pipeline.py --stage fetch-advisory-code python run_pipeline.py --stage extract-advisory-code-fences python run_pipeline.py --stage advisory-report ``` ### 3. Mode dry-run (sans token) ```bash python run_pipeline.py --stage all --dry-run ``` ### 4. Génération dataset synthetic uniquement ```bash python scrape_vulnia_v2.py --skip-github ``` ### 5. Vérification dataset final ```bash python verify_vulnia_v2.py --dataset output/vulnia_v2_balanced.csv ``` ## Outputs | Fichier | Description | |---------|-------------| | `output/vulnia_v2_balanced.csv` | Dataset principal training-ready | | `output/train.csv` | 70% stratifié | | `output/val.csv` | 15% stratifié | | `output/test.csv` | 15% stratifié | | `output/test_safe_heavy.csv` | Test spécial SAFE hard negatives | | `data/processed/advisories_deduplicated.csv` | **Seed-only** — NE PAS utiliser pour training | | `reports/advisory_seed_quality_report.md` | Verdict readiness | ## Label Mapping - V1 labels : 0-65 (inchangés) - CRYPTO : 66 - SAFE : 67 Voir `vulnia_v2_label_mapping.json` pour le mapping complet. ## Auteur ML Intern — Projet VulnIA ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern