| --- |
| tags: |
| - ml-intern |
| --- |
| # VulnIA V2 — Java Security Dataset Scraper |
|
|
| Pipeline complet de scraping, extraction, nettoyage et préparation d'un dataset Java pour continual fine-tuning de CodeBERT V2. |
|
|
| ## Objectif |
|
|
| - Labels V1 (0-65) conservés inchangés |
| - Nouveaux labels : CRYPTO=66, SAFE=67 |
| - 4 familles cibles : XXE, Deserialization, Crypto, SAFE |
| - Minimum 1500 exemples par famille |
| - Déduplication SHA256, scoring confiance ≥ 0.75, pas de leakage |
|
|
| ## Structure |
|
|
| ``` |
| vulnia-v2/ |
| ├── config/ |
| │ └── scraping_queries.yaml # Requêtes GitHub par famille/sous-type |
| ├── scraper/ |
| │ ├── __init__.py |
| │ ├── constants.py # Patterns API, CWE mapping |
| │ ├── utils.py # Nettoyage code Java, hash, leakage |
| │ ├── ghsa_client.py # Client API GHSA + synthetic fallback |
| │ ├── reference_extractor.py # Extraction URLs GitHub |
| │ ├── code_fetcher.py # Fetch fichiers .java |
| │ ├── fence_extractor.py # Extraction code fences |
| │ ├── reporter.py # Rapport qualité |
| │ └── stages/ # Scripts de pipeline |
| ├── run_pipeline.py # Orchestrateur unifié |
| ├── scrape_vulnia_v2.py # Pipeline principal (synthetic + nettoyage) |
| ├── verify_vulnia_v2.py # Vérification dataset final |
| ├── vulnia_v2_label_mapping.json # Mapping labels 0-67 |
| ├── requirements.txt |
| └── README.md |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| git clone https://huggingface.co/MaryamEl/vulnia-v2-scraper |
| cd vulnia-v2-scraper |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Usage |
|
|
| ### 1. Pipeline complet (avec token GitHub) |
|
|
| ```bash |
| export GITHUB_TOKEN="ghp_xxxxxxxx" |
| python run_pipeline.py --stage all |
| ``` |
|
|
| ### 2. Stages individuels |
|
|
| ```bash |
| python run_pipeline.py --stage collect-advisories |
| python run_pipeline.py --stage dedup-advisories |
| python run_pipeline.py --stage extract-advisory-references |
| python run_pipeline.py --stage fetch-advisory-code |
| python run_pipeline.py --stage extract-advisory-code-fences |
| python run_pipeline.py --stage advisory-report |
| ``` |
|
|
| ### 3. Mode dry-run (sans token) |
|
|
| ```bash |
| python run_pipeline.py --stage all --dry-run |
| ``` |
|
|
| ### 4. Génération dataset synthetic uniquement |
|
|
| ```bash |
| python scrape_vulnia_v2.py --skip-github |
| ``` |
|
|
| ### 5. Vérification dataset final |
|
|
| ```bash |
| python verify_vulnia_v2.py --dataset output/vulnia_v2_balanced.csv |
| ``` |
|
|
| ## Outputs |
|
|
| | Fichier | Description | |
| |---------|-------------| |
| | `output/vulnia_v2_balanced.csv` | Dataset principal training-ready | |
| | `output/train.csv` | 70% stratifié | |
| | `output/val.csv` | 15% stratifié | |
| | `output/test.csv` | 15% stratifié | |
| | `output/test_safe_heavy.csv` | Test spécial SAFE hard negatives | |
| | `data/processed/advisories_deduplicated.csv` | **Seed-only** — NE PAS utiliser pour training | |
| | `reports/advisory_seed_quality_report.md` | Verdict readiness | |
|
|
| ## Label Mapping |
|
|
| - V1 labels : 0-65 (inchangés) |
| - CRYPTO : 66 |
| - SAFE : 67 |
|
|
| Voir `vulnia_v2_label_mapping.json` pour le mapping complet. |
|
|
| ## Auteur |
|
|
| ML Intern — Projet VulnIA |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|