VulnIA V2 — Java Security Dataset Scraper
Pipeline complet de scraping, extraction, nettoyage et préparation d'un dataset Java pour continual fine-tuning de CodeBERT V2.
Objectif
- Labels V1 (0-65) conservés inchangés
- Nouveaux labels : CRYPTO=66, SAFE=67
- 4 familles cibles : XXE, Deserialization, Crypto, SAFE
- Minimum 1500 exemples par famille
- Déduplication SHA256, scoring confiance ≥ 0.75, pas de leakage
Structure
vulnia-v2/
├── config/
│ └── scraping_queries.yaml # Requêtes GitHub par famille/sous-type
├── scraper/
│ ├── __init__.py
│ ├── constants.py # Patterns API, CWE mapping
│ ├── utils.py # Nettoyage code Java, hash, leakage
│ ├── ghsa_client.py # Client API GHSA + synthetic fallback
│ ├── reference_extractor.py # Extraction URLs GitHub
│ ├── code_fetcher.py # Fetch fichiers .java
│ ├── fence_extractor.py # Extraction code fences
│ ├── reporter.py # Rapport qualité
│ └── stages/ # Scripts de pipeline
├── run_pipeline.py # Orchestrateur unifié
├── scrape_vulnia_v2.py # Pipeline principal (synthetic + nettoyage)
├── verify_vulnia_v2.py # Vérification dataset final
├── vulnia_v2_label_mapping.json # Mapping labels 0-67
├── requirements.txt
└── README.md
Installation
git clone https://huggingface.co/MaryamEl/vulnia-v2-scraper
cd vulnia-v2-scraper
pip install -r requirements.txt
Usage
1. Pipeline complet (avec token GitHub)
export GITHUB_TOKEN="ghp_xxxxxxxx"
python run_pipeline.py --stage all
2. Stages individuels
python run_pipeline.py --stage collect-advisories
python run_pipeline.py --stage dedup-advisories
python run_pipeline.py --stage extract-advisory-references
python run_pipeline.py --stage fetch-advisory-code
python run_pipeline.py --stage extract-advisory-code-fences
python run_pipeline.py --stage advisory-report
3. Mode dry-run (sans token)
python run_pipeline.py --stage all --dry-run
4. Génération dataset synthetic uniquement
python scrape_vulnia_v2.py --skip-github
5. Vérification dataset final
python verify_vulnia_v2.py --dataset output/vulnia_v2_balanced.csv
Outputs
| Fichier | Description |
|---|---|
output/vulnia_v2_balanced.csv |
Dataset principal training-ready |
output/train.csv |
70% stratifié |
output/val.csv |
15% stratifié |
output/test.csv |
15% stratifié |
output/test_safe_heavy.csv |
Test spécial SAFE hard negatives |
data/processed/advisories_deduplicated.csv |
Seed-only — NE PAS utiliser pour training |
reports/advisory_seed_quality_report.md |
Verdict readiness |
Label Mapping
- V1 labels : 0-65 (inchangés)
- CRYPTO : 66
- SAFE : 67
Voir vulnia_v2_label_mapping.json pour le mapping complet.
Auteur
ML Intern — Projet VulnIA
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support