--- license: apache-2.0 tags: - code-generation - differential-privacy - continued-pretraining - lora library_name: peft --- # CodeDP-CPT Models V2 LoRA adapters from continued pretraining (CPT) on code with and without differential privacy (DP-SGD), across 7 model families. ## Models Included Each model is trained with multiple variants: - `base` / `base_attn` — CPT without DP (no privacy) - `dp3` / `dp3_attn` — DP-SGD with ε=3 (strong privacy) - `dp8` / `dp8_attn` — DP-SGD with ε=8 (moderate privacy) - `*_v2` — re-runs with improved hyperparameters (LR=5e-4, 5 epochs, min_lr_ratio=0.15) ### Model Families | Family | Variants | Base Model | |--------|----------|-----------| | `starcoder2-7b` | base, dp3, dp8 | `bigcode/starcoder2-7b` | | `llama3-8b` | base, dp3, dp8, dp8_v2 | `meta-llama/Meta-Llama-3-8B` | | `llama3.1-8b` | dp3, dp8 | `meta-llama/Llama-3.1-8B` | | `llama3.2-3b` | base, dp3, dp8 | `meta-llama/Llama-3.2-3B` | | `qwen3-8b-base` | base, dp3, dp8, dp3_v2, dp8_v2 | `Qwen/Qwen3-8B-Base` | | `granite-4.0-h-tiny` | base_attn, dp3_attn, dp8_attn | `ibm-granite/granite-4.0-h-tiny-base` | | `qwen1.5-moe-a2.7b` | dp3_attn, base_attn_v2, dp3_attn_v2, dp8_attn_v2 | `Qwen/Qwen1.5-MoE-A2.7B` | Total: **24 LoRA adapters** ## Training Data Trained on `melihcatal/codedp-cpt` — a code corpus with embedded canary secrets for DP auditing and membership inference evaluation. ## Directory Structure Each variant directory contains: ``` // ├── adapter/ # Final LoRA adapter (PEFT format) │ ├── adapter_config.json │ ├── adapter_model.safetensors │ └── README.md ├── tokenizer/ # Tokenizer (may include added canary tokens) ├── resolved_config.yaml # Training configuration ├── metrics.jsonl # Training metrics per step ├── train.log # Training log ├── canary_meta.json # Canary metadata for MIA evaluation ├── summary.json # Run summary ├── audit_results.json # DP audit results └── audit_scores.npz # DP audit raw scores ``` ## Loading a Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained( "bigcode/starcoder2-7b", dtype="bfloat16", ) # Load tokenizer (important: uses trained tokenizer with canary tokens) tokenizer = AutoTokenizer.from_pretrained( "melihcatal/codedp-cpt-models-v2", subfolder="starcoder2-7b/base/tokenizer", ) # Resize embeddings to match tokenizer base_model.resize_token_embeddings(len(tokenizer)) # Load LoRA adapter model = PeftModel.from_pretrained( base_model, "melihcatal/codedp-cpt-models-v2", subfolder="starcoder2-7b/base/adapter", ) ``` ## Notes - **Qwen1.5-MoE requires `--model hf` backend** with lm-eval / transformers. vLLM's MoE routing produces incorrect output for this model. - **DP collapse at 8B scale**: Llama-3-8B, Llama-3.1-8B, and Qwen3-8B DP variants collapse to 0% on HumanEval. StarCoder2-7B, Granite-tiny, and Llama-3.2-3B DP variants retain utility. - All DP runs target ε=3 or ε=8 with δ=1e-5. ## Evaluation Evaluated on: - **HumanEval** (`openai_humaneval`) — basic code completion - **CodeDP-FC** (`melihcatal/codedp-bench-fc-cpt-v2`) — in-domain function completion - **BigCodeBench** (`bigcode/bigcodebench`) — library-heavy code generation - **Canary MIA** (`codedp-ase26/codedp-bench-canary-mia`) — membership inference attack ## Citation ```bibtex @misc{codedp-cpt-models-v2, title={CodeDP-CPT: Differentially Private Continued Pretraining for Code Models}, author={Catal, Melih}, year={2026}, url={https://huggingface.co/melihcatal/codedp-cpt-models-v2}, } ```