| # 🧠DeBERTa-v3-Base Code Quality Classifier |
|
|
| A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset. |
| This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder). |
|
|
| --- |
|
|
| ## 📌 Model Summary |
|
|
| This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`). |
| The output probabilities are used to **rank samples by quality** and select the highest-quality subset. |
|
|
| This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models. |
|
|
| --- |
|
|
| ## Expected Result |
|
|
| 5% improvement in perplexity based on early benchmarking of tuned DeBERTa |
|
|
| ## 🧱 Model Description |
|
|
| ### Architecture |
| - Base model: **microsoft/deberta-v3-base** |
| - Task: Binary sequence classification |
| - Labels: |
| - `0` = clean code |
| - `1` = buggy / defective code |
| - Max sequence length: 512 tokens |
|
|
| ### Purpose |
| This model is intended for: |
| - Dataset quality filtering |
| - Improving generative model training stability |
| - Research on LLM token quality and perplexity |
| - Understanding effects of removing noisy samples |
|
|
| This model is **not** intended for real-world bug detection or vulnerability scanning. |
|
|
| --- |
|
|
| ## 📚 Dataset |
|
|
| ### Training Dataset |
| **CodeXGlue Code Defect Detection** |
| (https://huggingface.co/datasets/code_x_glue_cc_defect_detection) |
| |
| - `"func"`: raw function-level source code |
| - `"target"`: binary label (0 = clean, 1 = buggy) |
| - ~21,000 training examples |
| |
| ### Preprocessing |
| - Tokenized with DeBERTa-v3-base tokenizer |
| - Truncated to 512 tokens |
| - Padded dynamically using `DataCollatorWithPadding` |
| |
| --- |
| |
| ## 🧪 Training Procedure |
| |
| ### Hyperparameters |
| | Hyperparameter | Value | |
| |----------------|-------| |
| | Epochs | 1 | |
| | Learning rate | 2e-5 | |
| | Batch size | 8 | |
| | FP16 | Yes | |
| | Max length | 512 | |
| | Optimizer | AdamW | |
| | Loss | Cross-entropy | |
| | remove_unused_columns | False | |
| |
| ### Training Code Snippet |
| |
| ```python |
| model = AutoModelForSequenceClassification.from_pretrained( |
| "microsoft/deberta-v3-base", num_labels=2 |
| ) |
| |
| TrainingArguments( |
| output_dir="filter_model", |
| learning_rate=2e-5, |
| num_train_epochs=1, |
| fp16=True, |
| per_device_train_batch_size=8, |
| logging_steps=50, |
| remove_unused_columns=False, |
| ) |
| |