CodeLlama-7B PR Review β LoRA Adapter (v1)
A QLoRA-fine-tuned LoRA adapter for CodeLlama-7B-Instruct, trained to generate inline code review comments on Python pull requests. This is the v1 prototype from a larger project that ships the model behind a FastAPI + GitHub App + Kubernetes deployment.
β οΈ This is a v1 prototype. It catches some code review patterns (e.g., missing context managers) but misses others (e.g., SQL injection). See the Evaluation section for honest failure modes. Not production-ready. A v2 trained with Unsloth on more data is planned β see the project repo.
Project repo: https://github.com/Zenlyst/PR_Review_AI
Model Details
| Field | Value |
|---|---|
| Base model | codellama/CodeLlama-7b-Instruct-hf |
| Adapter type | LoRA (via PEFT) |
| Quantization at training | 4-bit NF4 with double quantization (QLoRA) |
| Parameter count (trainable) | |
| Adapter size | ~20β30 MB |
| Task | Code review comment generation from before/after Python code pairs |
| Language scope | Python only |
| Intended use | Research / portfolio demonstration. Not production. |
LoRA Configuration
| Hyperparameter | Value |
|---|---|
Rank (r) |
16 |
| Alpha | 32 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
Intended Use
This adapter is intended for:
- Research on code review generation with small open-source LLMs
- Portfolio demonstration of the end-to-end QLoRA fine-tuning pipeline (data filtering β training β eval β adapter export β deployment)
- Educational exploration of how fine-tuned 7B models compare to frontier APIs on a domain-specific task
It is not intended for:
- Production code review (the model misses critical security issues like SQL injection β see evaluation below)
- Languages other than Python
- Replacing human code review
- Any use where reviewer omissions could cause harm
Training Data
Dataset: ronantakizawa/github-codereview β 355K+ real human code review comments from public GitHub PRs.
Filtering applied
| Filter | Value | Rationale |
|---|---|---|
| Language | language == "Python" (file-level) |
Focused scope for v1 |
| Quality score | >= 0.5 |
Drop noisy low-quality reviews |
| Code length | 5β200 lines (both before and after) | Remove trivial and very large diffs |
| Token length | <= 2048 after prompt formatting |
Fit T4 VRAM + training efficiency |
| Sample limit | 10,000 | Keep Colab T4 training under ~13h |
Both positive examples (real reviews) and negative examples (is_negative=True, "No issues found") were kept so the model learns when code is fine and doesn't need a comment.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | CodeLlama-7B-Instruct |
| Method | QLoRA (4-bit NF4 + double quantization) |
| Training samples | 10,000 |
| Epochs | 1 |
| Max sequence length | 2,048 tokens |
| Per-device batch size | 1 |
| Gradient accumulation steps | 8 (effective batch size = 8) |
| Learning rate | 2e-4 |
| LR scheduler | Cosine, 50 warmup steps |
| Weight decay | 0.01 |
| Precision | FP16 |
| Packing | Enabled (multiple short samples per sequence) |
| Evaluation | Once per epoch (not per step, for speed) |
| GPU | Google Colab Pro T4 (16GB VRAM) |
| Training time | ~13 hours |
| Training stack | Transformers + PEFT + TRL (transformers==4.47.1, trl==0.17.0) |
Prompt template
### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.
### File: {file_path}
### Before:
{before_code}
### After:
{after_code}
### Review:
{reviewer_comment}
At inference, everything after ### Review: is generated.
How to Use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
ADAPTER_REPO = "zenlyst/codellama-7b-pr-review-lora-v1" # replace after upload
# 4-bit load (matches training-time quantization)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()
PROMPT = """### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.
### File: {file_path}
### Before:
{before_code}
### After:
{after_code}
### Review:
"""
before = """def parse_config(path):
with open(path) as f:
data = json.load(f)
return data"""
after = """def parse_config(path):
with open(path) as f:
data = yaml.safe_load(f)
return data"""
prompt = PROMPT.format(file_path="utils/parser.py", before_code=before, after_code=after)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1,
)
review = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(review.strip())
Requirements: transformers, peft, bitsandbytes, accelerate, torch. Requires a CUDA GPU for 4-bit loading (bitsandbytes does not support macOS).
Evaluation
Evaluated on 3 hand-crafted Python code review cases covering error handling, security, and resource management. Scoring is qualitative (Good / Shallow / Failed) β this is a prototype eval, not a benchmark.
Summary
| Test | Category | Verdict |
|---|---|---|
| JSON β YAML migration (no error handling) | Imports / error handling | β Shallow |
| SQL query without parameterization | π Security | β Failed |
Missing resource cleanup (open() without with) |
Resource management | β Good |
Overall: 1 / 3 useful reviews.
Case 1 β JSON to YAML migration (β Shallow)
Code switches json.load to yaml.safe_load without adding import yaml or yaml.YAMLError handling.
Model output:
import yaml
Analysis: Caught the missing import but missed the more important issues β no error handling for malformed YAML and no consideration of the silent format migration impact on callers.
Case 2 β SQL injection via f-string (β Failed)
Code uses cursor.execute(f'SELECT * FROM users WHERE id = {user_id}') β a textbook SQL injection vulnerability.
Model output:
cursor.execute(f'SELECT * FROM users WHERE id = {user_id} AND active = 1')
Analysis: The model echoed the code back without identifying the SQL injection vulnerability. This is the most critical failure β security issues should be high-priority review items. Likely root causes: (1) SQL injection patterns are under-represented in 10K random samples, (2) one epoch may be insufficient for the model to consolidate rare security patterns.
Case 3 β Missing context manager on file handle (β Good)
Code replaces with open(dest, 'wb') as f: with a bare f = open(dest, 'wb'), losing automatic close on exception.
Model output:
The
requestsmodule allows streaming of responses using thestreamparameter. This means that we can read the content in chunks without loading it all into memory at once. In this case, we could use a context manager to ensure the file is properly closed after writing. Here's an example of how to implement this:with open(dest, 'wb') as f: for chunk in response.iter_content(8192): f.write(chunk)
Analysis: Correctly identified the missing context manager, explained why it matters, and provided the fix. This is the quality level the model aims for across all reviews.
Limitations and Biases
- Python only. The model has not been trained on or evaluated against any other language.
- Misses security issues. v1 failed to identify SQL injection in evaluation. Do not rely on this model for security review.
- Shallow on multi-issue diffs. The model tends to surface one issue per review even when multiple exist.
- Small eval set. Three hand-crafted cases is not a benchmark. Real-world performance will vary.
- Training data bias. Inherits biases of the
ronantakizawa/github-codereviewdataset β mostly open-source Python projects on GitHub. Code styles and review conventions from other ecosystems (enterprise, other languages, non-English projects) are underrepresented. - Prototype only. Not validated at scale, not safety-reviewed, not aligned for adversarial inputs.
Known Failure Modes (short list)
- SQL injection via f-string interpolation β missed entirely in eval
- Silent API migrations (e.g., JSONβYAML) β flags imports but misses behavioral implications
- Echoes code back as "suggestion" without explaining issues
- Single-line suggestions even when multi-line refactors are needed
Roadmap
A v2 adapter is in progress, targeting the v1 failure modes with:
- Training stack migration to Unsloth for ~2Γ speedup at identical accuracy
- 15K samples Γ 2 epochs (up from 10K Γ 1) within the same compute budget
- Expanded 10-case eval set including security, error handling, mutability, and performance cases
- Honest v1 vs v2 comparison on the same eval set
See the project repo for progress.
Citation
If you use this adapter, please cite the underlying dataset and base model:
@misc{codellama-7b-pr-review-lora-v1,
title = {CodeLlama-7B PR Review LoRA Adapter (v1)},
author = {Sherry Liu},
year = {2026},
howpublished = {\url{https://huggingface.co/zenlyst/codellama-7b-pr-review-lora-v1}},
note = {LoRA adapter fine-tuned via QLoRA on ronantakizawa/github-codereview}
}
Base model:
@misc{roziere2023code,
title = {Code Llama: Open Foundation Models for Code},
author = {Baptiste Rozière and Jonas Gehring and Fabian Gloeckle and others},
year = {2023},
eprint = {2308.12950},
archivePrefix= {arXiv}
}
Dataset:
License
This adapter inherits the license of the base model: Llama 2 Community License. Review the base model's license before use.
Acknowledgements
- Base model: Meta's CodeLlama-7B-Instruct
- Dataset: Ronan Takizawa's
github-codereviewdataset - Training stack: Hugging Face Transformers + PEFT + TRL + bitsandbytes
- Compute: Google Colab Pro (T4)
- Downloads last month
- 17
Model tree for zenlyst/codellama-7b-pr-review-lora
Base model
codellama/CodeLlama-7b-Instruct-hf