CodeLlama-7B PR Review β€” LoRA Adapter (v1)

A QLoRA-fine-tuned LoRA adapter for CodeLlama-7B-Instruct, trained to generate inline code review comments on Python pull requests. This is the v1 prototype from a larger project that ships the model behind a FastAPI + GitHub App + Kubernetes deployment.

⚠️ This is a v1 prototype. It catches some code review patterns (e.g., missing context managers) but misses others (e.g., SQL injection). See the Evaluation section for honest failure modes. Not production-ready. A v2 trained with Unsloth on more data is planned β€” see the project repo.

Project repo: https://github.com/Zenlyst/PR_Review_AI


Model Details

Field Value
Base model codellama/CodeLlama-7b-Instruct-hf
Adapter type LoRA (via PEFT)
Quantization at training 4-bit NF4 with double quantization (QLoRA)
Parameter count (trainable) 0.06% of base (4M params)
Adapter size ~20–30 MB
Task Code review comment generation from before/after Python code pairs
Language scope Python only
Intended use Research / portfolio demonstration. Not production.

LoRA Configuration

Hyperparameter Value
Rank (r) 16
Alpha 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj

Intended Use

This adapter is intended for:

  • Research on code review generation with small open-source LLMs
  • Portfolio demonstration of the end-to-end QLoRA fine-tuning pipeline (data filtering β†’ training β†’ eval β†’ adapter export β†’ deployment)
  • Educational exploration of how fine-tuned 7B models compare to frontier APIs on a domain-specific task

It is not intended for:

  • Production code review (the model misses critical security issues like SQL injection β€” see evaluation below)
  • Languages other than Python
  • Replacing human code review
  • Any use where reviewer omissions could cause harm

Training Data

Dataset: ronantakizawa/github-codereview β€” 355K+ real human code review comments from public GitHub PRs.

Filtering applied

Filter Value Rationale
Language language == "Python" (file-level) Focused scope for v1
Quality score >= 0.5 Drop noisy low-quality reviews
Code length 5–200 lines (both before and after) Remove trivial and very large diffs
Token length <= 2048 after prompt formatting Fit T4 VRAM + training efficiency
Sample limit 10,000 Keep Colab T4 training under ~13h

Both positive examples (real reviews) and negative examples (is_negative=True, "No issues found") were kept so the model learns when code is fine and doesn't need a comment.


Training Configuration

Parameter Value
Base model CodeLlama-7B-Instruct
Method QLoRA (4-bit NF4 + double quantization)
Training samples 10,000
Epochs 1
Max sequence length 2,048 tokens
Per-device batch size 1
Gradient accumulation steps 8 (effective batch size = 8)
Learning rate 2e-4
LR scheduler Cosine, 50 warmup steps
Weight decay 0.01
Precision FP16
Packing Enabled (multiple short samples per sequence)
Evaluation Once per epoch (not per step, for speed)
GPU Google Colab Pro T4 (16GB VRAM)
Training time ~13 hours
Training stack Transformers + PEFT + TRL (transformers==4.47.1, trl==0.17.0)

Prompt template

### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.

### File: {file_path}

### Before:
{before_code}

### After:
{after_code}

### Review:
{reviewer_comment}

At inference, everything after ### Review: is generated.


How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
ADAPTER_REPO = "zenlyst/codellama-7b-pr-review-lora-v1"  # replace after upload

# 4-bit load (matches training-time quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

PROMPT = """### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.

### File: {file_path}

### Before:
{before_code}

### After:
{after_code}

### Review:
"""

before = """def parse_config(path):
    with open(path) as f:
        data = json.load(f)
    return data"""

after = """def parse_config(path):
    with open(path) as f:
        data = yaml.safe_load(f)
    return data"""

prompt = PROMPT.format(file_path="utils/parser.py", before_code=before, after_code=after)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        repetition_penalty=1.1,
    )

review = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(review.strip())

Requirements: transformers, peft, bitsandbytes, accelerate, torch. Requires a CUDA GPU for 4-bit loading (bitsandbytes does not support macOS).


Evaluation

Evaluated on 3 hand-crafted Python code review cases covering error handling, security, and resource management. Scoring is qualitative (Good / Shallow / Failed) β€” this is a prototype eval, not a benchmark.

Summary

Test Category Verdict
JSON β†’ YAML migration (no error handling) Imports / error handling ❌ Shallow
SQL query without parameterization πŸ” Security ❌ Failed
Missing resource cleanup (open() without with) Resource management βœ… Good

Overall: 1 / 3 useful reviews.

Case 1 β€” JSON to YAML migration (❌ Shallow)

Code switches json.load to yaml.safe_load without adding import yaml or yaml.YAMLError handling.

Model output:

import yaml

Analysis: Caught the missing import but missed the more important issues β€” no error handling for malformed YAML and no consideration of the silent format migration impact on callers.

Case 2 β€” SQL injection via f-string (❌ Failed)

Code uses cursor.execute(f'SELECT * FROM users WHERE id = {user_id}') β€” a textbook SQL injection vulnerability.

Model output:

    cursor.execute(f'SELECT * FROM users WHERE id = {user_id} AND active = 1')

Analysis: The model echoed the code back without identifying the SQL injection vulnerability. This is the most critical failure β€” security issues should be high-priority review items. Likely root causes: (1) SQL injection patterns are under-represented in 10K random samples, (2) one epoch may be insufficient for the model to consolidate rare security patterns.

Case 3 β€” Missing context manager on file handle (βœ… Good)

Code replaces with open(dest, 'wb') as f: with a bare f = open(dest, 'wb'), losing automatic close on exception.

Model output:

The requests module allows streaming of responses using the stream parameter. This means that we can read the content in chunks without loading it all into memory at once. In this case, we could use a context manager to ensure the file is properly closed after writing. Here's an example of how to implement this:

with open(dest, 'wb') as f:
    for chunk in response.iter_content(8192):
        f.write(chunk)

Analysis: Correctly identified the missing context manager, explained why it matters, and provided the fix. This is the quality level the model aims for across all reviews.


Limitations and Biases

  • Python only. The model has not been trained on or evaluated against any other language.
  • Misses security issues. v1 failed to identify SQL injection in evaluation. Do not rely on this model for security review.
  • Shallow on multi-issue diffs. The model tends to surface one issue per review even when multiple exist.
  • Small eval set. Three hand-crafted cases is not a benchmark. Real-world performance will vary.
  • Training data bias. Inherits biases of the ronantakizawa/github-codereview dataset β€” mostly open-source Python projects on GitHub. Code styles and review conventions from other ecosystems (enterprise, other languages, non-English projects) are underrepresented.
  • Prototype only. Not validated at scale, not safety-reviewed, not aligned for adversarial inputs.

Known Failure Modes (short list)

  1. SQL injection via f-string interpolation β€” missed entirely in eval
  2. Silent API migrations (e.g., JSON→YAML) — flags imports but misses behavioral implications
  3. Echoes code back as "suggestion" without explaining issues
  4. Single-line suggestions even when multi-line refactors are needed

Roadmap

A v2 adapter is in progress, targeting the v1 failure modes with:

  • Training stack migration to Unsloth for ~2Γ— speedup at identical accuracy
  • 15K samples Γ— 2 epochs (up from 10K Γ— 1) within the same compute budget
  • Expanded 10-case eval set including security, error handling, mutability, and performance cases
  • Honest v1 vs v2 comparison on the same eval set

See the project repo for progress.


Citation

If you use this adapter, please cite the underlying dataset and base model:

@misc{codellama-7b-pr-review-lora-v1,
  title        = {CodeLlama-7B PR Review LoRA Adapter (v1)},
  author       = {Sherry Liu},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/zenlyst/codellama-7b-pr-review-lora-v1}},
  note         = {LoRA adapter fine-tuned via QLoRA on ronantakizawa/github-codereview}
}

Base model:

@misc{roziere2023code,
  title        = {Code Llama: Open Foundation Models for Code},
  author       = {Baptiste Rozière and Jonas Gehring and Fabian Gloeckle and others},
  year         = {2023},
  eprint       = {2308.12950},
  archivePrefix= {arXiv}
}

Dataset:


License

This adapter inherits the license of the base model: Llama 2 Community License. Review the base model's license before use.


Acknowledgements

  • Base model: Meta's CodeLlama-7B-Instruct
  • Dataset: Ronan Takizawa's github-codereview dataset
  • Training stack: Hugging Face Transformers + PEFT + TRL + bitsandbytes
  • Compute: Google Colab Pro (T4)
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zenlyst/codellama-7b-pr-review-lora

Adapter
(395)
this model

Dataset used to train zenlyst/codellama-7b-pr-review-lora

Paper for zenlyst/codellama-7b-pr-review-lora