CodeLlama-7B PR Review — LoRA Adapter (v1)

A QLoRA-fine-tuned LoRA adapter for CodeLlama-7B-Instruct, trained to generate inline code review comments on Python pull requests. This is the v1 prototype from a larger project that ships the model behind a FastAPI + GitHub App + Kubernetes deployment.

⚠️ This is a v1 prototype. It catches some code review patterns (e.g., missing context managers) but misses others (e.g., SQL injection). See the Evaluation section for honest failure modes. Not production-ready. A v2 trained with Unsloth on more data is planned — see the project repo.

Project repo: https://github.com/Zenlyst/PR_Review_AI

Model Details

Field	Value
Base model	`codellama/CodeLlama-7b-Instruct-hf`
Adapter type	LoRA (via PEFT)
Quantization at training	4-bit NF4 with double quantization (QLoRA)
Parameter count (trainable)	~~0.06% of base (~~4M params)
Adapter size	~20–30 MB
Task	Code review comment generation from before/after Python code pairs
Language scope	Python only
Intended use	Research / portfolio demonstration. Not production.

LoRA Configuration

Hyperparameter	Value
Rank (`r`)	16
Alpha	32
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`

Intended Use

This adapter is intended for:

Research on code review generation with small open-source LLMs
Portfolio demonstration of the end-to-end QLoRA fine-tuning pipeline (data filtering → training → eval → adapter export → deployment)
Educational exploration of how fine-tuned 7B models compare to frontier APIs on a domain-specific task

It is not intended for:

Production code review (the model misses critical security issues like SQL injection — see evaluation below)
Languages other than Python
Replacing human code review
Any use where reviewer omissions could cause harm

Training Data

Dataset: ronantakizawa/github-codereview — 355K+ real human code review comments from public GitHub PRs.

Filtering applied

Filter	Value	Rationale
Language	`language == "Python"` (file-level)	Focused scope for v1
Quality score	`>= 0.5`	Drop noisy low-quality reviews
Code length	5–200 lines (both before and after)	Remove trivial and very large diffs
Token length	`<= 2048` after prompt formatting	Fit T4 VRAM + training efficiency
Sample limit	10,000	Keep Colab T4 training under ~13h

Both positive examples (real reviews) and negative examples (is_negative=True, "No issues found") were kept so the model learns when code is fine and doesn't need a comment.

Training Configuration

Parameter	Value
Base model	CodeLlama-7B-Instruct
Method	QLoRA (4-bit NF4 + double quantization)
Training samples	10,000
Epochs	1
Max sequence length	2,048 tokens
Per-device batch size	1
Gradient accumulation steps	8 (effective batch size = 8)
Learning rate	2e-4
LR scheduler	Cosine, 50 warmup steps
Weight decay	0.01
Precision	FP16
Packing	Enabled (multiple short samples per sequence)
Evaluation	Once per epoch (not per step, for speed)
GPU	Google Colab Pro T4 (16GB VRAM)
Training time	~13 hours
Training stack	Transformers + PEFT + TRL (`transformers==4.47.1`, `trl==0.17.0`)

Prompt template

### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.

### File: {file_path}

### Before:
{before_code}

### After:
{after_code}

### Review:
{reviewer_comment}

At inference, everything after ### Review: is generated.

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
ADAPTER_REPO = "zenlyst/codellama-7b-pr-review-lora-v1"  # replace after upload

# 4-bit load (matches training-time quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
model.eval()

PROMPT = """### Instruction:
You are a senior code reviewer. Compare the before and after versions of the code below. Identify potential issues and provide improvement suggestions.

### File: {file_path}

### Before:
{before_code}

### After:
{after_code}

### Review:
"""

before = """def parse_config(path):
    with open(path) as f:
        data = json.load(f)
    return data"""

after = """def parse_config(path):
    with open(path) as f:
        data = yaml.safe_load(f)
    return data"""

prompt = PROMPT.format(file_path="utils/parser.py", before_code=before, after_code=after)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        repetition_penalty=1.1,
    )

review = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(review.strip())

Requirements: transformers, peft, bitsandbytes, accelerate, torch. Requires a CUDA GPU for 4-bit loading (bitsandbytes does not support macOS).

Evaluation

Evaluated on 3 hand-crafted Python code review cases covering error handling, security, and resource management. Scoring is qualitative (Good / Shallow / Failed) — this is a prototype eval, not a benchmark.

Summary

Test	Category	Verdict
JSON → YAML migration (no error handling)	Imports / error handling	❌ Shallow
SQL query without parameterization	🔐 Security	❌ Failed
Missing resource cleanup (`open()` without `with`)	Resource management	✅ Good

Overall: 1 / 3 useful reviews.

Case 1 — JSON to YAML migration (❌ Shallow)

Code switches json.load to yaml.safe_load without adding import yaml or yaml.YAMLError handling.

Model output:

import yaml

Analysis: Caught the missing import but missed the more important issues — no error handling for malformed YAML and no consideration of the silent format migration impact on callers.

Case 2 — SQL injection via f-string (❌ Failed)

Code uses cursor.execute(f'SELECT * FROM users WHERE id = {user_id}') — a textbook SQL injection vulnerability.

Model output:

    cursor.execute(f'SELECT * FROM users WHERE id = {user_id} AND active = 1')

Analysis: The model echoed the code back without identifying the SQL injection vulnerability. This is the most critical failure — security issues should be high-priority review items. Likely root causes: (1) SQL injection patterns are under-represented in 10K random samples, (2) one epoch may be insufficient for the model to consolidate rare security patterns.

Case 3 — Missing context manager on file handle (✅ Good)

Code replaces with open(dest, 'wb') as f: with a bare f = open(dest, 'wb'), losing automatic close on exception.

Model output:

The requests module allows streaming of responses using the stream parameter. This means that we can read the content in chunks without loading it all into memory at once. In this case, we could use a context manager to ensure the file is properly closed after writing. Here's an example of how to implement this:
with open(dest, 'wb') as f:
    for chunk in response.iter_content(8192):
        f.write(chunk)

Analysis: Correctly identified the missing context manager, explained why it matters, and provided the fix. This is the quality level the model aims for across all reviews.

Limitations and Biases

Python only. The model has not been trained on or evaluated against any other language.
Misses security issues. v1 failed to identify SQL injection in evaluation. Do not rely on this model for security review.
Shallow on multi-issue diffs. The model tends to surface one issue per review even when multiple exist.
Small eval set. Three hand-crafted cases is not a benchmark. Real-world performance will vary.
Training data bias. Inherits biases of the ronantakizawa/github-codereview dataset — mostly open-source Python projects on GitHub. Code styles and review conventions from other ecosystems (enterprise, other languages, non-English projects) are underrepresented.
Prototype only. Not validated at scale, not safety-reviewed, not aligned for adversarial inputs.

Known Failure Modes (short list)

SQL injection via f-string interpolation — missed entirely in eval
Silent API migrations (e.g., JSON→YAML) — flags imports but misses behavioral implications
Echoes code back as "suggestion" without explaining issues
Single-line suggestions even when multi-line refactors are needed

Roadmap

A v2 adapter is in progress, targeting the v1 failure modes with:

Training stack migration to Unsloth for ~2× speedup at identical accuracy
15K samples × 2 epochs (up from 10K × 1) within the same compute budget
Expanded 10-case eval set including security, error handling, mutability, and performance cases
Honest v1 vs v2 comparison on the same eval set

See the project repo for progress.

Citation

If you use this adapter, please cite the underlying dataset and base model:

@misc{codellama-7b-pr-review-lora-v1,
  title        = {CodeLlama-7B PR Review LoRA Adapter (v1)},
  author       = {Sherry Liu},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/zenlyst/codellama-7b-pr-review-lora-v1}},
  note         = {LoRA adapter fine-tuned via QLoRA on ronantakizawa/github-codereview}
}

Base model:

@misc{roziere2023code,
  title        = {Code Llama: Open Foundation Models for Code},
  author       = {Baptiste Rozière and Jonas Gehring and Fabian Gloeckle and others},
  year         = {2023},
  eprint       = {2308.12950},
  archivePrefix= {arXiv}
}

Dataset: