ThetaCPG v3 — Heterogeneous GNN for C++ Bug Detection
ThetaCPG (Heterogeneous Code Property Graph) is a Graph Neural Network model for detecting memory safety vulnerabilities in C++ code, with a focus on the Chromium codebase.
Architecture
| Component | Details |
|---|---|
| Graph type | Heterogeneous Code Property Graph (CPG) |
| GNN backbone | HGTConv (Heterogeneous Graph Transformer) |
| Node types | statement, variable, function, type, literal |
| Edge types | CFG, DFG, call graph, type edges, literal refs |
| Hidden dim | 256 |
| Attention heads | 8 |
| HGT layers | 6 |
| Total params | ~63.8M |
| Token encoder | CodeBERT (microsoft/codebert-base) |
| Training data | BigVul + D2A (CVE fix commits, C/C++) |
| AUROC (val) | ~0.87 |
What the model detects
Trained on BigVul and D2A — collections of CVE fix commits across C/C++ projects. Patterns the model recognizes best:
- Use-After-Free (CWE-416) — object freed but dangling reference still used
- Heap Buffer Overflow (CWE-122) — write past the end of a heap allocation
- Out-of-Bounds Read (CWE-125) — array index not validated before read
- Integer Overflow → Under-allocation (CWE-190) — size wraps before being passed to malloc
- Double Free (CWE-415) — same pointer freed twice
- Null Dereference (CWE-476) — pointer used without null check after allocation failure
Honest caveat: The model performs well on held-out benchmark data (
0.87 AUROC) but has a high false positive rate (70% at threshold 0.80) on modern Chromium production code due to distribution shift — it was trained mostly on pre-2023 patterns that Chrome has already patched. Every candidate needs manual review. v4 (in progress) is being retrained on Chromium-specific CVE data from 2022–2026.
Usage
Install dependencies
pip install torch==2.4.1 torch-geometric==2.6.1
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.4.1+cu118.html
pip install transformers==4.40.0 tree-sitter==0.20.4
Quick inference
import torch
from graph_builder import GraphBuilder
from model import HetCPGModel
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
builder = GraphBuilder()
cpp_code = """
void ProcessData(char* input, int size) {
char buffer[256];
memcpy(buffer, input, size); // overflow if size > 256
ProcessBuffer(buffer);
}
"""
graph = builder.build_graph(cpp_code)
# See inference.py for the full prediction pipeline
Scan a directory
python inference.py /path/to/file.cc 0.80 cuda
from scanner import ChromiumScanner
scanner = ChromiumScanner(model_path='best_model.pt', device='cuda')
results = scanner.scan_directory('/path/to/chromium/src', threshold=0.80, max_files=5000)
for r in results:
if r['probability'] > 0.90:
print(f"{r['file']} {r['func_name']} P={r['probability']:.3f}")
Manual triage checklist
After the model flags a function, check these before drawing any conclusions:
□ Does input come from an untrusted source (renderer, network, file)?
□ Is there a bounds check before the suspicious operation?
□ Is object lifetime managed correctly (ref-counted, scoped, owned)?
□ Is this an intentional pattern with a documented safety gate?
- setHTMLUnsafe() with prior CheckHTML() gate → not a bug
- delete this after Cancel()+reset() → Chromium idiom, not UAF
- Unretained() on same SequencedTaskRunner → lifetime guaranteed
- UNSAFE_BUFFERS macro with audit comment → reviewed, not a bug
Limitations
- High false positive rate on modern Chromium (distribution shift from pre-2023 training data)
- Cannot detect logic bugs, cryptographic misuse, or web-layer vulnerabilities
- No inter-procedural analysis — each function is analyzed in isolation
- Manual verification required for every candidate
Roadmap
- v4 (in progress): Fine-tuned on Chromium CVE commits 2022–2026, targeting Precision@0.80 ≥ 30%
- v4.1: Dynamic validation — ASAN/libFuzzer harness generation for flagged functions
- v5: Rust
unsafeblock support
Citation
@misc{thetacpg2026,
title={ThetaCPG: Heterogeneous Code Property Graph for C++ Vulnerability Detection},
author={hoanghai2110},
year={2026},
url={https://huggingface.co/hoanghai2110/thetacpg-v3}
}