ThetaCPG v3 — Heterogeneous GNN for C++ Bug Detection

ThetaCPG (Heterogeneous Code Property Graph) is a Graph Neural Network model for detecting memory safety vulnerabilities in C++ code, with a focus on the Chromium codebase.

Architecture

Component	Details
Graph type	Heterogeneous Code Property Graph (CPG)
GNN backbone	HGTConv (Heterogeneous Graph Transformer)
Node types	`statement`, `variable`, `function`, `type`, `literal`
Edge types	CFG, DFG, call graph, type edges, literal refs
Hidden dim	256
Attention heads	8
HGT layers	6
Total params	~63.8M
Token encoder	CodeBERT (microsoft/codebert-base)
Training data	BigVul + D2A (CVE fix commits, C/C++)
AUROC (val)	~0.87

What the model detects

Trained on BigVul and D2A — collections of CVE fix commits across C/C++ projects. Patterns the model recognizes best:

Use-After-Free (CWE-416) — object freed but dangling reference still used
Heap Buffer Overflow (CWE-122) — write past the end of a heap allocation
Out-of-Bounds Read (CWE-125) — array index not validated before read
Integer Overflow → Under-allocation (CWE-190) — size wraps before being passed to malloc
Double Free (CWE-415) — same pointer freed twice
Null Dereference (CWE-476) — pointer used without null check after allocation failure

Honest caveat: The model performs well on held-out benchmark data (~~0.87 AUROC) but has a high false positive rate (~~70% at threshold 0.80) on modern Chromium production code due to distribution shift — it was trained mostly on pre-2023 patterns that Chrome has already patched. Every candidate needs manual review. v4 (in progress) is being retrained on Chromium-specific CVE data from 2022–2026.

Usage

Install dependencies

pip install torch==2.4.1 torch-geometric==2.6.1
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.4.1+cu118.html
pip install transformers==4.40.0 tree-sitter==0.20.4

Quick inference

import torch
from graph_builder import GraphBuilder
from model import HetCPGModel

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
builder = GraphBuilder()

cpp_code = """
void ProcessData(char* input, int size) {
    char buffer[256];
    memcpy(buffer, input, size);  // overflow if size > 256
    ProcessBuffer(buffer);
}
"""

graph = builder.build_graph(cpp_code)
# See inference.py for the full prediction pipeline

Scan a directory

python inference.py /path/to/file.cc 0.80 cuda

from scanner import ChromiumScanner

scanner = ChromiumScanner(model_path='best_model.pt', device='cuda')
results = scanner.scan_directory('/path/to/chromium/src', threshold=0.80, max_files=5000)

for r in results:
    if r['probability'] > 0.90:
        print(f"{r['file']}  {r['func_name']}  P={r['probability']:.3f}")

Manual triage checklist

After the model flags a function, check these before drawing any conclusions:

□ Does input come from an untrusted source (renderer, network, file)?
□ Is there a bounds check before the suspicious operation?
□ Is object lifetime managed correctly (ref-counted, scoped, owned)?
□ Is this an intentional pattern with a documented safety gate?
   - setHTMLUnsafe() with prior CheckHTML() gate → not a bug
   - delete this after Cancel()+reset() → Chromium idiom, not UAF
   - Unretained() on same SequencedTaskRunner → lifetime guaranteed
   - UNSAFE_BUFFERS macro with audit comment → reviewed, not a bug

Limitations

High false positive rate on modern Chromium (distribution shift from pre-2023 training data)
Cannot detect logic bugs, cryptographic misuse, or web-layer vulnerabilities
No inter-procedural analysis — each function is analyzed in isolation
Manual verification required for every candidate

Roadmap

v4 (in progress): Fine-tuned on Chromium CVE commits 2022–2026, targeting Precision@0.80 ≥ 30%
v4.1: Dynamic validation — ASAN/libFuzzer harness generation for flagged functions
v5: Rust unsafe block support

Citation

@misc{thetacpg2026,
  title={ThetaCPG: Heterogeneous Code Property Graph for C++ Vulnerability Detection},
  author={hoanghai2110},
  year={2026},
  url={https://huggingface.co/hoanghai2110/thetacpg-v3}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Graph Machine Learning

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support