ThetaCPG v3 — Heterogeneous GNN for C++ Bug Detection

ThetaCPG (Heterogeneous Code Property Graph) is a Graph Neural Network model for detecting memory safety vulnerabilities in C++ code, with a focus on the Chromium codebase.


Architecture

Component Details
Graph type Heterogeneous Code Property Graph (CPG)
GNN backbone HGTConv (Heterogeneous Graph Transformer)
Node types statement, variable, function, type, literal
Edge types CFG, DFG, call graph, type edges, literal refs
Hidden dim 256
Attention heads 8
HGT layers 6
Total params ~63.8M
Token encoder CodeBERT (microsoft/codebert-base)
Training data BigVul + D2A (CVE fix commits, C/C++)
AUROC (val) ~0.87

What the model detects

Trained on BigVul and D2A — collections of CVE fix commits across C/C++ projects. Patterns the model recognizes best:

  • Use-After-Free (CWE-416) — object freed but dangling reference still used
  • Heap Buffer Overflow (CWE-122) — write past the end of a heap allocation
  • Out-of-Bounds Read (CWE-125) — array index not validated before read
  • Integer Overflow → Under-allocation (CWE-190) — size wraps before being passed to malloc
  • Double Free (CWE-415) — same pointer freed twice
  • Null Dereference (CWE-476) — pointer used without null check after allocation failure

Honest caveat: The model performs well on held-out benchmark data (0.87 AUROC) but has a high false positive rate (70% at threshold 0.80) on modern Chromium production code due to distribution shift — it was trained mostly on pre-2023 patterns that Chrome has already patched. Every candidate needs manual review. v4 (in progress) is being retrained on Chromium-specific CVE data from 2022–2026.


Usage

Install dependencies

pip install torch==2.4.1 torch-geometric==2.6.1
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.4.1+cu118.html
pip install transformers==4.40.0 tree-sitter==0.20.4

Quick inference

import torch
from graph_builder import GraphBuilder
from model import HetCPGModel

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
builder = GraphBuilder()

cpp_code = """
void ProcessData(char* input, int size) {
    char buffer[256];
    memcpy(buffer, input, size);  // overflow if size > 256
    ProcessBuffer(buffer);
}
"""

graph = builder.build_graph(cpp_code)
# See inference.py for the full prediction pipeline

Scan a directory

python inference.py /path/to/file.cc 0.80 cuda
from scanner import ChromiumScanner

scanner = ChromiumScanner(model_path='best_model.pt', device='cuda')
results = scanner.scan_directory('/path/to/chromium/src', threshold=0.80, max_files=5000)

for r in results:
    if r['probability'] > 0.90:
        print(f"{r['file']}  {r['func_name']}  P={r['probability']:.3f}")

Manual triage checklist

After the model flags a function, check these before drawing any conclusions:

□ Does input come from an untrusted source (renderer, network, file)?
□ Is there a bounds check before the suspicious operation?
□ Is object lifetime managed correctly (ref-counted, scoped, owned)?
□ Is this an intentional pattern with a documented safety gate?
   - setHTMLUnsafe() with prior CheckHTML() gate → not a bug
   - delete this after Cancel()+reset() → Chromium idiom, not UAF
   - Unretained() on same SequencedTaskRunner → lifetime guaranteed
   - UNSAFE_BUFFERS macro with audit comment → reviewed, not a bug

Limitations

  • High false positive rate on modern Chromium (distribution shift from pre-2023 training data)
  • Cannot detect logic bugs, cryptographic misuse, or web-layer vulnerabilities
  • No inter-procedural analysis — each function is analyzed in isolation
  • Manual verification required for every candidate

Roadmap

  • v4 (in progress): Fine-tuned on Chromium CVE commits 2022–2026, targeting Precision@0.80 ≥ 30%
  • v4.1: Dynamic validation — ASAN/libFuzzer harness generation for flagged functions
  • v5: Rust unsafe block support

Citation

@misc{thetacpg2026,
  title={ThetaCPG: Heterogeneous Code Property Graph for C++ Vulnerability Detection},
  author={hoanghai2110},
  year={2026},
  url={https://huggingface.co/hoanghai2110/thetacpg-v3}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support