Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
Paper β’ 2502.18851 β’ Published
Residual Quantized Variational Autoencoder (RQ-VAE) trained to assign hierarchical semantic identifiers to product embeddings. The model compresses 1024-dimensional item embeddings into compact 3-level discrete codes (A, B, C), where each level uses a 256-code codebook. Semantically similar products share common code prefixes.
| Metric | Value |
|---|---|
| Codebook utilization | 100% (0 dead codes, all levels) |
| Perplexity | 228β242 per level |
| Cosine similarity (orig β recon) | mean 0.80, p50 0.80 |
| NN overlap@5 | 0.236 |
| Unique (A,B,C) tuples | 55,647 / 63,319 (87.9%) |
| Prefix-1 coincidence (NN / random) | 0.745 / 0.005 |
from mipt_master.src.rqvae.model import RQVAE, RQVAEConfig
from mipt_master.src.rqvae.train import load_checkpoint
ckpt = load_checkpoint("best_model.pth")
cfg = RQVAEConfig(**ckpt.config)
model = RQVAE(cfg)
model.load_state_dict(ckpt.model_state_dict)
model.eval()
# Encode embeddings β semantic IDs [N, 3]
semantic_ids = model.encode_to_semantic_ids(embeddings)
Master's thesis, Moscow Institute of Physics and Technology (MIPT), 2026.