Stage 4 β Cross-Cloud Identity Embeddings
Part of the Trinetra multi-cloud threat detection pipeline (Group 24). Maps logically equivalent cloud identities from AWS, Azure, and GCP to nearby points in a shared 128-dimensional embedding space using contrastive learning.
Problem Solved
The same person has different account identifiers across cloud providers:
- AWS:
user_alice - Azure:
user_alice_az - GCP:
user_alice_gcp
Without identity linking, a cross-cloud pivot attack β where stolen AWS credentials are reused on GCP β appears as two completely unrelated events to the downstream graph neural network. This model maps all three to nearby points so Stage 5's graph can connect them and Stage 6's RGCN can detect the pivot.
Architecture
| Component | Detail |
|---|---|
| Base model | google/flan-t5-base (encoder only) |
| Fine-tuning | LoRA rank=16, alpha=32, target=q,k,v,o |
| Projection head | Linear(512β256β128) + ReLU + LayerNorm |
| Output dim | 128 (z_identity) |
| Training loss | Contrastive with margin=0.3 |
| Positive pairs | Same person, different cloud provider |
| Negative pairs | Different persons, any provider |
| Epochs | 50 Γ 40 steps |
| Hardware | Kaggle T4 x1 |
Files
| File | Description |
|---|---|
adapter/ |
LoRA adapter weights (flan-t5-base encoder) |
proj_head.pt |
Projection head weights (512β128) |
config.json |
Training configuration |
z_identity.parquet |
Pre-computed embeddings for all 33 pipeline identities |
Quick Start
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download
REPO = "sohomn/stage4-identity-embeddings"
BASE = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(REPO)
base = AutoModel.from_pretrained(BASE)
encoder = PeftModel.from_pretrained(base, REPO + "/adapter")
encoder.eval()
proj = nn.Sequential(
nn.Linear(512, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 128)
)
proj.load_state_dict(torch.load(
hf_hub_download(REPO, "proj_head.pt"), map_location="cpu"
))
proj.eval()
def embed(identity: str, provider: str) -> list:
text = f"identity: {identity} provider: {provider}"
inputs = tokenizer(text, return_tensors="pt", max_length=32,
truncation=True, padding=True)
with torch.no_grad():
out = encoder.encoder(**inputs)
mask = inputs["attention_mask"].unsqueeze(-1).float()
emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1)
z = proj(emb)
z = F.normalize(z, dim=-1)
return z[0].tolist()
print(embed("user_alice", "AWS")) # 128-dim vector
print(embed("user_alice_az", "Azure")) # should be close to above
Output Contract
- Shape:
(128,)per identity, L2-normalised - Equivalent identities: cosine similarity > 0.8
- Non-equivalent identities: cosine similarity < 0.3
- Consumed by: Stage 5 graph construction (last 128 dims of 514-dim node vector)
Training Details
| Setting | Value |
|---|---|
| Identity registry | 11 persons Γ 3 providers = 33 identities |
| Positive pairs | 33 (same person, different cloud) |
| Negative pairs | sampled each batch |
| Effective batch | 32 (16 pos + 16 neg) |
| Best loss | see config.json |
| Seed | 42 |
- Downloads last month
- -
Model tree for sohomn/stage4-identity-embeddings
Base model
google/flan-t5-base