Stage 4 β€” Cross-Cloud Identity Embeddings

Part of the Trinetra multi-cloud threat detection pipeline (Group 24). Maps logically equivalent cloud identities from AWS, Azure, and GCP to nearby points in a shared 128-dimensional embedding space using contrastive learning.

Problem Solved

The same person has different account identifiers across cloud providers:

  • AWS: user_alice
  • Azure: user_alice_az
  • GCP: user_alice_gcp

Without identity linking, a cross-cloud pivot attack β€” where stolen AWS credentials are reused on GCP β€” appears as two completely unrelated events to the downstream graph neural network. This model maps all three to nearby points so Stage 5's graph can connect them and Stage 6's RGCN can detect the pivot.

Architecture

Component Detail
Base model google/flan-t5-base (encoder only)
Fine-tuning LoRA rank=16, alpha=32, target=q,k,v,o
Projection head Linear(512β†’256β†’128) + ReLU + LayerNorm
Output dim 128 (z_identity)
Training loss Contrastive with margin=0.3
Positive pairs Same person, different cloud provider
Negative pairs Different persons, any provider
Epochs 50 Γ— 40 steps
Hardware Kaggle T4 x1

Files

File Description
adapter/ LoRA adapter weights (flan-t5-base encoder)
proj_head.pt Projection head weights (512β†’128)
config.json Training configuration
z_identity.parquet Pre-computed embeddings for all 33 pipeline identities

Quick Start

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download

REPO = "sohomn/stage4-identity-embeddings"
BASE = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(REPO)
base      = AutoModel.from_pretrained(BASE)
encoder   = PeftModel.from_pretrained(base, REPO + "/adapter")
encoder.eval()

proj = nn.Sequential(
    nn.Linear(512, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 128)
)
proj.load_state_dict(torch.load(
    hf_hub_download(REPO, "proj_head.pt"), map_location="cpu"
))
proj.eval()

def embed(identity: str, provider: str) -> list:
    text   = f"identity: {identity} provider: {provider}"
    inputs = tokenizer(text, return_tensors="pt", max_length=32,
                       truncation=True, padding=True)
    with torch.no_grad():
        out  = encoder.encoder(**inputs)
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        emb  = (out.last_hidden_state * mask).sum(1) / mask.sum(1)
        z    = proj(emb)
        z    = F.normalize(z, dim=-1)
    return z[0].tolist()

print(embed("user_alice", "AWS"))      # 128-dim vector
print(embed("user_alice_az", "Azure")) # should be close to above

Output Contract

  • Shape: (128,) per identity, L2-normalised
  • Equivalent identities: cosine similarity > 0.8
  • Non-equivalent identities: cosine similarity < 0.3
  • Consumed by: Stage 5 graph construction (last 128 dims of 514-dim node vector)

Training Details

Setting Value
Identity registry 11 persons Γ— 3 providers = 33 identities
Positive pairs 33 (same person, different cloud)
Negative pairs sampled each batch
Effective batch 32 (16 pos + 16 neg)
Best loss see config.json
Seed 42
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sohomn/stage4-identity-embeddings

Adapter
(304)
this model