Stage 4 — Cross-Cloud Identity Embeddings

Part of the Trinetra multi-cloud threat detection pipeline (Group 24). Maps logically equivalent cloud identities from AWS, Azure, and GCP to nearby points in a shared 128-dimensional embedding space using contrastive learning.

Problem Solved

The same person has different account identifiers across cloud providers:

AWS: user_alice
Azure: user_alice_az
GCP: user_alice_gcp

Without identity linking, a cross-cloud pivot attack — where stolen AWS credentials are reused on GCP — appears as two completely unrelated events to the downstream graph neural network. This model maps all three to nearby points so Stage 5's graph can connect them and Stage 6's RGCN can detect the pivot.

Architecture

Component	Detail
Base model	`google/flan-t5-base` (encoder only)
Fine-tuning	LoRA rank=16, alpha=32, target=q,k,v,o
Projection head	Linear(512→256→128) + ReLU + LayerNorm
Output dim	128 (z_identity)
Training loss	Contrastive with margin=0.3
Positive pairs	Same person, different cloud provider
Negative pairs	Different persons, any provider
Epochs	50 × 40 steps
Hardware	Kaggle T4 x1

Files

File	Description
`adapter/`	LoRA adapter weights (flan-t5-base encoder)
`proj_head.pt`	Projection head weights (512→128)
`config.json`	Training configuration
`z_identity.parquet`	Pre-computed embeddings for all 33 pipeline identities

Quick Start

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download

REPO = "sohomn/stage4-identity-embeddings"
BASE = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(REPO)
base      = AutoModel.from_pretrained(BASE)
encoder   = PeftModel.from_pretrained(base, REPO + "/adapter")
encoder.eval()

proj = nn.Sequential(
    nn.Linear(512, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 128)
)
proj.load_state_dict(torch.load(
    hf_hub_download(REPO, "proj_head.pt"), map_location="cpu"
))
proj.eval()

def embed(identity: str, provider: str) -> list:
    text   = f"identity: {identity} provider: {provider}"
    inputs = tokenizer(text, return_tensors="pt", max_length=32,
                       truncation=True, padding=True)
    with torch.no_grad():
        out  = encoder.encoder(**inputs)
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        emb  = (out.last_hidden_state * mask).sum(1) / mask.sum(1)
        z    = proj(emb)
        z    = F.normalize(z, dim=-1)
    return z[0].tolist()

print(embed("user_alice", "AWS"))      # 128-dim vector
print(embed("user_alice_az", "Azure")) # should be close to above

Output Contract

Shape: (128,) per identity, L2-normalised
Equivalent identities: cosine similarity > 0.8
Non-equivalent identities: cosine similarity < 0.3
Consumed by: Stage 5 graph construction (last 128 dims of 514-dim node vector)

Training Details

Setting	Value
Identity registry	11 persons × 3 providers = 33 identities
Positive pairs	33 (same person, different cloud)
Negative pairs	sampled each batch
Effective batch	32 (16 pos + 16 neg)
Best loss	see config.json
Seed	42

Downloads last month: -

Model tree for sohomn/stage4-identity-embeddings

Base model

google/flan-t5-base

Adapter

(304)

this model