SIEM Log Generator — LLaMA 3.1-8B (Stage 0b)

Fine-tuned LLaMA 3.1-8B-Instruct model that generates realistic, structured cloud security logs (SIEM events) from structured input events. Part of a multi-cloud threat detection research pipeline (Group 24, Final Year Project).

Given a structured security event (provider, action, entity IDs, attack phase, region, etc.), the model outputs a valid provider-native JSON log — AWS CloudTrail / GuardDuty, Azure Activity Log, or GCP Cloud Logging format — with a _pipeline_meta field preserving edge IDs and labels for downstream graph neural network stages.


Model Details

Model Description

Citations

@article{dubey2024llama,
  title  = {The Llama 3 Herd of Models},
  author = {Dubey, Abhimanyu and others},
  year   = {2024},
  url    = {https://arxiv.org/abs/2407.21783}
}

@inproceedings{dettmers2023qlora,
  title     = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author    = {Dettmers, Tim and Pagnoni, Artidoro and Farhadi, Ali and Zettlemoyer, Luke},
  booktitle = {NeurIPS},
  year      = {2023},
  url       = {https://arxiv.org/abs/2305.14314}
}

@inproceedings{hu2022lora,
  title     = {LoRA: Low-Rank Adaptation of Large Language Models},
  author    = {Hu, Edward J. and others},
  booktitle = {ICLR},
  year      = {2022},
  url       = {https://arxiv.org/abs/2106.09685}
}

Uses

Direct Use

Generate provider-native cloud security logs for research pipelines, dataset augmentation, and security simulation. Given a structured event dict, the model outputs a complete JSON log in the correct format for AWS CloudTrail, Azure Activity Log, or GCP Cloud Logging.

Downstream Use

This model is Stage 0b in a 10-stage multi-cloud threat detection pipeline:

Stage 0a (Attack Simulator) → Stage 0b (this model, log renderer)
  → Stage 1 (log ingestion) → Stage 2 (BGE-Large embeddings)
  → Stage 3a/3b (CVE extraction + risk scoring)
  → Stage 4 (identity embeddings) → Stage 5 (graph construction)
  → Stage 6 (RGCN) → Stage 7 (Temporal GNN)
  → Stage 8 (FT-Transformer) → Stage 9 (ensemble) → Stage 10 (explanation)

The _pipeline_meta field in every generated log preserves edge_id, scenario_id, t, malicious, and attack_phase labels — acting as a foreign key for all downstream stages.

Out-of-Scope Use

  • Not for production security monitoring — logs are synthetic and generated for research purposes only
  • Not a threat detector — this model renders logs, it does not classify them
  • Not suitable for generating real credentials, IPs, or account IDs — all identifiers are synthetic

Training Details

Training Data

Derived from Stage 0a of the pipeline — an attack chain simulator generating 1,000 multi-cloud scenarios across 4 attack templates:

Attack Template Description
Privilege Escalation IAM role abuse across AWS/Azure/GCP
Lateral Movement VM-to-VM propagation within cloud VPCs
Cross-Cloud Identity Pivot Credential exfiltration across cloud boundaries
CVE Exploitation Known CVE exploitation against cloud-hosted VMs

Source data: 632,108 structured events across 1,000 scenarios, T=20 timesteps
Class balance: ~65% benign / ~35% malicious
Providers covered: AWS, Azure, GCP, AWS_GCP (cross-cloud), GCP_Azure (cross-cloud)
Actions covered: ASSUMES_ROLE, ACCESS, CONNECTS_TO, EXPLOITS, CROSS_CLOUD_ACCESS, VM_LIST, RESTART_VM, STOP_VM

Training pairs were built by rendering each structured event into a LLaMA chat template (system prompt + structured event → provider-native JSON log). The dataset was capped at 2,000 pairs for the 2k sample run.

Training Procedure

Preprocessing

  • Each structured event is converted to a LLaMA 3.1 chat-format prompt
  • System prompt instructs the model to output only a valid JSON log with no explanation
  • Sequences truncated to MAX_SEQ_LEN=768 tokens
  • Validation split: last 10% of scenarios held out

Training Hyperparameters

Hyperparameter Value
Base model meta-llama/Meta-Llama-3.1-8B-Instruct
Quantisation 4-bit NF4 (double quantisation enabled)
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training regime fp16 mixed precision (T4 Turing — no bfloat16)
Optimiser paged_adamw_8bit
Learning rate 2e-4
LR scheduler cosine
Epochs 1
Per-device batch size 4
Gradient accumulation 4 (effective batch = 16)
Warmup steps 100
Max sequence length 768
NEFTune noise alpha 5
Seed 42

Hardware

  • Platform: Kaggle (Notebook)
  • GPU: NVIDIA Tesla T4 x1
  • VRAM: 16 GB
  • Fine-tuning method: QLoRA — full 8B model fine-tuned in 4-bit, only LoRA adapter weights updated (~0.7% of parameters trainable)

Evaluation

Testing Data

Held-out records from the last 10% of scenario IDs present in the training set, validated post-training.

Metrics

Metric Description
JSON Validity % % of generated outputs that parse as valid JSON
Schema Compliance % % of outputs containing all required provider-specific fields
Edge ID Preservation % % of outputs where _pipeline_meta.edge_id matches the source event

Results

Results below are from the 2k sample run (1 epoch, 2000 training pairs). Full-scale results pending.

Metric Threshold Result
JSON Validity % ≥ 90% pending full run
Schema Compliance % ≥ 85% pending full run
Edge ID Preservation % ≥ 90% pending full run

Technical Specifications

Model Architecture

  • Base: LLaMA 3.1-8B-Instruct (decoder-only transformer, 32 layers, 4096 hidden dim, 32 attention heads)
  • Adapter: LoRA rank-16 injected into all 7 projection matrices across all 32 layers
  • Quantisation: 4-bit NF4 via bitsandbytes — base weights frozen at 4-bit, LoRA adapters trained in fp16
  • Trainable parameters: 83M / 8B total (1.0%)

Log Schema Coverage

AWS (CloudTrail / GuardDuty)

Required fields: eventSource, eventName, awsRegion, userIdentity, sourceIPAddress, readOnly, resources, managementEvent, sessionContext, _pipeline_meta

Azure (Activity Log)

Required fields: time, operationName, correlationId, identity, properties, _pipeline_meta

GCP (Cloud Logging)

Required fields: protoPayload, resource, severity, timestamp, logName, _pipeline_meta

Pipeline Meta Field

Every generated log contains:

"_pipeline_meta": {
  "edge_id":           "user_001__ASSUMES_ROLE__role_admin",
  "scenario_id":       "scenario_00042",
  "t":                 7,
  "malicious":         1,
  "attack_phase":      "privilege_escalation",
  "provider":          "AWS",
  "original_provider": "AWS",
  "is_cross_cloud":    false
}

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch, json

base_id    = "meta-llama/Meta-Llama-3.1-8B-Instruct"
adapter_id = "Final-year-grp24/siem-log-generator-llama31-8b"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(base_id)
base      = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
                                                  device_map="auto", torch_dtype=torch.float16)
model     = PeftModel.from_pretrained(base, adapter_id)

event = {
    "provider": "AWS", "action": "ASSUMES_ROLE",
    "entity_id": "user_042", "target_id": "role_admin",
    "region": "us-east-1", "cloud_account": "acc_aws_123456",
    "source_ip": "10.0.1.42", "status": "Success",
    "malicious": 1, "attack_phase": "privilege_escalation",
    "edge_id": "user_042__ASSUMES_ROLE__role_admin",
    "scenario_id": "scenario_00001", "t": 5,
}

system = ("You are a cloud security log renderer for a research pipeline. "
          "Given a structured security event, generate ONLY the corresponding "
          "cloud provider log as a valid JSON object. Output nothing except the JSON. "
          "No explanation. No markdown fences. "
          "The JSON must include a \"_pipeline_meta\" field preserving edge_id and labels.")

messages = [{"role":"system","content":system},
            {"role":"user","content":json.dumps(event)}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=500, do_sample=False)

response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
log      = json.loads(response[response.find("{"):response.rfind("}")+1])
print(json.dumps(log, indent=2))

Environmental Impact

  • Hardware: NVIDIA Tesla T4 (16GB VRAM)
  • Cloud provider: Google (Kaggle)
  • Training duration: ~1–2 hours (2k sample), ~9–11 hours (full 480-scenario run)
  • Carbon estimation: ML Impact Calculator
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sohomn/siem-log-generator-llama31-8b

Finetuned
(2584)
this model

Papers for sohomn/siem-log-generator-llama31-8b