pradeep-xpert

Initial release: XGBoost + MLP for user-risk-tier classification, plus structural-leakage diagnostic on threat-actor detection

e6a6835 verified 2 days ago

preview code

raw

history blame contribute delete

22.1 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- cybersecurity
	- identity-security
	- insider-threat
	- ueba
	- user-risk-scoring
	- tabular-classification
	- synthetic-data
	- xgboost
	- baseline
	- leakage-diagnostic
	pipeline_tag: tabular-classification
	base_model: []
	datasets:
	- xpertsystems/cyb006-sample
	metrics:
	- accuracy
	- f1
	- roc_auc
	model-index:
	- name: cyb006-baseline-classifier
	results:
	- task:
	type: tabular-classification
	name: 3-class user risk tier classification
	dataset:
	type: xpertsystems/cyb006-sample
	name: CYB006 Synthetic Login Activity Dataset (Sample)
	metrics:
	- type: roc_auc
	value: 0.8017
	name: Test macro ROC-AUC OvR (XGBoost, seed 42)
	- type: accuracy
	value: 0.6667
	name: Test accuracy (XGBoost, seed 42)
	- type: f1
	value: 0.6454
	name: Test macro-F1 (XGBoost, seed 42)
	- type: accuracy
	value: 0.700
	name: Multi-seed accuracy mean ± 0.082 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.812
	name: Multi-seed ROC-AUC mean ± 0.048 (XGBoost, 10 seeds)
	- type: roc_auc
	value: 0.6974
	name: Test macro ROC-AUC OvR (MLP, seed 42)
	- type: accuracy
	value: 0.6000
	name: Test accuracy (MLP, seed 42)
	- type: f1
	value: 0.5914
	name: Test macro-F1 (MLP, seed 42)
	---

	# CYB006 Baseline Classifier

	**User-risk-tier classifier trained on the CYB006 synthetic login
	activity sample. Predicts which of 3 risk tiers (`low` / `medium` /
	`high`) a user belongs to, from per-user identity aggregates and
	non-leaky session aggregates. ALSO ships a leakage diagnostic for the
	README's stated headline use case (threat-actor tier classification).**

	> Read this first. This repo ships two artifacts: (1) a working
	> baseline classifier for `user_risk_tier` (the primary product), and
	> (2) a separate diagnostic file (`leakage_diagnostic.json`)
	> documenting why the README's stated headline use case — 4-class
	> threat-actor tier classification — is not a usable ML task on the
	> sample dataset. Both matter; the diagnostic is required reading for
	> anyone evaluating CYB006 for a threat-detection product.

	## Model overview

	\| Property \| Value \|
	\|---\|---\|
	\| Primary task \| 3-class user_risk_tier classification (`low`/`medium`/`high`) \|
	\| Secondary artifact \| `leakage_diagnostic.json` — audit of threat-actor detection on this sample \|
	\| Training data \| `xpertsystems/cyb006-sample` (200 users × 25 sessions = 5,000 sessions) \|
	\| Models \| XGBoost + PyTorch MLP \|
	\| Input features \| 34 (per-user aggregates + session aggregates + engineered) \|
	\| Split \| Stratified by user_risk_tier (this is a user-level task, n=200) \|
	\| Validation \| Single seed (artifact) + multi-seed aggregate across 10 seeds \|
	\| License \| CC-BY-NC-4.0 (matches dataset) \|
	\| Status \| Reference baseline + structural-leakage diagnostic \|

	## Why this task — and why not threat-actor classification?

	The CYB006 README's first suggested use case is "training **account
	takeover (ATO) detection models" and second is "threat-actor tier
	classification** — 4-class with realistic class imbalance". We piloted
	the threat-actor target first and discovered that the sample dataset
	contains structural distributional non-overlap between threat-actor
	and legitimate session populations across at least six independent
	feature groups:

	\| Oracle feature \| Actor range / value \| Non-actor range / value \|
	\|---\|---\|---\|
	\| `velocity_anomaly_score` \| [0.52, 0.82] \| [0.00, 0.25] — zero overlap \|
	\| `session_timestamp_utc` \| [6,417, 1,440,062] \| [1,445,187, 18,000,137] — disjoint windows \|
	\| `credential_attempt_count` \| [1, 59] (mean 12.9) \| [1, 2] (mean 1.07) \|
	\| `login_outcome` \| `success_normal` only occurs for non-actors; `failure_account_locked` / `account_takeover_confirmed` / `session_hijacked` / `success_anomalous` only occur for actors \|
	\| `geo_country_code` \| `KP`, `XX`, `CN`, `BY` appear only for actors \|
	\| `device_trust_level` \| `trusted_managed` / `compliant_enrolled` appear only for non-actors \|

	As a consequence, **plain XGBoost achieves 100% test accuracy on
	threat-actor binary detection (any-actor vs none) across every random
	seed, and stays at 97% accuracy and AUC 0.99 even with all six
	oracle feature groups dropped** (40+ columns excluded). This is not a
	useful ML benchmark; it's a property of the synthetic generator. Real
	identity-security telemetry has substantial overlap between threat
	and legitimate behaviour, with state-of-the-art detection systems
	operating at AUC 0.7–0.9, not 1.0.

	The diagnostic finding is documented quantitatively in
	[`leakage_diagnostic.json`](./leakage_diagnostic.json) and summarised
	in the [Leakage diagnostic](#leakage-diagnostic) section below.

	We therefore pivoted to **`user_risk_tier` (3-class user-level
	classification)** as the primary baseline target. This task:

	- Has overlapping per-tier feature distributions — no oracle features
	- Carries modest real signal (acc 0.66, AUC 0.80 over majority 0.57)
	- Targets a legitimate use case (the README lists "Insider threat scoring with composite behavioral indicators")
	- Demonstrates honest ML rigor on the dataset

	Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

	- `model_xgb.json` — gradient-boosted trees, primary recommendation
	- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format

	## Quick start

	```bash
	pip install xgboost torch safetensors pandas huggingface_hub
	```

	```python
	from huggingface_hub import hf_hub_download
	import json, numpy as np, torch, xgboost as xgb
	from safetensors.torch import load_file

	REPO = "xpertsystems/cyb006-baseline-classifier"

	paths = {n: hf_hub_download(REPO, n) for n in [
	"model_xgb.json", "model_mlp.safetensors",
	"feature_engineering.py", "feature_meta.json", "feature_scaler.json",
	]}

	import sys, os
	sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
	from feature_engineering import (
	transform_single, load_meta, INT_TO_LABEL,
	compute_session_aggregates_for_user
	)

	meta = load_meta(paths["feature_meta.json"])
	xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

	# Compose a per-user record from user_risk_summary row + session aggregates
	user_record = user_summary_row.to_dict()
	user_record.update(compute_session_aggregates_for_user(user_sessions))

	X = transform_single(user_record, meta)
	proba = xgb_model.predict_proba(X)[0]
	print(INT_TO_LABEL[int(np.argmax(proba))])
	```

	See [`inference_example.ipynb`](./inference_example.ipynb) for the full
	copy-paste demo.

	## Training data

	Trained on the public sample of CYB006, 200 per-user rows from
	`user_risk_summary.csv` enriched with per-user session aggregates
	computed from `login_sessions.csv`:

	\| Tier \| Users \| Class share \|
	\|---\|---:\|---:\|
	\| `low` \| 114 \| 57% \|
	\| `medium` \| 47 \| 23.5% \|
	\| `high` \| 39 \| 19.5% \|

	The CYB006 README claims a 4-tier scheme (`low`/`medium`/`high`/`critical`).
	The sample data contains only 3 — there is no `critical` tier present.

	### Stratified split

	This is a user-level task (one row per user, 200 users total).
	Group-aware splitting does not apply since there is no
	many-rows-per-group structure to leak. We use
	StratifiedShuffleSplit (nested 70/15/15) to preserve the 3-tier
	class distribution across folds:

	\| Fold \| Users \|
	\|---\|---:\|
	\| Train \| 139 \|
	\| Validation \| 31 \|
	\| Test \| 30 \|

	Class imbalance is addressed with `class_weight='balanced'` (XGBoost
	`sample_weight`) and weighted cross-entropy (MLP).

	## Feature pipeline

	The bundled `feature_engineering.py` is the canonical feature recipe.
	34 features survive after encoding, drawn from:

	- Per-user numeric (14, from `user_risk_summary.csv`): `total_login_attempts`, `successful_logins`, `failed_logins`, `mfa_failures`, `impossible_travel_events`, `lateral_hop_count`, `privilege_escalations`, `account_lockout_count`, `geo_dispersion_score`, `login_velocity_score`, `session_anomaly_rate`, `ueba_alert_count`, `overall_identity_risk_score`, `insider_threat_indicator_score`
	- Per-user categorical (1, one-hot): `peak_privilege_level_accessed` (6 values)
	- Session aggregates (8, derived from `login_sessions.csv`): `avg_session_duration_seconds`, `avg_mfa_response_latency_ms`, `avg_geo_anomaly_score`, `max_geo_anomaly_score`, `frac_impossible_travel`, `n_unique_countries`, `n_unique_devices`, `n_unique_applications`
	- Engineered (6): `failed_login_rate`, `mfa_failure_rate`, `ueba_alerts_per_session`, `hops_per_escalation`, `geo_velocity_composite`, `composite_anomaly_score`

	### Leakage exclusions

	Three columns from `user_risk_summary.csv` are dropped to avoid contamination:
	- `threat_actor_flag` — perfect oracle for `tier='high'` subset (only high-tier users can be threat actors)
	- `account_takeover_flag` — 2 positive cases out of 200 (1%); too sparse and oracle-prone
	- `credential_attack_victim_flag` — 1 positive case out of 200 (0.5%); same issue

	Four columns from `login_sessions.csv` are NOT aggregated into session
	features because they exhibited the structural non-overlap documented
	in [Leakage diagnostic](#leakage-diagnostic):
	- `velocity_anomaly_score`, `session_timestamp_utc`, `credential_attempt_count`, `login_outcome`

	## Evaluation

	### Test-set metrics, seed 42 (n = 30 disjoint users)

	XGBoost (the published `model_xgb.json` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.8017 \|
	\| Accuracy \| 0.6667 \|
	\| Macro-F1 \| 0.6454 \|
	\| Weighted-F1 \| 0.6606 \|

	MLP (the published `model_mlp.safetensors` artifact)

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro ROC-AUC (OvR) \| 0.6974 \|
	\| Accuracy \| 0.6000 \|
	\| Macro-F1 \| 0.5914 \|
	\| Weighted-F1 \| 0.6068 \|

	### Multi-seed robustness (XGBoost, 10 seeds)

	\| Metric \| Mean \| Std \| Min \| Max \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Accuracy \| 0.700 \| 0.082 \| 0.533 \| 0.867 \|
	\| Macro-F1 \| 0.638 \| 0.093 \| 0.445 \| 0.814 \|
	\| Macro ROC-AUC OvR \| 0.812 \| 0.048 \| 0.738 \| 0.877 \|

	Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
	With only 30 test users per seed, single-seed accuracy varies materially
	(0.53–0.87 across seeds). **ROC-AUC 0.812 ± 0.048 is the more reliable
	performance estimate.** All 10 seeds yield all 3 tiers in the test
	fold thanks to stratification.

	### Per-class F1 (seed 42)

	\| Tier \| Class share \| XGBoost F1 \| MLP F1 \|
	\|---\|---:\|---:\|---:\|
	\| `low` \| 57% \| 0.727 \| 0.647 \|
	\| `medium` \| 23.5% \| 0.286 \| 0.400 \|
	\| `high` \| 19.5% \| 0.923 \| 0.727 \|

	The model performs best on `high` (the most behaviourally distinct
	tier — high failed-login rates, frequent impossible travel, elevated
	anomaly scores) and `low` (the majority class). The `medium` tier is
	hardest, which is the expected behaviour for a 3-tier ordinal task —
	mid-class samples sit between two boundaries and pick up confusion
	from both sides.

	### Ablation: which feature groups matter

	\| Configuration \| Accuracy \| Macro-F1 \| Δ accuracy \|
	\|---\|---:\|---:\|---:\|
	\| Full feature set (published) \| 0.6667 \| 0.6454 \| — \|
	\| No user aggregates (count features) \| 0.5333 \| 0.4586 \| −0.1333 \|
	\| No risk scores \| 0.5667 \| 0.5300 \| −0.1000 \|
	\| No engineered features \| 0.5667 \| 0.5444 \| −0.1000 \|
	\| No session aggregates \| 0.7000 \| 0.6130 \| +0.0333 \|

	Findings:

	1. User-level count features matter most (failed logins, lateral
	hops, MFA failures). Dropping them costs 13 pp accuracy.
	2. Risk scores and engineered features each contribute ~10 pp.
	With only 139 training users, the trees can't fully recover
	engineered composites from raw inputs.
	3. Session aggregates slightly hurt accuracy in seed 42 (gain
	3 pp when dropped). With n=200, additional features can crowd
	the small data; the trees do better with fewer signals when
	each one is information-dense. Session aggregates are kept in
	the published pipeline because they help on most other seeds.

	### Architecture

	XGBoost: multi-class gradient boosting (`multi:softprob`, 3 classes),
	`hist` tree method, class-balanced sample weights, early stopping on
	validation mlogloss.

	MLP: `34 → 128 → 64 → 3`, each hidden layer followed by `BatchNorm1d`
	→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
	early stopping on validation macro-F1.

	Training hyperparameters are held internally by XpertSystems.

	## Leakage diagnostic

	This is the most important section of the model card. The full
	diagnostic is in [`leakage_diagnostic.json`](./leakage_diagnostic.json).
	Summary:

	Setup: Train an XGBoost binary classifier to predict
	`threat_actor_capability_tier != 'none'` from per-session features.
	Use group-aware split by `user_id` (15% test = 30 disjoint users).
	Cumulatively drop suspected oracle feature groups and re-evaluate.

	\| Configuration \| n_features \| Accuracy \| ROC-AUC \|
	\|---\|---:\|---:\|---:\|
	\| Full feature set \| 166 \| 1.0000 \| 1.0000 \|
	\| − behavioural oracles (velocity, timestamp, credential count) \| 163 \| 0.9991 \| 1.0000 \|
	\| − login_outcome \| 154 \| 0.9982 \| 1.0000 \|
	\| − geo_country_code \| 138 \| 0.9987 \| 1.0000 \|
	\| − device_trust_level \| 133 \| 0.9982 \| 0.9999 \|
	\| − user_risk_tier \| 130 \| 0.9978 \| 0.9996 \|
	\| − geo_anomaly_score \| 129 \| 0.9707 \| 0.9897 \|

	**Even after dropping six oracle feature groups (37 columns), the
	model still achieves 97% test accuracy and AUC 0.99.** The leakage
	is not localised to a few suspect features; it is distributed across
	the entire feature space because the synthetic generator produces
	threat-actor sessions that are anomalous on every dimension
	simultaneously, with no overlap into legitimate behaviour.

	### Recommendation to dataset author

	For threat-actor detection to be a useful ML benchmark on this
	dataset, the next generator version should introduce **distributional
	overlap** between threat-actor and legitimate session populations
	across all anomaly indicators:

	- `velocity_anomaly_score`: extend non-actor distribution into [0.0, 0.5] and shrink actor to [0.3, 0.9] for substantial overlap in [0.3, 0.5]
	- `session_timestamp_utc`: interleave threat-actor and legitimate sessions across the same time window
	- `credential_attempt_count`: allow some non-actor users to exhibit elevated counts (mistyped passwords, MFA fatigue)
	- `login_outcome`: allow `failure_account_locked` and `success_anomalous` for some legitimate sessions
	- `geo_country_code`: include a baseline frequency of risky-country logins among legitimate users (business travel, contractors)
	- `device_trust_level`: allow threat actors to occasionally use compliant devices (token theft scenarios)

	Target operating regime: real-world detection AUC 0.7–0.9, not 1.0.

	### What this means for buyers

	If you're evaluating CYB006 for a threat-detection product, you should
	know that:

	- **The sample dataset cannot be used to honestly benchmark threat-actor
	detection models.** A trivially regularised model will score 100%,
	which doesn't differentiate good detection systems from bad ones.
	- **The user-risk-tier task shipped in this baseline is a legitimate
	ML benchmark on the sample data.** It generalises modestly (AUC 0.81)
	and is the right starting point for evaluating insider-threat
	scoring on the sample.
	- **The full ~1.1M-row CYB006 product may or may not have the same
	structural property.** Confirm with XpertSystems before committing
	to a threat-detection use case.

	## Limitations

	This is a baseline reference, not a production identity-security system.

	1. Small held-out test fold (n=30). With only 30 test users per
	seed, single-seed metrics swing 0.53–0.87 in accuracy. The
	multi-seed ROC-AUC of 0.81 ± 0.05 is the reliable estimate. The
	full ~1.1M-row product would tighten the confidence interval
	substantially.

	2. The `medium` tier is harder than the others. F1 0.29 on
	`medium` (vs 0.92 on `high`) is expected — ordinal middle classes
	are typically the hardest under a flat-classification setup.

	3. MLP weaker than XGBoost. AUC 0.70 vs 0.80. With only 139
	training users, the MLP cannot match boosted trees on tabular data.

	4. Threat-actor detection task is not usable on this sample.
	See [Leakage diagnostic](#leakage-diagnostic) above.

	5. Synthetic-vs-real transfer. The dataset is synthetic and
	calibrated to identity-security benchmarks (Microsoft Digital
	Defense Report, Okta Customer Identity Trends, Verizon DBIR, CISA
	Joint Advisories, Mandiant M-Trends, MITRE ATT&CK Evaluations).
	Real identity telemetry has different noise characteristics; do
	not assume metrics transfer.

	6. 3 tiers, not 4. README lists `low`/`medium`/`high`/`critical`
	but the data contains only 3. If you need 4-class support, wait
	for a regenerated sample.

	## Notes on dataset schema

	The CYB006 sample dataset README describes some fields differently
	from the actual schema. The model was trained on the actual schema;
	this note helps buyers reconcile what they read with what they receive.

	\| What the README says \| What the data actually contains \|
	\|---\|---\|
	\| `session_phase` has 6 values \| All 5,000 rows have `session_phase = session_termination` — the field is constant. There is no usable session-phase target. \|
	\| `login_outcome` has 4 values (`success / failed / mfa_required / blocked`) \| 9 values: `success_normal`, `failure_bad_password`, `failure_account_locked`, `failure_mfa_rejected`, `failure_device_untrusted`, `failure_geo_blocked`, `success_anomalous`, `account_takeover_confirmed`, `session_hijacked` \|
	\| 4 actor tiers \| 5 values: 4 tier labels + `none` (92% of rows have `none`) \|
	\| `mfa_challenge_type` has 5 values \| 7: adds `authenticator_app`, `hardware_token`, `voice_call` \|
	\| `authentication_method` has 4 values \| 5: no `api_key`; adds `password_plus_mfa`, `phishing_resistant_fido2` \|
	\| `user_risk_tier` has 4 values (`low/medium/high/critical`) \| 3 values: no `critical` \|
	\| `session_timestamp_utc` is an ISO timestamp string \| It is an integer \|
	\| `user_risk_summary.csv` columns listed \| Adds `peak_privilege_level_accessed`, `credential_attack_victim_flag` (not in README) \|

	None of these affects model correctness — the feature pipeline uses
	the actual column names. If you build your own pipeline against the
	dataset, use the actual columns.

	## Intended use

	- Evaluating fit of the CYB006 dataset for your insider-threat
	or user-risk-scoring research
	- Baseline reference for new model architectures
	- Reference example of structural-leakage diagnostics in synthetic
	cybersecurity datasets — the diagnostic methodology in
	`train_classifier.py` is reusable
	- Feature engineering reference for per-user identity aggregates

	## Out-of-scope use

	- Production identity-security detection on real telemetry
	- Threat-actor attribution (this baseline does not address that task; see why above)
	- Any operational security or law-enforcement decision

	## Reproducibility

	Outputs above were produced with `seed = 42` (published artifact),
	nested `StratifiedShuffleSplit` (70/15/15 by user_risk_tier), on the
	published sample (`xpertsystems/cyb006-sample`, version 1.0.0,
	generated 2026-05-16). The feature pipeline in `feature_engineering.py`
	is deterministic and the trained weights in this repo correspond
	exactly to the metrics above.

	Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
	`multi_seed_results.json` confirm robust performance across splits.

	The training script itself is private to XpertSystems.

	## Files in this repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `model_xgb.json` \| XGBoost weights (seed 42) \|
	\| `model_mlp.safetensors` \| PyTorch MLP weights (seed 42) \|
	\| `feature_engineering.py` \| Feature pipeline \|
	\| `feature_meta.json` \| Feature column order + categorical levels \|
	\| `feature_scaler.json` \| MLP input mean/std (XGBoost ignores) \|
	\| `validation_results.json` \| Per-class metrics, confusion matrix, architecture \|
	\| `ablation_results.json` \| Per-feature-group ablation \|
	\| `multi_seed_results.json` \| XGBoost metrics across 10 seeds \|
	\| `leakage_diagnostic.json` \| Structural-leakage audit on threat-actor detection \|
	\| `inference_example.ipynb` \| End-to-end inference demo notebook \|
	\| `README.md` \| This file \|

	## Contact and full product

	The full CYB006 dataset contains ~1.1 million rows across four
	files, with 12 calibrated benchmark validation tests drawn from
	authoritative identity security and threat intelligence sources
	(Microsoft Digital Defense Report, Okta Customer Identity Trends,
	Verizon DBIR, CISA Joint Advisories, Mandiant M-Trends, MITRE ATT&CK
	Evaluations). The full XpertSystems.ai synthetic data catalogue spans
	41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas,
	and Materials & Energy.

	- 📧 pradeep@xpertsystems.ai
	- 🌐 https://xpertsystems.ai
	- 🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb006-sample
	- 🤖 Companion models:
	- https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
	- https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
	- https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
	- https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
	- https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)

	## Citation

	```bibtex
	@misc{xpertsystems_cyb006_baseline_2026,
	title = {CYB006 Baseline Classifier: XGBoost and MLP for User Risk Tier Classification, with Structural-Leakage Diagnostic on Threat-Actor Detection},
	author = {XpertSystems.ai},
	year = {2026},
	url = {https://huggingface.co/xpertsystems/cyb006-baseline-classifier},
	note = {Baseline reference model trained on xpertsystems/cyb006-sample}
	}
	```