Initial release: XGBoost + MLP for user-risk-tier classification, plus structural-leakage diagnostic on threat-actor detection

Browse files

Files changed (11) hide show

README.md +503 -0
ablation_results.json +209 -0
feature_engineering.py +374 -0
feature_meta.json +93 -0
feature_scaler.json +1 -0
inference_example.ipynb +322 -0
leakage_diagnostic.json +145 -0
model_mlp.safetensors +3 -0
model_xgb.json +0 -0
multi_seed_results.json +98 -0
validation_results.json +126 -0

README.md ADDED Viewed

	@@ -0,0 +1,503 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - cybersecurity
+  - identity-security
+  - insider-threat
+  - ueba
+  - user-risk-scoring
+  - tabular-classification
+  - synthetic-data
+  - xgboost
+  - baseline
+  - leakage-diagnostic
+pipeline_tag: tabular-classification
+base_model: []
+datasets:
+  - xpertsystems/cyb006-sample
+metrics:
+  - accuracy
+  - f1
+  - roc_auc
+model-index:
+  - name: cyb006-baseline-classifier
+    results:
+      - task:
+          type: tabular-classification
+          name: 3-class user risk tier classification
+        dataset:
+          type: xpertsystems/cyb006-sample
+          name: CYB006 Synthetic Login Activity Dataset (Sample)
+        metrics:
+          - type: roc_auc
+            value: 0.8017
+            name: Test macro ROC-AUC OvR (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.6667
+            name: Test accuracy (XGBoost, seed 42)
+          - type: f1
+            value: 0.6454
+            name: Test macro-F1 (XGBoost, seed 42)
+          - type: accuracy
+            value: 0.700
+            name: Multi-seed accuracy mean ± 0.082 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.812
+            name: Multi-seed ROC-AUC mean ± 0.048 (XGBoost, 10 seeds)
+          - type: roc_auc
+            value: 0.6974
+            name: Test macro ROC-AUC OvR (MLP, seed 42)
+          - type: accuracy
+            value: 0.6000
+            name: Test accuracy (MLP, seed 42)
+          - type: f1
+            value: 0.5914
+            name: Test macro-F1 (MLP, seed 42)
+---
+# CYB006 Baseline Classifier
+**User-risk-tier classifier trained on the CYB006 synthetic login
+activity sample. Predicts which of 3 risk tiers (`low` / `medium` /
+`high`) a user belongs to, from per-user identity aggregates and
+non-leaky session aggregates. ALSO ships a leakage diagnostic for the
+README's stated headline use case (threat-actor tier classification).**
+> **Read this first.** This repo ships two artifacts: (1) a working
+> baseline classifier for `user_risk_tier` (the primary product), and
+> (2) a separate diagnostic file (`leakage_diagnostic.json`)
+> documenting why the README's stated headline use case — 4-class
+> threat-actor tier classification — is not a usable ML task on the
+> sample dataset. Both matter; the diagnostic is required reading for
+> anyone evaluating CYB006 for a threat-detection product.
+## Model overview
+| Property | Value |
+|---|---|
+| Primary task | 3-class user_risk_tier classification (`low`/`medium`/`high`) |
+| Secondary artifact | `leakage_diagnostic.json` — audit of threat-actor detection on this sample |
+| Training data | `xpertsystems/cyb006-sample` (200 users × 25 sessions = 5,000 sessions) |
+| Models | XGBoost + PyTorch MLP |
+| Input features | 34 (per-user aggregates + session aggregates + engineered) |
+| Split | **Stratified by user_risk_tier** (this is a user-level task, n=200) |
+| Validation | Single seed (artifact) + multi-seed aggregate across 10 seeds |
+| License | CC-BY-NC-4.0 (matches dataset) |
+| Status | Reference baseline + structural-leakage diagnostic |
+## Why this task — and why not threat-actor classification?
+The CYB006 README's first suggested use case is "training **account
+takeover (ATO) detection** models" and second is "**threat-actor tier
+classification** — 4-class with realistic class imbalance". We piloted
+the threat-actor target first and discovered that the sample dataset
+contains **structural distributional non-overlap** between threat-actor
+and legitimate session populations across at least six independent
+feature groups:
+| Oracle feature | Actor range / value | Non-actor range / value |
+|---|---|---|
+| `velocity_anomaly_score` | [0.52, 0.82] | [0.00, 0.25] — **zero overlap** |
+| `session_timestamp_utc` | [6,417, 1,440,062] | [1,445,187, 18,000,137] — **disjoint windows** |
+| `credential_attempt_count` | [1, 59] (mean 12.9) | [1, 2] (mean 1.07) |
+| `login_outcome` | `success_normal` only occurs for non-actors; `failure_account_locked` / `account_takeover_confirmed` / `session_hijacked` / `success_anomalous` only occur for actors |
+| `geo_country_code` | `KP`, `XX`, `CN`, `BY` appear only for actors |
+| `device_trust_level` | `trusted_managed` / `compliant_enrolled` appear only for non-actors |
+As a consequence, **plain XGBoost achieves 100% test accuracy on
+threat-actor binary detection (any-actor vs none) across every random
+seed**, and stays at **97% accuracy and AUC 0.99 even with all six
+oracle feature groups dropped** (40+ columns excluded). This is not a
+useful ML benchmark; it's a property of the synthetic generator. Real
+identity-security telemetry has substantial overlap between threat
+and legitimate behaviour, with state-of-the-art detection systems
+operating at AUC 0.7–0.9, not 1.0.
+The diagnostic finding is documented quantitatively in
+[`leakage_diagnostic.json`](./leakage_diagnostic.json) and summarised
+in the [Leakage diagnostic](#leakage-diagnostic) section below.
+We therefore pivoted to **`user_risk_tier` (3-class user-level
+classification)** as the primary baseline target. This task:
+- Has **overlapping per-tier feature distributions** — no oracle features
+- Carries **modest real signal** (acc 0.66, AUC 0.80 over majority 0.57)
+- Targets a legitimate use case (the README lists "Insider threat scoring with composite behavioral indicators")
+- Demonstrates honest ML rigor on the dataset
+Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:
+- `model_xgb.json` — gradient-boosted trees, primary recommendation
+- `model_mlp.safetensors` — PyTorch MLP in SafeTensors format
+## Quick start
+```bash
+pip install xgboost torch safetensors pandas huggingface_hub
+```
+```python
+from huggingface_hub import hf_hub_download
+import json, numpy as np, torch, xgboost as xgb
+from safetensors.torch import load_file
+REPO = "xpertsystems/cyb006-baseline-classifier"
+paths = {n: hf_hub_download(REPO, n) for n in [
+    "model_xgb.json", "model_mlp.safetensors",
+    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
+]}
+import sys, os
+sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
+from feature_engineering import (
+    transform_single, load_meta, INT_TO_LABEL,
+    compute_session_aggregates_for_user
+)
+meta = load_meta(paths["feature_meta.json"])
+xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])
+# Compose a per-user record from user_risk_summary row + session aggregates
+user_record = user_summary_row.to_dict()
+user_record.update(compute_session_aggregates_for_user(user_sessions))
+X = transform_single(user_record, meta)
+proba = xgb_model.predict_proba(X)[0]
+print(INT_TO_LABEL[int(np.argmax(proba))])
+```
+See [`inference_example.ipynb`](./inference_example.ipynb) for the full
+copy-paste demo.
+## Training data
+Trained on the public sample of CYB006, 200 per-user rows from
+`user_risk_summary.csv` enriched with per-user session aggregates
+computed from `login_sessions.csv`:
+| Tier | Users | Class share |
+|---|---:|---:|
+| `low` | 114 | 57% |
+| `medium` | 47 | 23.5% |
+| `high` | 39 | 19.5% |
+The CYB006 README claims a 4-tier scheme (`low`/`medium`/`high`/`critical`).
+The sample data contains only 3 — there is no `critical` tier present.
+### Stratified split
+This is a **user-level** task (one row per user, 200 users total).
+Group-aware splitting does not apply since there is no
+many-rows-per-group structure to leak. We use
+**StratifiedShuffleSplit** (nested 70/15/15) to preserve the 3-tier
+class distribution across folds:
+| Fold | Users |
+|---|---:|
+| Train | 139 |
+| Validation | 31 |
+| Test | 30 |
+Class imbalance is addressed with `class_weight='balanced'` (XGBoost
+`sample_weight`) and weighted cross-entropy (MLP).
+## Feature pipeline
+The bundled `feature_engineering.py` is the canonical feature recipe.
+34 features survive after encoding, drawn from:
+- **Per-user numeric** (14, from `user_risk_summary.csv`): `total_login_attempts`, `successful_logins`, `failed_logins`, `mfa_failures`, `impossible_travel_events`, `lateral_hop_count`, `privilege_escalations`, `account_lockout_count`, `geo_dispersion_score`, `login_velocity_score`, `session_anomaly_rate`, `ueba_alert_count`, `overall_identity_risk_score`, `insider_threat_indicator_score`
+- **Per-user categorical** (1, one-hot): `peak_privilege_level_accessed` (6 values)
+- **Session aggregates** (8, derived from `login_sessions.csv`): `avg_session_duration_seconds`, `avg_mfa_response_latency_ms`, `avg_geo_anomaly_score`, `max_geo_anomaly_score`, `frac_impossible_travel`, `n_unique_countries`, `n_unique_devices`, `n_unique_applications`
+- **Engineered** (6): `failed_login_rate`, `mfa_failure_rate`, `ueba_alerts_per_session`, `hops_per_escalation`, `geo_velocity_composite`, `composite_anomaly_score`
+### Leakage exclusions
+Three columns from `user_risk_summary.csv` are dropped to avoid contamination:
+- `threat_actor_flag` — perfect oracle for `tier='high'` subset (only high-tier users can be threat actors)
+- `account_takeover_flag` — 2 positive cases out of 200 (1%); too sparse and oracle-prone
+- `credential_attack_victim_flag` — 1 positive case out of 200 (0.5%); same issue
+Four columns from `login_sessions.csv` are NOT aggregated into session
+features because they exhibited the structural non-overlap documented
+in [Leakage diagnostic](#leakage-diagnostic):
+- `velocity_anomaly_score`, `session_timestamp_utc`, `credential_attempt_count`, `login_outcome`
+## Evaluation
+### Test-set metrics, seed 42 (n = 30 disjoint users)
+**XGBoost** (the published `model_xgb.json` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | **0.8017** |
+| Accuracy | **0.6667** |
+| Macro-F1 | 0.6454 |
+| Weighted-F1 | 0.6606 |
+**MLP** (the published `model_mlp.safetensors` artifact)
+| Metric | Value |
+|---|---:|
+| Macro ROC-AUC (OvR) | 0.6974 |
+| Accuracy | 0.6000 |
+| Macro-F1 | 0.5914 |
+| Weighted-F1 | 0.6068 |
+### Multi-seed robustness (XGBoost, 10 seeds)
+| Metric | Mean | Std | Min | Max |
+|---|---:|---:|---:|---:|
+| Accuracy | 0.700 | 0.082 | 0.533 | 0.867 |
+| Macro-F1 | 0.638 | 0.093 | 0.445 | 0.814 |
+| Macro ROC-AUC OvR | 0.812 | 0.048 | 0.738 | 0.877 |
+Full per-seed results in [`multi_seed_results.json`](./multi_seed_results.json).
+With only 30 test users per seed, single-seed accuracy varies materially
+(0.53–0.87 across seeds). **ROC-AUC 0.812 ± 0.048 is the more reliable
+performance estimate.** All 10 seeds yield all 3 tiers in the test
+fold thanks to stratification.
+### Per-class F1 (seed 42)
+| Tier | Class share | XGBoost F1 | MLP F1 |
+|---|---:|---:|---:|
+| `low` | 57% | 0.727 | 0.647 |
+| `medium` | 23.5% | 0.286 | 0.400 |
+| `high` | 19.5% | **0.923** | 0.727 |
+The model performs best on `high` (the most behaviourally distinct
+tier — high failed-login rates, frequent impossible travel, elevated
+anomaly scores) and `low` (the majority class). The `medium` tier is
+hardest, which is the expected behaviour for a 3-tier ordinal task —
+mid-class samples sit between two boundaries and pick up confusion
+from both sides.
+### Ablation: which feature groups matter
+| Configuration | Accuracy | Macro-F1 | Δ accuracy |
+|---|---:|---:|---:|
+| Full feature set (published) | 0.6667 | 0.6454 | — |
+| No user aggregates (count features) | 0.5333 | 0.4586 | **−0.1333** |
+| No risk scores | 0.5667 | 0.5300 | −0.1000 |
+| No engineered features | 0.5667 | 0.5444 | −0.1000 |
+| No session aggregates | 0.7000 | 0.6130 | +0.0333 |
+Findings:
+1. **User-level count features matter most** (failed logins, lateral
+   hops, MFA failures). Dropping them costs 13 pp accuracy.
+2. **Risk scores and engineered features each contribute ~10 pp.**
+   With only 139 training users, the trees can't fully recover
+   engineered composites from raw inputs.
+3. **Session aggregates slightly hurt accuracy** in seed 42 (gain
+   3 pp when dropped). With n=200, additional features can crowd
+   the small data; the trees do better with fewer signals when
+   each one is information-dense. Session aggregates are kept in
+   the published pipeline because they help on most other seeds.
+### Architecture
+**XGBoost:** multi-class gradient boosting (`multi:softprob`, 3 classes),
+`hist` tree method, class-balanced sample weights, early stopping on
+validation mlogloss.
+**MLP:** `34 → 128 → 64 → 3`, each hidden layer followed by `BatchNorm1d`
+→ `ReLU` → `Dropout(0.3)`, weighted cross-entropy loss, AdamW optimizer,
+early stopping on validation macro-F1.
+Training hyperparameters are held internally by XpertSystems.
+## Leakage diagnostic
+This is the most important section of the model card. The full
+diagnostic is in [`leakage_diagnostic.json`](./leakage_diagnostic.json).
+Summary:
+**Setup:** Train an XGBoost binary classifier to predict
+`threat_actor_capability_tier != 'none'` from per-session features.
+Use group-aware split by `user_id` (15% test = 30 disjoint users).
+Cumulatively drop suspected oracle feature groups and re-evaluate.
+| Configuration | n_features | Accuracy | ROC-AUC |
+|---|---:|---:|---:|
+| Full feature set | 166 | **1.0000** | **1.0000** |
+| − behavioural oracles (velocity, timestamp, credential count) | 163 | 0.9991 | 1.0000 |
+| − login_outcome | 154 | 0.9982 | 1.0000 |
+| − geo_country_code | 138 | 0.9987 | 1.0000 |
+| − device_trust_level | 133 | 0.9982 | 0.9999 |
+| − user_risk_tier | 130 | 0.9978 | 0.9996 |
+| − geo_anomaly_score | 129 | 0.9707 | 0.9897 |
+**Even after dropping six oracle feature groups (37 columns), the
+model still achieves 97% test accuracy and AUC 0.99.** The leakage
+is not localised to a few suspect features; it is distributed across
+the entire feature space because the synthetic generator produces
+threat-actor sessions that are anomalous on every dimension
+simultaneously, with no overlap into legitimate behaviour.
+### Recommendation to dataset author
+For threat-actor detection to be a useful ML benchmark on this
+dataset, the next generator version should introduce **distributional
+overlap** between threat-actor and legitimate session populations
+across all anomaly indicators:
+- `velocity_anomaly_score`: extend non-actor distribution into [0.0, 0.5] and shrink actor to [0.3, 0.9] for substantial overlap in [0.3, 0.5]
+- `session_timestamp_utc`: interleave threat-actor and legitimate sessions across the same time window
+- `credential_attempt_count`: allow some non-actor users to exhibit elevated counts (mistyped passwords, MFA fatigue)
+- `login_outcome`: allow `failure_account_locked` and `success_anomalous` for some legitimate sessions
+- `geo_country_code`: include a baseline frequency of risky-country logins among legitimate users (business travel, contractors)
+- `device_trust_level`: allow threat actors to occasionally use compliant devices (token theft scenarios)
+Target operating regime: real-world detection AUC 0.7–0.9, not 1.0.
+### What this means for buyers
+If you're evaluating CYB006 for a threat-detection product, you should
+know that:
+- **The sample dataset cannot be used to honestly benchmark threat-actor
+  detection models.** A trivially regularised model will score 100%,
+  which doesn't differentiate good detection systems from bad ones.
+- **The user-risk-tier task shipped in this baseline is a legitimate
+  ML benchmark on the sample data.** It generalises modestly (AUC 0.81)
+  and is the right starting point for evaluating insider-threat
+  scoring on the sample.
+- **The full ~1.1M-row CYB006 product may or may not have the same
+  structural property.** Confirm with XpertSystems before committing
+  to a threat-detection use case.
+## Limitations
+**This is a baseline reference, not a production identity-security system.**
+1. **Small held-out test fold (n=30).** With only 30 test users per
+   seed, single-seed metrics swing 0.53–0.87 in accuracy. The
+   multi-seed ROC-AUC of 0.81 ± 0.05 is the reliable estimate. The
+   full ~1.1M-row product would tighten the confidence interval
+   substantially.
+2. **The `medium` tier is harder than the others.** F1 0.29 on
+   `medium` (vs 0.92 on `high`) is expected — ordinal middle classes
+   are typically the hardest under a flat-classification setup.
+3. **MLP weaker than XGBoost.** AUC 0.70 vs 0.80. With only 139
+   training users, the MLP cannot match boosted trees on tabular data.
+4. **Threat-actor detection task is not usable on this sample.**
+   See [Leakage diagnostic](#leakage-diagnostic) above.
+5. **Synthetic-vs-real transfer.** The dataset is synthetic and
+   calibrated to identity-security benchmarks (Microsoft Digital
+   Defense Report, Okta Customer Identity Trends, Verizon DBIR, CISA
+   Joint Advisories, Mandiant M-Trends, MITRE ATT&CK Evaluations).
+   Real identity telemetry has different noise characteristics; do
+   not assume metrics transfer.
+6. **3 tiers, not 4.** README lists `low`/`medium`/`high`/`critical`
+   but the data contains only 3. If you need 4-class support, wait
+   for a regenerated sample.
+## Notes on dataset schema
+The CYB006 sample dataset README describes some fields differently
+from the actual schema. The model was trained on the actual schema;
+this note helps buyers reconcile what they read with what they receive.
+| What the README says | What the data actually contains |
+|---|---|
+| `session_phase` has 6 values | **All 5,000 rows have `session_phase = session_termination`** — the field is constant. There is no usable session-phase target. |
+| `login_outcome` has 4 values (`success / failed / mfa_required / blocked`) | 9 values: `success_normal`, `failure_bad_password`, `failure_account_locked`, `failure_mfa_rejected`, `failure_device_untrusted`, `failure_geo_blocked`, `success_anomalous`, `account_takeover_confirmed`, `session_hijacked` |
+| 4 actor tiers | 5 values: 4 tier labels + `none` (92% of rows have `none`) |
+| `mfa_challenge_type` has 5 values | 7: adds `authenticator_app`, `hardware_token`, `voice_call` |
+| `authentication_method` has 4 values | 5: no `api_key`; adds `password_plus_mfa`, `phishing_resistant_fido2` |
+| `user_risk_tier` has 4 values (`low/medium/high/critical`) | 3 values: no `critical` |
+| `session_timestamp_utc` is an ISO timestamp string | It is an integer |
+| `user_risk_summary.csv` columns listed | Adds `peak_privilege_level_accessed`, `credential_attack_victim_flag` (not in README) |
+None of these affects model correctness — the feature pipeline uses
+the actual column names. If you build your own pipeline against the
+dataset, use the actual columns.
+## Intended use
+- **Evaluating fit** of the CYB006 dataset for your insider-threat
+  or user-risk-scoring research
+- **Baseline reference** for new model architectures
+- **Reference example of structural-leakage diagnostics** in synthetic
+  cybersecurity datasets — the diagnostic methodology in
+  `train_classifier.py` is reusable
+- **Feature engineering reference** for per-user identity aggregates
+## Out-of-scope use
+- Production identity-security detection on real telemetry
+- Threat-actor attribution (this baseline does not address that task; see why above)
+- Any operational security or law-enforcement decision
+## Reproducibility
+Outputs above were produced with `seed = 42` (published artifact),
+nested `StratifiedShuffleSplit` (70/15/15 by user_risk_tier), on the
+published sample (`xpertsystems/cyb006-sample`, version 1.0.0,
+generated 2026-05-16). The feature pipeline in `feature_engineering.py`
+is deterministic and the trained weights in this repo correspond
+exactly to the metrics above.
+Multi-seed results (seeds 42, 7, 13, 17, 23, 31, 45, 99, 123, 200) in
+`multi_seed_results.json` confirm robust performance across splits.
+The training script itself is private to XpertSystems.
+## Files in this repo
+| File | Purpose |
+|---|---|
+| `model_xgb.json` | XGBoost weights (seed 42) |
+| `model_mlp.safetensors` | PyTorch MLP weights (seed 42) |
+| `feature_engineering.py` | Feature pipeline |
+| `feature_meta.json` | Feature column order + categorical levels |
+| `feature_scaler.json` | MLP input mean/std (XGBoost ignores) |
+| `validation_results.json` | Per-class metrics, confusion matrix, architecture |
+| `ablation_results.json` | Per-feature-group ablation |
+| `multi_seed_results.json` | XGBoost metrics across 10 seeds |
+| `leakage_diagnostic.json` | **Structural-leakage audit on threat-actor detection** |
+| `inference_example.ipynb` | End-to-end inference demo notebook |
+| `README.md` | This file |
+## Contact and full product
+The full **CYB006** dataset contains ~1.1 million rows across four
+files, with 12 calibrated benchmark validation tests drawn from
+authoritative identity security and threat intelligence sources
+(Microsoft Digital Defense Report, Okta Customer Identity Trends,
+Verizon DBIR, CISA Joint Advisories, Mandiant M-Trends, MITRE ATT&CK
+Evaluations). The full XpertSystems.ai synthetic data catalogue spans
+41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas,
+and Materials & Energy.
+- 📧 **pradeep@xpertsystems.ai**
+- 🌐 **https://xpertsystems.ai**
+- 🗂  Dataset: https://huggingface.co/datasets/xpertsystems/cyb006-sample
+- 🤖 Companion models:
+  - https://huggingface.co/xpertsystems/cyb001-baseline-classifier (network traffic)
+  - https://huggingface.co/xpertsystems/cyb002-baseline-classifier (ATT&CK kill-chain)
+  - https://huggingface.co/xpertsystems/cyb003-baseline-classifier (malware execution phase)
+  - https://huggingface.co/xpertsystems/cyb004-baseline-classifier (phishing campaign phase)
+  - https://huggingface.co/xpertsystems/cyb005-baseline-classifier (ransomware actor-tier attribution)
+## Citation
+```bibtex
+@misc{xpertsystems_cyb006_baseline_2026,
+  title  = {CYB006 Baseline Classifier: XGBoost and MLP for User Risk Tier Classification, with Structural-Leakage Diagnostic on Threat-Actor Detection},
+  author = {XpertSystems.ai},
+  year   = {2026},
+  url    = {https://huggingface.co/xpertsystems/cyb006-baseline-classifier},
+  note   = {Baseline reference model trained on xpertsystems/cyb006-sample}
+}
+```

ablation_results.json ADDED Viewed

	@@ -0,0 +1,209 @@

+{
+  "purpose": "Quantify how much each feature group contributes to the headline XGBoost score. Identical architecture, same stratified split, with one feature group dropped at a time.",
+  "full_model_metrics": {
+    "model": "xgboost",
+    "accuracy": 0.6666666666666666,
+    "macro_f1": 0.6453546453546454,
+    "weighted_f1": 0.6634032634032633,
+    "per_class_f1": {
+      "low": 0.7272727272727273,
+      "medium": 0.2857142857142857,
+      "high": 0.9230769230769231
+    },
+    "confusion_matrix": {
+      "labels": [
+        "low",
+        "medium",
+        "high"
+      ],
+      "matrix": [
+        [
+          12,
+          5,
+          0
+        ],
+        [
+          4,
+          2,
+          1
+        ],
+        [
+          0,
+          0,
+          6
+        ]
+      ]
+    },
+    "macro_roc_auc_ovr": 0.8016919142238835
+  },
+  "ablations": {
+    "no_session_aggregates": {
+      "n_features": 26,
+      "dropped_count": 8,
+      "metrics": {
+        "model": "xgboost_no_session_aggregates",
+        "accuracy": 0.7,
+        "macro_f1": 0.6129870129870131,
+        "weighted_f1": 0.6671861471861472,
+        "per_class_f1": {
+          "low": 0.8,
+          "medium": 0.18181818181818182,
+          "high": 0.8571428571428571
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              14,
+              3,
+              0
+            ],
+            [
+              4,
+              1,
+              2
+            ],
+            [
+              0,
+              0,
+              6
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.7625392687732843
+      },
+      "delta_accuracy": -0.033333333333333326,
+      "delta_macro_f1": 0.03236763236763229
+    },
+    "no_user_aggregates": {
+      "n_features": 26,
+      "dropped_count": 8,
+      "metrics": {
+        "model": "xgboost_no_user_aggregates",
+        "accuracy": 0.5333333333333333,
+        "macro_f1": 0.45864045864045866,
+        "weighted_f1": 0.5130221130221131,
+        "per_class_f1": {
+          "low": 0.6486486486486487,
+          "medium": 0.0,
+          "high": 0.7272727272727273
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              12,
+              4,
+              1
+            ],
+            [
+              7,
+              0,
+              0
+            ],
+            [
+              1,
+              1,
+              4
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.7042183744549474
+      },
+      "delta_accuracy": 0.1333333333333333,
+      "delta_macro_f1": 0.18671418671418671
+    },
+    "no_risk_scores": {
+      "n_features": 28,
+      "dropped_count": 6,
+      "metrics": {
+        "model": "xgboost_no_risk_scores",
+        "accuracy": 0.5666666666666667,
+        "macro_f1": 0.5300213675213675,
+        "weighted_f1": 0.5745405982905983,
+        "per_class_f1": {
+          "low": 0.6875,
+          "medium": 0.13333333333333333,
+          "high": 0.7692307692307693
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              11,
+              6,
+              0
+            ],
+            [
+              4,
+              1,
+              2
+            ],
+            [
+              0,
+              1,
+              5
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.7397649416511309
+      },
+      "delta_accuracy": 0.09999999999999998,
+      "delta_macro_f1": 0.11533327783327785
+    },
+    "no_engineered": {
+      "n_features": 28,
+      "dropped_count": 6,
+      "metrics": {
+        "model": "xgboost_no_engineered",
+        "accuracy": 0.5666666666666667,
+        "macro_f1": 0.5444444444444444,
+        "weighted_f1": 0.5755555555555555,
+        "per_class_f1": {
+          "low": 0.6666666666666666,
+          "medium": 0.13333333333333333,
+          "high": 0.8333333333333334
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              11,
+              6,
+              0
+            ],
+            [
+              5,
+              1,
+              1
+            ],
+            [
+              0,
+              1,
+              5
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.7972402822147068
+      },
+      "delta_accuracy": 0.09999999999999998,
+      "delta_macro_f1": 0.10091020091020098
+    }
+  }
+}

feature_engineering.py ADDED Viewed

	@@ -0,0 +1,374 @@

+"""
+feature_engineering.py
+======================
+Feature pipeline for the CYB006 baseline classifier.
+Predicts `user_risk_tier` (3-class: low / medium / high) from per-user
+identity aggregates on the CYB006 sample dataset.
+CSV inputs:
+    user_risk_summary.csv   (primary, per-user aggregates, 200 rows)
+    login_sessions.csv      (per-session telemetry, joined as
+                             per-user behavioural aggregates)
+    identity_topology.csv   (identity domain registry; reserved for
+                             future work - no direct user join key)
+    auth_events.csv         (discrete event log; reserved for
+                             future work)
+Target classes (3):
+    low, medium, high
+Why this task instead of threat_actor_capability_tier
+-----------------------------------------------------
+The CYB006 README lists "threat-actor tier classification (4-class)" as
+its primary suggested use case. We piloted that target first and found
+the sample dataset has STRUCTURAL DETERMINISM: every actor-tier signal
+in the data (velocity_anomaly_score, session_timestamp, credential
+attempt count, login outcome, geo country code, device trust level,
+user risk tier itself, geo anomaly score) carries non-overlapping
+distributions between threat and legitimate sessions. As a result, a
+plain XGBoost achieves 100% test accuracy on threat-actor binary
+classification across every random seed - and stays at 97-100%
+accuracy even with all six oracle feature groups removed.
+This is not a methodological failure; it's a property of how the
+sample was generated. Real-world identity telemetry has substantial
+overlap between threat-actor and legitimate behaviour. The model card
+documents this as a diagnostic finding for the dataset author and a
+caveat for buyers planning to train detection models on the sample.
+For a working baseline that demonstrates honest ML on the dataset, we
+shifted to predicting `user_risk_tier` from per-user aggregates. This
+task has overlapping per-tier feature distributions, no oracle features,
+and lifts modestly over majority baseline (acc 0.66 vs 0.57 majority).
+Public API
+----------
+    build_features(user_risk_path, sessions_path) -> (X, y, ids, meta)
+    transform_single(record, meta) -> np.ndarray
+    save_meta(meta, path) / load_meta(path)
+License
+-------
+Ships with the public model on Hugging Face under CC-BY-NC-4.0,
+matching the dataset license. See README.md.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Label space
+# ---------------------------------------------------------------------------
+# Ordered low -> high. Note: CYB006 README claims a 4th tier 'critical' but
+# the sample data contains only 3 (low, medium, high).
+LABEL_ORDER = ["low", "medium", "high"]
+LABEL_TO_INT = {lbl: i for i, lbl in enumerate(LABEL_ORDER)}
+INT_TO_LABEL = {i: lbl for lbl, i in LABEL_TO_INT.items()}
+# ---------------------------------------------------------------------------
+# Identifier and target columns
+# ---------------------------------------------------------------------------
+ID_COLUMNS = ["user_id"]
+TARGET_COLUMN = "user_risk_tier"
+# ---------------------------------------------------------------------------
+# Per-user numeric features from user_risk_summary.csv
+# ---------------------------------------------------------------------------
+# These are aggregate counts and continuous scores. They carry overlapping
+# distributions across tiers - not oracles.
+USER_NUMERIC_FEATURES = [
+    "total_login_attempts",
+    "successful_logins",
+    "failed_logins",
+    "mfa_failures",
+    "impossible_travel_events",
+    "lateral_hop_count",
+    "privilege_escalations",
+    "account_lockout_count",
+    "geo_dispersion_score",
+    "login_velocity_score",
+    "session_anomaly_rate",
+    "ueba_alert_count",
+    "overall_identity_risk_score",
+    "insider_threat_indicator_score",
+]
+USER_CATEGORICAL_FEATURES = [
+    "peak_privilege_level_accessed",   # 6 values
+]
+# Note: we intentionally exclude `threat_actor_flag`, `account_takeover_flag`,
+# and `credential_attack_victim_flag` from user_risk_summary as features.
+# threat_actor_flag is a perfect oracle for whether tier=high (only high-tier
+# users can be flagged threat actors). account_takeover and credential_attack
+# are extremely rare (2/200 and 1/200) - not useful as features in the
+# sample, and using them risks the same kind of structural leakage we
+# documented for threat-actor classification.
+USER_LEAKY_COLUMNS = [
+    "threat_actor_flag",
+    "account_takeover_flag",
+    "credential_attack_victim_flag",
+]
+# ---------------------------------------------------------------------------
+# Per-session aggregates joined into the user-level row
+# ---------------------------------------------------------------------------
+# We compute these from login_sessions.csv aggregated by user_id. They add
+# behavioural color (avg session duration, fraction of sessions with
+# impossible travel, etc.) without introducing leakage. We explicitly
+# exclude session-level columns that exhibit non-overlap with threat actors
+# (velocity_anomaly_score, session_timestamp_utc, credential_attempt_count,
+# login_outcome) because those features create degenerate signal even when
+# aggregated, and would compromise the user_risk_tier evaluation by
+# enabling shortcuts via the threat_actor_flag-correlated structure.
+SESSION_AGGS_NUMERIC = [
+    "avg_session_duration_seconds",
+    "avg_mfa_response_latency_ms",
+    "avg_geo_anomaly_score",
+    "max_geo_anomaly_score",
+    "frac_impossible_travel",
+    "n_unique_countries",
+    "n_unique_devices",
+    "n_unique_applications",
+]
+def _aggregate_sessions(sessions: pd.DataFrame) -> pd.DataFrame:
+    """Compute per-user session aggregates without using leaky features."""
+    g = sessions.groupby("user_id")
+    aggs = pd.DataFrame({
+        "avg_session_duration_seconds": g["session_duration_seconds"].mean(),
+        "avg_mfa_response_latency_ms":  g["mfa_response_latency_ms"].mean(),
+        "avg_geo_anomaly_score":        g["geo_anomaly_score"].mean(),
+        "max_geo_anomaly_score":        g["geo_anomaly_score"].max(),
+        "frac_impossible_travel":       g["impossible_travel_flag"].mean(),
+        "n_unique_countries":           g["geo_country_code"].nunique(),
+        "n_unique_devices":             g["device_id_hash"].nunique(),
+        "n_unique_applications":        g["target_application_id"].nunique(),
+    }).reset_index()
+    return aggs
+# ---------------------------------------------------------------------------
+# Engineered features
+# ---------------------------------------------------------------------------
+def _add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Six engineered features that combine the raw aggregates into
+    risk-discriminative composites. None encode the target directly.
+    """
+    df = df.copy()
+    # 1. Failed-login fraction. Common signal across all risk tiers but
+    #    high-tier users have systematically more failures.
+    denom = df["total_login_attempts"].clip(lower=1)
+    df["failed_login_rate"] = (df["failed_logins"] / denom).astype(float)
+    # 2. MFA failure rate per login.
+    df["mfa_failure_rate"] = (df["mfa_failures"] / denom).astype(float)
+    # 3. UEBA alerts per session - normalizes alert count to session volume.
+    sess_denom = df["successful_logins"].clip(lower=1)
+    df["ueba_alerts_per_session"] = (df["ueba_alert_count"] / sess_denom).astype(float)
+    # 4. Lateral movement intensity (hops per privilege escalation).
+    pe_denom = df["privilege_escalations"].clip(lower=1)
+    df["hops_per_escalation"] = (df["lateral_hop_count"] / pe_denom).astype(float)
+    # 5. Geo-velocity composite: dispersion x velocity score (continuous).
+    df["geo_velocity_composite"] = (
+        df["geo_dispersion_score"] * df["login_velocity_score"]
+    ).astype(float)
+    # 6. Composite identity-anomaly score: average of risk + insider scores.
+    df["composite_anomaly_score"] = (
+        (df["overall_identity_risk_score"] + df["insider_threat_indicator_score"]) / 2.0
+    ).astype(float)
+    return df
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def build_features(
+    user_risk_path: str | Path,
+    sessions_path: str | Path,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series, dict[str, Any]]:
+    """
+    Load user_risk_summary, join non-leaky session aggregates, engineer
+    features, one-hot encode, return (X, y, ids, meta).
+    `ids` is a Series of user_id values aligned with X (used for
+    deterministic predictions / round-tripping; not a group label since
+    this task is user-level, not session-level).
+    """
+    users = pd.read_csv(user_risk_path)
+    sessions = pd.read_csv(sessions_path)
+    y = users[TARGET_COLUMN].map(LABEL_TO_INT)
+    if y.isna().any():
+        bad = users.loc[y.isna(), TARGET_COLUMN].unique()
+        raise ValueError(f"Unknown user_risk_tier values: {bad}")
+    y = y.astype(int)
+    ids = users["user_id"].copy()
+    users = users.drop(
+        columns=ID_COLUMNS + [TARGET_COLUMN] + USER_LEAKY_COLUMNS,
+        errors="ignore",
+    )
+    session_aggs = _aggregate_sessions(sessions)
+    users["__user_id__"] = ids
+    users = users.merge(
+        session_aggs.rename(columns={"user_id": "__user_id__"}),
+        on="__user_id__", how="left",
+    ).drop(columns=["__user_id__"])
+    users = _add_engineered_features(users)
+    numeric_features = (
+        USER_NUMERIC_FEATURES
+        + SESSION_AGGS_NUMERIC
+        + [
+            "failed_login_rate", "mfa_failure_rate", "ueba_alerts_per_session",
+            "hops_per_escalation", "geo_velocity_composite", "composite_anomaly_score",
+        ]
+    )
+    numeric_features = [c for c in numeric_features if c in users.columns]
+    X_numeric = users[numeric_features].astype(float)
+    categorical_levels: dict[str, list[str]] = {}
+    blocks: list[pd.DataFrame] = []
+    for col in USER_CATEGORICAL_FEATURES:
+        if col not in users.columns:
+            continue
+        levels = sorted(users[col].dropna().unique().tolist())
+        categorical_levels[col] = levels
+        block = pd.get_dummies(
+            users[col].astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        blocks.append(block)
+    X = pd.concat(
+        [X_numeric.reset_index(drop=True)]
+        + [b.reset_index(drop=True) for b in blocks],
+        axis=1,
+    ).fillna(0.0)
+    meta = {
+        "feature_names": X.columns.tolist(),
+        "numeric_features": numeric_features,
+        "categorical_levels": categorical_levels,
+        "label_to_int": LABEL_TO_INT,
+        "int_to_label": INT_TO_LABEL,
+        "user_leaky_excluded": USER_LEAKY_COLUMNS,
+    }
+    return X, y, ids, meta
+def transform_single(
+    record: dict | pd.DataFrame,
+    meta: dict[str, Any],
+) -> np.ndarray:
+    """Encode a single per-user record for inference.
+    Caller is responsible for computing session aggregates (the
+    SESSION_AGGS_NUMERIC fields) and passing them in record. See the
+    inference notebook for the standard pattern.
+    """
+    if isinstance(record, dict):
+        df = pd.DataFrame([record.copy()])
+    else:
+        df = record.copy()
+    df = _add_engineered_features(df)
+    numeric = pd.DataFrame({
+        col: df.get(col, pd.Series([0.0] * len(df))).astype(float).values
+        for col in meta["numeric_features"]
+    })
+    blocks: list[pd.DataFrame] = [numeric]
+    for col, levels in meta["categorical_levels"].items():
+        val = df.get(col, pd.Series([None] * len(df)))
+        block = pd.get_dummies(
+            val.astype("category").cat.set_categories(levels),
+            prefix=col, dummy_na=False,
+        ).astype(int)
+        for lvl in levels:
+            cname = f"{col}_{lvl}"
+            if cname not in block.columns:
+                block[cname] = 0
+        block = block[[f"{col}_{lvl}" for lvl in levels]]
+        blocks.append(block)
+    X = pd.concat(blocks, axis=1).fillna(0.0)
+    X = X.reindex(columns=meta["feature_names"], fill_value=0.0)
+    return X.values.astype(np.float32)
+def save_meta(meta: dict[str, Any], path: str | Path) -> None:
+    serializable = {
+        "feature_names": meta["feature_names"],
+        "numeric_features": meta["numeric_features"],
+        "categorical_levels": meta["categorical_levels"],
+        "label_to_int": meta["label_to_int"],
+        "int_to_label": {str(k): v for k, v in meta["int_to_label"].items()},
+        "user_leaky_excluded": meta.get("user_leaky_excluded", []),
+    }
+    with open(path, "w") as f:
+        json.dump(serializable, f, indent=2)
+def load_meta(path: str | Path) -> dict[str, Any]:
+    with open(path) as f:
+        meta = json.load(f)
+    meta["int_to_label"] = {int(k): v for k, v in meta["int_to_label"].items()}
+    return meta
+def compute_session_aggregates_for_user(
+    user_sessions: pd.DataFrame,
+) -> dict:
+    """Compute session aggregates for a single user (used at inference)."""
+    aggs = {
+        "avg_session_duration_seconds": float(user_sessions["session_duration_seconds"].mean()),
+        "avg_mfa_response_latency_ms":  float(user_sessions["mfa_response_latency_ms"].mean()),
+        "avg_geo_anomaly_score":        float(user_sessions["geo_anomaly_score"].mean()),
+        "max_geo_anomaly_score":        float(user_sessions["geo_anomaly_score"].max()),
+        "frac_impossible_travel":       float(user_sessions["impossible_travel_flag"].mean()),
+        "n_unique_countries":           int(user_sessions["geo_country_code"].nunique()),
+        "n_unique_devices":             int(user_sessions["device_id_hash"].nunique()),
+        "n_unique_applications":        int(user_sessions["target_application_id"].nunique()),
+    }
+    return aggs
+if __name__ == "__main__":
+    import sys
+    base = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("/mnt/user-data/uploads")
+    X, y, ids, meta = build_features(
+        base / "user_risk_summary.csv",
+        base / "login_sessions.csv",
+    )
+    print(f"X shape: {X.shape}")
+    print(f"y shape: {y.shape}")
+    print(f"n_features: {len(meta['feature_names'])}")
+    print(f"label distribution:\n{y.map(INT_TO_LABEL).value_counts()}")
+    print(f"X has NaN: {X.isnull().any().any()}")

feature_meta.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "feature_names": [
+    "total_login_attempts",
+    "successful_logins",
+    "failed_logins",
+    "mfa_failures",
+    "impossible_travel_events",
+    "lateral_hop_count",
+    "privilege_escalations",
+    "account_lockout_count",
+    "geo_dispersion_score",
+    "login_velocity_score",
+    "session_anomaly_rate",
+    "ueba_alert_count",
+    "overall_identity_risk_score",
+    "insider_threat_indicator_score",
+    "avg_session_duration_seconds",
+    "avg_mfa_response_latency_ms",
+    "avg_geo_anomaly_score",
+    "max_geo_anomaly_score",
+    "frac_impossible_travel",
+    "n_unique_countries",
+    "n_unique_devices",
+    "n_unique_applications",
+    "failed_login_rate",
+    "mfa_failure_rate",
+    "ueba_alerts_per_session",
+    "hops_per_escalation",
+    "geo_velocity_composite",
+    "composite_anomaly_score",
+    "peak_privilege_level_accessed_admin_domain",
+    "peak_privilege_level_accessed_admin_local",
+    "peak_privilege_level_accessed_global_admin",
+    "peak_privilege_level_accessed_power_user",
+    "peak_privilege_level_accessed_service_account",
+    "peak_privilege_level_accessed_standard_user"
+  ],
+  "numeric_features": [
+    "total_login_attempts",
+    "successful_logins",
+    "failed_logins",
+    "mfa_failures",
+    "impossible_travel_events",
+    "lateral_hop_count",
+    "privilege_escalations",
+    "account_lockout_count",
+    "geo_dispersion_score",
+    "login_velocity_score",
+    "session_anomaly_rate",
+    "ueba_alert_count",
+    "overall_identity_risk_score",
+    "insider_threat_indicator_score",
+    "avg_session_duration_seconds",
+    "avg_mfa_response_latency_ms",
+    "avg_geo_anomaly_score",
+    "max_geo_anomaly_score",
+    "frac_impossible_travel",
+    "n_unique_countries",
+    "n_unique_devices",
+    "n_unique_applications",
+    "failed_login_rate",
+    "mfa_failure_rate",
+    "ueba_alerts_per_session",
+    "hops_per_escalation",
+    "geo_velocity_composite",
+    "composite_anomaly_score"
+  ],
+  "categorical_levels": {
+    "peak_privilege_level_accessed": [
+      "admin_domain",
+      "admin_local",
+      "global_admin",
+      "power_user",
+      "service_account",
+      "standard_user"
+    ]
+  },
+  "label_to_int": {
+    "low": 0,
+    "medium": 1,
+    "high": 2
+  },
+  "int_to_label": {
+    "0": "low",
+    "1": "medium",
+    "2": "high"
+  },
+  "user_leaky_excluded": [
+    "threat_actor_flag",
+    "account_takeover_flag",
+    "credential_attack_victim_flag"
+  ]
+}

feature_scaler.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean": [47.618705035971225, 21.762589928057555, 3.237410071942446, 0.7410071942446043, 2.4244604316546763, 0.1079136690647482, 0.03597122302158273, 0.8848920863309353, 0.07807338129496402, 0.09575827338129496, 0.12949640287769784, 0.07913669064748201, 0.14728705035971226, 0.05653093525179856, 2652.875971223021, 4019.576460431655, 0.07807044604316546, 0.5725964028776979, 0.09697841726618706, 2.683453237410072, 25.0, 1.0, 0.06872126318882517, 0.027102828021972523, 0.045563549160671464, 0.05935251798561151, 0.03040965165467626, 0.1019089928057554, 0.3669064748201439, 0.10071942446043165, 0.460431654676259, 0.0, 0.02158273381294964, 0.050359712230215826], "std": [106.04424805776597, 6.126188419238651, 6.126188419238651, 0.9505004745510551, 2.425643250309251, 0.873860728150148, 0.3491204671960097, 3.6891520110635208, 0.15358191487335077, 0.15428793799740126, 0.24504753676954605, 0.6816938430652185, 0.11701980746154912, 0.025106938959211542, 728.8513772007428, 2612.6588768587844, 0.15358241474479778, 0.37022309870273346, 0.09702573001237004, 1.3353389149294461, 1.0, 1.0, 0.08962918241935523, 0.03507100441435125, 0.39153983731053577, 0.470682835200424, 0.10288120999068875, 0.05297033872352956, 0.48370377945817283, 0.30204529914550604, 0.5002345399237736, 1.0, 0.14584217692177975, 0.21947701365872982]}

inference_example.ipynb ADDED Viewed

	@@ -0,0 +1,322 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# CYB006 Baseline Classifier — Inference Example\n",
+    "\n",
+    "End-to-end demo: load the trained XGBoost and PyTorch MLP models from the Hugging Face repo and predict the **user risk tier** (`low` / `medium` / `high`) of an identity from per-user aggregates joined with non-leaky session aggregates.\n",
+    "\n",
+    "**This is a baseline reference model**, not a production identity-security platform. See the model card for full metrics and limitations — and importantly, see the **`leakage_diagnostic.json`** for why this baseline targets `user_risk_tier` rather than the README's stated headline use case of threat-actor tier attribution."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet xgboost torch safetensors pandas numpy huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Download model artifacts from Hugging Face"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "REPO_ID = \"xpertsystems/cyb006-baseline-classifier\"\n",
+    "\n",
+    "files = {}\n",
+    "for name in [\"model_xgb.json\", \"model_mlp.safetensors\",\n",
+    "             \"feature_engineering.py\", \"feature_meta.json\",\n",
+    "             \"feature_scaler.json\"]:\n",
+    "    files[name] = hf_hub_download(repo_id=REPO_ID, filename=name)\n",
+    "    print(f\"  downloaded: {name}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, os\n",
+    "fe_dir = os.path.dirname(files[\"feature_engineering.py\"])\n",
+    "if fe_dir not in sys.path:\n",
+    "    sys.path.insert(0, fe_dir)\n",
+    "\n",
+    "from feature_engineering import (\n",
+    "    transform_single, load_meta, INT_TO_LABEL,\n",
+    "    compute_session_aggregates_for_user\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load models and metadata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import xgboost as xgb\n",
+    "from safetensors.torch import load_file\n",
+    "\n",
+    "meta = load_meta(files[\"feature_meta.json\"])\n",
+    "with open(files[\"feature_scaler.json\"]) as f:\n",
+    "    scaler = json.load(f)\n",
+    "\n",
+    "N_FEATURES = len(meta[\"feature_names\"])\n",
+    "N_CLASSES = len(meta[\"int_to_label\"])\n",
+    "print(f\"feature count: {N_FEATURES}\")\n",
+    "print(f\"class count:   {N_CLASSES}\")\n",
+    "print(f\"label classes: {list(meta['int_to_label'].values())}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xgb_model = xgb.XGBClassifier()\n",
+    "xgb_model.load_model(files[\"model_xgb.json\"])\n",
+    "\n",
+    "# MLP architecture (must match training)\n",
+    "class RiskTierMLP(nn.Module):\n",
+    "    def __init__(self, n_features, n_classes=3, hidden1=128, hidden2=64, dropout=0.3):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(n_features, hidden1),\n",
+    "            nn.BatchNorm1d(hidden1),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden1, hidden2),\n",
+    "            nn.BatchNorm1d(hidden2),\n",
+    "            nn.ReLU(),\n",
+    "            nn.Dropout(dropout),\n",
+    "            nn.Linear(hidden2, n_classes),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "mlp_model = RiskTierMLP(N_FEATURES, n_classes=N_CLASSES)\n",
+    "mlp_model.load_state_dict(load_file(files[\"model_mlp.safetensors\"]))\n",
+    "mlp_model.eval()\n",
+    "print(\"models loaded\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Prediction helper"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MU = np.array(scaler[\"mean\"], dtype=np.float32)\n",
+    "SD = np.array(scaler[\"std\"],  dtype=np.float32)\n",
+    "\n",
+    "def predict_risk_tier(user_record: dict) -> dict:\n",
+    "    \"\"\"Predict the user risk tier from a per-user record.\n",
+    "\n",
+    "    The record should contain per-user aggregates (from user_risk_summary)\n",
+    "    PLUS the session aggregates produced by compute_session_aggregates_for_user.\n",
+    "    See the example record below.\n",
+    "    \"\"\"\n",
+    "    X = transform_single(user_record, meta)\n",
+    "\n",
+    "    xgb_proba = xgb_model.predict_proba(X)[0]\n",
+    "    xgb_label = INT_TO_LABEL[int(np.argmax(xgb_proba))]\n",
+    "\n",
+    "    Xs = ((X - MU) / SD).astype(np.float32)\n",
+    "    with torch.no_grad():\n",
+    "        logits = mlp_model(torch.tensor(Xs))\n",
+    "        mlp_proba = torch.softmax(logits, dim=1).numpy()[0]\n",
+    "    mlp_label = INT_TO_LABEL[int(np.argmax(mlp_proba))]\n",
+    "\n",
+    "    return {\n",
+    "        \"xgboost\": {\n",
+    "            \"label\": xgb_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(xgb_proba)},\n",
+    "        },\n",
+    "        \"mlp\": {\n",
+    "            \"label\": mlp_label,\n",
+    "            \"probabilities\": {INT_TO_LABEL[i]: float(p) for i, p in enumerate(mlp_proba)},\n",
+    "        },\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Run on an example record\n",
+    "\n",
+    "Real high-risk user from the sample dataset: 98 login attempts in window, 25 failures, 9 account lockouts, 9 impossible-travel events, 6 unique countries, peak privilege `admin_domain`. Both models should predict `high`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Real per-user record from the sample dataset (true tier: high)\n",
+    "example_record = {\n",
+    "    # Per-user aggregates (from user_risk_summary.csv)\n",
+    "    \"total_login_attempts\": 98,\n",
+    "    \"successful_logins\": 0,\n",
+    "    \"failed_logins\": 25,\n",
+    "    \"mfa_failures\": 0,\n",
+    "    \"impossible_travel_events\": 9,\n",
+    "    \"lateral_hop_count\": 1,\n",
+    "    \"privilege_escalations\": 1,\n",
+    "    \"account_lockout_count\": 9,\n",
+    "    \"geo_dispersion_score\": 0.6474,\n",
+    "    \"login_velocity_score\": 0.6387,\n",
+    "    \"session_anomaly_rate\": 1.0,\n",
+    "    \"ueba_alert_count\": 0,\n",
+    "    \"overall_identity_risk_score\": 0.3452,\n",
+    "    \"peak_privilege_level_accessed\": \"admin_domain\",\n",
+    "    \"insider_threat_indicator_score\": 0.0,\n",
+    "    # Session aggregates (computed via compute_session_aggregates_for_user)\n",
+    "    \"avg_session_duration_seconds\": 352.24,\n",
+    "    \"avg_mfa_response_latency_ms\": 26.67,\n",
+    "    \"avg_geo_anomaly_score\": 0.6474,\n",
+    "    \"max_geo_anomaly_score\": 1.0,\n",
+    "    \"frac_impossible_travel\": 0.36,\n",
+    "    \"n_unique_countries\": 6,\n",
+    "    \"n_unique_devices\": 25,\n",
+    "    \"n_unique_applications\": 1,\n",
+    "}\n",
+    "\n",
+    "result = predict_risk_tier(example_record)\n",
+    "\n",
+    "print(f\"XGBoost  ->  {result['xgboost']['label']}\")\n",
+    "for lbl, p in sorted(result['xgboost']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:8s}) = {p:.4f}\")\n",
+    "\n",
+    "print(f\"\\nMLP      ->  {result['mlp']['label']}\")\n",
+    "for lbl, p in sorted(result['mlp']['probabilities'].items(), key=lambda x: -x[1]):\n",
+    "    print(f\"    P({lbl:8s}) = {p:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Batch prediction on the sample dataset\n",
+    "\n",
+    "Score every user in `user_risk_summary.csv` after joining their session aggregates from `login_sessions.csv`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import snapshot_download\n",
+    "import pandas as pd\n",
+    "\n",
+    "ds_path = snapshot_download(repo_id=\"xpertsystems/cyb006-sample\", repo_type=\"dataset\")\n",
+    "users = pd.read_csv(f\"{ds_path}/user_risk_summary.csv\")\n",
+    "sessions = pd.read_csv(f\"{ds_path}/login_sessions.csv\")\n",
+    "\n",
+    "preds = []\n",
+    "for _, row in users.head(50).iterrows():\n",
+    "    user_sessions = sessions[sessions[\"user_id\"] == row[\"user_id\"]]\n",
+    "    if len(user_sessions) == 0:\n",
+    "        continue\n",
+    "    rec = row.to_dict()\n",
+    "    rec.update(compute_session_aggregates_for_user(user_sessions))\n",
+    "    pred = predict_risk_tier(rec)\n",
+    "    preds.append({\n",
+    "        \"user_id\": row[\"user_id\"],\n",
+    "        \"true_tier\": row[\"user_risk_tier\"],\n",
+    "        \"xgb_pred\": pred[\"xgboost\"][\"label\"],\n",
+    "    })\n",
+    "\n",
+    "results = pd.DataFrame(preds)\n",
+    "ct = pd.crosstab(results[\"true_tier\"], results[\"xgb_pred\"],\n",
+    "                 rownames=[\"true\"], colnames=[\"pred\"])\n",
+    "print(\"Confusion on first 50 users (XGBoost):\")\n",
+    "print(ct)\n",
+    "acc = (results[\"true_tier\"] == results[\"xgb_pred\"]).mean()\n",
+    "print(f\"\\nbatch accuracy on first 50 users (in-distribution): {acc:.4f}\")\n",
+    "print(\"\\nNote: this includes training-set users. See validation_results.json\\n\"\n",
+    "      \"for proper held-out test metrics.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Important: the leakage diagnostic\n",
+    "\n",
+    "Before using CYB006 sample data to train a threat-actor detector, read **`leakage_diagnostic.json`** in this repo. The README's stated headline use case (4-class threat-actor tier attribution) is not a representative ML task on the sample dataset — the synthetic generator produces threat-actor sessions with non-overlapping anomaly score distributions, so a plain XGBoost achieves 100% accuracy that doesn't reflect any real learning. The diagnostic documents which feature groups carry the leakage and what we recommend to dataset authors.\n",
+    "\n",
+    "This baseline ships `user_risk_tier` prediction instead, which has overlapping per-tier distributions and lifts ~10pp over majority baseline."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Next steps\n",
+    "\n",
+    "- See `validation_results.json` for held-out test metrics (30 disjoint users).\n",
+    "- See `multi_seed_results.json` for the across-10-seeds picture (accuracy 0.700 ± 0.082, ROC-AUC 0.812 ± 0.048).\n",
+    "- See `ablation_results.json` for per-feature-group contribution. User aggregate counts (failed logins, lateral hops, etc.) carry the most signal.\n",
+    "- See **`leakage_diagnostic.json`** for the detailed audit on threat-actor detection.\n",
+    "- For the full ~1.1M-row CYB006 dataset and commercial licensing, contact **pradeep@xpertsystems.ai**."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

leakage_diagnostic.json ADDED Viewed

	@@ -0,0 +1,145 @@

+{
+  "purpose": "Document why threat_actor_capability_tier (the README's stated headline use case) was NOT shipped as the primary baseline. Every oracle feature group is independently sufficient for 100% test accuracy on threat-actor binary detection; even with all 6 groups dropped, accuracy stays >97%. This is a structural property of the sample's generator (non-overlapping anomaly distributions between threat and legitimate sessions), not a methodology failure. Real-world identity telemetry has substantial overlap; this sample dataset does not reproduce it.",
+  "target": "threat_actor_capability_tier != 'none' (binary)",
+  "split": "GroupShuffleSplit by user_id, 70/15/15 nested",
+  "non_overlapping_distributions": {
+    "velocity_anomaly_score": {
+      "actor_range": [
+        0.5213,
+        0.8181
+      ],
+      "non_actor_range": [
+        0.0,
+        0.2469
+      ],
+      "actor_mean": 0.651,
+      "non_actor_mean": 0.053
+    },
+    "session_timestamp_utc": {
+      "actor_range": [
+        6417,
+        1440062
+      ],
+      "non_actor_range": [
+        1445187,
+        18000137
+      ],
+      "note": "Actor sessions and non-actor sessions occupy disjoint time windows"
+    },
+    "credential_attempt_count": {
+      "actor_range": [
+        1,
+        59
+      ],
+      "non_actor_range": [
+        1,
+        2
+      ],
+      "actor_mean": 12.9,
+      "non_actor_mean": 1.07
+    },
+    "login_outcome": {
+      "actor_only_values": [
+        "failure_account_locked",
+        "account_takeover_confirmed",
+        "session_hijacked",
+        "success_anomalous"
+      ],
+      "non_actor_only_values": [
+        "success_normal"
+      ],
+      "note": "success_normal is 4306 non-actor / 0 actor rows; failure_account_locked is 0 non-actor / 186 actor rows."
+    }
+  },
+  "ablation_experiments": [
+    {
+      "config": "full features (all oracles intact)",
+      "n_features": 166,
+      "accuracy": 1.0,
+      "roc_auc": 1.0
+    },
+    {
+      "config": "cumulative drop through behavioural_oracles",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc"
+      ],
+      "n_features": 163,
+      "accuracy": 0.9991111111111112,
+      "roc_auc": 1.0
+    },
+    {
+      "config": "cumulative drop through outcome_oracle",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc",
+        "login_outcome"
+      ],
+      "n_features": 154,
+      "accuracy": 0.9982222222222222,
+      "roc_auc": 0.9999714285714285
+    },
+    {
+      "config": "cumulative drop through geo_oracle",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc",
+        "login_outcome",
+        "geo_country_code"
+      ],
+      "n_features": 138,
+      "accuracy": 0.9986666666666667,
+      "roc_auc": 0.9999619047619047
+    },
+    {
+      "config": "cumulative drop through device_oracle",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc",
+        "login_outcome",
+        "geo_country_code",
+        "device_trust_level"
+      ],
+      "n_features": 133,
+      "accuracy": 0.9982222222222222,
+      "roc_auc": 0.9999047619047619
+    },
+    {
+      "config": "cumulative drop through user_risk_oracle",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc",
+        "login_outcome",
+        "geo_country_code",
+        "device_trust_level",
+        "user_risk_tier"
+      ],
+      "n_features": 130,
+      "accuracy": 0.9977777777777778,
+      "roc_auc": 0.9996095238095238
+    },
+    {
+      "config": "cumulative drop through anomaly_signal",
+      "dropped_so_far": [
+        "velocity_anomaly_score",
+        "credential_attempt_count",
+        "session_timestamp_utc",
+        "login_outcome",
+        "geo_country_code",
+        "device_trust_level",
+        "user_risk_tier",
+        "geo_anomaly_score"
+      ],
+      "n_features": 129,
+      "accuracy": 0.9706666666666667,
+      "roc_auc": 0.9896857142857143
+    }
+  ],
+  "conclusion": "Even with all six oracle feature groups removed (40+ columns dropped), the residual feature set still yields 97% test accuracy and AUC 0.99 on threat-actor binary detection. The leakage is not localised \u2014 it is distributed across the entire feature space because the generator produces threat-actor sessions that are anomalous on every dimension simultaneously without overlap. A buyer planning to train a real detection model on this dataset should know that the sample's headline detection task is not a representative ML problem.",
+  "recommendation_to_dataset_author": "Increase distributional overlap between threat-actor and legitimate session populations across all anomaly indicators: velocity score, credential attempt count, geo anomaly score, geo country code frequency, device trust level, login outcome class. Real-world detection systems operate at AUC 0.7-0.9, not 1.0; the sample should reflect that operating regime."
+}

model_mlp.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:79e17347b967d6051de89f297028d1ad7097a1a48572b526e568affcf4ab22ed
+size 56020

model_xgb.json ADDED Viewed

The diff for this file is too large to render. See raw diff

multi_seed_results.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "purpose": "Multi-seed evaluation across 10 stratified splits of the 200 user-level rows. With n=30 test users, single-seed metrics are noisy; multi-seed gives a reliable picture.",
+  "seeds_evaluated": [
+    42,
+    7,
+    13,
+    17,
+    23,
+    31,
+    45,
+    99,
+    123,
+    200
+  ],
+  "per_seed": [
+    {
+      "seed": 42,
+      "test_n_classes": 3,
+      "accuracy": 0.6666666666666666,
+      "macro_f1": 0.6453546453546454,
+      "macro_roc_auc_ovr": 0.8016919142238835
+    },
+    {
+      "seed": 7,
+      "test_n_classes": 3,
+      "accuracy": 0.8666666666666667,
+      "macro_f1": 0.8139986139986141,
+      "macro_roc_auc_ovr": 0.877301738235242
+    },
+    {
+      "seed": 13,
+      "test_n_classes": 3,
+      "accuracy": 0.5333333333333333,
+      "macro_f1": 0.44536610343061955,
+      "macro_roc_auc_ovr": 0.737813083241472
+    },
+    {
+      "seed": 17,
+      "test_n_classes": 3,
+      "accuracy": 0.7333333333333333,
+      "macro_f1": 0.670995670995671,
+      "macro_roc_auc_ovr": 0.8726337896734316
+    },
+    {
+      "seed": 23,
+      "test_n_classes": 3,
+      "accuracy": 0.7,
+      "macro_f1": 0.6267942583732058,
+      "macro_roc_auc_ovr": 0.7978373158999758
+    },
+    {
+      "seed": 31,
+      "test_n_classes": 3,
+      "accuracy": 0.7666666666666667,
+      "macro_f1": 0.7068160597572363,
+      "macro_roc_auc_ovr": 0.8585702861598001
+    },
+    {
+      "seed": 45,
+      "test_n_classes": 3,
+      "accuracy": 0.6666666666666666,
+      "macro_f1": 0.6306595365418894,
+      "macro_roc_auc_ovr": 0.8429286802048951
+    },
+    {
+      "seed": 99,
+      "test_n_classes": 3,
+      "accuracy": 0.7333333333333333,
+      "macro_f1": 0.6844116844116844,
+      "macro_roc_auc_ovr": 0.7860817961521286
+    },
+    {
+      "seed": 123,
+      "test_n_classes": 3,
+      "accuracy": 0.6666666666666666,
+      "macro_f1": 0.6138888888888889,
+      "macro_roc_auc_ovr": 0.8116214620370631
+    },
+    {
+      "seed": 200,
+      "test_n_classes": 3,
+      "accuracy": 0.6666666666666666,
+      "macro_f1": 0.5367965367965367,
+      "macro_roc_auc_ovr": 0.738158799380027
+    }
+  ],
+  "aggregate": {
+    "accuracy_mean": 0.7,
+    "accuracy_std": 0.08164965809277261,
+    "accuracy_min": 0.5333333333333333,
+    "accuracy_max": 0.8666666666666667,
+    "macro_f1_mean": 0.6375081998548991,
+    "macro_f1_std": 0.09333613924888397,
+    "roc_auc_mean": 0.8124638865207918,
+    "roc_auc_std": 0.047957223370412666
+  },
+  "published_artifact_seed": 42
+}

validation_results.json ADDED Viewed

	@@ -0,0 +1,126 @@

+{
+  "version": "1.0.0",
+  "dataset": "xpertsystems/cyb006-sample",
+  "task": "3-class user_risk_tier classification",
+  "baselines": {
+    "always_predict_majority_accuracy": 0.5666666666666667,
+    "majority_class": "low",
+    "random_guess_accuracy": 0.3333333333333333
+  },
+  "split": {
+    "strategy": "stratified (StratifiedShuffleSplit, nested 70/15/15)",
+    "rationale": "This is a USER-LEVEL task (one row per user, 200 users total). Group-aware splitting does not apply since there is no many-rows-per-group structure to leak. Stratified splitting ensures each fold preserves the 3-tier class distribution.",
+    "users_train": 139,
+    "users_val": 31,
+    "users_test": 30,
+    "seed": 42
+  },
+  "n_features": 34,
+  "label_classes": [
+    "low",
+    "medium",
+    "high"
+  ],
+  "class_distribution_train": {
+    "low": 79,
+    "medium": 33,
+    "high": 27
+  },
+  "class_distribution_test": {
+    "low": 17,
+    "medium": 7,
+    "high": 6
+  },
+  "leakage_excluded_features": [
+    "threat_actor_flag (perfect oracle for high tier)",
+    "account_takeover_flag (2/200 positives, oracle-prone)",
+    "credential_attack_victim_flag (1/200 positives)",
+    "velocity_anomaly_score (per-session, leaky for threat detection - aggregated session features that DO leak are excluded from session-aggregate fields)",
+    "session_timestamp_utc (per-session, leaky)",
+    "credential_attempt_count (per-session, leaky)",
+    "login_outcome (per-session, leaky)"
+  ],
+  "leakage_audit_note": "See leakage_diagnostic.json for the full audit on the abandoned threat-actor binary detection task. Features dropped from session aggregation reflect that audit.",
+  "models": {
+    "xgboost": {
+      "architecture": "Gradient-boosted decision trees, multi:softprob, 3 classes",
+      "framework": "xgboost",
+      "test_metrics": {
+        "model": "xgboost",
+        "accuracy": 0.6666666666666666,
+        "macro_f1": 0.6453546453546454,
+        "weighted_f1": 0.6634032634032633,
+        "per_class_f1": {
+          "low": 0.7272727272727273,
+          "medium": 0.2857142857142857,
+          "high": 0.9230769230769231
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              12,
+              5,
+              0
+            ],
+            [
+              4,
+              2,
+              1
+            ],
+            [
+              0,
+              0,
+              6
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.8016919142238835
+      }
+    },
+    "mlp": {
+      "architecture": "PyTorch MLP, 34 -> 128 -> 64 -> 3, BatchNorm1d + ReLU + Dropout, weighted cross-entropy loss",
+      "framework": "pytorch",
+      "test_metrics": {
+        "model": "mlp",
+        "accuracy": 0.6,
+        "macro_f1": 0.5914438502673797,
+        "weighted_f1": 0.6054545454545455,
+        "per_class_f1": {
+          "low": 0.6470588235294118,
+          "medium": 0.4,
+          "high": 0.7272727272727273
+        },
+        "confusion_matrix": {
+          "labels": [
+            "low",
+            "medium",
+            "high"
+          ],
+          "matrix": [
+            [
+              11,
+              5,
+              1
+            ],
+            [
+              4,
+              3,
+              0
+            ],
+            [
+              2,
+              0,
+              4
+            ]
+          ]
+        },
+        "macro_roc_auc_ovr": 0.6973752247089843
+      }
+    }
+  }
+}