| --- |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - cti |
| - attack-classification |
| - mitre-attack |
| - cybersecurity |
| - text-classification |
| - multi-label-classification |
| - asymmetric-loss |
| language: |
| - en |
| base_model: ibm-research/CTI-BERT |
| --- |
| |
| # CASSANDRA — ASL configuration on AnnoCTR (regression case) |
|
|
| Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the **ASL configuration** of the CASSANDRA recipe trained on **AnnoCTR** (118 ATT&CK techniques), comprising **3 ensemble members** trained with seeds {42, 123, 456}. |
|
|
| > **Note:** Unlike on TRAM2, the ASL configuration *underperforms* the BCE configuration on AnnoCTR. This is the regression case from the paper's label-density transfer analysis (§3.2, RQ4 results). For deployment on AnnoCTR-like sparse benchmarks, use the BCE configuration: [`cassandra-bce-annoctr`](https://huggingface.co/cassandra-anon/cassandra-bce-annoctr). |
|
|
| > Anonymous artifact for double-blind peer review. Author information will be added after the review period. |
|
|
| ## Headline result |
|
|
| On the **AnnoCTR** test set (33 scored documents): |
|
|
| - **3-seed ensemble per-document F1 (τ=0.5): 60.17%** (a 3.36-point regression vs the BCE configuration) |
| - 3-seed ensemble per-document F1 (dev-tuned τ=0.69): 61.27% (a 2.26-point regression) |
| - BCE configuration on the same benchmark: **63.53%** ([`cassandra-bce-annoctr`](https://huggingface.co/cassandra-anon/cassandra-bce-annoctr)) — preferred for deployment |
|
|
| This regression is the label-density transfer story analyzed in the paper (§3.2, RQ4 results): on AnnoCTR's 118-technique long tail with mean 15.5 samples per train-present technique, ASL's aggressive easy-negative suppression also starves genuinely rare positive techniques of training signal. |
|
|
| Full per-seed and ensemble metrics are in [`results.json`](./results.json). |
|
|
| ## Why include this configuration? |
|
|
| The CASSANDRA paper's central finding is that the same training recipe transfers across benchmarks **only when label density is sufficient**. ASL helps on TRAM2 (mean ~82 samples/technique) and hurts on AnnoCTR (mean 15.5/technique). Releasing the AnnoCTR ASL weights makes this regression directly verifiable rather than reported-only. |
|
|
| If you want a deployable AnnoCTR classifier, use the **BCE** configuration linked above. |
|
|
| ## Architecture |
|
|
| `LabelAttentionClassifier` with asymmetric loss training: |
|
|
| - Encoder: [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT) (110M params, 768 hidden) |
| - Head: 118 learned 768-dim label queries that attend over the encoder's `last_hidden_state`, followed by a shared 1-output linear layer applied per-label |
| - Loss: Asymmetric Loss (Ridnik et al. 2021) with γ_neg=4, γ_pos=0, clip=0.05 |
| - Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), stochastic weight averaging (last 25% of epochs), per-seed best-of-{base, EMA, SWA} selection on validation macro-F1, multi-seed probability averaging at inference |
|
|
| The architecture is custom (not derived from `transformers.PreTrainedModel`), so loading requires the [`modeling.py`](./modeling.py) file shipped with this repo. |
|
|
| ## Training data |
|
|
| - **AnnoCTR**: 104 reports, 5,265 sentences, 118 canonical ATT&CK techniques (113 train-present, 5 unobserved at training but present in test). Mean of 15.5 deduplicated positive examples per train-present technique. 78 of 113 train-present techniques have fewer than 10 positive examples. |
| - **Splits**: report-level train/test split from Buchel et al. (2025) (70 train reports, 34 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth). |
| - **Validation**: 80:20 sentence-level random split within the training reports for early stopping and threshold selection. |
|
|
| ## Intended use |
|
|
| Primarily as a reproducibility artifact for the paper's ASL-on-AnnoCTR regression analysis. For practical AnnoCTR deployment, prefer the BCE configuration. |
|
|
| **Limitations:** |
| - ASL's easy-negative suppression is mistuned for AnnoCTR's sparsity; rare-technique predictions are noisier than under BCE training. |
| - 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set produce all-zero predictions. |
| - Trained on English-language CTI. |
|
|
| ## How to load and run |
|
|
| ```python |
| from modeling import load_ensemble, predict_ensemble |
| import os, glob |
| |
| seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*"))) |
| seeds = load_ensemble(seed_dirs, device="cuda") |
| |
| sentences = [ |
| "The malware uses Windows Command Shell to execute encoded scripts.", |
| "After initial access, persistence was established via Registry Run Keys.", |
| ] |
| results = predict_ensemble(seeds, sentences, threshold=0.5) |
| for sentence, techniques in results: |
| print(sentence, "->", techniques) |
| ``` |
|
|
| A complete CLI example is in [`inference_example.py`](./inference_example.py). |
|
|
| ## Per-seed members |
|
|
| | Seed | Per-document F1 (τ=0.5) | Selected weights | |
| |---|---|---| |
| | 42 | 55.90% | base | |
| | 123 | 58.80% | base | |
| | 456 | 60.16% | base | |
| | **3-seed ensemble (τ=0.5)** | **60.17%** | — | |
| | **3-seed ensemble (dev-τ=0.69)** | **61.27%** | — | |
|
|
| Notable: all three seeds selected `base` weights over EMA and SWA on validation macro-F1, consistent with ASL's regularization being unsuitable for this label-density regime. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{cassandra2026, |
| title = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound}, |
| author = {{Anonymous Authors}}, |
| year = {2026}, |
| note = {Anonymous submission under review} |
| } |
| ``` |
|
|
| Please also cite the AnnoCTR dataset, the CTI-BERT encoder, and the asymmetric-loss work (Ridnik et al. 2021). |
|
|
| ## License |
|
|
| Apache-2.0. These fine-tuned weights are derived from [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT). |
|
|