File size: 5,319 Bytes
5fc9222
fe8c7cf
5fc9222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
988e2ec
5fc9222
 
 
 
 
c30bee7
 
5fc9222
c30bee7
5fc9222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c30bee7
5fc9222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c30bee7
5fc9222
 
 
 
988e2ec
c30bee7
988e2ec
5fc9222
988e2ec
5fc9222
 
 
 
 
 
 
2e67abc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
library_name: pytorch
tags:
  - cti
  - attack-classification
  - mitre-attack
  - cybersecurity
  - text-classification
  - multi-label-classification
language:
  - en
base_model: ibm-research/CTI-BERT
---

# CASSANDRA — BCE configuration on AnnoCTR

Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the **BCE configuration** of the CASSANDRA recipe trained on **AnnoCTR** (118 ATT&CK techniques), comprising **3 ensemble members** trained with seeds {42, 123, 456}.

This is the **headline configuration on AnnoCTR** in the paper. The asymmetric-loss (ASL) variant regresses on this benchmark due to its low label density (mean 15.5 samples per class); BCE with `pos_weight` handles the rare classes more robustly.

> Anonymous artifact for double-blind peer review. Author information will be added after the review period.

## Headline result

On the **AnnoCTR** test set (33 scored documents):

- **3-seed ensemble per-document F1 (τ=0.5): 63.53%**
- Exceeds CySecBERT's 62.75% (Buchel et al. 2025) without CySecBERT's additional 4.3M cybersecurity pre-training texts.

The per-seed table below shows the live artifact's individual seed F1s and ensemble F1; small variance from the headline (≤0.3 F1) reflects inference-time floating-point ordering on different hardware. Full per-seed and ensemble metrics are in [`results.json`](./results.json).

## Architecture

`LabelAttentionClassifier`: a 110M-parameter CTI-BERT encoder followed by a per-label attention head.

- Encoder: [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT) (110M params, 768 hidden)
- Head: 118 learned 768-dim label queries that attend over the encoder's `last_hidden_state`, followed by a shared 1-output linear layer applied per-label
- Loss: BCE with `pos_weight=5.0`
- Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), multi-seed probability averaging at inference

The architecture is custom (not derived from `transformers.PreTrainedModel`), so loading requires the [`modeling.py`](./modeling.py) file shipped with this repo.

## Training data

- **AnnoCTR**: 104 reports, 5,265 sentences, 118 canonical ATT&CK techniques (113 train-present, 5 unobserved at training but present in test). Mean of 15.5 deduplicated positive examples per train-present class. 78 of 113 train-present classes have fewer than 10 positive examples.
- **Splits**: report-level train/test split from Buchel et al. (2025) "SoK: A Survey of Approaches for ATT&CK Classifier Construction" (70 train reports, 34 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth).
- **Validation**: 80:20 sentence-level random split within the training reports for early stopping and threshold selection.

## Intended use

Map free-text CTI sentences to ATT&CK techniques. The model takes a single sentence and outputs a probability for each of 118 techniques.

**Aggregation to document level (paper convention):** apply per-sentence inference, take the per-class max across sentences in a document, threshold that, report the union of predicted techniques per document.

**Limitations:**
- Trained on English-language CTI; behavior on other languages is not characterized.
- The 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set will produce all-zero predictions.
- AnnoCTR's extreme sparsity (78 of 113 train-present techniques have fewer than 10 positives) means rare-technique predictions are noisier than common-technique predictions. Per-technique threshold tuning (provided as an option in `inference_example.py`) does not consistently help for these ultra-rare techniques — see paper §3.1 (per-technique thresholding excluded from the recommended configuration).

## How to load and run

```python
from modeling import load_ensemble, predict_ensemble
import os, glob

seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*")))
seeds = load_ensemble(seed_dirs, device="cuda")

sentences = [
    "The malware uses Windows Command Shell to execute encoded scripts.",
    "After initial access, persistence was established via Registry Run Keys.",
]
results = predict_ensemble(seeds, sentences, threshold=0.5)
for sentence, techniques in results:
    print(sentence, "->", techniques)
```

A complete CLI example is in [`inference_example.py`](./inference_example.py):

```bash
pip install -r requirements.txt
python inference_example.py
```

## Per-seed members

| Seed | Per-document F1 (τ=0.5) | Selected weights |
|---|---|---|
| 42  | 59.82% | EMA |
| 123 | 61.29% | EMA |
| 456 | 63.57% | EMA |
| **3-seed ensemble** | **63.53%** | — |

## Citation

```bibtex
@misc{cassandra2026,
  title  = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
  author = {{Anonymous Authors}},
  year   = {2026},
  note   = {Anonymous submission under review}
}
```

Please also cite the AnnoCTR dataset and the CTI-BERT encoder.

## License

Apache-2.0. These fine-tuned weights are derived from [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT).