Trim License section to short form

b342782 verified 23 days ago

5.98 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- cti
	- attack-classification
	- mitre-attack
	- cybersecurity
	- text-classification
	- multi-label-classification
	- asymmetric-loss
	language:
	- en
	base_model: ibm-research/CTI-BERT
	---

	# CASSANDRA — ASL configuration on AnnoCTR (regression case)

	Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the ASL configuration of the CASSANDRA recipe trained on AnnoCTR (118 ATT&CK techniques), comprising 3 ensemble members trained with seeds {42, 123, 456}.

	> Note: Unlike on TRAM2, the ASL configuration underperforms the BCE configuration on AnnoCTR. This is the regression case from the paper's label-density transfer analysis (§3.2, RQ4 results). For deployment on AnnoCTR-like sparse benchmarks, use the BCE configuration: [`cassandra-bce-annoctr`](https://huggingface.co/cassandra-anon/cassandra-bce-annoctr).

	> Anonymous artifact for double-blind peer review. Author information will be added after the review period.

	## Headline result

	On the AnnoCTR test set (33 scored documents):

	- 3-seed ensemble per-document F1 (τ=0.5): 60.17% (a 3.36-point regression vs the BCE configuration)
	- 3-seed ensemble per-document F1 (dev-tuned τ=0.69): 61.27% (a 2.26-point regression)
	- BCE configuration on the same benchmark: 63.53% ([`cassandra-bce-annoctr`](https://huggingface.co/cassandra-anon/cassandra-bce-annoctr)) — preferred for deployment

	This regression is the label-density transfer story analyzed in the paper (§3.2, RQ4 results): on AnnoCTR's 118-technique long tail with mean 15.5 samples per train-present technique, ASL's aggressive easy-negative suppression also starves genuinely rare positive techniques of training signal.

	Full per-seed and ensemble metrics are in [`results.json`](./results.json).

	## Why include this configuration?

	The CASSANDRA paper's central finding is that the same training recipe transfers across benchmarks only when label density is sufficient. ASL helps on TRAM2 (mean ~82 samples/technique) and hurts on AnnoCTR (mean 15.5/technique). Releasing the AnnoCTR ASL weights makes this regression directly verifiable rather than reported-only.

	If you want a deployable AnnoCTR classifier, use the BCE configuration linked above.

	## Architecture

	`LabelAttentionClassifier` with asymmetric loss training:

	- Encoder: [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT) (110M params, 768 hidden)
	- Head: 118 learned 768-dim label queries that attend over the encoder's `last_hidden_state`, followed by a shared 1-output linear layer applied per-label
	- Loss: Asymmetric Loss (Ridnik et al. 2021) with γ_neg=4, γ_pos=0, clip=0.05
	- Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), stochastic weight averaging (last 25% of epochs), per-seed best-of-{base, EMA, SWA} selection on validation macro-F1, multi-seed probability averaging at inference

	The architecture is custom (not derived from `transformers.PreTrainedModel`), so loading requires the [`modeling.py`](./modeling.py) file shipped with this repo.

	## Training data

	- AnnoCTR: 104 reports, 5,265 sentences, 118 canonical ATT&CK techniques (113 train-present, 5 unobserved at training but present in test). Mean of 15.5 deduplicated positive examples per train-present technique. 78 of 113 train-present techniques have fewer than 10 positive examples.
	- Splits: report-level train/test split from Buchel et al. (2025) (70 train reports, 34 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth).
	- Validation: 80:20 sentence-level random split within the training reports for early stopping and threshold selection.

	## Intended use

	Primarily as a reproducibility artifact for the paper's ASL-on-AnnoCTR regression analysis. For practical AnnoCTR deployment, prefer the BCE configuration.

	Limitations:
	- ASL's easy-negative suppression is mistuned for AnnoCTR's sparsity; rare-technique predictions are noisier than under BCE training.
	- 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set produce all-zero predictions.
	- Trained on English-language CTI.

	## How to load and run

	```python
	from modeling import load_ensemble, predict_ensemble
	import os, glob

	seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*")))
	seeds = load_ensemble(seed_dirs, device="cuda")

	sentences = [
	"The malware uses Windows Command Shell to execute encoded scripts.",
	"After initial access, persistence was established via Registry Run Keys.",
	]
	results = predict_ensemble(seeds, sentences, threshold=0.5)
	for sentence, techniques in results:
	print(sentence, "->", techniques)
	```

	A complete CLI example is in [`inference_example.py`](./inference_example.py).

	## Per-seed members

	\| Seed \| Per-document F1 (τ=0.5) \| Selected weights \|
	\|---\|---\|---\|
	\| 42 \| 55.90% \| base \|
	\| 123 \| 58.80% \| base \|
	\| 456 \| 60.16% \| base \|
	\| 3-seed ensemble (τ=0.5) \| 60.17% \| — \|
	\| 3-seed ensemble (dev-τ=0.69) \| 61.27% \| — \|

	Notable: all three seeds selected `base` weights over EMA and SWA on validation macro-F1, consistent with ASL's regularization being unsuitable for this label-density regime.

	## Citation

	```bibtex
	@misc{cassandra2026,
	title = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
	author = {{Anonymous Authors}},
	year = {2026},
	note = {Anonymous submission under review}
	}
	```

	Please also cite the AnnoCTR dataset, the CTI-BERT encoder, and the asymmetric-loss work (Ridnik et al. 2021).

	## License

	Apache-2.0. These fine-tuned weights are derived from [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT).