cassandra-anon
/

cassandra-bce-annoctr

@@ -25,11 +25,10 @@ This is the **headline configuration on AnnoCTR** in the paper. The asymmetric-l
 On the **AnnoCTR** test set (33 scored documents):
-- **3-seed ensemble per-document F1 (τ=0.5): 63.31%**
-- Paper reports 63.53% on the same configuration; the 0.22 F1 difference is within stochastic seed variance.
-- Exceeds CySecBERT's 62.75% reported in Buchel et al. (2025), without using CySecBERT's additional cybersecurity pre-training corpus.
-Full per-seed and ensemble metrics are in [`results.json`](./results.json).
 ## Architecture
@@ -57,7 +56,7 @@ Map free-text CTI sentences to ATT&CK techniques. The model takes a single sente
 **Limitations:**
 - Trained on English-language CTI; behavior on other languages is not characterized.
 - The 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set will produce all-zero predictions.
-- AnnoCTR's extreme sparsity (78 of 113 train-present classes have <10 positives) means rare-class predictions are noisier than common-class predictions. Per-class threshold tuning (provided as an option in `inference_example.py`) does not consistently help for these ultra-rare classes — see paper §6.2.
 ## How to load and run
@@ -91,13 +90,13 @@ python inference_example.py
 | 42  | 59.82% | EMA |
 | 123 | 61.29% | EMA |
 | 456 | 63.57% | EMA |
-| **3-seed ensemble** | **63.31%** | — |
 ## Citation
 ```bibtex
 @inproceedings{cassandra2026,
-  title  = {CASSANDRA: Why Training Recipe Matters More Than Model Size for ATT&CK Classification},
   author = {Anonymous},
   booktitle = {Proceedings of the 2026 ACM SIGSAC Conference on Computer and Communications Security (CCS)},
   year   = {2026},

 On the **AnnoCTR** test set (33 scored documents):
+- **3-seed ensemble per-document F1 (τ=0.5): 63.53%**
+- Exceeds CySecBERT's 62.75% (Buchel et al. 2025) without CySecBERT's additional 4.3M cybersecurity pre-training texts.
+The per-seed table below shows the live artifact's individual seed F1s and ensemble F1; small variance from the headline (≤0.3 F1) reflects inference-time floating-point ordering on different hardware. Full per-seed and ensemble metrics are in [`results.json`](./results.json).
 ## Architecture
 **Limitations:**
 - Trained on English-language CTI; behavior on other languages is not characterized.
 - The 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set will produce all-zero predictions.
+- AnnoCTR's extreme sparsity (78 of 113 train-present techniques have fewer than 10 positives) means rare-technique predictions are noisier than common-technique predictions. Per-technique threshold tuning (provided as an option in `inference_example.py`) does not consistently help for these ultra-rare techniques — see paper §3.1 (per-technique thresholding excluded from the recommended configuration).
 ## How to load and run
 | 42  | 59.82% | EMA |
 | 123 | 61.29% | EMA |
 | 456 | 63.57% | EMA |
+| **3-seed ensemble** | **63.53%** | — |
 ## Citation
 ```bibtex
 @inproceedings{cassandra2026,
+  title  = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
   author = {Anonymous},
   booktitle = {Proceedings of the 2026 ACM SIGSAC Conference on Computer and Communications Security (CCS)},
   year   = {2026},