ree2raz commited on
Commit
3329f36
·
verified ·
1 Parent(s): ec12b47

Update eval with GGUF comparison; RCM 0.5814, MCQ 0.5921

Browse files
Files changed (1) hide show
  1. README.md +28 -18
README.md CHANGED
@@ -18,10 +18,6 @@ pipeline_tag: text-generation
18
 
19
  4-bit AWQ quantized version of [CyberSecQwen-4B](https://huggingface.co/lablab-ai-amd-developer-hackathon/CyberSecQwen-4B).
20
 
21
- ## Evaluation Infrastructure
22
-
23
- [GitHub repository](https://github.com/ree2raz/cyberSecQwen_4b_4bit) — Modal scripts for AWQ quantization + vLLM CTI-Bench evaluation.
24
-
25
  ## Quantization
26
 
27
  | Parameter | Value |
@@ -35,19 +31,20 @@ pipeline_tag: text-generation
35
 
36
  ## CTI-Bench Evaluation
37
 
38
- Evaluated under the [Foundation-Sec-8B protocol](https://arxiv.org/abs/2504.21039) (arXiv:2504.21039):
39
  - Temperature 0.3, max_tokens 512, concurrency 32
40
  - 5 independent trials, zero-shot (no system prompt)
41
  - vLLM v0.20.1 with awq_marlin kernel on Modal L4 GPU
42
 
43
- | Task | AWQ 4-bit | FP16 Reference | Delta |
44
  |---|---|---|---|
45
- | CTI-MCQ (2,500 items) | 0.5921 +/- 0.0083 | 0.5868 +/- 0.0029 | +0.0053 |
46
- | CTI-RCM (1,000 items) | 0.5814 +/- 0.0025 | 0.6664 +/- 0.0023 | -0.0850 |
47
 
48
  **Key findings:**
49
- - **CTI-MCQ**: AWQ 4-bit matches or slightly exceeds FP16 performance (+0.5 points). No measurable accuracy loss.
50
- - **CTI-RCM**: AWQ 4-bit degrades by 8.5 percentage points vs FP16. Parseable rate > 99.8% so answer extraction is working correctly. The model retains correct CWE identification in reasoning but sometimes diverges on final answers. This gap can likely be reduced with more calibration data.
 
51
 
52
  ## Trial results
53
 
@@ -62,13 +59,22 @@ Evaluated under the [Foundation-Sec-8B protocol](https://arxiv.org/abs/2504.2103
62
 
63
  ### CTI-RCM
64
  | Trial | Seed | Accuracy |
65
- |-------|------|----------|
66
  | 1 | 42 | 0.5790 |
67
  | 2 | 43 | 0.5830 |
68
  | 3 | 44 | 0.5790 |
69
  | 4 | 45 | 0.5840 |
70
  | 5 | 46 | 0.5820 |
71
 
 
 
 
 
 
 
 
 
 
72
  ## Usage with vLLM
73
 
74
  ```bash
@@ -85,11 +91,15 @@ vllm serve ree2raz/CyberSecQwen-4B-AWQ --quantization awq_marlin --dtype float16
85
  ## Citation
86
 
87
  ```bibtex
88
- @misc{cybersecqwen2026,
89
- title = {CyberSecQwen-4B: A Compact CTI Specialist Fine-Tuned from Qwen3-4B-Instruct-2507 on AMD MI300X},
90
- author = {Mulia, Samuel},
91
- year = {2026},
92
- publisher = {Hugging Face},
93
- url = {https://huggingface.co/athena129/CyberSecQwen-4B}
94
- }
95
  ```
 
 
 
 
 
18
 
19
  4-bit AWQ quantized version of [CyberSecQwen-4B](https://huggingface.co/lablab-ai-amd-developer-hackathon/CyberSecQwen-4B).
20
 
 
 
 
 
21
  ## Quantization
22
 
23
  | Parameter | Value |
 
31
 
32
  ## CTI-Bench Evaluation
33
 
34
+ Evaluated under the [Foundation-Sec-8B protocol](https://arxiv.org/abs/2504.21039):
35
  - Temperature 0.3, max_tokens 512, concurrency 32
36
  - 5 independent trials, zero-shot (no system prompt)
37
  - vLLM v0.20.1 with awq_marlin kernel on Modal L4 GPU
38
 
39
+ | Task | AWQ 4-bit | GGUF Q4_K_M | FP16 Reference |
40
  |---|---|---|---|
41
+ | CTI-MCQ (2,500 items) | **0.5921** ± 0.0083 | 0.5368 ± 0.0048 | 0.5868 ± 0.0029 |
42
+ | CTI-RCM (1,000 items) | 0.5814 ± 0.0025 | **0.6254 ± 0.0063** | 0.6664 ± 0.0023 |
43
 
44
  **Key findings:**
45
+ - **CTI-MCQ**: AWQ 4-bit matches or slightly exceeds FP16 performance (+0.5 points). Better than GGUF Q4_K_M.
46
+ - **CTI-RCM**: AWQ 4-bit degrades by 8.5 percentage points vs FP16. GGUF Q4_K_M does better on this task (-4.1 pts).
47
+ - AWQ is best for MCQ (general language), GGUF is best for RCM (task-specific classification).
48
 
49
  ## Trial results
50
 
 
59
 
60
  ### CTI-RCM
61
  | Trial | Seed | Accuracy |
62
+ |---|---|
63
  | 1 | 42 | 0.5790 |
64
  | 2 | 43 | 0.5830 |
65
  | 3 | 44 | 0.5790 |
66
  | 4 | 45 | 0.5840 |
67
  | 5 | 46 | 0.5820 |
68
 
69
+ ## Quantization variants
70
+
71
+ | Variant | CTI-MCQ | CTI-RCM | Size | Engine |
72
+ |---|---|---|---|---|
73
+ | [AWQ 4-bit](https://huggingface.co/ree2raz/CyberSecQwen-4B-AWQ) | 0.5921 | 0.5814 | 2.7 GB | vLLM |
74
+ | [GGUF Q4_K_M](https://huggingface.co/ree2raz/CyberSecQwen-4B-GGUF) | 0.5368 | 0.6254 | 2.5 GB | llama.cpp |
75
+
76
+ Choose AWQ for MCQ/general chat, GGUF for vulnerability classification.
77
+
78
  ## Usage with vLLM
79
 
80
  ```bash
 
91
  ## Citation
92
 
93
  ```bibtex
94
+ @misc{{cybersecqwen2026,
95
+ title = {{CyberSecQwen-4B: A Compact CTI Specialist Fine-Tuned from Qwen3-4B-Instruct-2507 on AMD MI300X}},
96
+ author = {{Mulia, Samuel}},
97
+ year = {{2026}},
98
+ publisher = {{Hugging Face}},
99
+ url = {{https://huggingface.co/athena129/CyberSecQwen-4B}}
100
+ }}
101
  ```
102
+
103
+ ## Evaluation Infrastructure
104
+
105
+ [GitHub repository](https://github.com/ree2raz/cyberSecQwen_4b_4bit) — Modal scripts for quantization + evaluation.