josephmayo commited on
Commit
f4bcba3
·
verified ·
1 Parent(s): ace58cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -18
README.md CHANGED
@@ -1,14 +1,15 @@
 
1
  base_model: microsoft/Fara-7B
2
  library_name: transformers
3
  license: other
4
  pipeline_tag: text-generation
5
  tags:
6
- - abliteration
7
- - refusal-removal
8
- - uncensored
9
- - research
10
- - qwen2_5_vl
11
- - orthogonalization
12
  ---
13
 
14
  # Fara-7B Abliterated v2
@@ -23,7 +24,7 @@ Built using:
23
  Using harmful + harmless probe sets, residual-stream activations were extracted across layers 0–27 to identify the strongest refusal direction.
24
 
25
  Best layer:
26
- - Layer 13
27
 
28
  Orthogonalization was applied in fp32 to:
29
  - `embed_tokens`
@@ -34,22 +35,25 @@ Total modified tensors:
34
  - 57
35
 
36
  Formula:
 
37
  ```python
38
  W ← W - r rᵀ W
39
- Results
40
 
41
- Held-out harmful evaluation set:
42
 
43
- Original Fara-7B: 5/160 compliance (~3.1%)
44
- Abliterated v2: 158/160 compliance (~98.75%)
 
45
 
46
  Held-out refusal probe:
 
 
 
 
47
 
48
- Before: 160/160 refusals
49
- After: 2/160 refusals
50
- Notes
51
- fp32 surgery used to avoid the precision issues from v1
52
- edits applied only to the language tower
53
- held-out evaluation set was separate from the layer-selection probe set
54
 
55
- Research artifact only. Use responsibly and follow the upstream Fara/Qwen license terms.
 
1
+ ---
2
  base_model: microsoft/Fara-7B
3
  library_name: transformers
4
  license: other
5
  pipeline_tag: text-generation
6
  tags:
7
+ - abliteration
8
+ - refusal-removal
9
+ - uncensored
10
+ - research
11
+ - qwen2_5_vl
12
+ - orthogonalization
13
  ---
14
 
15
  # Fara-7B Abliterated v2
 
24
  Using harmful + harmless probe sets, residual-stream activations were extracted across layers 0–27 to identify the strongest refusal direction.
25
 
26
  Best layer:
27
+ - 13
28
 
29
  Orthogonalization was applied in fp32 to:
30
  - `embed_tokens`
 
35
  - 57
36
 
37
  Formula:
38
+
39
  ```python
40
  W ← W - r rᵀ W
41
+ ```
42
 
43
+ ## Results
44
 
45
+ Held-out harmful evaluation set:
46
+ - Original Fara-7B: 5/160 compliance (~3.1%)
47
+ - Abliterated v2: 158/160 compliance (~98.75%)
48
 
49
  Held-out refusal probe:
50
+ - Before: 160/160 refusals
51
+ - After: 2/160 refusals
52
+
53
+ ## Notes
54
 
55
+ - fp32 surgery used to avoid precision issues from v1
56
+ - edits applied only to the language tower
57
+ - held-out evaluation set was separate from the layer-selection probe set
 
 
 
58
 
59
+ Research artifact only. Use responsibly and follow upstream Fara/Qwen license terms.