File size: 1,232 Bytes
f4bcba3
0185942
 
ddbfeec
ace58cd
0185942
f4bcba3
 
 
 
 
 
0185942
 
 
 
ace58cd
ddbfeec
ace58cd
ddbfeec
 
ace58cd
ddbfeec
ace58cd
ddbfeec
ace58cd
f4bcba3
ddbfeec
ace58cd
 
 
 
ddbfeec
 
ace58cd
ddbfeec
ace58cd
f4bcba3
ace58cd
ddbfeec
f4bcba3
ddbfeec
f4bcba3
0185942
f4bcba3
 
 
0185942
ace58cd
a9c2c4e
f4bcba3
 
 
0185942
f4bcba3
 
 
0185942
f4bcba3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
base_model: microsoft/Fara-7B
library_name: transformers
license: other
pipeline_tag: text-generation
tags:
  - abliteration
  - refusal-removal
  - uncensored
  - research
  - qwen2_5_vl
  - orthogonalization
---

# Fara-7B Abliterated v2

A refusal-direction-orthogonalized variant of `microsoft/Fara-7B` (Qwen2.5-VL based).

Built using:
- https://github.com/HOLYKEYZ/model-unfetter

## Method

Using harmful + harmless probe sets, residual-stream activations were extracted across layers 0–27 to identify the strongest refusal direction.

Best layer:
- 13

Orthogonalization was applied in fp32 to:
- `embed_tokens`
- every `self_attn.o_proj`
- every `mlp.down_proj`

Total modified tensors:
- 57

Formula:

```python
W ← W - r rᵀ W
```

## Results

Held-out harmful evaluation set:
- Original Fara-7B: 5/160 compliance (~3.1%)
- Abliterated v2: 158/160 compliance (~98.75%)

Held-out refusal probe:
- Before: 155/160 refusals
- After: 2/160 refusals

## Notes

- fp32 surgery used to avoid precision issues from v1
- edits applied only to the language tower
- held-out evaluation set was separate from the layer-selection probe set

Research artifact only. Use responsibly and follow upstream Fara/Qwen license terms.