File size: 3,041 Bytes
c83515c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff53b11
c83515c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## Introduction

This repository provides the weight files required for computing sample-level **SQSD** scores based on **Qwen3-8B**, as used in the paper **"From Parameter Dynamics to Risk Scoring: Quantifying Sample-Level Safety Degradation in LLM Fine-tuning"**.

Two types of weights are needed to compute **SQSD**:

- **Parameter shift direction weights (Direction)**: Encode safety-relevant directions in the model's parameter space, used to measure how individual fine-tuning samples affect model safety.
- **Model initialization weights (initial-state)**: Serve as the starting point for SQSD computation. **Note**: These weights are only required when computing **Danger-Projection**. For details, please refer to **Section 4.3 Parameter Initialization** of the paper.

## Links

- Paper: https://arxiv.org/abs/2605.04572
- GitHub: https://github.com/Jason-wx/SQSD

## Directory Structure

```
./
├── Direction/               # Parameter shift direction weights
│   ├── Ageis_Danger/        # Danger direction weights
│   ├── Beaver-Danger/       # Danger direction weights
│   └── PKURLHF-10K_Safety/  # Safety direction weights
├── initial-state/           # Model initialization weights
│   └── dolly_ckpt_5850/     # Initial weights (initialized via Danger-Projection)
└── README.md
```

## Direction Folder

The Direction folder contains three sets of direction weights, each extracted from a different dataset, encoding either a safety or danger direction in parameter space:

| Name | Type | Description |
|------|------|-------------|
| Ageis_Danger | Danger | Danger direction weights extracted from the Aegis dataset |
| Beaver-Danger | Danger | Danger direction weights extracted from the BeaverTails dataset |
| PKURLHF-10K_Safety | Safety | Safety direction weights extracted from the PKU-RLHF dataset |

These direction weights encode safety-relevant parameter shift directions and are a core dependency for computing SQSD scores.

## initial-state Folder

The weights in `initial-state` (`dolly_ckpt_5850`) represent the model initialization state derived via the **Danger-Projection** method — specifically, the parameter point obtained by projecting the base model weights along the danger direction. This serves as the reference starting point for subsequent SQSD computation.

> ⚠️ The paper defines two initialization strategies depending on the projection direction (see **Section 4.3 Parameter Initialization**):
> - **Danger direction** (drift-enhanced sensitivity): θ_initial = θ_t, initialized from a fine-tuning checkpoint that exhibits high directional sensitivity. The weights provided here (`dolly_ckpt_5850`) serve this purpose.
> - **Safety direction** (linear-path sensitivity): θ_initial = θ_0 + α\*V_safety, initialized by interpolating from the base model along the safety direction vector. **No additional checkpoint is required** — only the base model weights and the safety direction weights from the `Direction` folder are needed.