Sensitivity-Aware LoRA Adapter for Qwen3-8B (4-bit)
This repository contains a LoRA adapter that improves Sensitivity Awareness (SA) – i.e., the ability of an LLM to follow role-based access rules and avoid unauthorized disclosure in enterprise-style question answering scenarios (e.g., HR/employee data with access-control policies).
Important scope note: This adapter is specialized for the ADI toy environment used in our paper. It should not be interpreted as a general-purpose “enterprise security” model and must not replace application-layer access control.
The adapter is trained and evaluated following the methodology described in “Towards Sensitivity-Aware Language Models” (AISTATS 2026 submission, under review).
What’s in this repo
This is not a full standalone model checkpoint. You load it on top of the base model:
adapter_model.safetensors— LoRA weightsadapter_config.json— PEFT configuration- Tokenizer/chat assets included in this repo:
tokenizer.json,tokenizer_config.json,special_tokens_map.jsonadded_tokens.json(if present)vocab.json,merges.txt(if present)chat_template.jinja
Model details
- Adapter type: LoRA (PEFT)
- Base model:
unsloth/qwen3-8b-unsloth-bnb-4bit(4-bit quantized Qwen3-8B via Unsloth) - Intended capability: Improved compliance with access-control rules under the ADI (“ACCESS DENIED INC”) evaluation protocol.
- Languages: Primarily English (benchmark is English).
- License: MIT
Intended use
Direct use
Use this adapter when you want an LLM that is more likely to refuse or redact requests that violate role-based access policies in the ADI toy environment, especially under adversarial prompting (e.g., “malicious” or “lying” requests).
Typical research use cases:
- Sensitivity-aware / policy-aware generation in a controlled benchmark setting
- Prompt-exfiltration resistance experiments on synthetic corpora
- Reproducing ADI evaluation results reported in our paper
Out-of-scope / misuse
- Not a substitute for real access control. Always enforce authorization in your application layer (RBAC/ABAC), not just in the model.
- Do not use this as the only safeguard for real PII, regulated data, or high-stakes compliance environments.
- Like all LLMs, it can still fail in edge cases (prompt injection, distribution shift, ambiguous policy wording, etc.).
How to get started
Recommended: use our GitHub repo (batch ADI evaluation)
For end-to-end reproduction (dataset format, batch inference, output files, and ADI-style evaluation), please use:
https://github.com/DrenFazlija/towards-sa-llms
The batch inference script in that repo (unsloth_adi.py) supports loading this adapter directly from the Hub via --adapter_repo SisWiss/qwen3-8b-sa-lora.
Example:
python unsloth_adi.py \
--inputs path/to/test_data.csv \
--model unsloth/qwen3-8b-unsloth-bnb-4bit \
--adapter_repo SisWiss/qwen3-8b-sa-lora \
--identifier sa-lora
Minimal sanity check (programmatic load)
If you just want to verify the adapter loads correctly:
from unsloth import FastLanguageModel
from peft import PeftModel
base_id = "unsloth/qwen3-8b-unsloth-bnb-4bit"
adapter_id = "SisWiss/qwen3-8b-sa-lora"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=base_id,
max_seq_length=3048,
load_in_4bit=True,
load_in_8bit=False,
full_finetuning=False,
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
Training
Training data
Supervised fine-tuning uses 30,897 “correct” annotations from the ADI setup as training signal. The training mix is:
- 75% chain-of-thought examples (reasoning traces + final output)
- 25% output-only examples
To avoid contamination, evaluation uses a newly generated mock corporate dataset created via the ADI pipeline, producing 3×3,500 = 10,500 evaluation questions.
Training procedure
Method: Supervised fine-tuning (SFT) with LoRA adapters
LoRA configuration:
- rank
r = 32 - scaling factor
α = 32 - targets: attention + MLP projection layers
- base weights frozen
- dropout: none
- bias adaptation: none
- rank
Tooling: Unsloth for efficient fine-tuning; PEFT/Transformers stack
Evaluation
Sensitivity Awareness benchmark (ADI)
ADI grades each response on a 3-point scale:
- Correct (policy-compliant + format compliant)
- Error (format/accuracy issues)
- Wrong (unauthorized disclosure or incorrect denial)
It tests four scenarios:
- Benign user requests
- Malicious requests
- Supervisor requests
- Lying/adversarial prompts aiming to leak sensitive data
Results (Qwen3-8B baseline vs this LoRA adapter)
Overall correctness improves by +21.71 percentage points on ADI (55.27% → 76.98%), with the largest gains in adversarial categories.
ADI (10,500 questions total; success rates shown per category):
| Model | Correct ↑ | Error ↓ | Wrong ↓ | Benign ↑ | Malicious ↑ | Supervisor ↑ | Lying ↑ |
|---|---|---|---|---|---|---|---|
| Qwen3-8B (4-bit base) | 55.27% | 5.40% | 26.94% | 96.00% | 14.53% | 92.40% | 0.93% |
| Qwen3-8B (LoRA SA) | 76.98% | 4.18% | 16.83% | 97.28% | 56.69% | 88.67% | 33.87% |
General capability trade-offs (non-SA benchmarks)
The paper reports small-to-moderate drops on broad reasoning benchmarks, while instruction-following and math remain comparatively stable.
Language Model Evaluation Harness (base vs LoRA):
| Model | BIG-Bench Hard (Exact Match) | IFEval strict (Instance) | IFEval strict (Prompt) | GSM8K-Platinum (Flexible Extract) | GSM8K-Platinum (Exact Extract) |
|---|---|---|---|---|---|
| Qwen3-8B (4-bit base) | 0.7689 ± 0.0046 | 0.4173 | 0.2699 ± 0.0191 | 0.8933 ± 0.0089 | 0.8834 ± 0.0092 |
| Qwen3-8B (LoRA SA) | 0.6756 ± 0.0049 | 0.4161 | 0.2699 ± 0.0191 | 0.8602 ± 0.0100 | 0.8644 ± 0.0099 |
Deployment note: If you want to minimize broad-capability regressions, consider enabling the adapter only in guarded contexts (e.g., when a request touches sensitive tables), or interpolating with the base model in low-risk flows.
Bias, risks, and limitations
- SA behavior is evaluated on mock corporate data and a specific benchmark format; performance may vary under different policy schemas, domains, or languages.
- LLMs can still be vulnerable to prompt injection and policy ambiguity; treat SA as defense-in-depth, not a single control.
- Models may over-refuse in some “supervisor” cases depending on how permissions are expressed (observed in the paper’s analysis).
Citation
If you use this adapter in academic work, please cite:
@misc{fazlija2026sensitivityawarelanguagemodels,
title={Towards Sensitivity-Aware Language Models},
author={Dren Fazlija and Iyiola E. Olatunji and Daniel Kudenko and Sandipan Sikdar},
year={2026},
eprint={2601.20901},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2601.20901},
}
Contact
- Use the HF Discussions tab for questions/issues.
- For code, evaluation scripts, and reproducibility instructions, see: https://github.com/DrenFazlija/towards-sa-llms
Framework versions
- PEFT 0.16.0
- Downloads last month
- 7