Sensitivity-Aware LoRA Adapter for Qwen3-8B (4-bit)

This repository contains a LoRA adapter that improves Sensitivity Awareness (SA) – i.e., the ability of an LLM to follow role-based access rules and avoid unauthorized disclosure in enterprise-style question answering scenarios (e.g., HR/employee data with access-control policies).

Important scope note: This adapter is specialized for the ADI toy environment used in our paper. It should not be interpreted as a general-purpose “enterprise security” model and must not replace application-layer access control.

The adapter is trained and evaluated following the methodology described in “Towards Sensitivity-Aware Language Models” (AISTATS 2026 submission, under review).


What’s in this repo

This is not a full standalone model checkpoint. You load it on top of the base model:

  • adapter_model.safetensors — LoRA weights
  • adapter_config.json — PEFT configuration
  • Tokenizer/chat assets included in this repo:
    • tokenizer.json, tokenizer_config.json, special_tokens_map.json
    • added_tokens.json (if present)
    • vocab.json, merges.txt (if present)
    • chat_template.jinja

Model details

  • Adapter type: LoRA (PEFT)
  • Base model: unsloth/qwen3-8b-unsloth-bnb-4bit (4-bit quantized Qwen3-8B via Unsloth)
  • Intended capability: Improved compliance with access-control rules under the ADI (“ACCESS DENIED INC”) evaluation protocol.
  • Languages: Primarily English (benchmark is English).
  • License: MIT

Intended use

Direct use

Use this adapter when you want an LLM that is more likely to refuse or redact requests that violate role-based access policies in the ADI toy environment, especially under adversarial prompting (e.g., “malicious” or “lying” requests).

Typical research use cases:

  • Sensitivity-aware / policy-aware generation in a controlled benchmark setting
  • Prompt-exfiltration resistance experiments on synthetic corpora
  • Reproducing ADI evaluation results reported in our paper

Out-of-scope / misuse

  • Not a substitute for real access control. Always enforce authorization in your application layer (RBAC/ABAC), not just in the model.
  • Do not use this as the only safeguard for real PII, regulated data, or high-stakes compliance environments.
  • Like all LLMs, it can still fail in edge cases (prompt injection, distribution shift, ambiguous policy wording, etc.).

How to get started

Recommended: use our GitHub repo (batch ADI evaluation)

For end-to-end reproduction (dataset format, batch inference, output files, and ADI-style evaluation), please use:

https://github.com/DrenFazlija/towards-sa-llms

The batch inference script in that repo (unsloth_adi.py) supports loading this adapter directly from the Hub via --adapter_repo SisWiss/qwen3-8b-sa-lora.

Example:

python unsloth_adi.py \
  --inputs path/to/test_data.csv \
  --model unsloth/qwen3-8b-unsloth-bnb-4bit \
  --adapter_repo SisWiss/qwen3-8b-sa-lora \
  --identifier sa-lora

Minimal sanity check (programmatic load)

If you just want to verify the adapter loads correctly:

from unsloth import FastLanguageModel
from peft import PeftModel

base_id = "unsloth/qwen3-8b-unsloth-bnb-4bit"
adapter_id = "SisWiss/qwen3-8b-sa-lora"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_id,
    max_seq_length=3048,
    load_in_4bit=True,
    load_in_8bit=False,
    full_finetuning=False,
)

model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

Training

Training data

Supervised fine-tuning uses 30,897 “correct” annotations from the ADI setup as training signal. The training mix is:

  • 75% chain-of-thought examples (reasoning traces + final output)
  • 25% output-only examples

To avoid contamination, evaluation uses a newly generated mock corporate dataset created via the ADI pipeline, producing 3×3,500 = 10,500 evaluation questions.

Training procedure

  • Method: Supervised fine-tuning (SFT) with LoRA adapters

  • LoRA configuration:

    • rank r = 32
    • scaling factor α = 32
    • targets: attention + MLP projection layers
    • base weights frozen
    • dropout: none
    • bias adaptation: none
  • Tooling: Unsloth for efficient fine-tuning; PEFT/Transformers stack


Evaluation

Sensitivity Awareness benchmark (ADI)

ADI grades each response on a 3-point scale:

  1. Correct (policy-compliant + format compliant)
  2. Error (format/accuracy issues)
  3. Wrong (unauthorized disclosure or incorrect denial)

It tests four scenarios:

  • Benign user requests
  • Malicious requests
  • Supervisor requests
  • Lying/adversarial prompts aiming to leak sensitive data

Results (Qwen3-8B baseline vs this LoRA adapter)

Overall correctness improves by +21.71 percentage points on ADI (55.27% → 76.98%), with the largest gains in adversarial categories.

ADI (10,500 questions total; success rates shown per category):

Model Correct ↑ Error ↓ Wrong ↓ Benign ↑ Malicious ↑ Supervisor ↑ Lying ↑
Qwen3-8B (4-bit base) 55.27% 5.40% 26.94% 96.00% 14.53% 92.40% 0.93%
Qwen3-8B (LoRA SA) 76.98% 4.18% 16.83% 97.28% 56.69% 88.67% 33.87%

General capability trade-offs (non-SA benchmarks)

The paper reports small-to-moderate drops on broad reasoning benchmarks, while instruction-following and math remain comparatively stable.

Language Model Evaluation Harness (base vs LoRA):

Model BIG-Bench Hard (Exact Match) IFEval strict (Instance) IFEval strict (Prompt) GSM8K-Platinum (Flexible Extract) GSM8K-Platinum (Exact Extract)
Qwen3-8B (4-bit base) 0.7689 ± 0.0046 0.4173 0.2699 ± 0.0191 0.8933 ± 0.0089 0.8834 ± 0.0092
Qwen3-8B (LoRA SA) 0.6756 ± 0.0049 0.4161 0.2699 ± 0.0191 0.8602 ± 0.0100 0.8644 ± 0.0099

Deployment note: If you want to minimize broad-capability regressions, consider enabling the adapter only in guarded contexts (e.g., when a request touches sensitive tables), or interpolating with the base model in low-risk flows.


Bias, risks, and limitations

  • SA behavior is evaluated on mock corporate data and a specific benchmark format; performance may vary under different policy schemas, domains, or languages.
  • LLMs can still be vulnerable to prompt injection and policy ambiguity; treat SA as defense-in-depth, not a single control.
  • Models may over-refuse in some “supervisor” cases depending on how permissions are expressed (observed in the paper’s analysis).

Citation

If you use this adapter in academic work, please cite:

@misc{fazlija2026sensitivityawarelanguagemodels,
      title={Towards Sensitivity-Aware Language Models}, 
      author={Dren Fazlija and Iyiola E. Olatunji and Daniel Kudenko and Sandipan Sikdar},
      year={2026},
      eprint={2601.20901},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2601.20901}, 
}

Contact

Framework versions

  • PEFT 0.16.0
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for SisWiss/qwen3-8b-sa-lora