File size: 3,872 Bytes
c60a44c
 
 
 
 
 
 
 
 
 
 
25d62d5
c60a44c
 
 
 
 
 
 
 
 
 
 
 
 
25d62d5
c60a44c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25d62d5
 
 
 
 
 
 
 
c60a44c
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
base_model: microsoft/Phi-3.5-mini-instruct
library_name: transformers
language:
  - en
tags:
  - pruning
  - random
  - bias-evaluation
  - llm-compression
  - arxiv:2605.08137
  - research-only
---

# phi-3.5-mini-instruct — random pruning at 50% target sparsity

> ⚠️ **Research artifact only — not for production use.**
> This model was created to *study* fairness degradation under weight pruning. The companion paper (IEEE AIIoT 2026) demonstrates that random pruning at this sparsity level induces measurable bias amplification on the BBQ benchmark. Do not deploy this model in any user-facing or decision-making system.

## Paper

**Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI**
Plawan Kumar Rath, Rahul Maliakkal. *IEEE AIIoT 2026.*

- arXiv: <https://arxiv.org/abs/2605.08137>
- Code: <https://github.com/plawanrath/pruning-impact-analysis>
- Base model: [microsoft/Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)
- License: `mit` (inherited from base model — see [terms](https://opensource.org/licenses/MIT))

## Pruning configuration

- **Method**: `random`
- **Target sparsity**: 50%
- **Actual sparsity achieved**: 50.00%
- **Zeroed parameters**: 1,811,924,620 of 3,623,878,656 prunable (50.00%)
- **Prune wall time**: 0.3s
- **Pruning scope**: linear layers in transformer blocks (attention projections + MLP). Embeddings, LM head, and layer norms are untouched.
- **Calibration set** (Wanda only): 128 samples from C4, sequence length 2048.

**Method description.** Uniform random unstructured pruning. Acts as a control to test whether observed effects come from the *selection criterion* or from sparsity itself.

## Reported metrics (from the paper)

| Metric | Value | Reference |
|---|---|---|
| Perplexity (Tulu-3 SFT mix, 256×512) | 47,924,927 | dense baseline 4.44 (+1079390147.7%) |
| SRS by category (s50) | Age: 0.318, Gender Identity: 0.322, Race/Ethnicity: 0.341, Religion: 0.313, SES: 0.324 | random-chance baseline ≈ 0.333 |
| Mean per-item inference latency (Apple Silicon, MLX) | 0.158s | **identical to the dense baseline** — unstructured pruning provides no latency benefit on dense GEMM kernels (paper §V.B) |

## Important caveats for IoT / edge deployment

- **No storage savings.** Unstructured pruning zeroes individual weights but keeps them in the dense float tensor. SafeTensors and GGUF do not exploit unstructured sparsity, so the on-disk size of this checkpoint is **identical** to the dense base model.
- **No latency savings.** Dense GEMM kernels do not skip zero entries. Inference latency on Apple Silicon (MLX) and the majority of consumer GPUs / mobile NPUs is **identical** to the dense baseline.
- **Bias amplification may be invisible to perplexity-based eval.** The paper's headline finding (the *Smart Pruning Paradox*): Wanda at 50% sparsity on Mistral-7B raises perplexity 3.5% but raises Stereotype Reliance Score 83.7% — a 24× disparity. Standard deployment validation based on perplexity alone provides false assurance.

## Citation

```bibtex
@inproceedings{rath2026pruning,
  title         = {Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI},
  author        = {Rath, Plawan Kumar and Maliakkal, Rahul},
  booktitle     = {Proc. IEEE AIIoT 2026},
  year          = {2026},
  eprint        = {2605.08137},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.08137}
}
```

## Reproducibility

- All pruning scripts, evaluation pipelines, and aggregated results: <https://github.com/plawanrath/pruning-impact-analysis>
- BBQ benchmark (ambiguous condition only): [`Elfsong/BBQ`](https://huggingface.co/datasets/Elfsong/BBQ)
- Generated from `pruning_meta.json` shipped in this repo (`actual_sparsity`, prune time, etc.).