MaziyarPanahi commited on
Commit
4c9836d
·
verified ·
1 Parent(s): 428fc63

Update 8-bit Privacy Filter artifact with expert quantization

Browse files
Files changed (2) hide show
  1. README.md +4 -10
  2. weights.safetensors +2 -2
README.md CHANGED
@@ -38,7 +38,7 @@ After the model is downloaded once, inference runs locally. No document text is
38
  - Tokenizer: `o200k_base` / tiktoken-style BPE
39
  - Labels: `account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`
40
 
41
- The standard MLX layers are quantized, including embeddings, attention projections, MoE gates, and the token-classification head. Custom sparse-MoE expert tensors remain stored in their normal precision until OpenMed adds a dedicated expert-tensor quantization kernel.
42
 
43
  ## Quick Start: Python
44
 
@@ -75,7 +75,7 @@ Example output:
75
  "word": "alice.smith@example.com",
76
  "start": 39,
77
  "end": 62,
78
- "score": 0.9600,
79
  }
80
  ```
81
 
@@ -109,19 +109,13 @@ For iOS, run on Apple Silicon hardware. The iOS Simulator is not the recommended
109
 
110
  ## Validation
111
 
112
- The 8-bit artifact was validated against the unquantized OpenMed MLX artifact with fixed text samples. In a sanity check containing a person name, phone number, email, and address, both artifacts returned the same four span types with close scores:
113
-
114
- | Span | bf16 score | q8 score |
115
- |---|---:|---:|
116
- | `private_person` | 1.0000 | 1.0000 |
117
- | `private_phone` | 0.9891 | 0.9881 |
118
- | `private_email` | 0.9662 | 0.9604 |
119
- | `private_address` | 0.9107 | 0.9051 |
120
 
121
  OpenMed also includes unit tests for:
122
 
123
  - q8 artifact loading
124
  - quantization metadata decoding
 
125
  - finite logits from the q8 runtime
126
  - bf16/q8 shape and argmax-label coherence
127
  - BIOES/Viterbi span decoding
 
38
  - Tokenizer: `o200k_base` / tiktoken-style BPE
39
  - Labels: `account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`
40
 
41
+ This artifact uses expert-aware MLX quantization: embeddings, attention projections, MoE gates, sparse-MoE expert tensors, and the token-classification head are all stored in 8-bit packed form. The resulting `weights.safetensors` file is about 1.39 GiB, compared with about 2.61 GiB for the BF16 OpenMed MLX artifact.
42
 
43
  ## Quick Start: Python
44
 
 
75
  "word": "alice.smith@example.com",
76
  "start": 39,
77
  "end": 62,
78
+ "score": 0.9998,
79
  }
80
  ```
81
 
 
109
 
110
  ## Validation
111
 
112
+ The 8-bit artifact was validated against the unquantized OpenMed MLX artifact with fixed text samples. BF16 and Q8 returned identical grouped spans for person, date, phone, email, address, and account-number examples.
 
 
 
 
 
 
 
113
 
114
  OpenMed also includes unit tests for:
115
 
116
  - q8 artifact loading
117
  - quantization metadata decoding
118
+ - expert tensor packing and `.scales` coverage
119
  - finite logits from the q8 runtime
120
  - bf16/q8 shape and argmax-label coherence
121
  - BIOES/Viterbi span decoding
weights.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fde03f50d02edefe911e511c012ccfb2302b0792014165d1add0ce9c7c798d65
3
- size 2668486393
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22c4e8323b5a39bd6ab42b8cf9f8920f7676158b7375759d28d053616fbd7d6e
3
+ size 1488841579