OpenMed
/

privacy-filter-mlx-8bit

@@ -38,7 +38,7 @@ After the model is downloaded once, inference runs locally. No document text is
 - Tokenizer: `o200k_base` / tiktoken-style BPE
 - Labels: `account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`
-The standard MLX layers are quantized, including embeddings, attention projections, MoE gates, and the token-classification head. Custom sparse-MoE expert tensors remain stored in their normal precision until OpenMed adds a dedicated expert-tensor quantization kernel.
 ## Quick Start: Python
@@ -75,7 +75,7 @@ Example output:
     "word": "alice.smith@example.com",
     "start": 39,
     "end": 62,
-    "score": 0.9600,
 }
 ```
@@ -109,19 +109,13 @@ For iOS, run on Apple Silicon hardware. The iOS Simulator is not the recommended
 ## Validation
-The 8-bit artifact was validated against the unquantized OpenMed MLX artifact with fixed text samples. In a sanity check containing a person name, phone number, email, and address, both artifacts returned the same four span types with close scores:
-| Span | bf16 score | q8 score |
-|---|---:|---:|
-| `private_person` | 1.0000 | 1.0000 |
-| `private_phone` | 0.9891 | 0.9881 |
-| `private_email` | 0.9662 | 0.9604 |
-| `private_address` | 0.9107 | 0.9051 |
 OpenMed also includes unit tests for:
 - q8 artifact loading
 - quantization metadata decoding
 - finite logits from the q8 runtime
 - bf16/q8 shape and argmax-label coherence
 - BIOES/Viterbi span decoding

 - Tokenizer: `o200k_base` / tiktoken-style BPE
 - Labels: `account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`
+This artifact uses expert-aware MLX quantization: embeddings, attention projections, MoE gates, sparse-MoE expert tensors, and the token-classification head are all stored in 8-bit packed form. The resulting `weights.safetensors` file is about 1.39 GiB, compared with about 2.61 GiB for the BF16 OpenMed MLX artifact.
 ## Quick Start: Python
     "word": "alice.smith@example.com",
     "start": 39,
     "end": 62,
+    "score": 0.9998,
 }
 ```
 ## Validation
+The 8-bit artifact was validated against the unquantized OpenMed MLX artifact with fixed text samples. BF16 and Q8 returned identical grouped spans for person, date, phone, email, address, and account-number examples.
 OpenMed also includes unit tests for:
 - q8 artifact loading
 - quantization metadata decoding
+- expert tensor packing and `.scales` coverage
 - finite logits from the q8 runtime
 - bf16/q8 shape and argmax-label coherence
 - BIOES/Viterbi span decoding

weights.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fde03f50d02edefe911e511c012ccfb2302b0792014165d1add0ce9c7c798d65
-size 2668486393

 version https://git-lfs.github.com/spec/v1
+oid sha256:22c4e8323b5a39bd6ab42b8cf9f8920f7676158b7375759d28d053616fbd7d6e
+size 1488841579