File size: 17,801 Bytes
701aac8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
---
license: mit
language:
  - en
pipeline_tag: text-ranking
tags:
  - cross-encoder
  - reranker
  - core-ml
  - apple-silicon
  - ane
  - mteb
base_model: BAAI/bge-reranker-base
# MODEL_INDEX:auto-stamped β€” do not edit by hand
model-index:
  - name: bge-reranker-base-coreml
    results:
      - task:
          type: text-ranking
          name: Reranking
        dataset:
          type: mteb/scidocs-reranking
          name: SciDocs Reranking
          split: test
          revision: <not pinned>
        metrics:
          - type: ndcg_at_10
            name: nDCG@10
            value: 0.7415
          - type: map
            name: MAP
            value: 0.6743
# /MODEL_INDEX
---

# bge-reranker-base β€” Core ML (.mlpackage) for Apple Silicon

Core ML port of [`BAAI/bge-reranker-base`](https://huggingface.co/BAAI/bge-reranker-base) targeting the **Apple Neural Engine** on M-series Macs. Produced by the maintainer-side conversion tool at [github.com/tcashel/juice-bge-reranker-coreml](https://github.com/tcashel/juice-bge-reranker-coreml). Consumed by the Juice macOS app via [`swift-transformers`](https://github.com/huggingface/swift-transformers).

This card **is the integration contract**. The Swift consumer relies on every section below; do not change a tensor name, shape, or token ID without bumping the variant tag (which the consumer pins in any per-model cache key, see below).

## Requirements

- **Apple Silicon Mac** (M1 / M2 / M3 / M4 / later). The headline `-ane` build requires the Apple Neural Engine.
- **macOS 15.0 (Sequoia) or later.** This is the artifact's `minimum_deployment_target`; older macOS versions cannot load the `.mlpackage`.
- **Swift consumer:** [`swift-transformers`](https://github.com/huggingface/swift-transformers) β‰₯ 1.3.0 for `HubApi` (snapshot download) and `AutoTokenizer` (XLM-R Unigram path). Direct `MLModel` load via `CoreML` also works.

## Usage

End-to-end working examples live in the [GitHub repo's `examples/`](https://github.com/tcashel/juice-bge-reranker-coreml/tree/main/examples) directory β€” both load the artifact, score one `(query, doc)` pair, and print the sigmoid-mapped relevance.

### Swift (`swift-transformers` + `CoreML`)

The canonical consumer pattern; mirrors what the Juice macOS app does. Full source at [`examples/swift/Sources/Predict/main.swift`](https://github.com/tcashel/juice-bge-reranker-coreml/blob/main/examples/swift/Sources/Predict/main.swift). Key steps:

```swift
import CoreML
import Hub
import Tokenizers

let repo = Hub.Repo(id: "tcashel/bge-reranker-base-coreml", type: .models)
let folder = try await HubApi.shared.snapshot(from: repo, revision: "v0.1-ane")
let tokenizer = try await AutoTokenizer.from(modelFolder: folder)

// XLM-R paired-input template (swift-transformers does not expose textPair for Unigram):
let bos: Int32 = 0, eos: Int32 = 2, pad: Int32 = 1
let q = tokenizer.encode(text: query, addSpecialTokens: false).map(Int32.init)
let d = tokenizer.encode(text: doc,   addSpecialTokens: false).map(Int32.init)
var ids: [Int32] = [bos] + q + [eos, eos] + d + [eos]
// ... pad to seq ∈ {128, 256, 512}, fill 20 batch rows with <pad>, then:

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: folder.appendingPathComponent("model.mlpackage"), configuration: config)
let prediction = try await model.prediction(from: provider)
let logit = Double(truncating: prediction.featureValue(for: "logit")!.multiArrayValue![[0, 0]])
let score = 1.0 / (1.0 + exp(-logit))
```

Run:

```sh
cd examples/swift
swift run Predict --tag v0.1-ane --query "what is the capital of france?" --doc "Paris is the capital of France."
```

### Python (`coremltools` + `transformers` for tokenization)

For verifying the artifact end-to-end on macOS without a Swift toolchain. Full source at [`examples/predict.py`](https://github.com/tcashel/juice-bge-reranker-coreml/blob/main/examples/predict.py):

```python
import math, numpy as np
from coremltools.models import MLModel
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

folder = snapshot_download(repo_id="tcashel/bge-reranker-base-coreml", revision="v0.1-ane")
tokenizer = AutoTokenizer.from_pretrained(folder, use_fast=True)
model = MLModel(f"{folder}/model.mlpackage")

# Python's transformers builds the paired-input template internally:
enc = tokenizer(query, doc, padding="max_length", truncation=True, max_length=128, return_tensors="np")

# Pad up to the fixed batch=20 (read row 0 of the output, discard the rest):
ids  = np.full ((20, 1, 1, 128), 1, dtype=np.int32); ids [0, 0, 0, :] = enc["input_ids"][0]
mask = np.zeros((20, 1, 1, 128),    dtype=np.int32); mask[0, 0, 0, :] = enc["attention_mask"][0]

logit = float(model.predict({"input_ids": ids, "attention_mask": mask})["logit"][0, 0])
score = 1.0 / (1.0 + math.exp(-logit))
```

Run:

```sh
pixi run python examples/predict.py --source hub --tag v0.1-ane
```

## Identity

- **Source model:** `BAAI/bge-reranker-base` @ `<source_revision_sha>` (set by `convert.py`).
- **Conversion type:** PyTorch (FP32) β†’ Core ML `.mlpackage` (FP16) with the [`apple/ml-ane-transformers`](https://github.com/apple/ml-ane-transformers) primitives (Conv2d 1Γ—1 projections, BC1S layout, `LayerNormANE`) so the encoder lowers to the Apple Neural Engine. **This is a precision reduction (FP32 β†’ FP16) and format conversion, not integer quantization** β€” there is no INT8/INT4 mapping. No fine-tuning, distillation, or weight pruning was applied; weights are bit-equivalent up to FP16 rounding.
- **Conversion stack:** see `<variant>_provenance.json` published alongside the artifact (records exact torch / transformers / coremltools versions and host machine).
- **License:** MIT (inherited from the upstream model).

## Variants

| Tag | Compute units | Intended use |
|---|---|---|
| `v{X}-ane` | `cpuAndNeuralEngine` | Headline build. The 12-layer encoder backbone (924 ops: einsum, conv, softmax, layer_norm, gelu, transpose, residual add/mul) runs on the Apple Neural Engine. 31 boundary ops (embedding gather over the 250k vocab, position-id arithmetic, mask construction, casts) dispatch to CPU; this is the Pareto frontier for XLM-RoBERTa-class models with very large vocabularies. `verify_ane.py` enforces this exact 924/31 residency fingerprint as a regression gate β€” any drift fails. M-series Macs only. |
| `v{X}-cpugpu` | `cpuAndGPU` | Known-good fallback β€” the same ANE port converted with `compute_units=CPU_AND_GPU`. Used by Swift if the `-ane` build fails to load (e.g. driver or macOS version mismatch). |

The Swift caller pins the tag in `Hub.snapshot(repo: "tcashel/bge-reranker-base-coreml", revision: "<tag>")` and embeds the same `<tag>` in any consumer-side cache key tied to model identity β€” rotating the tag invalidates downstream caches.

> **Repository layout.** This repo uses git **tags** (not subdirectories or sibling repos) to distinguish variants β€” `v{X}-ane` and `v{X}-cpugpu` point to different commits, each containing exactly one variant's files at the repo root (one `model.mlpackage`, one set of tokenizer files, one `provenance.json`). The `main` branch reflects whichever variant was published last, so consumers should always pin to a specific tag rather than reading from `main`. This layout optimizes for the Swift consumer: `HubApi.shared.snapshot(from:, revision: <tag>)` returns a flat ready-to-use directory.

## Architecture

> **Heads-up β€” XLM-RoBERTa, not BERT.** The encoder *geometry* is BERT-like (12L / 768H / 12 heads, GELU, post-LN), so a casual reader may pattern-match it as a BERT cross-encoder. It isn't. The upstream `config.json` declares `model_type: xlm-roberta`, `architectures: ["XLMRobertaForSequenceClassification"]`. The tokenizer and special-token IDs differ accordingly (see below); don't reach for `[CLS]`/`[SEP]`.

- 12 transformer encoder layers, hidden 768, 12 attention heads, intermediate FFN 3072.
- Single-segment model (`type_vocab_size = 1`).
- Classification head reads the `<s>` token (position 0): `dense(768β†’768) β†’ tanh β†’ out_proj(768β†’1)`. **No pooler.**
- Output: a single logit per pair. Apply `sigmoid` on the Swift side to get a relevance score in `[0, 1]`.

## Tokenizer

- **Class:** `XLMRobertaTokenizer` (SentencePiece-Unigram). Consumed in Swift via `swift-transformers`' `AutoTokenizer.from(modelFolder:)`, which dispatches to `UnigramTokenizer` for this `tokenizer_class`.
- **Files in this repo (under `tokenizer/`):** `tokenizer.json` (the fast-tokenizer file Swift consumes), `tokenizer_config.json`, `special_tokens_map.json`, `sentencepiece.bpe.model`. All four are required β€” `tokenizer.json` is the load path; the others are belt-and-braces.
- **Special tokens:**

  | Token | ID |
  |---|---|
  | `<s>` (BOS / CLS-equivalent) | 0 |
  | `<pad>` | 1 |
  | `</s>` (EOS / SEP-equivalent) | 2 |
  | `<unk>` | 3 |
  | `<mask>` | 250001 |

- **Padding:** right-side pad with `<pad>` (id 1).
- **Vocab size:** 250 002.
- **Max position embeddings:** 514 (= 512 max content tokens + `padding_idx + 1` offset).

## Paired-input template (must be constructed by the Swift consumer)

```
<s> {query} </s></s> {document} </s>
```

The doubled `</s></s>` separator is XLM-RoBERTa-specific (NOT the single BERT `[SEP]` you might expect from the encoder geometry). `swift-transformers` does **not** expose `encode(text:textPair:)` for the Unigram path, so the Swift consumer must concatenate the template string itself before calling `encode(text:)`. Do not pre-tokenize and concatenate token IDs β€” let the tokenizer handle the special-token IDs.

## Truncation policy

If the tokenized template exceeds the target sequence length `S`, truncate the **document side from the right**. Never truncate the query β€” query terms drive both lexical and semantic match in the cross-encoder. Reserve 4 token slots for the special tokens (`<s>`, `</s>`, `</s>`, `</s>`):

```
max_doc_tokens = S - len(query_tokens) - 4
```

If `max_doc_tokens <= 0`, the query alone fills the budget β€” drop the document, the score is essentially noise, and the consumer should down-weight or skip this candidate at the orchestrator.

## Input tensors (Core ML)

Both variants share the same input shape contract β€” they're the same architecture (the ANE-friendly port) converted with different `compute_units`. The `(1, 1)` middle dims are constant on the cpuAndGPU path (no overhead) and required by ANE's BC1S layout on the ANE path.

| Name | Dtype | Shape | Notes |
|---|---|---|---|
| `input_ids` | `Int32` | `(20, 1, 1, S)` | `S ∈ {128, 256, 512}` via `EnumeratedShapes`. Token IDs in `[0, 250001]`. |
| `attention_mask` | `Int32` | `(20, 1, 1, S)` | `1` for real tokens, `0` for `<pad>`. |

There is **no `token_type_ids` input** β€” `type_vocab_size = 1`, so token-type embedding is constant and folded internally.

Batch is fixed at 20 (sized for the consumer's typical post-RRF candidate pool). Smaller actual batches must be padded with `<pad>` rows on the Swift side; the corresponding `attention_mask` rows should be all-zeros. The classification head still emits 20 logits β€” the consumer reads the first `actual_batch` of them and discards the rest.

## Output tensor

| Name | Dtype | Shape | Interpretation |
|---|---|---|---|
| `logit` | `Float32` | `(20, 1)` | Raw logit. Apply `sigmoid` to get relevance score in `[0, 1]`. |

## Position-ID computation (informational)

Position IDs inside the model are computed as:

```
position_ids[i] = (arange(S) + 2) * attention_mask + 1 * (1 - attention_mask)
```

i.e. real tokens get positions starting at 2 (= `pad_token_id + 1`), pad tokens get position 1 (= `pad_token_id`). This is bit-exact equivalent to HF's `create_position_ids_from_input_ids` when input is right-padded, and avoids `cumsum` (which doesn't lower cleanly to ANE). The Swift consumer **does not** pass position IDs as a model input.

## Performance

Measured by `bench.py` on the maintainer's machine (recorded under `<variant>_provenance.json β†’ machine`). 50 warmup + 100 timed iterations per cell. `per-pair p95 = p95 / batch`.

<!-- BENCH:ane -->
### Variant: `ane`

| batch | seq | p50 (ms) | p95 (ms) | per-pair p95 (ms) |
|------:|----:|---------:|---------:|------------------:|
| 1 | 128 | 50.45 | 52.47 | 52.47 |
| 4 | 128 | 50.34 | 51.63 | 12.91 |
| 10 | 128 | 50.53 | 51.95 | 5.19 |
| 20 | 128 | 51.24 | 52.46 | 2.62 |
| 1 | 256 | 127.76 | 128.99 | 128.99 |
| 4 | 256 | 128.50 | 129.16 | 32.29 |
| 10 | 256 | 129.70 | 131.15 | 13.12 |
| 20 | 256 | 129.46 | 130.74 | 6.54 |
| 1 | 512 | 344.20 | 346.74 | 346.74 |
| 4 | 512 | 343.03 | 346.89 | 86.72 |
| 10 | 512 | 343.46 | 345.43 | 34.54 |
| 20 | 512 | 346.01 | 348.64 | 17.43 |
<!-- /BENCH:ane -->

<!-- BENCH:cpugpu -->
### Variant: `cpugpu`

| batch | seq | p50 (ms) | p95 (ms) | per-pair p95 (ms) |
|------:|----:|---------:|---------:|------------------:|
| 1 | 128 | 122.79 | 123.10 | 123.10 |
| 4 | 128 | 123.01 | 123.34 | 30.83 |
| 10 | 128 | 123.13 | 123.46 | 12.35 |
| 20 | 128 | 122.69 | 138.34 | 6.92 |
| 1 | 256 | 242.07 | 242.87 | 242.87 |
| 4 | 256 | 241.94 | 242.82 | 60.70 |
| 10 | 256 | 242.10 | 243.17 | 24.32 |
| 20 | 256 | 242.16 | 243.11 | 12.16 |
| 1 | 512 | 503.81 | 504.98 | 504.98 |
| 4 | 512 | 503.97 | 506.10 | 126.53 |
| 10 | 512 | 503.95 | 504.87 | 50.49 |
| 20 | 512 | 504.06 | 504.82 | 25.24 |
<!-- /BENCH:cpugpu -->

**Pass criterion (ANE variant):** `p95(batch=20, seq=256) < 200 ms` AND `per-pair p95 < 15 ms`. Matches the consumer's reranker latency budget.

## Quality regression eval

Validates that the FP32 β†’ FP16 + Core ML conversion preserved upstream behavior. Scored by `eval/quality_regression.py` against the [MTEB Reranking](https://huggingface.co/datasets?other=mteb&task_categories=task_categories%3Asentence-similarity) suite β€” the same benchmark family `BAAI/bge-reranker-base` is evaluated on. Pass criterion: `|Ξ” nDCG@10| < 0.005` per task vs the FP32 reference. Variant equivalence: scores apply to both `-ane` and `-cpugpu` (the FP16 weights inside each `.mlpackage` are bit-identical; only `compute_units` differs at load).

<!-- EVAL:reranking -->
### MTEB Reranking β€” FP32 reference vs Core ML FP16

_Variant equivalence: FP16 weights are bit-identical between `-ane` and `-cpugpu`; both inherit these numbers._

| Task | n queries | FP32 nDCG@10 | Core ML nDCG@10 | Ξ” nDCG@10 | FP32 MAP | Core ML MAP |
|---|---:|---:|---:|---:|---:|---:|
| scidocs-reranking | 3978 | 0.7410 | 0.7415 | +0.0005 | 0.6742 | 0.6743 |

**Pass criterion:** `|Ξ” nDCG@10| < 0.005` per task. FP32 baseline is `BAAI/bge-reranker-base` loaded with `attn_implementation="eager"`.

_Note on absolute scale:_ the nDCG@10 reported here (~0.74) reflects macro nDCG@10 over the test split's pre-ranked candidate pool (1 positive + ~29 negatives per query), which is structurally different from the full-corpus eval setup the BGE paper reports (~0.84). Ξ” vs the FP32 reference on the same setup is the meaningful regression signal; the absolute number is not directly comparable to the upstream paper.
<!-- /EVAL:reranking -->

## Failure modes the Swift consumer must handle

| Failure | Symptom | Recommended response |
|---|---|---|
| Download fails / hash mismatch | `Hub.snapshot` throws | Surface a one-line UI banner; reranker is bypassed; RRF order returned unchanged. |
| `MLModel` load fails (Intel Mac, missing ANE driver) | `MLModelLoadError` | Fall back to `-cpugpu` variant. If both fail, banner + RRF-only. |
| Per-query budget exceeded (>800 ms wall) | Cancel observed via `Task.cancel` | Return RRF order, log slow query. |
| Op fallback to CPU at runtime | Latency outliers in monitoring | Out of scope to detect from Swift; bench harness should catch this pre-publish via `verify_ane.py`. |

## Known limitations

- Apple Silicon only (`-ane` requires the Apple Neural Engine; Intel Macs must use `-cpugpu`).
- Fixed batch size 20. Smaller batches waste compute on pad rows; larger batches need a re-conversion.
- English-language reranking only (the upstream model is English; XLM-R's vocab supports more languages but the reranker has not been fine-tuned for them).
- FP16 internally on the ANE path β€” extreme inputs may show small numerical drift from the FP32 PyTorch reference. Tested within 1e-3 absolute tolerance on 16 fixed pairs; see `tests/test_numerical_equivalence.py`.

## References

- **Source model:** [`BAAI/bge-reranker-base`](https://huggingface.co/BAAI/bge-reranker-base)
- **BGE family papers:**
  - [C-Pack: Packed Resources For General Chinese Embeddings (Xiao et al., 2023)](https://arxiv.org/abs/2309.07597)
  - [Making Large Language Models A Better Foundation For Dense Retrieval (Li et al., 2023)](https://arxiv.org/abs/2312.15503)
- **Apple Neural Engine + Core ML conversion:**
  - [`apple/ml-ane-transformers`](https://github.com/apple/ml-ane-transformers) β€” the reference primitives (LayerNormANE, Conv2d-projection MultiHeadAttention) we vendor for the ANE rewrite.
  - Apple Machine Learning Research β€” [Deploying Transformers on the Apple Neural Engine](https://machinelearning.apple.com/research/neural-engine-transformers).

## How to reproduce

```sh
git clone https://github.com/tcashel/juice-bge-reranker-coreml
cd juice-bge-reranker-coreml
pixi install
pixi run convert         # produces build/bge-reranker-base-{ane,cpugpu}.mlpackage
pixi run verify-ane build/bge-reranker-base-ane.mlpackage
pixi run bench --variants ane:build/bge-reranker-base-ane.mlpackage cpugpu:build/bge-reranker-base-cpugpu.mlpackage --update-model-card MODEL_CARD.md
pixi run test
```

Publishing requires `HUGGINGFACE_TOKEN` in env and `--confirm`:

```sh
pixi run python publish.py --variant both --tag v0.1 --confirm
```