Changelog
All notable changes to this Core ML port are documented in this file.
Unreleased
Changed
- Switched the distribution layout to Hugging Face Hub delivery instead of GitHub Releases and SwiftPM binary targets.
- Standardized the published artifact layout to:
Artifacts/stateless/zenz-stateless-fp16.mlpackageArtifacts/stateless/zenz-stateless-8bit.mlpackageArtifacts/stateful/zenz-stateful-fp16.mlpackageArtifacts/stateful/zenz-stateful-8bit.mlpackagetokenizer/*hf_manifest.json
- Kept the export strategy on a single stateful model instead of maintaining separate
prefillanddecodepackages. - Preserved the Core ML state layout as
keyCache,valueCache, andpastLen.
Fixed
- Fixed a broken stateful cache-read path in the exporter.
The previous exporter converted
pastLeninto a Python scalar during tracing, which effectively froze the cache slice logic and caused the generated Core ML model to behave like a broken last-token-only decode path. - Fixed the stateful input contract so
attention_maskcan represent the total active context length instead of being tied to the current input chunk length. - Fixed tokenizer special-token handling by treating
tokenizer.jsonas the source of truth for</s>,<s>,[PAD], and[UNK]instead of trusting the olderconfig.jsonvalues alone.
Optimized
- Kept the single-model stateful path but optimized runtime usage around incremental decode.
The intended call pattern is now:
- one prompt prefill call with the full prompt
- repeated one-token decode calls on the same Core ML state
- Reduced deployment churn by letting downstream app bootstrap logic cache compiled resources and invalidate them using the upstream Hugging Face commit SHA.
- Simplified downstream app integration by standardizing on bundled compiled resources with SHA-based cache invalidation instead of runtime model downloads.
Compared With The Previous Stateful Export
- Previous behavior:
- The app could emit a plausible first generated token and then quickly degrade into repeated invalid byte tokens rendered as
�. - EOS handling could drift because token semantics were taken from stale config metadata instead of the tokenizer definition.
- Stale compiled app caches could survive even after new HF artifacts were uploaded.
- The app could emit a plausible first generated token and then quickly degrade into repeated invalid byte tokens rendered as
- Current behavior:
- The exporter keeps the cache slice in tensor form so the traced graph can keep reading Core ML state dynamically.
- The stateful model accepts a full-length
attention_maskfor decode steps, which matches GPT-2 cache usage more closely. - Tokenizer semantics are aligned to the actual tokenizer assets shipped with the model.
- Downstream consumers can invalidate compiled caches automatically when the HF revision changes.
Notes For Consumers
- The tokenizer assets from the Hub are part of the runtime contract, not optional metadata.
- If a client app caches compiled models locally, it should invalidate that cache on HF revision changes.
- Benchmark numbers collected before this cache-handling fix should be treated as invalid for quality comparisons.