Changelog

All notable changes to this Core ML port are documented in this file.

Unreleased

Switched the distribution layout to Hugging Face Hub delivery instead of GitHub Releases and SwiftPM binary targets.
Standardized the published artifact layout to:
- Artifacts/stateless/zenz-stateless-fp16.mlpackage
- Artifacts/stateless/zenz-stateless-8bit.mlpackage
- Artifacts/stateful/zenz-stateful-fp16.mlpackage
- Artifacts/stateful/zenz-stateful-8bit.mlpackage
- tokenizer/*
- hf_manifest.json
Kept the export strategy on a single stateful model instead of maintaining separate prefill and decode packages.
Preserved the Core ML state layout as keyCache, valueCache, and pastLen.

Fixed a broken stateful cache-read path in the exporter. The previous exporter converted pastLen into a Python scalar during tracing, which effectively froze the cache slice logic and caused the generated Core ML model to behave like a broken last-token-only decode path.
Fixed the stateful input contract so attention_mask can represent the total active context length instead of being tied to the current input chunk length.
Fixed tokenizer special-token handling by treating tokenizer.json as the source of truth for </s>, <s>, [PAD], and [UNK] instead of trusting the older config.json values alone.

Kept the single-model stateful path but optimized runtime usage around incremental decode. The intended call pattern is now:
- one prompt prefill call with the full prompt
- repeated one-token decode calls on the same Core ML state
Reduced deployment churn by letting downstream app bootstrap logic cache compiled resources and invalidate them using the upstream Hugging Face commit SHA.
Simplified downstream app integration by standardizing on bundled compiled resources with SHA-based cache invalidation instead of runtime model downloads.

Previous behavior:
- The app could emit a plausible first generated token and then quickly degrade into repeated invalid byte tokens rendered as �.
- EOS handling could drift because token semantics were taken from stale config metadata instead of the tokenizer definition.
- Stale compiled app caches could survive even after new HF artifacts were uploaded.
Current behavior:
- The exporter keeps the cache slice in tensor form so the traced graph can keep reading Core ML state dynamically.
- The stateful model accepts a full-length attention_mask for decode steps, which matches GPT-2 cache usage more closely.
- Tokenizer semantics are aligned to the actual tokenizer assets shipped with the model.
- Downstream consumers can invalidate compiled caches automatically when the HF revision changes.

The tokenizer assets from the Hub are part of the runtime contract, not optional metadata.
If a client app caches compiled models locally, it should invalidate that cache on HF revision changes.
Benchmark numbers collected before this cache-handling fix should be treated as invalid for quality comparisons.