zenz-coreml / CHANGELOG.md
Skyline23's picture
Initialize model repository contents
df17805 verified

Changelog

All notable changes to this Core ML port are documented in this file.

Unreleased

Changed

  • Switched the distribution layout to Hugging Face Hub delivery instead of GitHub Releases and SwiftPM binary targets.
  • Standardized the published artifact layout to:
    • Artifacts/stateless/zenz-stateless-fp16.mlpackage
    • Artifacts/stateless/zenz-stateless-8bit.mlpackage
    • Artifacts/stateful/zenz-stateful-fp16.mlpackage
    • Artifacts/stateful/zenz-stateful-8bit.mlpackage
    • tokenizer/*
    • hf_manifest.json
  • Kept the export strategy on a single stateful model instead of maintaining separate prefill and decode packages.
  • Preserved the Core ML state layout as keyCache, valueCache, and pastLen.

Fixed

  • Fixed a broken stateful cache-read path in the exporter. The previous exporter converted pastLen into a Python scalar during tracing, which effectively froze the cache slice logic and caused the generated Core ML model to behave like a broken last-token-only decode path.
  • Fixed the stateful input contract so attention_mask can represent the total active context length instead of being tied to the current input chunk length.
  • Fixed tokenizer special-token handling by treating tokenizer.json as the source of truth for </s>, <s>, [PAD], and [UNK] instead of trusting the older config.json values alone.

Optimized

  • Kept the single-model stateful path but optimized runtime usage around incremental decode. The intended call pattern is now:
    • one prompt prefill call with the full prompt
    • repeated one-token decode calls on the same Core ML state
  • Reduced deployment churn by letting downstream app bootstrap logic cache compiled resources and invalidate them using the upstream Hugging Face commit SHA.
  • Simplified downstream app integration by standardizing on bundled compiled resources with SHA-based cache invalidation instead of runtime model downloads.

Compared With The Previous Stateful Export

  • Previous behavior:
    • The app could emit a plausible first generated token and then quickly degrade into repeated invalid byte tokens rendered as .
    • EOS handling could drift because token semantics were taken from stale config metadata instead of the tokenizer definition.
    • Stale compiled app caches could survive even after new HF artifacts were uploaded.
  • Current behavior:
    • The exporter keeps the cache slice in tensor form so the traced graph can keep reading Core ML state dynamically.
    • The stateful model accepts a full-length attention_mask for decode steps, which matches GPT-2 cache usage more closely.
    • Tokenizer semantics are aligned to the actual tokenizer assets shipped with the model.
    • Downstream consumers can invalidate compiled caches automatically when the HF revision changes.

Notes For Consumers

  • The tokenizer assets from the Hub are part of the runtime contract, not optional metadata.
  • If a client app caches compiled models locally, it should invalidate that cache on HF revision changes.
  • Benchmark numbers collected before this cache-handling fix should be treated as invalid for quality comparisons.