qmd-query-expansion-1.7B ONNX KV Cache FP32
This repo contains a CPU-exported ONNX version of tobil/qmd-query-expansion-1.7B with flattened KV cache inputs and outputs for decoder generation.
What is included:
- Root tokenizer and config files
onnx/model.onnxonnx/model.onnx.data
Status:
- KV-cache ONNX export is validated in ONNX Runtime
- Variable
seq_lenandpast_lenwork, includingpast_len=0 - FP32 only
- Q4 export is still pending because the current
MatMulNBitsQuantizerpath does not rewrite this graph yet
Validation summary:
- Exported with
torch.onnx.export(..., dynamo=True) - Verified with ONNX Runtime for multiple
(seq_len, past_len)pairs:(2, 8)(1, 6)(3, 5)(1, 0)
Intended use:
- This repo is for testing Transformers.js generation with native KV cache support
- It is not the final quantized deployment artifact
- Downloads last month
- 3