qmd-query-expansion-1.7B ONNX KV Cache FP32

This repo contains a CPU-exported ONNX version of tobil/qmd-query-expansion-1.7B with flattened KV cache inputs and outputs for decoder generation.

What is included:

  • Root tokenizer and config files
  • onnx/model.onnx
  • onnx/model.onnx.data

Status:

  • KV-cache ONNX export is validated in ONNX Runtime
  • Variable seq_len and past_len work, including past_len=0
  • FP32 only
  • Q4 export is still pending because the current MatMulNBitsQuantizer path does not rewrite this graph yet

Validation summary:

  • Exported with torch.onnx.export(..., dynamo=True)
  • Verified with ONNX Runtime for multiple (seq_len, past_len) pairs:
    • (2, 8)
    • (1, 6)
    • (3, 5)
    • (1, 0)

Intended use:

  • This repo is for testing Transformers.js generation with native KV cache support
  • It is not the final quantized deployment artifact
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support