shreyask
/

qmd-query-expansion-1.7B-ONNX-kvcache-fp32

Text Generation

Transformers.js

Model card Files Files and versions

qmd-query-expansion-1.7B ONNX KV Cache FP32

This repo contains a CPU-exported ONNX version of tobil/qmd-query-expansion-1.7B with flattened KV cache inputs and outputs for decoder generation.

What is included:

Root tokenizer and config files
onnx/model.onnx
onnx/model.onnx.data

Status:

KV-cache ONNX export is validated in ONNX Runtime
Variable seq_len and past_len work, including past_len=0
FP32 only
Q4 export is still pending because the current MatMulNBitsQuantizer path does not rewrite this graph yet

Validation summary:

Exported with torch.onnx.export(..., dynamo=True)
Verified with ONNX Runtime for multiple (seq_len, past_len) pairs:
- (2, 8)
- (1, 6)
- (3, 5)
- (1, 0)

Intended use:

This repo is for testing Transformers.js generation with native KV cache support
It is not the final quantized deployment artifact

Downloads last month: 3