Flan-T5-Small INT8 β€” ONNX Quantized

ONNX INT8 quantized version of google/flan-t5-small for efficient text embeddings via encoder representations.

Model Details

Property Value
Base Model google/flan-t5-small
Format ONNX
Quantization INT8 (dynamic quantization)
Parameters ~60M
Quantized by JustEmbed

What is this?

This is a quantized ONNX export of Flan-T5-Small, an instruction-finetuned version of T5-Small by Google. The encoder is used to generate text embeddings. The INT8 quantization reduces model size and improves inference speed.

Flan-T5 was instruction-finetuned on over 1,000 tasks, making its encoder representations broadly useful for diverse text understanding tasks.

Use Cases

  • General-purpose text embeddings
  • Instruction-aware text similarity
  • Cross-task text retrieval
  • Lightweight embedding model for resource-constrained environments

Files

  • model.onnx β€” INT8 quantized ONNX model
  • tokenizer.json β€” Fast tokenizer
  • config.json β€” Model configuration

Usage with JustEmbed

from justembed import Embedder

embedder = Embedder("flan-t5-small-int8")
vectors = embedder.embed(["Summarize the key findings of this study"])

Usage with ONNX Runtime

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(".")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer("Summarize the key findings", return_tensors="np")
outputs = session.run(None, dict(inputs))

Quantization Details

  • Method: Dynamic INT8 quantization via ONNX Runtime
  • Source: Original PyTorch weights converted to ONNX, then quantized
  • Speed: ~2-3x faster inference than FP32
  • Size: ~4x smaller than FP32

License

This model is a derivative work of google/flan-t5-small.

The original model is licensed under Apache License 2.0. This quantized version is distributed under the same license. See the LICENSE file for the full text.

Citation

@article{chung2022scaling,
  title={Scaling Instruction-Finetuned Language Models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

Acknowledgments

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sekarkrishna/flan-t5-small-int8

Quantized
(21)
this model

Paper for sekarkrishna/flan-t5-small-int8