Flan-T5-Small INT8 — ONNX Quantized

ONNX INT8 quantized version of google/flan-t5-small for efficient text embeddings via encoder representations.

Model Details

Property	Value
Base Model	google/flan-t5-small
Format	ONNX
Quantization	INT8 (dynamic quantization)
Parameters	~60M
Quantized by	JustEmbed

What is this?

This is a quantized ONNX export of Flan-T5-Small, an instruction-finetuned version of T5-Small by Google. The encoder is used to generate text embeddings. The INT8 quantization reduces model size and improves inference speed.

Flan-T5 was instruction-finetuned on over 1,000 tasks, making its encoder representations broadly useful for diverse text understanding tasks.

Use Cases

General-purpose text embeddings
Instruction-aware text similarity
Cross-task text retrieval
Lightweight embedding model for resource-constrained environments

Files

model.onnx — INT8 quantized ONNX model
tokenizer.json — Fast tokenizer
config.json — Model configuration

Usage with JustEmbed

from justembed import Embedder

embedder = Embedder("flan-t5-small-int8")
vectors = embedder.embed(["Summarize the key findings of this study"])

Usage with ONNX Runtime

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(".")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer("Summarize the key findings", return_tensors="np")
outputs = session.run(None, dict(inputs))

Quantization Details

Method: Dynamic INT8 quantization via ONNX Runtime
Source: Original PyTorch weights converted to ONNX, then quantized
Speed: ~2-3x faster inference than FP32
Size: ~4x smaller than FP32

License

This model is a derivative work of google/flan-t5-small.

The original model is licensed under Apache License 2.0. This quantized version is distributed under the same license. See the LICENSE file for the full text.

Citation

@article{chung2022scaling,
  title={Scaling Instruction-Finetuned Language Models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

Acknowledgments

Original model by Google Research
Quantization and packaging by JustEmbed

Downloads last month: 16

Model tree for sekarkrishna/flan-t5-small-int8

Base model

google/flan-t5-small

Quantized

(21)

this model

Paper for sekarkrishna/flan-t5-small-int8

Scaling Instruction-Finetuned Language Models

Paper • 2210.11416 • Published Oct 20, 2022 • 8