Scaling Instruction-Finetuned Language Models
Paper β’ 2210.11416 β’ Published β’ 8
ONNX INT8 quantized version of google/flan-t5-small for efficient text embeddings via encoder representations.
| Property | Value |
|---|---|
| Base Model | google/flan-t5-small |
| Format | ONNX |
| Quantization | INT8 (dynamic quantization) |
| Parameters | ~60M |
| Quantized by | JustEmbed |
This is a quantized ONNX export of Flan-T5-Small, an instruction-finetuned version of T5-Small by Google. The encoder is used to generate text embeddings. The INT8 quantization reduces model size and improves inference speed.
Flan-T5 was instruction-finetuned on over 1,000 tasks, making its encoder representations broadly useful for diverse text understanding tasks.
model.onnx β INT8 quantized ONNX modeltokenizer.json β Fast tokenizerconfig.json β Model configurationfrom justembed import Embedder
embedder = Embedder("flan-t5-small-int8")
vectors = embedder.embed(["Summarize the key findings of this study"])
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
session = ort.InferenceSession("model.onnx")
inputs = tokenizer("Summarize the key findings", return_tensors="np")
outputs = session.run(None, dict(inputs))
This model is a derivative work of google/flan-t5-small.
The original model is licensed under Apache License 2.0. This quantized version is distributed under the same license. See the LICENSE file for the full text.
@article{chung2022scaling,
title={Scaling Instruction-Finetuned Language Models},
author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and others},
journal={arXiv preprint arXiv:2210.11416},
year={2022}
}
Base model
google/flan-t5-small