Model Card for Distilled INT8 Quantized TinyBERT (SST-2)

Model Details

Model Description

This model is a distilled and INT8-quantized TinyBERT fine-tuned for binary sentiment classification on the SST-2 dataset.
The goal of this model is energy-efficient and low-latency inference, making it suitable for on-device and edge deployment.

Starting from huawei-noah/TinyBERT_General_4L_312D, the model undergoes:

  1. Knowledge Distillation to transfer task-specific knowledge
  2. Post-training INT8 quantization to reduce memory footprint and inference cost

The result is a compact Transformer model that preserves competitive accuracy while significantly improving efficiency.

  • Developed by: hamadbijarani012
  • Model type: Transformer encoder (TinyBERT)
  • Task: Binary sentiment classification
  • Language: English
  • License: Apache 2.0
  • Finetuned from: huawei-noah/TinyBERT_General_4L_312D

Uses

Direct Use

  • Sentiment classification on short English text
  • Low-power or latency-sensitive inference scenarios
  • Benchmarking model compression and energy-aware NLP research

Out-of-Scope Use

  • Multilingual sentiment analysis
  • Long-document understanding
  • Generative text tasks
  • Safety-critical or high-stakes decision-making systems

Bias, Risks, and Limitations

  • The model inherits dataset biases present in SST-2 (movie-review–centric sentiment).
  • INT8 quantization may introduce minor numerical precision loss, particularly for out-of-distribution inputs.
  • Performance may degrade on domains substantially different from movie reviews.

This model is intended for research and efficiency benchmarking, not for deployment in sensitive real-world applications without further validation.


How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hamadbijarani012/<model-id>")
model = AutoModelForSequenceClassification.from_pretrained(
    "hamadbijarani012/<model-id>",
    torch_dtype=torch.int8
)

text = "The movie was surprisingly good!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax(dim=-1).item()

Training Details

Training Data

  • Dataset: SST-2 (Stanford Sentiment Treebank v2)
  • Task: Binary sentiment classification (positive / negative)
  • Split: Standard train / validation split provided by the dataset

Training Procedure

  • Distillation: Knowledge distillation using soft targets from a larger teacher model
  • Fine-tuning: Supervised fine-tuning on SST-2 labels
  • Quantization: Post-training INT8 dynamic quantization using PyTorch

Training Hyperparameters

  • Training regime: FP32 training → INT8 inference
  • Loss: Cross-entropy (task loss) + distillation loss (KL divergence)
  • Optimizer: AdamW
  • Batch size: 32
  • Epochs: 3–5

Evaluation

Metrics

  • Accuracy

Results

Model Variant Accuracy
Distilled INT8 TinyBERT (this) 89.11%

Summary

INT8 quantization introduces negligible accuracy degradation while providing substantial benefits in model size, inference latency, and energy consumption, making it suitable for efficient NLP inference.


Technical Specifications

Model Architecture

  • 4 Transformer encoder layers
  • Hidden size: 312
  • Attention heads: 12
  • Vocabulary: BERT uncased

Compute Infrastructure

Hardware

  • NVIDIA GPU (training)
  • CPU / GPU (INT8 inference)

Software

  • PyTorch
  • Hugging Face Transformers
  • PyTorch Quantization API

Model Card Contact

For questions related to this model or the associated research, please contact the model author via Hugging Face.

Downloads last month
2
Safetensors
Model size
14.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamadbijarani012/TinyBERT_General_4L_312D_distilled_BERT_base_uncased_SST2_int8

Finetuned
(58)
this model

Dataset used to train hamadbijarani012/TinyBERT_General_4L_312D_distilled_BERT_base_uncased_SST2_int8