Model Card for Distilled INT8 Quantized TinyBERT (SST-2)

Model Details

Model Description

This model is a distilled and INT8-quantized TinyBERT fine-tuned for binary sentiment classification on the SST-2 dataset.
The goal of this model is energy-efficient and low-latency inference, making it suitable for on-device and edge deployment.

Starting from huawei-noah/TinyBERT_General_4L_312D, the model undergoes:

Knowledge Distillation to transfer task-specific knowledge
Post-training INT8 quantization to reduce memory footprint and inference cost

The result is a compact Transformer model that preserves competitive accuracy while significantly improving efficiency.

Developed by: hamadbijarani012
Model type: Transformer encoder (TinyBERT)
Task: Binary sentiment classification
Language: English
License: Apache 2.0
Finetuned from: huawei-noah/TinyBERT_General_4L_312D

Uses

Direct Use

Sentiment classification on short English text
Low-power or latency-sensitive inference scenarios
Benchmarking model compression and energy-aware NLP research

Out-of-Scope Use

Multilingual sentiment analysis
Long-document understanding
Generative text tasks
Safety-critical or high-stakes decision-making systems

Bias, Risks, and Limitations

The model inherits dataset biases present in SST-2 (movie-review–centric sentiment).
INT8 quantization may introduce minor numerical precision loss, particularly for out-of-distribution inputs.
Performance may degrade on domains substantially different from movie reviews.

This model is intended for research and efficiency benchmarking, not for deployment in sensitive real-world applications without further validation.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hamadbijarani012/<model-id>")
model = AutoModelForSequenceClassification.from_pretrained(
    "hamadbijarani012/<model-id>",
    torch_dtype=torch.int8
)

text = "The movie was surprisingly good!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax(dim=-1).item()

Training Details

Training Data

Dataset: SST-2 (Stanford Sentiment Treebank v2)
Task: Binary sentiment classification (positive / negative)
Split: Standard train / validation split provided by the dataset

Training Procedure

Distillation: Knowledge distillation using soft targets from a larger teacher model
Fine-tuning: Supervised fine-tuning on SST-2 labels
Quantization: Post-training INT8 dynamic quantization using PyTorch

Training Hyperparameters

Training regime: FP32 training → INT8 inference
Loss: Cross-entropy (task loss) + distillation loss (KL divergence)
Optimizer: AdamW
Batch size: 32
Epochs: 3–5

Evaluation

Metrics

Accuracy

Results

Model Variant	Accuracy
Distilled INT8 TinyBERT (this)	89.11%

Summary

INT8 quantization introduces negligible accuracy degradation while providing substantial benefits in model size, inference latency, and energy consumption, making it suitable for efficient NLP inference.

Technical Specifications

Model Architecture

4 Transformer encoder layers
Hidden size: 312
Attention heads: 12
Vocabulary: BERT uncased

Compute Infrastructure

Hardware

NVIDIA GPU (training)
CPU / GPU (INT8 inference)

Software

PyTorch
Hugging Face Transformers
PyTorch Quantization API

Model Card Contact

For questions related to this model or the associated research, please contact the model author via Hugging Face.

Downloads last month: 2

Safetensors

Model size

14.4M params

Tensor type

F32

Model tree for hamadbijarani012/TinyBERT_General_4L_312D_distilled_BERT_base_uncased_SST2_int8

Base model

huawei-noah/TinyBERT_General_4L_312D

Finetuned

(58)

this model

hamadbijarani012
/

TinyBERT_General_4L_312D_distilled_BERT_base_uncased_SST2_int8