Model Card for Distilled INT8 Quantized TinyBERT (SST-2)
Model Details
Model Description
This model is a distilled and INT8-quantized TinyBERT fine-tuned for binary sentiment classification on the SST-2 dataset.
The goal of this model is energy-efficient and low-latency inference, making it suitable for on-device and edge deployment.
Starting from huawei-noah/TinyBERT_General_4L_312D, the model undergoes:
- Knowledge Distillation to transfer task-specific knowledge
- Post-training INT8 quantization to reduce memory footprint and inference cost
The result is a compact Transformer model that preserves competitive accuracy while significantly improving efficiency.
- Developed by: hamadbijarani012
- Model type: Transformer encoder (TinyBERT)
- Task: Binary sentiment classification
- Language: English
- License: Apache 2.0
- Finetuned from: huawei-noah/TinyBERT_General_4L_312D
Uses
Direct Use
- Sentiment classification on short English text
- Low-power or latency-sensitive inference scenarios
- Benchmarking model compression and energy-aware NLP research
Out-of-Scope Use
- Multilingual sentiment analysis
- Long-document understanding
- Generative text tasks
- Safety-critical or high-stakes decision-making systems
Bias, Risks, and Limitations
- The model inherits dataset biases present in SST-2 (movie-review–centric sentiment).
- INT8 quantization may introduce minor numerical precision loss, particularly for out-of-distribution inputs.
- Performance may degrade on domains substantially different from movie reviews.
This model is intended for research and efficiency benchmarking, not for deployment in sensitive real-world applications without further validation.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("hamadbijarani012/<model-id>")
model = AutoModelForSequenceClassification.from_pretrained(
"hamadbijarani012/<model-id>",
torch_dtype=torch.int8
)
text = "The movie was surprisingly good!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax(dim=-1).item()
Training Details
Training Data
- Dataset: SST-2 (Stanford Sentiment Treebank v2)
- Task: Binary sentiment classification (positive / negative)
- Split: Standard train / validation split provided by the dataset
Training Procedure
- Distillation: Knowledge distillation using soft targets from a larger teacher model
- Fine-tuning: Supervised fine-tuning on SST-2 labels
- Quantization: Post-training INT8 dynamic quantization using PyTorch
Training Hyperparameters
- Training regime: FP32 training → INT8 inference
- Loss: Cross-entropy (task loss) + distillation loss (KL divergence)
- Optimizer: AdamW
- Batch size: 32
- Epochs: 3–5
Evaluation
Metrics
- Accuracy
Results
| Model Variant | Accuracy |
|---|---|
| Distilled INT8 TinyBERT (this) | 89.11% |
Summary
INT8 quantization introduces negligible accuracy degradation while providing substantial benefits in model size, inference latency, and energy consumption, making it suitable for efficient NLP inference.
Technical Specifications
Model Architecture
- 4 Transformer encoder layers
- Hidden size: 312
- Attention heads: 12
- Vocabulary: BERT uncased
Compute Infrastructure
Hardware
- NVIDIA GPU (training)
- CPU / GPU (INT8 inference)
Software
- PyTorch
- Hugging Face Transformers
- PyTorch Quantization API
Model Card Contact
For questions related to this model or the associated research, please contact the model author via Hugging Face.
- Downloads last month
- 2
Model tree for hamadbijarani012/TinyBERT_General_4L_312D_distilled_BERT_base_uncased_SST2_int8
Base model
huawei-noah/TinyBERT_General_4L_312D