FP16 Optimized ONNX Model

This model is a ONNX-FP16 optimized version of cardiffnlp/twitter-xlm-roberta-base-sentiment. It runs exclusively on the GPU. Depending on the model, ONNX-FP16 versions can be 2-3X faster than base PyTorch models.

For more information on ONNX-FP16 benchmarks vs ONNX and pytorch, as well as the scripts used to generate and check the accuracy of this model, please check https://github.com/joaopn/encoder-optimization-guide.

On a test set of 10000 reddit comments, the label probability differences between it and the FP32 model were

Mean: 0.00075560
Std Dev: 0.00073272
Min: 0.00000095
Max: 0.01000583
Median: 0.00054353

Quantiles:
  25th percentile: 0.00024071
  50th percentile: 0.00054353
  75th percentile: 0.00103680
  90th percentile: 0.00168392
  95th percentile: 0.00217985
  99th percentile: 0.00333904

Usage

The model was generated with:

from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification, AutoOptimizationConfig
from transformers import AutoTokenizer

model_id = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
save_dir = "./model-onnx-fp16"

# 1. Export the base model to ONNX
model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. Setup the Optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# 3. Apply O4 Optimization (GPU-only FP16)
optimization_config = AutoOptimizationConfig.O4()

optimizer.optimize(
    save_dir=save_dir,
    optimization_config=optimization_config
)

# 4. Save tokenizer for a complete package
tokenizer.save_pretrained(save_dir)

You will need the GPU version of the ONNX Runtime. It can be installed with:

pip install optimum[onnxruntime-gpu] --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

For convenience, you can use this this environment.yml file to create a conda env with all the requirements. Below is an optimized, batched usage example:

import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

def sentiment_analysis_batched(df, batch_size, field_name):
    # Replace with your HuggingFace username/model_id after uploading
    model_id = 'YOUR_USERNAME/YOUR_MODEL_ID'
    file_name = 'model.onnx'
    gpu_id = 0
    
    model = ORTModelForSequenceClassification.from_pretrained(model_id, file_name=file_name, provider="CUDAExecutionProvider", provider_options={'device_id': gpu_id})
    device = torch.device(f"cuda:{gpu_id}")

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    results = []

    # Precompute id2label mapping
    id2label = model.config.id2label

    total_samples = len(df)
    with tqdm(total=total_samples, desc="Processing samples") as pbar:
        for start_idx in range(0, total_samples, batch_size):
            end_idx = start_idx + batch_size
            texts = df[field_name].iloc[start_idx:end_idx].tolist()

            inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
            input_ids = inputs['input_ids'].to(device)
            attention_mask = inputs['attention_mask'].to(device)

            with torch.no_grad():
                outputs = model(input_ids, attention_mask=attention_mask)
            predictions = torch.sigmoid(outputs.logits)  # Use sigmoid for multi-label classification

            # Collect predictions on GPU
            results.append(predictions)

            pbar.update(end_idx - start_idx)

    # Concatenate all results on GPU
    all_predictions = torch.cat(results, dim=0).cpu().numpy()

    # Convert to DataFrame
    predictions_df = pd.DataFrame(all_predictions, columns=[id2label[i] for i in range(all_predictions.shape[1])])

    # Add prediction columns to the original DataFrame
    combined_df = pd.concat([df.reset_index(drop=True), predictions_df], axis=1)

    return combined_df

df = pd.read_csv('https://github.com/joaopn/gpu_benchmark_goemotions/raw/main/data/random_sample_10k.csv.gz')
df = sentiment_analysis_batched(df, batch_size=8, field_name='body')

Downloads last month: 214

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joaopn/twitter-xlm-roberta-base-sentiment-onnx-fp16

Base model

cardiffnlp/twitter-xlm-roberta-base-sentiment

Quantized

(2)

this model

Collection including joaopn/twitter-xlm-roberta-base-sentiment-onnx-fp16

ONNX-FP16

Collection

Optimized ONNX-FP16 versions of transformers encoder models • 6 items • Updated Feb 6