Qwen3-4B Instruct Quality Classifier

Qwen3-4B Instruct Quality Classifier is a Qwen3-4B based model that can be used for judging the quality of a user-assistant conversation, serving as a quality filter for instruction-following tasks. This model was trained on the Portuguese Instruct Quality Qwen Annotations dataset.

Details

For training, we added a classification head with a single regression output to Qwen/Qwen3-4B. For training, we only froze the weights of the embedding layer.

Dataset: Portuguese Instruct Quality Qwen Annotations
Language: Portuguese
Number of Training Epochs: 2
Batch size: 64
Optimizer: torch.optim.AdamW (cosine learning rate scheduler with 100 warmup steps)
Learning Rate: 5e-5
Eval Metric: f1-score

This repository has the source code used to train this model.

Evaluation Results

Confusion Matrix

	1	2	3	4	5
1	153	35	4	1	2
2	17	204	91	11	4
3	2	60	578	143	7
4	0	1	99	2076	348
5	0	0	5	299	5860

Precision: 0.8248
Recall: 0.7821
F1 Macro: 0.8082
Accuracy: 0.894

Usage

Here's an example of how to use Qwen3-4B Instruct Quality Classifier for scoring a conversation in Portuguese:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier")
model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier")
model.to(device)

good_messages = [
    {
        "role": "user",
        "content": "Qual é a capital de Portugal?"
    },
    {
        "role": "assistant",
        "content": "A capital de Portugal é Lisboa."
    }
]

bad_messages = [
    {
        "role": "user",
        "content": "Qual é a capital de Portugal?"
    },
    {
        "role": "assistant",
        "content": "Minha cor favorita é azul."
    }
]

for message in [good_messages, bad_messages]:

    # Format the conversation into a single string (which is how the model was trained)
    text = tokenizer.apply_chat_template(
        message,
        tokenize=False,
    )

    # This model was fine-tuned with sequences up to 6032 tokens long
    inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits.squeeze(-1).float().cpu().numpy()
    score = [x + 1 for x in logits.tolist()][0] # scores are produced in the range [0, 4]. To convert to the range [1, 5], we add 1 to the score.
    print({
        "text": text,
        "score": score,
        "int_score": [int(round(max(0, min(score, 4)))) + 1 for score in logits][0], # scores are produced in the range [0, 4]. To convert to the range [1, 5], we add 1 to the rounded score.
    })

Cite as 🤗

@misc{correa2026tucano2cool,
      title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, 
      author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
      year={2026},
      eprint={2603.03543},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03543}, 
}

Aknowlegments

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.