Model Card for dro-v-qwen3-1.7b-paperlike

This model is a fine-tuned version of Qwen/Qwen3-1.7B. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="MWilinski/dro-v-qwen3-1.7b-paperlike", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with DRO, a method introduced in Offline Regularised Reinforcement Learning for Large Language Models Alignment.

Framework versions

TRL: 0.29.0.dev0
Transformers: 4.57.6
Pytorch: 2.9.0
Datasets: 4.6.0
Tokenizers: 0.22.2

Citations

Cite DRO as:

@inproceedings{richemond2024offline,
    title        = {{Offline Regularised Reinforcement Learning for Large Language Models Alignment}},
    author       = {Pierre Harvey Richemond and Shangmin Guo and Caglar Gulcehre and Daniele Calandriello and
                    Corrado Anselmi and Nikola Momchev and Olivier Bachem and Daniel Toyama and Zoe Stepleton and
                    Thomas Baines and Bilal Piot and Francesco Visin and Doina Precup and Rémi Munos},
    booktitle    = {Advances in Neural Information Processing Systems},
    year         = 2024,
    eprint       = {arXiv:2405.19107},
}

Cite TRL as:

@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MWilinski/dro-v-qwen3-1.7b-paperlike

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(623)

this model

Paper for MWilinski/dro-v-qwen3-1.7b-paperlike

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Paper • 2405.19107 • Published May 29, 2024 • 15