Model Card for Qwen2.5-0.5B-SFT-OpenOrca

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on a cleaned subset of the Open-Orca/OpenOrca dataset. It was trained to follow instructions and reason across various tasks.

Model Details

Model Description

  • Model type: Causal Language Model
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen2.5-0.5B

Model Sources

Training Details

Training Data

The model was trained on a 300k sample subset of the OpenOrca dataset. We applied a rigorous Data Cleaning Pipeline to ensure high quality:

  1. Length & Refusal Filtering: Removed samples with very short questions (<20 chars) or responses (<120 chars), and filtered out "I cannot answer..." refusals.
  2. Language Filtering: Kept only English samples using a fast ASCII-ratio heuristic (>0.95) and langdetect fallback.
  3. Repetition Filtering: Removed samples with excessive word repetition (consecutive words >5) or high n-gram frequency.
  4. MinHash Deduplication: Applied 3-gram MinHash LSH (threshold 0.85) to remove near-duplicates.
  5. System Prompt Standardization: Unified system prompts to a standard "You are a helpful assistant..." format.

Training Procedure

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Optimizer: AdamW (paged_adamw_32bit)
  • Learning Rate: 8e-6 with cosine schedule (warmup ratio 0.03)
  • Epochs: 1
  • Packing: Enabled (packed short sequences for efficiency)

Speeds, Sizes, Times

  • Hardware: trained on NVIDIA L4 (24GB) via Google Colab
  • Training Time: ~5 hours (approx)

Evaluation

Benchmarks

evaluated the model on standard benchmarks using lm-evaluation-harness.

full_eval

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Sepolian/qwen2.5-0.5b-sft-openorca"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Citation

@misc{openorca,
  title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces},
  author = {Mukherjee, Subhabrata and Mitra, Arindam and Skjellum, Ganesh and Catalyurek, Umit and Tielelman, Thomas and Monroy-Hernandez, Andres},
  year = {2023},
  publisher = {HuggingFace}
}
Downloads last month
4
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sepolian/qwen2.5-0.5b-sft-openorca

Finetuned
(579)
this model

Dataset used to train Sepolian/qwen2.5-0.5b-sft-openorca