Model Card for Qwen2.5-0.5B-SFT-OpenOrca

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on a cleaned subset of the Open-Orca/OpenOrca dataset. It was trained to follow instructions and reason across various tasks.

Model Details

Model Description

Model type: Causal Language Model
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-0.5B

Model Sources

Repository: Sepolian/qwen2.5-0.5b-sft-openorca
Dataset: Open-Orca/OpenOrca

Training Details

Training Data

The model was trained on a 300k sample subset of the OpenOrca dataset. We applied a rigorous Data Cleaning Pipeline to ensure high quality:

Length & Refusal Filtering: Removed samples with very short questions (<20 chars) or responses (<120 chars), and filtered out "I cannot answer..." refusals.
Language Filtering: Kept only English samples using a fast ASCII-ratio heuristic (>0.95) and langdetect fallback.
Repetition Filtering: Removed samples with excessive word repetition (consecutive words >5) or high n-gram frequency.
MinHash Deduplication: Applied 3-gram MinHash LSH (threshold 0.85) to remove near-duplicates.
System Prompt Standardization: Unified system prompts to a standard "You are a helpful assistant..." format.

Training Procedure

Training Hyperparameters

Training regime: bf16 mixed precision
Optimizer: AdamW (paged_adamw_32bit)
Learning Rate: 8e-6 with cosine schedule (warmup ratio 0.03)
Epochs: 1
Packing: Enabled (packed short sequences for efficiency)

Speeds, Sizes, Times

Hardware: trained on NVIDIA L4 (24GB) via Google Colab
Training Time: ~5 hours (approx)

Evaluation

Benchmarks

evaluated the model on standard benchmarks using lm-evaluation-harness.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Sepolian/qwen2.5-0.5b-sft-openorca"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Citation

@misc{openorca,
  title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces},
  author = {Mukherjee, Subhabrata and Mitra, Arindam and Skjellum, Ganesh and Catalyurek, Umit and Tielelman, Thomas and Monroy-Hernandez, Andres},
  year = {2023},
  publisher = {HuggingFace}
}

Downloads last month: 4

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for Sepolian/qwen2.5-0.5b-sft-openorca

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(579)

this model

Sepolian
/

qwen2.5-0.5b-sft-openorca