Model Card for Qwen2.5-0.5B-SFT-OpenOrca
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on a cleaned subset of the Open-Orca/OpenOrca dataset. It was trained to follow instructions and reason across various tasks.
Model Details
Model Description
- Model type: Causal Language Model
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-0.5B
Model Sources
- Repository: Sepolian/qwen2.5-0.5b-sft-openorca
- Dataset: Open-Orca/OpenOrca
Training Details
Training Data
The model was trained on a 300k sample subset of the OpenOrca dataset. We applied a rigorous Data Cleaning Pipeline to ensure high quality:
- Length & Refusal Filtering: Removed samples with very short questions (<20 chars) or responses (<120 chars), and filtered out "I cannot answer..." refusals.
- Language Filtering: Kept only English samples using a fast ASCII-ratio heuristic (>0.95) and
langdetectfallback. - Repetition Filtering: Removed samples with excessive word repetition (consecutive words >5) or high n-gram frequency.
- MinHash Deduplication: Applied 3-gram MinHash LSH (threshold 0.85) to remove near-duplicates.
- System Prompt Standardization: Unified system prompts to a standard "You are a helpful assistant..." format.
Training Procedure
Training Hyperparameters
- Training regime: bf16 mixed precision
- Optimizer: AdamW (
paged_adamw_32bit) - Learning Rate: 8e-6 with cosine schedule (warmup ratio 0.03)
- Epochs: 1
- Packing: Enabled (packed short sequences for efficiency)
Speeds, Sizes, Times
- Hardware: trained on NVIDIA L4 (24GB) via Google Colab
- Training Time: ~5 hours (approx)
Evaluation
Benchmarks
evaluated the model on standard benchmarks using lm-evaluation-harness.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Sepolian/qwen2.5-0.5b-sft-openorca"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Citation
@misc{openorca,
title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces},
author = {Mukherjee, Subhabrata and Mitra, Arindam and Skjellum, Ganesh and Catalyurek, Umit and Tielelman, Thomas and Monroy-Hernandez, Andres},
year = {2023},
publisher = {HuggingFace}
}
- Downloads last month
- 4
Model tree for Sepolian/qwen2.5-0.5b-sft-openorca
Base model
Qwen/Qwen2.5-0.5B