Qwen3-8B DFlash Draft Model (Distillation)

Base model: Qwen/Qwen3-8B
Training method: DFlash with teacher-student distillation loss
Checkpoint: epoch_1_step_50000
Dataset: nemotron + codealpaca (greedy regen)
Hyperparameters:
- batch_size: 2
- learning_rate: 3e-4
- loss_type: distill
- loss_decay_gamma: 7.0
- block_size: 16
- num_epochs: 2

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support