MISS INDEX FILE?

#1
by Nick-Awesome - opened

Hi, thank you for your work on this repo. I just want to confirm whether it might be missing a model.safetensors.index.json file. The base model you reference, unsloth/gemma-3-4b-it-unsloth-bnb-4bit, only contains a single model.safetensors file, but your repo has two. When I try to run the model from your repo, I get the following error:

OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /data/models/Grok-3-reasoning-gemma3-4B-distilled-HF.

I suspect this might be because the index file was not included when pushing the repo. Could you please confirm?

All set! Added it. Let me know if it works.

reedmayhew changed discussion status to closed

Thank you for your kind reply. It works well on my side.

In addition, I have noticed that many of your repositories—both for models and data—mention “Distill.” May I kindly ask whether you have a blog or any documentation explaining how such data are generated, and how they can be used for SFT in detail?

Most importantly, I would like to know whether these models have been evaluated: have you tested them on standard benchmarks, and if so, have they shown significant improvements in practical scenarios?

Thank you very much for your time and guidance.

Thanks for the question! I don't have any professional documentation at the moment, as these are more for my personal use. I leave them public for other people to use if they'd like.

As for the process, if you view the dataset associated with the model (which should be linked), I simply feed those questions to the API of whatever model I'm trying to distill from and use the model's output as the assistant response for the dataset.

It's similar to the process DeepSeek used to create distilled versions of their model by using other smaller open-source models as the foundation. They're basically just fine-tuning based on the output of their R1 model. I'm doing the same via API calls to whichever model is included in the name of the distilled model.

Lastly, I'm not familiar with how to run benchmarks on them. But, I'd definitely be interested in doing so!

Thanks for the question! I don't have any professional documentation at the moment, as these are more for my personal use. I leave them public for other people to use if they'd like.

As for the process, if you view the dataset associated with the model (which should be linked), I simply feed those questions to the API of whatever model I'm trying to distill from and use the model's output as the assistant response for the dataset.

It's similar to the process DeepSeek used to create distilled versions of their model by using other smaller open-source models as the foundation. They're basically just fine-tuning based on the output of their R1 model. I'm doing the same via API calls to whichever model is included in the name of the distilled model.

Lastly, I'm not familiar with how to run benchmarks on them. But, I'd definitely be interested in doing so!


Thank you for your patient reply. Your open-sourced data and models have been of great help to the community, and I truly appreciate it. I also noticed that the distilled datasets you provided for different models, such as Grok-3 or Claude-3.7, typically contain only about 100–200 samples. May I ask: when such a limited amount of data is used for fine-tuning (i.e., distillation) on models like Qwen-3-8B or LLaMA-3.1-8B, what can they actually learn? This is also why I raised my earlier question—whether such small-scale distillation can actually improve the model’s performance. In my own testing, I observed that the outputs do resemble those of a “reasoning” model, but I have not yet had the chance to evaluate whether performance is truly improved. If you have conducted such evaluations before, could you please share the specific results?

In addition, have you considered performing large-scale distillation (e.g., with 1K+ or even 10K+ samples)? Based on DeepSeek’s previous technical report, such larger-scale distillation would almost certainly yield further improvements.

Thank you once again for providing these datasets and models, which have been of great help to the entire community. If you consider conducting distillation with larger-scale datasets, I believe it will bring even more valuable insights.

No worries at all!

The reason for the smaller datasets is mostly due to time and cost constraints on my part. I would love to be able to create 1K+10K datasets, as you are correct—they would reflect the original model better. However, I'm paying for all of the API requests out of my own pocket and doing so in my limited free time between my full-time job and dealing with numerous health issues.

Additionally, Unsloth keeps breaking their fine-tuning notebooks, which completely disrupts my previously working setups, forcing me to spend a ton of time fixing them. The whole process of making these is a nightmare on my end.

Right now, these distilled models reflect the grammatical and response style of the models they are distilled from, rather than transferring their knowledge.

I've been wanting to do an updated run on Claude Opus 4.1, which has been one of my favorite models recently. However, I haven't been able to find the time. Hopefully, I'll be able to soon, and possibly with a larger dataset.

Sign up or log in to comment