Thanks for making this! How can we eval intelligence loss from FT?

#1
by BingoBird - opened

Isn't there some place we can submit finetunes to, to get a battery of current relevant performance scores?

Or is there a docker or similar image to download with some benchmarks pre-configured?

With all these finetunes, users lack key information to pick them: 'how much loss can i expect in intelligence, how much bias is present UGI etc.

That is a good point. Hugging Face used to let you submit your models and they would have infrastructure in place to run evaluations on it for you. They archived that older leaderboard and evaluations have to be done on your own compute. For this specific release, my primary focus was on the Cloudbjorn.com Infrastructure as Code (IaC) workflow. I wanted to ensure the environment for training and merging models during the fine-tuning process was as smooth as possible so upcoming finetunings I could focus on customizing datasets and evaluating behavior more efficiently and in a more standardized manner.

I am planning on spinning up compute to run evaluations using the EleutherAI lm-evaluation-harness in the near future to get some evaluations on it.

Sign up or log in to comment