How much ram and VRAM did this take?
What was your ram and VRAM use when you made this? Was thinking about experimenting with running heretic on minimax.
8x B200! Massive
But just need 800gb or more but 48 hour process with 25+ patches
But just need 800gb or more but 48 hour process with 25+ patches
Wow thank you for your contribution that was probably very expensive to do.
Thank you. It was because of all the time to troubleshoot the enviornment and complexity of the unique techniques post training to laydown the alignment the way they did. Ill share a paper if someone wants to repeat. It draft quality and I didnt share the last day of effort but you will get enough to advance to the abation runs. Ill share in the org ... I will compress back down to FP8 maybe this week if there is time.
Thank you. It was because of all the time to troubleshoot the enviornment and complexity of the unique techniques post training to laydown the alignment the way they did. Ill share a paper if someone wants to repeat. It draft quality and I didnt share the last day of effort but you will get enough to advance to the abation runs. Ill share in the org ... I will compress back down to FP8 maybe this week if there is time.
That sounds great! Wish there was an efficient way to fine tune with lora though, I think minimax would be a great model to fine tune if it was not so incredibly resource intensive.
They trained in FP8 so upscalling by the three I found would make it possible. only one will work which is the one I selected but required three patches for Transformer 5.5 After that and as it is above you could do it but yes massively slow with total weight tensors of 48, 234... 256 experts per layer. By the end I could do it on 8x 6000 which makes it more reasonable from a cost point of view if you are leasing. Its just very very slow using pcie only between cards. It will be interesting to see what they have cooked up under the hood for 2.7
They trained in FP8 so upscalling by the three I found would make it possible. only one will work which is the one I selected but required three patches for Transformer 5.5 After that and as it is above you could do it but yes massively slow with total weight tensors of 48, 234... 256 experts per layer. By the end I could do it on 8x 6000 which makes it more reasonable from a cost point of view if you are leasing. Its just very very slow using pcie only between cards. It will be interesting to see what they have cooked up under the hood for 2.7
I've been pretty exited to test 2.7 locally since its benchmarks look pretty promising. If its good enough id gladly switch from my expensive claude max subscription to use it locally so fingers crossed. Wont be opus but if I can get 80% of opus's intelligence and coding skills I would be more than happy.