how do you fine tune?
how do you fine tune?, in what precision? and what gpu do you use, what context window? I would like to know about this as i also work on similar project, could you please help me with that
I also have a question about that
How i finetune Is a big question. we put some docs together to help people replicate what we do. Although things can vary depending how much compute you have at your disposal.
https://docs.teichai.com
Note: the "open in colab" buttons don't work at the moment as we haven't made the notebooks yet
For tuning something like GLM Flash you want at the very least 64gb of vram.
Well what GPUs do you use
Again depends on the model but for models like this colab I use the 80gb a100
What is the max I could fine tune with my 5070 Ti
Same gpu I use. Max of 14B (depending on how optimized the model is in unsloth)
is it worth it? if i fine tune glm 4.7 flash on 100k rows+ dataset, answered by gemini 3 flash, it has various domains and concepts - so it's really diverse, could you please give advice
but you know i get much higher tokens per sec in moe with near same vram consumption(even tho it's 6gig more, it makes math in 3b making it much faster)
exactly why I do not understand why people still use dense so often. I mean sub. 10b maybe sure. Else: nahh
Well I use dense models under 32B for 3 reasons, 1 they are well supported, 2 they are easy to train, 3 they are consistent baselines
Above 32B they become almost impossible to run though
Ya, still dense models are kinda outstanding if under limited parameters (personally I prefer dense llms under 20b), but again it depends on hardware, but imagine if I perfectly fine tune and make much better model than initially, and if it achieves scores and feels like frontend models such gemini 3,gpt 5, and Claude sonnet 4.5, it will be clear advantage to have moe and faster tokens per sec by like 40% over 27b variant. Again it will be harder to fine tune moe- but it will pay off with speed advantage
im not sure which model you end up going for but if you use qwen 3.5 35b a3b or smth (id recomend this one because of the qwen3.5 architecture) you can still improve architecture and inf by using quant aware training and multy token prediction. you could also try flash attention for better long context performance. i mean if you use qwen 3.5 specifically i would also hope for better front end (qwen models are garbage at it. geniunly unsuable) but else i would already be happy with a tie with baseline with the better inf optimization
oh no, i meant frontend models as high quality - best models, (sorry for confusion, as i currently learning english)
i agree with you, thanks for help, also for the front end ui and coding stuff - i would include 5-10k rows of glm 5 front end designing masterpiece
gemini is the best at front end. no need for glm. the more models you use, the more messy training can get