how do you fine tune?

by Roman1111111 - opened Feb 19

Feb 19

how do you fine tune?, in what precision? and what gpu do you use, what context window? I would like to know about this as i also work on similar project, could you please help me with that

flashxRQ

Feb 19

I also have a question about that

armand0e

TeichAI org Feb 19

How i finetune Is a big question. we put some docs together to help people replicate what we do. Although things can vary depending how much compute you have at your disposal.

https://docs.teichai.com
Note: the "open in colab" buttons don't work at the moment as we haven't made the notebooks yet

For tuning something like GLM Flash you want at the very least 64gb of vram.

Bob-the-Koala

Feb 20

Well what GPUs do you use

armand0e

TeichAI org Feb 20

Again depends on the model but for models like this colab I use the 80gb a100

Bob-the-Koala

Feb 20

What is the max I could fine tune with my 5070 Ti

armand0e

TeichAI org Feb 20

Same gpu I use. Max of 14B (depending on how optimized the model is in unsloth)

Roman1111111

Feb 20

is it worth it? if i fine tune glm 4.7 flash on 100k rows+ dataset, answered by gemini 3 flash, it has various domains and concepts - so it's really diverse, could you please give advice

71 hidden messages

Expand all

CryptoAIM

Mar 1

but you know i get much higher tokens per sec in moe with near same vram consumption(even tho it's 6gig more, it makes math in 3b making it much faster)

exactly why I do not understand why people still use dense so often. I mean sub. 10b maybe sure. Else: nahh

Bob-the-Koala

Mar 1

Well I use dense models under 32B for 3 reasons, 1 they are well supported, 2 they are easy to train, 3 they are consistent baselines

Bob-the-Koala

Mar 1

Above 32B they become almost impossible to run though

Roman1111111

Mar 2

Ya, still dense models are kinda outstanding if under limited parameters (personally I prefer dense llms under 20b), but again it depends on hardware, but imagine if I perfectly fine tune and make much better model than initially, and if it achieves scores and feels like frontend models such gemini 3,gpt 5, and Claude sonnet 4.5, it will be clear advantage to have moe and faster tokens per sec by like 40% over 27b variant. Again it will be harder to fine tune moe- but it will pay off with speed advantage

CryptoAIM

Mar 2

im not sure which model you end up going for but if you use qwen 3.5 35b a3b or smth (id recomend this one because of the qwen3.5 architecture) you can still improve architecture and inf by using quant aware training and multy token prediction. you could also try flash attention for better long context performance. i mean if you use qwen 3.5 specifically i would also hope for better front end (qwen models are garbage at it. geniunly unsuable) but else i would already be happy with a tie with baseline with the better inf optimization

Roman1111111

Mar 2

oh no, i meant frontend models as high quality - best models, (sorry for confusion, as i currently learning english)

Roman1111111

Mar 2

i agree with you, thanks for help, also for the front end ui and coding stuff - i would include 5-10k rows of glm 5 front end designing masterpiece

CryptoAIM

Mar 2

gemini is the best at front end. no need for glm. the more models you use, the more messy training can get

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment