40B to 80B dense LLM: A better/smarter model than 122B that can still fit
Since my 64 GB RAM + 12 GB VRAM gaming PC can't fit the good minimum suggested Q4_K_M quant and the dense 27B model performs really well and seems to surprise many, not only on the provided official benchmarks, here is how an even more powerful model than the 122B could fit into the same 46 GB RAM + 12 GB VRAM: Making a, e.g. 40B to 80B, dense model. As a Q4_K_M / 4.5-bit quant, the 80B model would just be 45-50 GB (80GB-per-8-bpw/8-bpw*4.5-bpw) and it would still easily fit. Yes, it would run slower, but the big advantage would be that it would provide a much higher quality answer (sometimes waiting for a much higher quality answer is a tradeoff many are easily willing to take). (One could say since the RAM prices are so high right now, we went back full circle.)
That's the thing, it's only marginally better output and vastly less efficient to run a dense.
The real answer is serial processing by using itterative response generation. The result it outputs that vastly outstrip any single response from either an 80b dense or more by multi turn generation.
Doing this with a dense is prohibitively slow, but not on an moe. It can run 3 or 4 turns in the time a dense would take for a single output, and it will have a drastically better output, particularly for coding...