Thanks for making this!
What in your readme is supposed to come after
π Training Data
?
Curious about how you did it.
Hello and thank you for enjoying my work on this model. What you are seeing, nothing comes after that. It was this -> Primary Dataset: Magicoder-Evol-Instruct-110K π Training Data. Sorry for the confusion. The model didn't need much training data. Just needed slight correction. I tried Nvidia Open Code datasets but that affected the model greatly on several training sessions.
Hi @jgebbeken ! Great work on this model.
I ran your gemma-4-coder through my LLM Reasoning Benchmark v10 β a custom test suite designed to evaluate logical reasoning capabilities of local models. Here are the results.
Benchmark Overview
30 tests across 10 categories (3 difficulty variants each)
Categories: Arithmetic, Logic (constraint satisfaction), Speed/Time, Combinatorics, Age Algebra, Truth/Liars puzzles, Optimization, Probability, Graph pathfinding, Business problems
All answers are validated programmatically against known correct solutions
Models must output structured JSON with both reasoning and final answers
Scoring v2.0 with partial credit and cascade error detection
Results: gemma-4-coder vs 41 other models
Model Score Perfect tests Avg tokens Total time
gemma-4-coder 200/200 30/30 ~835 ~8 min
microsoft/phi-4-reasoning-plus [THINKING] 200/200 30/30 ~1,516 ~43 min
qwen/qwen3-coder-30b ~174/200 22/30 β β
gigachat3.1-10b-a1.8b ~165/200 17/30 β β
qwen/qwen2.5-coder-14b ~132/200 14/30 β β
gemma-4-coder achieved a perfect score β the only non-thinking model to do so.
Compared to the other perfect scorer (phi-4-reasoning-plus):
1.8x fewer tokens per response
5.2x faster total benchmark time
No [THINKING] mode required β solves everything via direct generation
Environment
Server: LM Studio (localhost)
Hardware: local GPU inference
Settings: max_tokens=8192, default sampling parameters
Quantization: Q4_K_M (as provided in this repo)
Key observations
Your model scored perfectly across all categories including the hardest ones (Combinatorics, Graph pathfinding, Truth/Liars) where most other models fail. This is especially impressive given that the Magicoder fine-tune targets code tasks, yet the benchmark tests pure logical/mathematical reasoning.
This suggests the base Gemma 4 E4B architecture is exceptionally strong, and your fine-tune preserved (or slightly enhanced) its reasoning capabilities while adding code specialization.
Full benchmark is still running across all 42 models. I plan to share complete results on r/LocalLLaMA soon.
Thank you for releasing this model β it's a hidden gem that deserves more attention!
Wow I am actually amazed by this. Thank you. I wasn't completely sure how my model would fare with benchmarks. If it is alright with you. I would like to post these results. Maybe give it the light it deserves.
Here at the link you can find the source code for the tests themselves (I wrote them for myself to quickly test a huge collection of local models), the raw responses from all the models, and the summary test results.
https://huggingface.co/Fortser/Flux_Krea/resolve/main/tests.zip
In terms of the speed-to-quality ratio, your model is the clear leader.