loggenix-moe-0.4B-0.2A-sft-s3.1
Model Description
This is a Mixture of Experts (MoE) language model fine-tuned for various tasks including:
- Tool/Function Calling
- Code Generation
- Reasoning & Math
- Safety & Content Moderation
Evaluation Results
Evaluation Date: 20260128_105522
Standard Benchmarks (lm-evaluation-harness)
| Benchmark | Score |
|---|---|
| arc_challenge | 10.00% |
| arc_easy | 10.00% |
| boolq | 40.00% |
| gsm8k | 0.00% |
| hellaswag | 40.00% |
| mmlu | 25.26% |
| mmlu_humanities | 26.15% |
| mmlu_formal_logic | 20.00% |
| mmlu_high_school_european_history | 20.00% |
| mmlu_high_school_us_history | 30.00% |
| mmlu_high_school_world_history | 30.00% |
| mmlu_international_law | 30.00% |
| mmlu_jurisprudence | 40.00% |
| mmlu_logical_fallacies | 40.00% |
| mmlu_moral_disputes | 10.00% |
| mmlu_moral_scenarios | 20.00% |
| mmlu_philosophy | 10.00% |
| mmlu_prehistory | 30.00% |
| mmlu_professional_law | 20.00% |
| mmlu_world_religions | 40.00% |
| mmlu_other | 26.15% |
| mmlu_business_ethics | 10.00% |
| mmlu_clinical_knowledge | 40.00% |
| mmlu_college_medicine | 40.00% |
| mmlu_global_facts | 20.00% |
| mmlu_human_aging | 40.00% |
| mmlu_management | 0.00% |
| mmlu_marketing | 50.00% |
| mmlu_medical_genetics | 30.00% |
| mmlu_miscellaneous | 20.00% |
| mmlu_nutrition | 20.00% |
| mmlu_professional_accounting | 30.00% |
| mmlu_professional_medicine | 20.00% |
| mmlu_virology | 20.00% |
| mmlu_social_sciences | 27.50% |
| mmlu_econometrics | 40.00% |
| mmlu_high_school_geography | 20.00% |
| mmlu_high_school_government_and_politics | 30.00% |
| mmlu_high_school_macroeconomics | 0.00% |
| mmlu_high_school_microeconomics | 10.00% |
| mmlu_high_school_psychology | 50.00% |
| mmlu_human_sexuality | 20.00% |
| mmlu_professional_psychology | 40.00% |
| mmlu_public_relations | 20.00% |
| mmlu_security_studies | 30.00% |
| mmlu_sociology | 20.00% |
| mmlu_us_foreign_policy | 50.00% |
| mmlu_stem | 22.63% |
| mmlu_abstract_algebra | 20.00% |
| mmlu_anatomy | 20.00% |
| mmlu_astronomy | 20.00% |
| mmlu_college_biology | 30.00% |
| mmlu_college_chemistry | 10.00% |
| mmlu_college_computer_science | 20.00% |
| mmlu_college_mathematics | 20.00% |
| mmlu_college_physics | 30.00% |
| mmlu_computer_security | 50.00% |
| mmlu_conceptual_physics | 30.00% |
| mmlu_electrical_engineering | 40.00% |
| mmlu_elementary_mathematics | 0.00% |
| mmlu_high_school_biology | 30.00% |
| mmlu_high_school_chemistry | 30.00% |
| mmlu_high_school_computer_science | 20.00% |
| mmlu_high_school_mathematics | 20.00% |
| mmlu_high_school_physics | 10.00% |
| mmlu_high_school_statistics | 0.00% |
| mmlu_machine_learning | 30.00% |
| openbookqa | 40.00% |
| piqa | 70.00% |
| winogrande | 60.00% |
Synthetic Task Categories
| Category | Score | Tasks Evaluated |
|---|---|---|
| Tool Calling | 50.00% | 1 |
| SRE DevOps | 16.67% | 12 |
| Programming | 25.88% | 17 |
| Reasoning | 11.25% | 4 |
| LLM Evaluation | 22.00% | 5 |
| Safety Ethics | 20.00% | 8 |
| Financial | 23.75% | 8 |
| Customer Support | 16.88% | 8 |
| Observability | 26.25% | 8 |
| Content Generation | 13.33% | 3 |
| Core AI | 35.00% | 3 |
Tool-Calling Performance
| Metric | Score |
|---|---|
| Format Accuracy | 20.00% |
| Function Name Accuracy | 0.00% |
| Parameter Accuracy | 0.00% |
| Overall Accuracy | 6.67% |
Code Generation Performance
| Metric | Score |
|---|---|
| Syntax Accuracy | 8.33% |
| Keyword Coverage | 29.38% |
| Completion Rate | 75.00% |
Overall Summary
- Synthetic Tasks Mean Score: 21.68%
- Total Tasks Evaluated: 170
- Task Coverage: 180.9%
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.4B-0.2A-sft-s3.1")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.4B-0.2A-sft-s3.1", trust_remote_code=True)
# For tool calling
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the weather in San Francisco?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
Training Data
The model was fine-tuned on a diverse dataset including:
- Tool-calling datasets (Toucan, ToolACE, SmoLAgents)
- Safety datasets (HelpSteer3, Safety-Guard, Content-Safety-Reasoning)
- Math datasets (GSM8K, MetaMath, Big-Math-RL)
- Code datasets (Magicoder)
- Reasoning datasets (Reasoning-Gemini, Textbook-Reasoning)
License
Apache 2.0
- Downloads last month
- 165