Cetvel benchmark context & inference observations

#2
by O96a - opened

Solid work on the Turkish fine-tune. The 32.54 QA score on Cetvel is interesting β€” I've been benchmarking smaller models on Arabic dialect tasks and see similar patterns where generative QA struggles with domain-specific phrasing.

Curious: did you observe any degradation on translation tasks after fine-tuning on the QA/summarization mix? We found that multitask fine-tuning on low-resource languages often trades generative fluency for task-specific accuracy.

Also, have you tested inference latency on CPU-only setups? The 2.3B parameter count fits well within edge deployment constraints, and Mistral's KV-cache efficiency makes it practical for real-time applications if quantization holds.

Sign up or log in to comment