Cetvel benchmark context & inference observations
#2
by O96a - opened
Solid work on the Turkish fine-tune. The 32.54 QA score on Cetvel is interesting β I've been benchmarking smaller models on Arabic dialect tasks and see similar patterns where generative QA struggles with domain-specific phrasing.
Curious: did you observe any degradation on translation tasks after fine-tuning on the QA/summarization mix? We found that multitask fine-tuning on low-resource languages often trades generative fluency for task-specific accuracy.
Also, have you tested inference latency on CPU-only setups? The 2.3B parameter count fits well within edge deployment constraints, and Mistral's KV-cache efficiency makes it practical for real-time applications if quantization holds.