Roka โ Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B
Roka is a supervised fine-tune of AlicanKiraz0/Kara-Kumru-v1.0-2B that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style <tool_call>โฆ</tool_call> output format.
This is a v0.2 research preview, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see Limitations).
The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation.
Model at a glance
| Base model | AlicanKiraz0/Kara-Kumru-v1.0-2B (Mistral architecture, Llama-3 chat template, Turkish-pretrained) |
| Upstream base | vngrs-ai/Kumru-2B |
| Parameters | ~2.15B |
| Fine-tuning | Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer |
| Hardware | Single NVIDIA A6000 (~65 min / epoch ~22 min) |
| Languages | Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples) |
| License | Apache 2.0 (inherited from base chain) |
Tool set
| Tool | Description |
|---|---|
web_search |
Internet search (DuckDuckGo) |
calculator |
Arithmetic expression evaluator |
datetime |
Date/time and calendar arithmetic (9 actions: today, now, day_of_week, add_days, date_diff, days_until, day_of_year, end_of_month, days_until_weekday) |
hava_durumu |
Weather query by city name |
sayfa_oku |
URL content reader |
The model is trained to emit tool calls as:
<tool_call>
{"name": "datetime", "arguments": {"action": "today"}}
</tool_call>
Tool results are fed back to the model wrapped in <tool_response>โฆ</tool_response> inside a user turn, and the model synthesizes a final Turkish answer.
Evaluation
The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (scripts/rescore_aligned.py) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions.
Overall results (Roka v0.2, April 2026)
| View | n | Full-Match | Tool-Call Acc. | Name Acc. | Arg Acc. |
|---|---|---|---|---|---|
| All test (held-out) | 260 | 73.5% | 93.1% | 71.9% | 60.6% |
Every test query was verified to be absent from both data/train.jsonl and data/val.jsonl, so the 73.5% number above is a genuinely held-out measurement. See Decontamination history below for why this is lower than an earlier, un-decontaminated run.
Per-subcategory results
| Subcategory | n | Full-Match |
|---|---|---|
| simple/web_search | 30 | 93.3% |
| simple/weather | 20 | 100.0% |
| simple/url_reader | 15 | 100.0% |
| simple/calculator | 20 | 70.0% |
| simple/datetime | 15 | 46.7% |
| fullflow | 35 | 80.0% |
| multiple | 45 | 64.4% |
| parallel | 15 | 0.0% |
| adversarial/turkish_special | 10 | 90.0% |
| adversarial/edge_case | 5 | 40.0% |
| adversarial/ambiguous | 15 | 26.7% |
| irrelevance/greeting | 15 | 100.0% |
| irrelevance/identity | 10 | 100.0% |
| irrelevance/opinion | 10 | 100.0% |
Parallel tool calls score 0% because the training mix does not contain parallel-call examples. This is a known gap, not a reproducibility failure.
Decontamination history
During preparation for this release we audited the training set and found that 44 of the 260 test queries appeared verbatim in train/val (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above.
For transparency we also report the before-and-after numbers on the 216 test queries that were not affected by the decontamination (i.e., the genuinely held-out subset from the pre-cleanup model's perspective):
| Model | Training data | Clean-216 FM |
|---|---|---|
| v0.1 pre-clean | original (with 76 overlaps) | 78.2% |
| v0.2 (released) | decontaminated | 73.6% |
The ~4.6-point drop is informative: it is not contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples.
Development journey (brief)
Arriving at the final model required an honest amount of dead-ends.
- Baseline (Run 10) โ 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work.
- Phase A v1โv4 collapse โ four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed
losssanity checks, so the failure was invisible from inside the run. - Root cause โ TRL issue #3910: the
max_seq_lengthargument was silently renamed tomax_length(default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (โ75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: passmax_length=4096explicitly. - Data iterations
- Removed the
unitargument from allhava_durumutraining examples (the test set does not supply it).simple/weatherFull-Match rose from 10% to 100%. - Added 45 supplementary
datetimeexamples coveringday_of_year,end_of_month, anddays_until_weekdayโ test actions that were absent from the R10 training data. - Those supplementary examples caused a regression on
day_of_weekqueries ("23 Nisan hangi gรผne denk geliyor?" was mis-routed today_of_year). A targeted set of 30day_of_weekcontrast examples fixed it.
- Removed the
- Final v0.1 model โ 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset.
- v0.2 โ decontamination โ 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination โ see Decontamination history.
Total compute used across Phase A and v0.2: ~5 A6000-hours.
Limitations
- Multi-turn pattern lock-in. The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided
scripts/serve_ui.pyworks around this by feeding only the current user message (without prior turns) into the tool-decision loop. - Parallel tool calls: 0%. Not trained.
hava_durumuhas no temporal parameter. Queries like "yarฤฑn ฤฐstanbul'da hava" still produce{"city": "ฤฐstanbul"}because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change.- Adversarial/ambiguous: 40%. The model is easily nudged off-task by ambiguous phrasing.
- Long-passage synthesis is brittle. When
sayfa_okureturns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way. - Hermes parser coupling. Native OpenAI-style
tool_callsparsing viallama-serverrequires the providedtraining/roka_tool_template.jinjachat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector. - Scoring discrepancy. The in-training
training/eval.pyscorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work.
Training data
- 4,778 train / 509 validation examples, Hermes-format chat turns.
- ~72% Turkish, ~13% English, ~15% short/symbolic. The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage.
- Deterministic generators for
calculator,datetime,hava_durumu(intraining/generators/). - Real DuckDuckGo search results cached in
data/ddg_cache.jsonand used to constructweb_searchfullflow examples. - PII scan: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found.
Contamination verification
The released v0.2 model is trained on a split where no test query appears verbatim in either train or validation. The decontamination script (scripts/decontaminate.py) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was:
| Subcategory | Overlap (removed) |
|---|---|
| irrelevance/identity | 8 / 10 |
| irrelevance/greeting | 11 / 15 |
| simple/datetime | 8 / 15 |
| simple/web_search | 6 / 30 |
| multiple | 7 / 45 |
| adversarial/turkish_special | 1 / 10 |
| adversarial/opinion | 1 / 10 |
| simple/weather | 1 / 20 |
| fullflow | 1 / 35 |
Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on.
This decontamination is exact-string, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3.
Repository layout
src/ Inference clients (transformers & llama-server)
training/
tools.py Tool schemas + training system prompt
train.py TRL SFTTrainer entry point
eval.py Test-set scorer (in-training)
roka_tool_template.jinja llama-server chat template with Hermes detection hook
generators/ Deterministic data generators per tool
scripts/
work_pipeline.py End-to-end pod orchestration
pod_run_and_dump.py On-pod training โ prediction dump โ HF upload
rescore_aligned.py Alignment-aware rescorer (authoritative numbers)
serve_ui.py FastAPI chat UI wrapping the agent
data/
train.jsonl, val.jsonl, test_set.json
specs/005-post-run10-75/ Spec, plan, and task list for this iteration
Github: (https://github.com/bilersan/roka)
Reproducibility
- Clone the repo and install requirements:
pip install -r requirements.txt - Regenerate the training set (deterministic):
python -m training.build_dataset - Train (RunPod-hosted, ~1 GPU-hour on an A6000):
python -m scripts.work_pipeline - Rescore predictions with the alignment-aware harness:
python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json
The training recipe is fully specified in training/config.yaml. The only hyperparameter that is unusually specific is max_length: 4096 in training/train.py โ removing it reproduces the Phase A v1โv4 collapse described above.
Intended use and out-of-scope use
Intended: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline.
Out of scope:
- Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base.
- Parallel / agentic planning over large tool catalogs.
- Multi-turn conversational agents that need to preserve long prior context.
- Any application that requires the model to use tools not present in the training schema.
License
This repository and the released weights are distributed under the Apache License 2.0, inherited from both AlicanKiraz0/Kara-Kumru-v1.0-2B and its upstream base vngrs-ai/Kumru-2B. See LICENSE.
Citation
If you use Roka in research, please cite both the base model and this work:
@misc{roka_2026,
title = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B},
author = {Bilik, Ersan},
year = {2026},
url = {https://huggingface.co/ersanbil/roka}
}
@misc{karakumru_2025,
title = {Kara-Kumru-v1.0-2B},
author = {Kiraz, Alican},
year = {2025},
url = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}
Acknowledgements
- vngrs-ai for the open Turkish base model
Kumru-2B. - Alican Kiraz for the Turkish-conversational fine-tune
Kara-Kumru-v1.0-2B. - Hugging Face TRL / Unsloth for the training stack.
- Glaive-AI function-calling dataset for the English portion of the multi-tool synthetic mix.
Contact and feedback
Issues and pull requests are welcome on the GitHub mirror. This is a research preview โ please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases.
- Downloads last month
- 112