Roka — Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B

Roka is a supervised fine-tune of AlicanKiraz0/Kara-Kumru-v1.0-2B that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style <tool_call>…</tool_call> output format.

This is a v0.2 research preview, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see Limitations).

The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation.

Model at a glance


Base model	`AlicanKiraz0/Kara-Kumru-v1.0-2B` (Mistral architecture, Llama-3 chat template, Turkish-pretrained)
Upstream base	`vngrs-ai/Kumru-2B`
Parameters	~2.15B
Fine-tuning	Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer
Hardware	Single NVIDIA A6000 (~65 min / epoch ~22 min)
Languages	Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples)
License	Apache 2.0 (inherited from base chain)

Tool set

Tool	Description
`web_search`	Internet search (DuckDuckGo)
`calculator`	Arithmetic expression evaluator
`datetime`	Date/time and calendar arithmetic (9 actions: `today`, `now`, `day_of_week`, `add_days`, `date_diff`, `days_until`, `day_of_year`, `end_of_month`, `days_until_weekday`)
`hava_durumu`	Weather query by city name
`sayfa_oku`	URL content reader

The model is trained to emit tool calls as:

<tool_call>
{"name": "datetime", "arguments": {"action": "today"}}
</tool_call>

Tool results are fed back to the model wrapped in <tool_response>…</tool_response> inside a user turn, and the model synthesizes a final Turkish answer.

Evaluation

The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (scripts/rescore_aligned.py) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions.

Overall results (Roka v0.2, April 2026)

View	n	Full-Match	Tool-Call Acc.	Name Acc.	Arg Acc.
All test (held-out)	260	73.5%	93.1%	71.9%	60.6%

Every test query was verified to be absent from both data/train.jsonl and data/val.jsonl, so the 73.5% number above is a genuinely held-out measurement. See Decontamination history below for why this is lower than an earlier, un-decontaminated run.

Per-subcategory results

Subcategory	n	Full-Match
simple/web_search	30	93.3%
simple/weather	20	100.0%
simple/url_reader	15	100.0%
simple/calculator	20	70.0%
simple/datetime	15	46.7%
fullflow	35	80.0%
multiple	45	64.4%
parallel	15	0.0%
adversarial/turkish_special	10	90.0%
adversarial/edge_case	5	40.0%
adversarial/ambiguous	15	26.7%
irrelevance/greeting	15	100.0%
irrelevance/identity	10	100.0%
irrelevance/opinion	10	100.0%

Parallel tool calls score 0% because the training mix does not contain parallel-call examples. This is a known gap, not a reproducibility failure.

Decontamination history

During preparation for this release we audited the training set and found that 44 of the 260 test queries appeared verbatim in train/val (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above.

For transparency we also report the before-and-after numbers on the 216 test queries that were not affected by the decontamination (i.e., the genuinely held-out subset from the pre-cleanup model's perspective):

Model	Training data	Clean-216 FM
v0.1 pre-clean	original (with 76 overlaps)	78.2%
v0.2 (released)	decontaminated	73.6%

The ~4.6-point drop is informative: it is not contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples.

Development journey (brief)

Arriving at the final model required an honest amount of dead-ends.

Baseline (Run 10) — 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work.
Phase A v1–v4 collapse — four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed loss sanity checks, so the failure was invisible from inside the run.
Root cause — TRL issue #3910: the max_seq_length argument was silently renamed to max_length (default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (≈75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: pass max_length=4096 explicitly.
Data iterations
- Removed the unit argument from all hava_durumu training examples (the test set does not supply it). simple/weather Full-Match rose from 10% to 100%.
- Added 45 supplementary datetime examples covering day_of_year, end_of_month, and days_until_weekday — test actions that were absent from the R10 training data.
- Those supplementary examples caused a regression on day_of_week queries ("23 Nisan hangi güne denk geliyor?" was mis-routed to day_of_year). A targeted set of 30 day_of_week contrast examples fixed it.
Final v0.1 model — 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset.
v0.2 — decontamination — 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination — see Decontamination history.

Total compute used across Phase A and v0.2: ~5 A6000-hours.

Limitations

Multi-turn pattern lock-in. The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided scripts/serve_ui.py works around this by feeding only the current user message (without prior turns) into the tool-decision loop.
Parallel tool calls: 0%. Not trained.
hava_durumu has no temporal parameter. Queries like "yarın İstanbul'da hava" still produce {"city": "İstanbul"} because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change.
Adversarial/ambiguous: 40%. The model is easily nudged off-task by ambiguous phrasing.
Long-passage synthesis is brittle. When sayfa_oku returns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way.
Hermes parser coupling. Native OpenAI-style tool_calls parsing via llama-server requires the provided training/roka_tool_template.jinja chat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector.
Scoring discrepancy. The in-training training/eval.py scorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work.

Training data

4,778 train / 509 validation examples, Hermes-format chat turns.
~72% Turkish, ~13% English, ~15% short/symbolic. The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage.
Deterministic generators for calculator, datetime, hava_durumu (in training/generators/).
Real DuckDuckGo search results cached in data/ddg_cache.json and used to construct web_search fullflow examples.
PII scan: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found.

Contamination verification

The released v0.2 model is trained on a split where no test query appears verbatim in either train or validation. The decontamination script (scripts/decontaminate.py) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was:

Subcategory	Overlap (removed)
irrelevance/identity	8 / 10
irrelevance/greeting	11 / 15
simple/datetime	8 / 15
simple/web_search	6 / 30
multiple	7 / 45
adversarial/turkish_special	1 / 10
adversarial/opinion	1 / 10
simple/weather	1 / 20
fullflow	1 / 35

Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on.

This decontamination is exact-string, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3.

Repository layout

src/                    Inference clients (transformers & llama-server)
training/
  tools.py              Tool schemas + training system prompt
  train.py              TRL SFTTrainer entry point
  eval.py               Test-set scorer (in-training)
  roka_tool_template.jinja   llama-server chat template with Hermes detection hook
  generators/           Deterministic data generators per tool
scripts/
  work_pipeline.py      End-to-end pod orchestration
  pod_run_and_dump.py   On-pod training → prediction dump → HF upload
  rescore_aligned.py    Alignment-aware rescorer (authoritative numbers)
  serve_ui.py           FastAPI chat UI wrapping the agent
data/
  train.jsonl, val.jsonl, test_set.json
specs/005-post-run10-75/   Spec, plan, and task list for this iteration

Github: (https://github.com/bilersan/roka)

Reproducibility

Clone the repo and install requirements:
```
pip install -r requirements.txt
```
Regenerate the training set (deterministic):
```
python -m training.build_dataset
```
Train (RunPod-hosted, ~1 GPU-hour on an A6000):
```
python -m scripts.work_pipeline
```

Rescore predictions with the alignment-aware harness:

python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json

The training recipe is fully specified in training/config.yaml. The only hyperparameter that is unusually specific is max_length: 4096 in training/train.py — removing it reproduces the Phase A v1–v4 collapse described above.

Intended use and out-of-scope use

Intended: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline.

Out of scope:

Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base.
Parallel / agentic planning over large tool catalogs.
Multi-turn conversational agents that need to preserve long prior context.
Any application that requires the model to use tools not present in the training schema.

License

This repository and the released weights are distributed under the Apache License 2.0, inherited from both AlicanKiraz0/Kara-Kumru-v1.0-2B and its upstream base vngrs-ai/Kumru-2B. See LICENSE.

Citation

If you use Roka in research, please cite both the base model and this work:

@misc{roka_2026,
  title  = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B},
  author = {Bilik, Ersan},
  year   = {2026},
  url    = {https://huggingface.co/ersanbil/roka}
}

@misc{karakumru_2025,
  title  = {Kara-Kumru-v1.0-2B},
  author = {Kiraz, Alican},
  year   = {2025},
  url    = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}

Acknowledgements

vngrs-ai for the open Turkish base model Kumru-2B.
Alican Kiraz for the Turkish-conversational fine-tune Kara-Kumru-v1.0-2B.
Hugging Face TRL / Unsloth for the training stack.
Glaive-AI function-calling dataset for the English portion of the multi-tool synthetic mix.

Contact and feedback

Issues and pull requests are welcome on the GitHub mirror. This is a research preview — please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases.

Downloads last month: 112

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for ersanbil/roka

Base model

vngrs-ai/Kumru-2B

Finetuned

AlicanKiraz0/Kara-Kumru-v1.0-2B

Quantized

(2)

this model