Roka โ€” Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B

Roka is a supervised fine-tune of AlicanKiraz0/Kara-Kumru-v1.0-2B that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style <tool_call>โ€ฆ</tool_call> output format.

This is a v0.2 research preview, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see Limitations).

The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation.

Model at a glance

Base model AlicanKiraz0/Kara-Kumru-v1.0-2B (Mistral architecture, Llama-3 chat template, Turkish-pretrained)
Upstream base vngrs-ai/Kumru-2B
Parameters ~2.15B
Fine-tuning Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer
Hardware Single NVIDIA A6000 (~65 min / epoch ~22 min)
Languages Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples)
License Apache 2.0 (inherited from base chain)

Tool set

Tool Description
web_search Internet search (DuckDuckGo)
calculator Arithmetic expression evaluator
datetime Date/time and calendar arithmetic (9 actions: today, now, day_of_week, add_days, date_diff, days_until, day_of_year, end_of_month, days_until_weekday)
hava_durumu Weather query by city name
sayfa_oku URL content reader

The model is trained to emit tool calls as:

<tool_call>
{"name": "datetime", "arguments": {"action": "today"}}
</tool_call>

Tool results are fed back to the model wrapped in <tool_response>โ€ฆ</tool_response> inside a user turn, and the model synthesizes a final Turkish answer.

Evaluation

The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (scripts/rescore_aligned.py) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions.

Overall results (Roka v0.2, April 2026)

View n Full-Match Tool-Call Acc. Name Acc. Arg Acc.
All test (held-out) 260 73.5% 93.1% 71.9% 60.6%

Every test query was verified to be absent from both data/train.jsonl and data/val.jsonl, so the 73.5% number above is a genuinely held-out measurement. See Decontamination history below for why this is lower than an earlier, un-decontaminated run.

Per-subcategory results

Subcategory n Full-Match
simple/web_search 30 93.3%
simple/weather 20 100.0%
simple/url_reader 15 100.0%
simple/calculator 20 70.0%
simple/datetime 15 46.7%
fullflow 35 80.0%
multiple 45 64.4%
parallel 15 0.0%
adversarial/turkish_special 10 90.0%
adversarial/edge_case 5 40.0%
adversarial/ambiguous 15 26.7%
irrelevance/greeting 15 100.0%
irrelevance/identity 10 100.0%
irrelevance/opinion 10 100.0%

Parallel tool calls score 0% because the training mix does not contain parallel-call examples. This is a known gap, not a reproducibility failure.

Decontamination history

During preparation for this release we audited the training set and found that 44 of the 260 test queries appeared verbatim in train/val (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above.

For transparency we also report the before-and-after numbers on the 216 test queries that were not affected by the decontamination (i.e., the genuinely held-out subset from the pre-cleanup model's perspective):

Model Training data Clean-216 FM
v0.1 pre-clean original (with 76 overlaps) 78.2%
v0.2 (released) decontaminated 73.6%

The ~4.6-point drop is informative: it is not contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples.

Development journey (brief)

Arriving at the final model required an honest amount of dead-ends.

  1. Baseline (Run 10) โ€” 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work.
  2. Phase A v1โ€“v4 collapse โ€” four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed loss sanity checks, so the failure was invisible from inside the run.
  3. Root cause โ€” TRL issue #3910: the max_seq_length argument was silently renamed to max_length (default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (โ‰ˆ75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: pass max_length=4096 explicitly.
  4. Data iterations
    • Removed the unit argument from all hava_durumu training examples (the test set does not supply it). simple/weather Full-Match rose from 10% to 100%.
    • Added 45 supplementary datetime examples covering day_of_year, end_of_month, and days_until_weekday โ€” test actions that were absent from the R10 training data.
    • Those supplementary examples caused a regression on day_of_week queries ("23 Nisan hangi gรผne denk geliyor?" was mis-routed to day_of_year). A targeted set of 30 day_of_week contrast examples fixed it.
  5. Final v0.1 model โ€” 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset.
  6. v0.2 โ€” decontamination โ€” 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination โ€” see Decontamination history.

Total compute used across Phase A and v0.2: ~5 A6000-hours.

Limitations

  • Multi-turn pattern lock-in. The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided scripts/serve_ui.py works around this by feeding only the current user message (without prior turns) into the tool-decision loop.
  • Parallel tool calls: 0%. Not trained.
  • hava_durumu has no temporal parameter. Queries like "yarฤฑn ฤฐstanbul'da hava" still produce {"city": "ฤฐstanbul"} because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change.
  • Adversarial/ambiguous: 40%. The model is easily nudged off-task by ambiguous phrasing.
  • Long-passage synthesis is brittle. When sayfa_oku returns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way.
  • Hermes parser coupling. Native OpenAI-style tool_calls parsing via llama-server requires the provided training/roka_tool_template.jinja chat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector.
  • Scoring discrepancy. The in-training training/eval.py scorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work.

Training data

  • 4,778 train / 509 validation examples, Hermes-format chat turns.
  • ~72% Turkish, ~13% English, ~15% short/symbolic. The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage.
  • Deterministic generators for calculator, datetime, hava_durumu (in training/generators/).
  • Real DuckDuckGo search results cached in data/ddg_cache.json and used to construct web_search fullflow examples.
  • PII scan: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found.

Contamination verification

The released v0.2 model is trained on a split where no test query appears verbatim in either train or validation. The decontamination script (scripts/decontaminate.py) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was:

Subcategory Overlap (removed)
irrelevance/identity 8 / 10
irrelevance/greeting 11 / 15
simple/datetime 8 / 15
simple/web_search 6 / 30
multiple 7 / 45
adversarial/turkish_special 1 / 10
adversarial/opinion 1 / 10
simple/weather 1 / 20
fullflow 1 / 35

Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on.

This decontamination is exact-string, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3.

Repository layout

src/                    Inference clients (transformers & llama-server)
training/
  tools.py              Tool schemas + training system prompt
  train.py              TRL SFTTrainer entry point
  eval.py               Test-set scorer (in-training)
  roka_tool_template.jinja   llama-server chat template with Hermes detection hook
  generators/           Deterministic data generators per tool
scripts/
  work_pipeline.py      End-to-end pod orchestration
  pod_run_and_dump.py   On-pod training โ†’ prediction dump โ†’ HF upload
  rescore_aligned.py    Alignment-aware rescorer (authoritative numbers)
  serve_ui.py           FastAPI chat UI wrapping the agent
data/
  train.jsonl, val.jsonl, test_set.json
specs/005-post-run10-75/   Spec, plan, and task list for this iteration

Github: (https://github.com/bilersan/roka)

Reproducibility

  1. Clone the repo and install requirements:
    pip install -r requirements.txt
    
  2. Regenerate the training set (deterministic):
    python -m training.build_dataset
    
  3. Train (RunPod-hosted, ~1 GPU-hour on an A6000):
    python -m scripts.work_pipeline
    
  4. Rescore predictions with the alignment-aware harness:
    python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json
    

The training recipe is fully specified in training/config.yaml. The only hyperparameter that is unusually specific is max_length: 4096 in training/train.py โ€” removing it reproduces the Phase A v1โ€“v4 collapse described above.

Intended use and out-of-scope use

Intended: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline.

Out of scope:

  • Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base.
  • Parallel / agentic planning over large tool catalogs.
  • Multi-turn conversational agents that need to preserve long prior context.
  • Any application that requires the model to use tools not present in the training schema.

License

This repository and the released weights are distributed under the Apache License 2.0, inherited from both AlicanKiraz0/Kara-Kumru-v1.0-2B and its upstream base vngrs-ai/Kumru-2B. See LICENSE.

Citation

If you use Roka in research, please cite both the base model and this work:

@misc{roka_2026,
  title  = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B},
  author = {Bilik, Ersan},
  year   = {2026},
  url    = {https://huggingface.co/ersanbil/roka}
}

@misc{karakumru_2025,
  title  = {Kara-Kumru-v1.0-2B},
  author = {Kiraz, Alican},
  year   = {2025},
  url    = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}

Acknowledgements

  • vngrs-ai for the open Turkish base model Kumru-2B.
  • Alican Kiraz for the Turkish-conversational fine-tune Kara-Kumru-v1.0-2B.
  • Hugging Face TRL / Unsloth for the training stack.
  • Glaive-AI function-calling dataset for the English portion of the multi-tool synthetic mix.

Contact and feedback

Issues and pull requests are welcome on the GitHub mirror. This is a research preview โ€” please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases.

Downloads last month
112
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ersanbil/roka

Quantized
(2)
this model