Turkish-Gemma-4b-T1-Scout
Turkish-Gemma-4b-T1-Scout is a Turkish web-search agent model designed for multi-step information seeking, evidence-grounded answer generation, and tool-augmented reasoning. The model is built on top of an open base LLM and adapted using synthetic trajectory data through supervised fine-tuning (SFT) and GRPO-based reinforcement learning for better tool use, search decisions, and final answer synthesis.
The overall goal of the project is to reduce hallucinations on Turkish questions that require current, rare, or multi-step factual lookup by combining:
- synthetic trajectory generation,
- supervised fine-tuning (SFT),
- reinforcement learning with GRPO,
- explicit tool-use formatting,
- and benchmark-driven evaluation on Turkish web-search tasks.
This folder contains the model weights, tokenizer, generation config, a custom chat template, and a custom tool parser for vLLM. It does not include the full browsing runtime, benchmark files, search backend, or training pipeline code.
What Makes This Model Different
Compared with a standard instruction-tuned checkpoint, this model was trained around an agentic interaction format. The local files in this repository include:
cosmos_chat_template.jinja: custom chat template with tool instructionscosmos_gemma_tool_parser.py: custom vLLM tool parser for parsing<tool_call>...</tool_call>blocks during agent execution
This model is intended for Turkish queries that involve:
- up-to-date information needs,
- rare or hard-to-memorize facts,
- multi-step search and evidence gathering,
- synthesizing a final answer after a limited number of tool calls.
Unlike a standard chat model, this checkpoint is trained for reasoning + acting behavior. During agent execution, it can generate intermediate tool calls and then produce a final natural language answer.
The template also supports <think> and <tool_call> tags, so deployment systems should be prepared to handle structured tool-use output.
Highlights
This model is focused specifically on the Turkish language and Turkish web ecosystem, and was trained with a multi-stage SFT + GRPO pipeline using trajectory-style supervision rather than plain QA alone. It was evaluated on a human-written Turkish Web Search Benchmark designed to measure agentic search behavior.
Training overview
Training was carried out in multiple stages. First, around 300,000 synthetic reasoning prompts were generated to improve structured reasoning behavior before search specialization. Then, around 100,000 synthetic web-search prompts were created, of which 1,463 were retained after filtering for cases that genuinely required web search. The final training pipeline consisted of reasoning-oriented SFT, web-search SFT, and GRPO-based reinforcement learning.
Evaluation benchmark
The project introduces a Turkish Web Search Benchmark consisting of 70 human-written questions, grouped by difficulty:
- Easy: generally solvable with a single search
- Medium: generally requires multiple search steps
- Hard: requires multi-step retrieval and evidence synthesis
Representative examples include:
- Easy: 1 Ocak 2026 tarihinde İstanbul belediye başkanı kim?
- Medium: Dağ 2 filminde, üsteğmen rütbesinde bir oyuncu vardı. Bu oyuncu, başka bir Amerikalı oyuncu ile tip olarak çok benziyordu. Bu özelliği taşıyan dağ 2 filmindeki oyuncu kimdi?
Reported results
The report compares base models and tool-augmented models.
| Model | Correct | Wrong | Not Attempted | Average tool calls |
|---|---|---|---|---|
| Turkish-Gemma-27b-T1-Scout | 87.14% (61/70) | 12.86% | 0.00% | 10.07 |
| Turkish-Gemma-4b-T1-Scout | 71.43% (50/70) | 22.86% | 5.71% | 4.44 |
| Gemma-3-27B | 10.00% (7/70) | 90.00% | 0.00% | — |
| Gemma-3-4B | 5.71% (4/70) | 94.29% | 0.00% | — |
Intended use
This model is suitable for Turkish web-search assistants, tool-using research agents, retrieval-augmented conversational systems, and academic or open-source agent experiments. It is not a complete production system by itself: reproducing the intended behavior still requires an external search or browsing backend, tool schemas exposed to the model, execution logic for emitted tool calls, post-processing of tool results, and safety or citation controls around final answers.
Out-of-scope use
This model is not intended for fully autonomous decision-making systems or for professional legal, medical, or financial advice. It should also not be treated as standalone browsing software or as a guaranteed source of truth without external verification. Human oversight, logging, and additional safeguards are recommended in high-stakes settings.
Agent behavior and expected runtime format
The training setup encourages outputs of the following general form:
- Tool step:
<think>...</think><tool_call>{...}</tool_call> - Final step:
<think>...</think>followed by a natural language answer
As a result, this model works best inside an agent runtime. A typical execution loop is:
- pass the user query to the model,
- parse the produced tool call,
- execute the tool,
- append the observation back into context,
- continue until the model gives a final answer or the tool budget is exhausted.
Tooling used in training
The agent was trained with a compact function-calling interface centered on web retrieval and evidence extraction. The main functions were:
search(query, max_results, recency_days, time_range)for general web searchbrowse(url, max_chars, js_render, timeout_s)for opening pages and extracting readable textfind_in_page(url, patterns, mode, within_chars, context_chars, case_sensitive, use_regex, max_results)for locating evidence inside crawled page textpaper_search(query, limit, year_from, year_to, fields, venue, sort)for academic search via Semantic Scholar
This interface was used during both trajectory generation and agent training. In deployment, equivalent external tools or APIs are required for full agent functionality.
Tool-Use Deployment Notes
For agentic use, you should keep the custom formatting artifacts in the same repository:
cosmos_chat_template.jinjacosmos_gemma_tool_parser.py
The parser expects tool calls in blocks like:
<tool_call>
{"type":"function","function":{"name":"browse","arguments":{"query":"..."}}}
</tool_call>
Tips
Use
Temperature=0.6,TopP=0.95,TopK=64. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
- Complex tasks: Increase
max_new_tokens. You can increase therepetition penaltyand also adjust thepresence_penaltyparameter (between 0 and 2) to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance
Limitations
- This repository does not include the external browsing/search runtime.
- Performance depends heavily on the quality of the external search/runtime stack.
- The benchmark is relatively small (70 questions) and limited in domain coverage, may not fully represent all real-world domain.
- The model can still hallucinate, especially when tools are unavailable, misused, or budget-limited.
- Reported results reflect the project’s internal evaluation setup and may not transfer unchanged to every real-world deployment
- Since the model is derived from Gemma 3, downstream use should remain consistent with the base model's license and usage terms.
Acknowledgments
Thanks to Hugging Face for hosting models on S3 storage.
Compute resources were provided by the Barcelona Supercomputing Center
Contact
COSMOS AI Research Group – Yildiz Technical University, Computer Engineering Department
🔗 https://cosmos.yildiz.edu.tr/
✉️ cosmos@yildiz.edu.tr
license: gemma3
- Downloads last month
- 110