Turkish-Gemma-4b-T1-Scout

Turkish-Gemma-4b-T1-Scout is a Turkish web-search agent model designed for multi-step information seeking, evidence-grounded answer generation, and tool-augmented reasoning. The model is built on top of an open base LLM and adapted using synthetic trajectory data through supervised fine-tuning (SFT) and GRPO-based reinforcement learning for better tool use, search decisions, and final answer synthesis.

The overall goal of the project is to reduce hallucinations on Turkish questions that require current, rare, or multi-step factual lookup by combining:

synthetic trajectory generation,
supervised fine-tuning (SFT),
reinforcement learning with GRPO,
explicit tool-use formatting,
and benchmark-driven evaluation on Turkish web-search tasks.

This folder contains the model weights, tokenizer, generation config, a custom chat template, and a custom tool parser for vLLM. It does not include the full browsing runtime, benchmark files, search backend, or training pipeline code.

What Makes This Model Different

Compared with a standard instruction-tuned checkpoint, this model was trained around an agentic interaction format. The local files in this repository include:

cosmos_chat_template.jinja: custom chat template with tool instructions
cosmos_gemma_tool_parser.py: custom vLLM tool parser for parsing <tool_call>...</tool_call> blocks during agent execution

This model is intended for Turkish queries that involve:

up-to-date information needs,
rare or hard-to-memorize facts,
multi-step search and evidence gathering,
synthesizing a final answer after a limited number of tool calls.

Unlike a standard chat model, this checkpoint is trained for reasoning + acting behavior. During agent execution, it can generate intermediate tool calls and then produce a final natural language answer.

The template also supports <think> and <tool_call> tags, so deployment systems should be prepared to handle structured tool-use output.

Highlights

This model is focused specifically on the Turkish language and Turkish web ecosystem, and was trained with a multi-stage SFT + GRPO pipeline using trajectory-style supervision rather than plain QA alone. It was evaluated on a human-written Turkish Web Search Benchmark designed to measure agentic search behavior.

Training overview

Training was carried out in multiple stages. First, around 300,000 synthetic reasoning prompts were generated to improve structured reasoning behavior before search specialization. Then, around 100,000 synthetic web-search prompts were created, of which 1,463 were retained after filtering for cases that genuinely required web search. The final training pipeline consisted of reasoning-oriented SFT, web-search SFT, and GRPO-based reinforcement learning.

Evaluation benchmark

The project introduces a Turkish Web Search Benchmark consisting of 70 human-written questions, grouped by difficulty:

Easy: generally solvable with a single search
Medium: generally requires multiple search steps
Hard: requires multi-step retrieval and evidence synthesis

Representative examples include:

Easy: 1 Ocak 2026 tarihinde İstanbul belediye başkanı kim?
Medium: Dağ 2 filminde, üsteğmen rütbesinde bir oyuncu vardı. Bu oyuncu, başka bir Amerikalı oyuncu ile tip olarak çok benziyordu. Bu özelliği taşıyan dağ 2 filmindeki oyuncu kimdi?

Reported results

The report compares base models and tool-augmented models.

Model	Correct	Wrong	Not Attempted	Average tool calls
Turkish-Gemma-27b-T1-Scout	87.14% (61/70)	12.86%	0.00%	10.07
Turkish-Gemma-4b-T1-Scout	71.43% (50/70)	22.86%	5.71%	4.44
Gemma-3-27B	10.00% (7/70)	90.00%	0.00%	—
Gemma-3-4B	5.71% (4/70)	94.29%	0.00%	—

Intended use

This model is suitable for Turkish web-search assistants, tool-using research agents, retrieval-augmented conversational systems, and academic or open-source agent experiments. It is not a complete production system by itself: reproducing the intended behavior still requires an external search or browsing backend, tool schemas exposed to the model, execution logic for emitted tool calls, post-processing of tool results, and safety or citation controls around final answers.

Out-of-scope use

This model is not intended for fully autonomous decision-making systems or for professional legal, medical, or financial advice. It should also not be treated as standalone browsing software or as a guaranteed source of truth without external verification. Human oversight, logging, and additional safeguards are recommended in high-stakes settings.

Agent behavior and expected runtime format

The training setup encourages outputs of the following general form:

Tool step: <think>...</think><tool_call>{...}</tool_call>
Final step: <think>...</think> followed by a natural language answer

As a result, this model works best inside an agent runtime. A typical execution loop is:

pass the user query to the model,
parse the produced tool call,
execute the tool,
append the observation back into context,
continue until the model gives a final answer or the tool budget is exhausted.

Tooling used in training

The agent was trained with a compact function-calling interface centered on web retrieval and evidence extraction. The main functions were:

search(query, max_results, recency_days, time_range) for general web search
browse(url, max_chars, js_render, timeout_s) for opening pages and extracting readable text
find_in_page(url, patterns, mode, within_chars, context_chars, case_sensitive, use_regex, max_results) for locating evidence inside crawled page text
paper_search(query, limit, year_from, year_to, fields, venue, sort) for academic search via Semantic Scholar

This interface was used during both trajectory generation and agent training. In deployment, equivalent external tools or APIs are required for full agent functionality.

Tool-Use Deployment Notes

For agentic use, you should keep the custom formatting artifacts in the same repository:

cosmos_chat_template.jinja
cosmos_gemma_tool_parser.py

The parser expects tool calls in blocks like:

<tool_call>
{"type":"function","function":{"name":"browse","arguments":{"query":"..."}}}
</tool_call>

Tips

Use Temperature=0.6, TopP=0.95, TopK=64. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

Complex tasks: Increase max_new_tokens. You can increase the repetition penalty and also adjust the presence_penalty parameter (between 0 and 2) to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance

Limitations

This repository does not include the external browsing/search runtime.
Performance depends heavily on the quality of the external search/runtime stack.
The benchmark is relatively small (70 questions) and limited in domain coverage, may not fully represent all real-world domain.
The model can still hallucinate, especially when tools are unavailable, misused, or budget-limited.
Reported results reflect the project’s internal evaluation setup and may not transfer unchanged to every real-world deployment
Since the model is derived from Gemma 3, downstream use should remain consistent with the base model's license and usage terms.

Acknowledgments

Thanks to Hugging Face for hosting models on S3 storage.

Compute resources were provided by the Barcelona Supercomputing Center

Contact

COSMOS AI Research Group – Yildiz Technical University, Computer Engineering Department
🔗 https://cosmos.yildiz.edu.tr/
✉️ cosmos@yildiz.edu.tr

license: gemma3

Downloads last month: 110

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ytu-ce-cosmos/Turkish-Gemma-4b-T1-Scout

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

(659)

this model