Qwen3.5-FT-Japanese-CoT-4B

Overview

This model is a fine-tuned version of Qwen3.5-4B base model, trained with Japanese Chain-of-Thought (CoT) reasoning data. The key characteristic of this model is that it reasons in the same language as the input — reasoning in Japanese when prompted in Japanese, in English when prompted in English, and so on.

Model Details

Item	Detail
Base Model	Qwen3.5-4B (Base)
Model Size	4B parameters
Fine-tuning Method	Supervised Fine-Tuning (SFT)
License	MIT

Key Characteristics

Native Language Reasoning

Unlike most models that reason primarily in English or Chinese regardless of input language, this model adapts its internal reasoning language to match the input. This behavior is observable in the <think> tags of the model's output.

Example (Japanese input):

User: 明日の会議、準備できてる？
Think: ふむ、上司からの質問だな。丁寧に答える必要がある...

Example (Italian input):

User: Un mio amico è arrivato in ritardo. Come ti sentiresti?
Think: Mm, l'utente chiede una reazione emotiva...

Benchmark Results

Evaluated using lm-evaluation-harness with 100 samples per task.

Task	Score	Shot	Notes
MMLU	76.4%	0-shot	Standard evaluation
GSM8K	83.0%	5-shot (strict-match)	Standard evaluation
ARC-Challenge	57.0%	0-shot	Typically evaluated at 25-shot
HellaSwag	71.0% (acc_norm)	0-shot	Typically evaluated at 10-shot

Note: ARC-Challenge and HellaSwag were evaluated at 0-shot. Standard evaluations use 25-shot and 10-shot respectively, which typically yield higher scores.

MMLU Breakdown

Category	Score
Social Sciences	82.3%
Other	77.4%
STEM	73.8%
Humanities	73.7%

Comparison with Similar-Scale Models

Model	MMLU	GSM8K	Parameters
Gemma 3 4B IT	~70%	~75%	4B
Qwen3-4B (Official)	~72-74%	~67%	4B
This model	76.4%	83.0%	4B
GPT-4 mini	~80%	~87%	Undisclosed

Note: Comparisons are approximate and may vary by evaluation setup.

Limitations

ARC / HellaSwag: Scores at 0-shot are below typical 4B baselines. These benchmarks measure intuitive commonsense reasoning, which may be less suited to deep CoT-style inference.
Benchmark contamination: Cannot be fully ruled out, as with any fine-tuned model.
Limited multilingual benchmarks: Full evaluation on multilingual benchmarks (e.g., MMMLU) has not yet been conducted.
Training data: Details of the fine-tuning dataset are not fully disclosed.

Intended Use

Japanese-language AI applications requiring culturally aware reasoning
Multilingual applications where native-language reasoning is preferred
Edge/local deployment requiring small model size with strong reasoning capability
Research into native-language CoT training approaches

Out-of-Scope Use

Safety-critical applications without additional evaluation
High-stakes medical, legal, or financial decisions without human oversight
Tasks requiring up-to-date world knowledge beyond training cutoff

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("日本語で考えるAIとは何ですか？", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF (Local Inference)

A GGUF version is also available for local inference:

# Using llama.cpp or compatible runners
./main -m Qwen3.5-FT-Japanese-CoT-4B.gguf -p "日本語で考えるとはどういうことですか？"

Training Details

Base model: Qwen3.5-4B (Base, not Instruct)
Method: Supervised Fine-Tuning on Japanese CoT data
Infrastructure: Cloud GPU

Citation

If you use this model in your research, please consider citing it:

@misc{qwen35-ft-japanese-cot-4b,
  author = {Aname-Tommy},
  title  = {Qwen3.5-FT-Japanese-CoT-4B},
  year   = {2026},
  url    = {https://huggingface.co/Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B}
}

Acknowledgements

Based on Qwen3.5 by Alibaba Cloud.
Evaluation conducted using lm-evaluation-harness by EleutherAI.

Downloads last month: 60

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support