Qwen3.5-FT-Japanese-CoT-4B

Overview

This model is a fine-tuned version of Qwen3.5-4B base model, trained with Japanese Chain-of-Thought (CoT) reasoning data. The key characteristic of this model is that it reasons in the same language as the input — reasoning in Japanese when prompted in Japanese, in English when prompted in English, and so on.

Model Details

Item Detail
Base Model Qwen3.5-4B (Base)
Model Size 4B parameters
Fine-tuning Method Supervised Fine-Tuning (SFT)
License MIT

Key Characteristics

Native Language Reasoning

Unlike most models that reason primarily in English or Chinese regardless of input language, this model adapts its internal reasoning language to match the input. This behavior is observable in the <think> tags of the model's output.

Example (Japanese input):

User: 明日の会議、準備できてる?
Think: ふむ、上司からの質問だな。丁寧に答える必要がある...

Example (Italian input):

User: Un mio amico è arrivato in ritardo. Come ti sentiresti?
Think: Mm, l'utente chiede una reazione emotiva...

Benchmark Results

Evaluated using lm-evaluation-harness with 100 samples per task.

Task Score Shot Notes
MMLU 76.4% 0-shot Standard evaluation
GSM8K 83.0% 5-shot (strict-match) Standard evaluation
ARC-Challenge 57.0% 0-shot Typically evaluated at 25-shot
HellaSwag 71.0% (acc_norm) 0-shot Typically evaluated at 10-shot

Note: ARC-Challenge and HellaSwag were evaluated at 0-shot. Standard evaluations use 25-shot and 10-shot respectively, which typically yield higher scores.

MMLU Breakdown

Category Score
Social Sciences 82.3%
Other 77.4%
STEM 73.8%
Humanities 73.7%

Comparison with Similar-Scale Models

Model MMLU GSM8K Parameters
Gemma 3 4B IT ~70% ~75% 4B
Qwen3-4B (Official) ~72-74% ~67% 4B
This model 76.4% 83.0% 4B
GPT-4 mini ~80% ~87% Undisclosed

Note: Comparisons are approximate and may vary by evaluation setup.

Limitations

  • ARC / HellaSwag: Scores at 0-shot are below typical 4B baselines. These benchmarks measure intuitive commonsense reasoning, which may be less suited to deep CoT-style inference.
  • Benchmark contamination: Cannot be fully ruled out, as with any fine-tuned model.
  • Limited multilingual benchmarks: Full evaluation on multilingual benchmarks (e.g., MMMLU) has not yet been conducted.
  • Training data: Details of the fine-tuning dataset are not fully disclosed.

Intended Use

  • Japanese-language AI applications requiring culturally aware reasoning
  • Multilingual applications where native-language reasoning is preferred
  • Edge/local deployment requiring small model size with strong reasoning capability
  • Research into native-language CoT training approaches

Out-of-Scope Use

  • Safety-critical applications without additional evaluation
  • High-stakes medical, legal, or financial decisions without human oversight
  • Tasks requiring up-to-date world knowledge beyond training cutoff

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("日本語で考えるAIとは何ですか?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF (Local Inference)

A GGUF version is also available for local inference:

# Using llama.cpp or compatible runners
./main -m Qwen3.5-FT-Japanese-CoT-4B.gguf -p "日本語で考えるとはどういうことですか?"

Training Details

  • Base model: Qwen3.5-4B (Base, not Instruct)
  • Method: Supervised Fine-Tuning on Japanese CoT data
  • Infrastructure: Cloud GPU

Citation

If you use this model in your research, please consider citing it:

@misc{qwen35-ft-japanese-cot-4b,
  author = {Aname-Tommy},
  title  = {Qwen3.5-FT-Japanese-CoT-4B},
  year   = {2026},
  url    = {https://huggingface.co/Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B}
}

Acknowledgements

Based on Qwen3.5 by Alibaba Cloud.
Evaluation conducted using lm-evaluation-harness by EleutherAI.

Downloads last month
60
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support