Qwen3.5-FT-Japanese-CoT-4B
Overview
This model is a fine-tuned version of Qwen3.5-4B base model, trained with Japanese Chain-of-Thought (CoT) reasoning data. The key characteristic of this model is that it reasons in the same language as the input — reasoning in Japanese when prompted in Japanese, in English when prompted in English, and so on.
Model Details
| Item | Detail |
|---|---|
| Base Model | Qwen3.5-4B (Base) |
| Model Size | 4B parameters |
| Fine-tuning Method | Supervised Fine-Tuning (SFT) |
| License | MIT |
Key Characteristics
Native Language Reasoning
Unlike most models that reason primarily in English or Chinese regardless of input language, this model adapts its internal reasoning language to match the input. This behavior is observable in the <think> tags of the model's output.
Example (Japanese input):
User: 明日の会議、準備できてる?
Think: ふむ、上司からの質問だな。丁寧に答える必要がある...
Example (Italian input):
User: Un mio amico è arrivato in ritardo. Come ti sentiresti?
Think: Mm, l'utente chiede una reazione emotiva...
Benchmark Results
Evaluated using lm-evaluation-harness with 100 samples per task.
| Task | Score | Shot | Notes |
|---|---|---|---|
| MMLU | 76.4% | 0-shot | Standard evaluation |
| GSM8K | 83.0% | 5-shot (strict-match) | Standard evaluation |
| ARC-Challenge | 57.0% | 0-shot | Typically evaluated at 25-shot |
| HellaSwag | 71.0% (acc_norm) | 0-shot | Typically evaluated at 10-shot |
Note: ARC-Challenge and HellaSwag were evaluated at 0-shot. Standard evaluations use 25-shot and 10-shot respectively, which typically yield higher scores.
MMLU Breakdown
| Category | Score |
|---|---|
| Social Sciences | 82.3% |
| Other | 77.4% |
| STEM | 73.8% |
| Humanities | 73.7% |
Comparison with Similar-Scale Models
| Model | MMLU | GSM8K | Parameters |
|---|---|---|---|
| Gemma 3 4B IT | ~70% | ~75% | 4B |
| Qwen3-4B (Official) | ~72-74% | ~67% | 4B |
| This model | 76.4% | 83.0% | 4B |
| GPT-4 mini | ~80% | ~87% | Undisclosed |
Note: Comparisons are approximate and may vary by evaluation setup.
Limitations
- ARC / HellaSwag: Scores at 0-shot are below typical 4B baselines. These benchmarks measure intuitive commonsense reasoning, which may be less suited to deep CoT-style inference.
- Benchmark contamination: Cannot be fully ruled out, as with any fine-tuned model.
- Limited multilingual benchmarks: Full evaluation on multilingual benchmarks (e.g., MMMLU) has not yet been conducted.
- Training data: Details of the fine-tuning dataset are not fully disclosed.
Intended Use
- Japanese-language AI applications requiring culturally aware reasoning
- Multilingual applications where native-language reasoning is preferred
- Edge/local deployment requiring small model size with strong reasoning capability
- Research into native-language CoT training approaches
Out-of-Scope Use
- Safety-critical applications without additional evaluation
- High-stakes medical, legal, or financial decisions without human oversight
- Tasks requiring up-to-date world knowledge beyond training cutoff
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("日本語で考えるAIとは何ですか?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF (Local Inference)
A GGUF version is also available for local inference:
# Using llama.cpp or compatible runners
./main -m Qwen3.5-FT-Japanese-CoT-4B.gguf -p "日本語で考えるとはどういうことですか?"
Training Details
- Base model: Qwen3.5-4B (Base, not Instruct)
- Method: Supervised Fine-Tuning on Japanese CoT data
- Infrastructure: Cloud GPU
Citation
If you use this model in your research, please consider citing it:
@misc{qwen35-ft-japanese-cot-4b,
author = {Aname-Tommy},
title = {Qwen3.5-FT-Japanese-CoT-4B},
year = {2026},
url = {https://huggingface.co/Aname-Tommy/Qwen3.5-FT-Japanese-CoT-4B}
}
Acknowledgements
Based on Qwen3.5 by Alibaba Cloud.
Evaluation conducted using lm-evaluation-harness by EleutherAI.
- Downloads last month
- 60