🌊 LFM2-1.2B-KoEn-MT-v4-100k

LFM2-1.2B-KoEn-MT-v4-100k은 LiquidAI의 LFM2-1.2B 모델을 기반으로 한국어-영어 번역 능력 향상을 위해 100,000개의 고품질 병렬 데이터셋으로 파인튜닝된 모델입니다.

T4 GPU x 2 (DDP) 환경에서 최적화된 학습 파이프라인을 통해 학습되었으며, 1.2B의 가벼운 파라미터로도 효율적이고 준수한 번역 성능을 보여줍니다. 특히, NLLB-600M과 경쟁 가능한 성능을 보이며 모바일 및 엣지 디바이스에서의 활용 가능성을 열어줍니다.

📊 벤치마크 (Benchmarks)

Flores-200 데이터셋(1012 문장)을 기준으로 한 평가 결과입니다. (CHrF++ 기준 정렬)

Rank	Model	CHrF++	BLEU	비고
1	Google Translate	39.27	18.18	상용 서비스 (Target)
2	Yanolja-4B-GGUF	38.61	16.03	Open Source Model (SOTA)
3	NLLB-200 (3.3B)	35.09	11.68	3.3B 번역 전용 모델
4	Gemma-3-4B-it-GGUF	32.83	11.36	Google 최신 4B 모델
5	NLLB-200-Distilled-600M	31.97	10.32	600M 번역 전용 모델
6	LFM2-1.2B-KOEN-MT-v4-100k	31.53	11.13	본 모델 (1.2B)
7	lfm2-mt-v1	30.85	11.17	100 Samples 학습
8	LFM2-1.2B	27.23	6.43	베이스라인 모델
9	Qwen3-4B-GGUF	25.62	7.46	4B Base Model
10	Gemma-3-1B-it-GGUF	24.07	6.94	1B 모델
11	Qwen3-1.7B-GGUF	21.19	-	1.7B Base Model
12	Qwen3-0.6B-GGUF	13.48	1.98	0.6B Base Model

📈 학습 로그 (Training Logs)

약 6,188 Step 동안 진행된 학습의 Loss 및 Learning Rate 변화 추이입니다. 초기 손실값 3.5대에서 시작하여 최종 1.43까지 안정적으로 수렴하였습니다.

Step	Epoch	Training Loss (Avg)	Learning Rate	비고
0	0.00	3.57	0	Start
500	0.08	1.59	8.06e-06	Warmup 완료 후 감소
1000	0.16	1.57	9.88e-06	초기 안정화
2000	0.32	1.48	8.45e-06	Loss 1.5 미만 진입
3000	0.49	1.46	5.99e-06	중반부 수렴 가속
4000	0.65	1.45	3.21e-06	미세 조정 단계
5000	0.81	1.44	1.08e-06	성능 극대화
6000	0.98	1.43	6.30e-09	최종 수렴 (Final Convergence)

Optimizer: paged_adamw_8bit
LR Scheduler: Cosine Decay with Warmup (0.1 ratio)
Max LR: 1e-5

🚀 사용 예시 (Usage)

이 모델은 transformers 라이브러리를 사용하여 쉽게 로드하고 번역을 수행할 수 있습니다.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 모델 로드
model_id = "gyung/lfm2-1.2b-koen-mt-v4-100k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# 번역할 문장
text = "The model is working correctly now."

# 채팅 템플릿 적용 (ChatML 형식 권장)
messages = [
    {"role": "system", "content": "Translate to Korean."},
    {"role": "user", "content": text}
]

# 입력 토큰화
input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

# 번역 생성
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
)

# 결과 디코딩
decoded = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {decoded}")
# Output: 모델이 정상적으로 작동하고 있습니다.

⚙️ 학습 상세 정보 (Training Details)

이 모델은 Kaggle T4 x 2 환경에서 최적화된 설정으로 학습되었습니다.

학습 구성 (Configuration)

Base Model: LiquidAI/LFM2-1.2B
Dataset: dataset_100000.jsonl (English-Korean Parallel, 100k samples)
Hardware: NVIDIA T4 GPU x 2 (Data Parallelism, DDP)
Epochs: 1
Batch Size: 1 per device (Gradient Accumulation 16) -> Effective Batch Size 32
Optimizer: paged_adamw_8bit
Learning Rate: 1e-5 (Cosine Scheduler, Warmup 0.1)
Precision: Mixed Precision or FP16 (Optimized for T4)

학습 코드 (Training Code Snippet)

# SFTTrainer Configuration used for v4
sft_config = SFTConfig(
    output_dir="/kaggle/working/lfm2-mt-v4",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=50,
    save_steps=500,
    eval_strategy="no",  # Optimized for speed
    dataset_text_field="messages",
    packing=False,
    ddp_find_unused_parameters=False,
)

⚠️ 제한 사항 (Limitations)

이 모델은 1.2B 파라미터의 소형 모델로, 매우 복잡하거나 전문적인 문맥에서는 대형 모델(4B+)보다 성능이 떨어질 수 있습니다.
학습 데이터에 포함되지 않은 희귀 단어나 매우 긴 문장에 대해서는 환각(Hallucination)이 발생할 수 있습니다.

📜 라이선스 (License)

이 모델은 Liquid AI LFM Open License v1.0을 따릅니다.

허용: 학술 연구 및 개인적 사용은 제한 없이 가능합니다.
상업적 이용: 연 매출 1,000만 달러(약 140억 원) 미만의 기업/개인은 무료로 상업적 이용이 가능합니다.
제한: 연 매출 1,000만 달러를 초과하는 기업은 Liquid AI와 별도의 라이선스 계약이 필요합니다. 자세한 내용은 LICENSE 파일을 참고하세요.

Citation

Model

@misc{lfm2-1.2b-koen-mt-v4-100k,
  author = {Gyung},
  title = {LFM2-1.2B Korean-English Machine Translation Model v4},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/gyung/lfm2-1.2b-koen-mt-v4-100k}}
}

Base Model (Liquid LFM-2.1B)

@article{liquidai2025lfm2,
  title={LFM2 Technical Report},
  author={Liquid AI},
  journal={arXiv preprint arXiv:2511.23404},
  year={2025}
}

Evaluation Dataset (Flores-200)

@article{nllb2022,
  author = {NLLB Team and Costa-jussà, Marta R. and Cross, James and Onabanjo, Onurkele and et al.},
  title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  year = {2022},
  journal = {arXiv preprint arXiv:2207.04672}
}

Metrics

@inproceedings{popovic-2015-chrf,
    title = "chrF: character n-gram F-score for automatic MT evaluation",
    author = "Popovi{\'c}, Maja",
    booktitle = "Proceedings of the Tenth Workshop on Statistical Machine Translation",
    month = sep,
    year = "2015",
    address = "Lisbon, Portugal",
    publisher = "Association for Computational Linguistics",
    pages = "392--395",
}

@inproceedings{post-2018-call,
    title = "A Call for Clarity in Reporting BLEU Scores",
    author = "Post, Matt",
    booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
    month = oct,
    year = "2018",
    address = "Belgium, Brussels",
    publisher = "Association for Computational Linguistics",
    pages = "186--191",
}