Configuration Parsing Warning:Invalid JSON for config file config.json

sangmin6600/mamba2-400m-ko-sft

본 모델은 한국어 데이터로 처음부터 학습한 Mamba2 기반 소형 언어모델입니다.
자체 구축한 BPE 토크나이저를 사용했으며, 사전학습(Pretraining) 완료 후 Instruction Tuning(IT)을 통해 질의응답, 지시문 수행, 간단한 대화 작업에 적합하도록 추가 학습되었습니다.

본 모델 개발 이후, 약 7B 규모의 Mamba2 아키텍처 확장을 시도했습니다.
그러나 학습 과정에서 activation 메모리 사용량이 매우 크게 증가하여 학습이 불가능했습니다.
특히 Mamba 계열 모델은 Transformer 구조와 달리 selective scan 기반의 state-space 연산을 사용하며,
학습 시 backward pass를 위해 더 많은 중간 activation을 보존해야 합니다.
이로 인해 동일 파라미터 규모의 Transformer 대비 activation memory overhead가 더 크게 발생했습니다.

Fully Sharded Data Parallel (FSDP), Activation Checkpointing, 8bit-optimizer, 입력 길이 8K 제한을 적용해도 80GB VRAM GPU x 4 환경에서 OOM 문제가 해결되지 않았으며, 대규모 mamba모델 학습은 추가적인 시스템 자원 또는 구조적 개선이 필요하다는 결론에 도달했습니다.

Uses

Install Dependencies

python, cuda, torch 버전에 맞는 causal-conv1d, mamba_ssm 설치

pip install causal-conv1d
pip install mamba-ssm

Direct Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "sangmin6600/mamba2-400m-ko-sft"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

inputs = [
    # {"role": "system", "content": "..."},
    {"role": "user", "content": "인공지능 언어 모델이 뭐야?"},
]
text = tokenizer.apply_chat_template(inputs, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer(text, return_tensors="pt")['input_ids'].to("cuda")
output_ids = model.generate(input_ids, max_new_tokens=2048, do_sample=True)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Downstream Use [optional]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 8

Safetensors

Model size

0.4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using sangmin6600/mamba2-400m-ko-sft 1

Collection including sangmin6600/mamba2-400m-ko-sft

Mamba2-ko

Collection

한국어 데이터로 처음부터 학습한 Mamba2 기반 언어모델 • 2 items • Updated Feb 5

Paper for sangmin6600/mamba2-400m-ko-sft

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 42