OtterMambaLM-Indonesian 🦦

Model bahasa Indonesia eksperimental berbasis arsitektur Mamba2 (State Space Model).

Experimental Indonesian language model based on Mamba2 (State Space Model) architecture.

Model ini dibuat untuk tujuan riset dan pembelajaran arsitektur SSM pada bahasa Indonesia. Bukan untuk produksi.

This model is built for research and learning purposes on SSM architecture for Indonesian language. Not for production use.


πŸ“Š Model Details / Detail Model

Detail Value
Architecture Mamba2 + RMSNorm
Parameters 38 Million
Vocab Size 30,521 (IndoBERT)
Training Data Indonesian Wikipedia (subset)
Training Steps 3 Epochs 3.8% of real dataset
Max Sequence 256 tokens
Precision FP16 (AMP)

πŸš€ Usage / Penggunaan

Karena ini custom architecture, kamu perlu mendefinisikan class model-nya dulu sebelum load weights.

Since this is a custom architecture, you need to define the model class first before loading weights.

import torch
from transformers import AutoTokenizer
from mamba_ssm import Mamba2

# 1. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
tokenizer.pad_token = tokenizer.eos_token

# 2. Define Model Class (same as training) / Definisi Class Model (sama seperti training)
class RMSNorm(torch.nn.Module):
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = torch.nn.Parameter(torch.ones(d_model))
    
    def forward(self, x):
        norm = x.pow(2).mean(-1, keepdim=True)
        x_normed = x * torch.rsqrt(norm + self.eps)
        return self.weight * x_normed

class OtterMambaLM(torch.nn.Module):
    def __init__(self, vocab_size, d_model=768, n_layer=4, max_seq_len=256):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, d_model)
        self.layers = torch.nn.ModuleList([
            Mamba2(d_model=d_model, d_state=64, d_conv=4, expand=2)
            for _ in range(n_layer)
        ])
        self.norm_f = RMSNorm(d_model)
        self.lm_head = torch.nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.embedding.weight  # Weight tying
        self.pos_emb = torch.nn.Parameter(torch.zeros(1, max_seq_len, d_model))
    
    def forward(self, input_ids):
        x = self.embedding(input_ids) + self.pos_emb[:, :input_ids.shape[1], :]
        for layer in self.layers:
            out = layer(x)
            if isinstance(out, tuple): out = out[0]  # Handle tuple output
            x = x + out  # Residual connection
        x = self.norm_f(x)
        return self.lm_head(x)

# 3. Load Weights / Muat Bobot
model = OtterMambaLM(vocab_size=30521, d_model=768, n_layer=4, max_seq_len=256)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()

# 4. Generate Text / Hasilkan Teks
input_ids = tokenizer.encode("Indonesia adalah", return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids)
    next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
    output = tokenizer.decode(next_token[0])
    print(output)

## ⚠️ Limitations & Known Issues / Keterbatasan

Model ini masih sangat awal (3 epoch) dan memiliki keterbatasan: This model is still very early stage (3 epochs) and has the following limitations: Hallucination / Halusinasi Fakta mungkin tidak akurat (contoh: Indonesia disebut asteroid). Facts may be inaccurate (e.g., Indonesia being called an asteroid). Repetition / Pengulangan Cenderung mengulang kata (contoh: "yang paling yang paling"). Tends to repeat words (e.g., "the most the most"). Logic / Logika Belum memahami logika sebab-akibat atau fakta dunia nyata. Does not yet understand cause-effect logic or real-world facts. Vocab / Kosakata Menggunakan tokenizer IndoBERT yang mungkin menghasilkan [UNK] pada kata langka. Uses IndoBERT tokenizer which may produce [UNK] for rare words. πŸ“ˆ Training Stats / Statistik Training Final Loss: ~5.6 GPU: NVIDIA Tesla T4 (Kaggle Free Tier) Framework: PyTorch 2.x + Mamba2 Optimizer: AdamW + Cosine LR Schedule Precision: Mixed Precision (AMP FP16) πŸ™ Credits / Penghargaan Architecture: Mamba2 by State Spaces Tokenizer: IndoBERT by IndoBenchmark Dataset: Wikipedia ID by Indonesian NLP πŸ“„ License / Lisensi MIT License Feel free to use, modify, and experiment! Just don't expect it to write your thesis yet. πŸ˜‰ Silakan gunakan, modifikasi, dan eksperimen! Jangan harap bisa nulis skripsi dulu ya. πŸ˜‰ Created with 🦦 and β˜• on Kaggle | Dibuat dengan 🦦 dan β˜• di Kaggle ```

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train RinKana/OtterMambaLM-Indonesian