OtterMambaLM-Indonesian 🦦

Model bahasa Indonesia eksperimental berbasis arsitektur Mamba2 (State Space Model).

Experimental Indonesian language model based on Mamba2 (State Space Model) architecture.

Model ini dibuat untuk tujuan riset dan pembelajaran arsitektur SSM pada bahasa Indonesia. Bukan untuk produksi.

This model is built for research and learning purposes on SSM architecture for Indonesian language. Not for production use.

📊 Model Details / Detail Model

Detail	Value
Architecture	Mamba2 + RMSNorm
Parameters	38 Million
Vocab Size	30,521 (IndoBERT)
Training Data	Indonesian Wikipedia (subset)
Training Steps	3 Epochs 3.8% of real dataset
Max Sequence	256 tokens
Precision	FP16 (AMP)

🚀 Usage / Penggunaan

Karena ini custom architecture, kamu perlu mendefinisikan class model-nya dulu sebelum load weights.

Since this is a custom architecture, you need to define the model class first before loading weights.

import torch
from transformers import AutoTokenizer
from mamba_ssm import Mamba2

# 1. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
tokenizer.pad_token = tokenizer.eos_token

# 2. Define Model Class (same as training) / Definisi Class Model (sama seperti training)
class RMSNorm(torch.nn.Module):
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.weight = torch.nn.Parameter(torch.ones(d_model))
    
    def forward(self, x):
        norm = x.pow(2).mean(-1, keepdim=True)
        x_normed = x * torch.rsqrt(norm + self.eps)
        return self.weight * x_normed

class OtterMambaLM(torch.nn.Module):
    def __init__(self, vocab_size, d_model=768, n_layer=4, max_seq_len=256):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, d_model)
        self.layers = torch.nn.ModuleList([
            Mamba2(d_model=d_model, d_state=64, d_conv=4, expand=2)
            for _ in range(n_layer)
        ])
        self.norm_f = RMSNorm(d_model)
        self.lm_head = torch.nn.Linear(d_model, vocab_size, bias=False)
        self.lm_head.weight = self.embedding.weight  # Weight tying
        self.pos_emb = torch.nn.Parameter(torch.zeros(1, max_seq_len, d_model))
    
    def forward(self, input_ids):
        x = self.embedding(input_ids) + self.pos_emb[:, :input_ids.shape[1], :]
        for layer in self.layers:
            out = layer(x)
            if isinstance(out, tuple): out = out[0]  # Handle tuple output
            x = x + out  # Residual connection
        x = self.norm_f(x)
        return self.lm_head(x)

# 3. Load Weights / Muat Bobot
model = OtterMambaLM(vocab_size=30521, d_model=768, n_layer=4, max_seq_len=256)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
model.eval()

# 4. Generate Text / Hasilkan Teks
input_ids = tokenizer.encode("Indonesia adalah", return_tensors="pt")
with torch.no_grad():
    logits = model(input_ids)
    next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)
    output = tokenizer.decode(next_token[0])
    print(output)

## ⚠️ Limitations & Known Issues / Keterbatasan

Model ini masih sangat awal (3 epoch) dan memiliki keterbatasan: This model is still very early stage (3 epochs) and has the following limitations: Hallucination / Halusinasi Fakta mungkin tidak akurat (contoh: Indonesia disebut asteroid). Facts may be inaccurate (e.g., Indonesia being called an asteroid). Repetition / Pengulangan Cenderung mengulang kata (contoh: "yang paling yang paling"). Tends to repeat words (e.g., "the most the most"). Logic / Logika Belum memahami logika sebab-akibat atau fakta dunia nyata. Does not yet understand cause-effect logic or real-world facts. Vocab / Kosakata Menggunakan tokenizer IndoBERT yang mungkin menghasilkan [UNK] pada kata langka. Uses IndoBERT tokenizer which may produce [UNK] for rare words. 📈 Training Stats / Statistik Training Final Loss: ~5.6 GPU: NVIDIA Tesla T4 (Kaggle Free Tier) Framework: PyTorch 2.x + Mamba2 Optimizer: AdamW + Cosine LR Schedule Precision: Mixed Precision (AMP FP16) 🙏 Credits / Penghargaan Architecture: Mamba2 by State Spaces Tokenizer: IndoBERT by IndoBenchmark Dataset: Wikipedia ID by Indonesian NLP 📄 License / Lisensi MIT License Feel free to use, modify, and experiment! Just don't expect it to write your thesis yet. 😉 Silakan gunakan, modifikasi, dan eksperimen! Jangan harap bisa nulis skripsi dulu ya. 😉 Created with 🦦 and ☕ on Kaggle | Dibuat dengan 🦦 dan ☕ di Kaggle ```

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

RinKana
/

OtterMambaLM-Indonesian

OtterMambaLM-Indonesian 🦦

📊 Model Details / Detail Model

🚀 Usage / Penggunaan

Dataset used to train RinKana/OtterMambaLM-Indonesian