Instructions to use aitf-komdigi/KomdigiUB-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
PEFT
How to use aitf-komdigi/KomdigiUB-8B-Base with PEFT:
```
Task type is invalid.
```

How to use aitf-komdigi/KomdigiUB-8B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="aitf-komdigi/KomdigiUB-8B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("aitf-komdigi/KomdigiUB-8B-Base")
model = AutoModelForCausalLM.from_pretrained("aitf-komdigi/KomdigiUB-8B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use aitf-komdigi/KomdigiUB-8B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "aitf-komdigi/KomdigiUB-8B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aitf-komdigi/KomdigiUB-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/aitf-komdigi/KomdigiUB-8B-Base

SGLang

How to use aitf-komdigi/KomdigiUB-8B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "aitf-komdigi/KomdigiUB-8B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aitf-komdigi/KomdigiUB-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "aitf-komdigi/KomdigiUB-8B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aitf-komdigi/KomdigiUB-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use aitf-komdigi/KomdigiUB-8B-Base with Docker Model Runner:
```
docker model run hf.co/aitf-komdigi/KomdigiUB-8B-Base
```

aitfindonesia commited on Dec 17, 2025

Commit

aeb6289

verified ·

1 Parent(s): bf0c7a2

Update README.md content

Browse files

Files changed (1) hide show

README.md +104 -68

README.md CHANGED Viewed

@@ -1,76 +1,88 @@
-# Bakti-8B-Base
-- **library_name:** transformers
-- **base_model:** Qwen/Qwen3-8B
-- **tags:** qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp
-- **license:** apache-2.0
-- **language:** id, en
 ---
-## 📌 Overview
-**Bakti-8B-Base** is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas:
-* **Perlindungan Ruang Digital (PRD)** – Digital Space Protection
-* **Digital Talent Pool (DTP)** – Workforce and digital capability development
-This model is built through **Continued Pre‑training (CPT)** on the **Qwen‑3‑8B** base model using a curated Indonesian dataset.
 ---
-## 🧠 Model Details
-### Model Description
-* **Developed by:** *AITF Indonesia*
-* **Model Type:** Causal Language Model (Base)
-* **Base Model:** Qwen/Qwen3-8B
-* **Language:** Indonesian (Primary), English (Secondary)
-* **License:** Apache 2.0
-* **Training Method:** Continued Pre‑training (CPT)
-### 🎯 Goal
-To create a sovereign, domain‑specialized Indonesian foundation model with strong understanding of:
-* Digital policies (UU PDP, UU ITE)
-* Digital workforce & skill landscape (DTP)
----
-## 📚 Dataset Composition
-Total Dataset Size: **~214.2 Million Tokens**
-| Category         | Description                                                 | Token Count (M) | Percentage |
-| ---------------- | ----------------------------------------------------------- | --------------- | ---------- |
-| **DTP**          | Digital HR, tech syllabi, certifications, job trends        | 94.0            | ~43.9%     |
-| **PRD**          | Cybersecurity, PDP Law, content moderation, hoax prevention | 92.0            | ~42.9%     |
-| **Wikipedia ID** | General knowledge anchor & grammar stability                | 28.2            | ~13.2%     |
-| **Total**        | —                                                           | **214.2**       | **100%**   |
 ---
-## 🧩 Intended Use
-As a **Base Model**, Bakti‑8B outputs **text completions** and can be adapted into chat/instruct variants.
-### 1. PRD (Perlindungan Ruang Digital)
-* Policy sentiment analysis
-* Misinformation pattern detection
-* Understanding legal terminology (UU ITE, UU PDP)
-### 2. DTP (Digital Talent Pool)
-* Skill gap analysis
-* Curriculum drafting assistance
-* Job description & talent understanding
 ---
-## 🚀 How to Get Started
 Load the model using **HuggingFace Transformers**:
@@ -79,7 +91,7 @@ import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # 1. Configuration
-model_id = "YOUR_USERNAME/Bakti-8B-Base"  # Replace with your actual Hub ID
 # 2. Load Model
 # Use bfloat16 for A100/A10G, float16 for T4
@@ -106,41 +118,65 @@ with torch.no_grad():
 ---
-## ⚙️ Training Details
-### Training Procedure
-The model was continued‑pretrained with a **causal language modeling (CLM)** objective while preserving base reasoning capabilities.
-### Hardware & Environment
-* **GPU:** NVIDIA A100 80GB (Colab Pro+)
-* **Training Duration:** ~36 hours
-* **Frameworks:** PyTorch, Transformers, Accelerate
-### 🔧 Hyperparameters (Highlights)
-* Sequence Length: **4096**
-* Optimizer: **AdamW**
-* Scheduler: **Cosine Decay**
-* Precision: **bf16**
 ---
-## ⚠️ Limitations
-* **Base Model:** No SFT or RLHF; few‑shot prompting may be required.
-* **Web Data Bias:** May inherit biases from Indonesian web sources.
-* **Hallucinations:** Possible incorrect factual output.
 ---
-## ✅ Recommendations
-For production use, it is recommended to:
-* Perform **Supervised Fine‑Tuning (SFT)** for PRD/DTP domains
-* Add **high‑quality instruction datasets**
-* Apply **evaluation benchmarks** before deployment
----

+---
+base_model: aitfindonesia/Bakat-8B-Base
+library_name: peft
+pipeline_tag: text-generation
+anguage:
+  - id
+tags:
+  - base_model:Qwen/Qwen3-8B
+  - lora
+  - sft
+  - transformers
+  - trl
+  - lm-eval
+  - biawak
+  - indonesian
+license: apache-2.0
+datasets:
+  - internal-curated
 ---
+# Bakti-8B-Base
+## Model Details
+### Model Description
+**Bakti-8B-Base** adalah base model bahasa Indonesia yang dirancang untuk **Continued Pre-Training (CPT)** pada domain kebijakan dan pengawasan ruang digital. Model ini merupakan turunan dari **Biawak-8B-Base** dan dibangun di atas arsitektur **Qwen3-8B**, dengan pendekatan **LoRA (Low-Rank Adaptation)** dan **4-bit quantization** untuk efisiensi memori dan komputasi.
+* **Developed by**: Tim 1 AITF
+* **Model type**: Causal Language Model (LoRA Adapter)
+* **Base architecture**: Qwen3-8B
+* **Primary language**: Indonesian (id)
+* **License**: Apache-2.0
 ---
+## Training Data Composition
+| Kategori         | Elemen                                                                                                | Jumlah Token (M) | Persentase |
+| ---------------- | ----------------------------------------------------------------------------------------------------- | ---------------- | ---------- |
+| **DTP**          | Okupasi PON TIK, Tren Pekerjaan, Kompetensi & SDM, Kebijakan & Regulasi DTP, Teknologi Digital Talent | 94               | 43.9%      |
+| **PRD**          | Judi Online, Hoax, Perlindungan Anak, Konten Edukasi, Kebijakan & Regulasi PRD, Kekerasan Masyarakat  | 92               | 42.9%      |
+| **Wikipedia ID** | Pengetahuan Umum & Bahasa Daerah Seluruh Indonesia                                                    | 28.2             | 13.2%      |
+| **Total**        | –                                                                                                     | **214.2**        | **100%**   |
+---
+## Intended Use
+### Direct Use (Recommended)
+Model ini **ditujukan untuk Continued Pre-Training**, khususnya untuk:
+* Adaptasi domain kebijakan publik dan regulasi digital
+* Pengayaan pengetahuan spesifik Indonesia
+* Pre-adaptation sebelum Instruction Tuning atau SFT
+### Out-of-Scope Use
+* **Long-context conversations** (belum dioptimalkan)
+* **High-stakes decision making** (legal, medis, finansial)
+* **Chat-oriented instruction following** tanpa fine-tuning lanjutan
 ---
+## Bias, Risks, and Limitations
+* Dataset didominasi oleh domain kebijakan dan pengawasan ruang digital, sehingga bias topikal dapat muncul pada domain non-terkait.
+* Model belum melalui tahap preference alignment (RLHF/DPO).
+* Konten Wikipedia digunakan sebagai penyeimbang, namun tidak menjamin netralitas penuh.
+Pengguna disarankan melakukan evaluasi tambahan sebelum penggunaan produksi.
+---
+## Recommendations
+* Gunakan **Qwen3 chat template** untuk hasil generasi terbaik.
+* Lakukan **Instruction Fine-Tuning** atau **Preference Tuning** sebelum deployment ke end-user.
+* Verifikasi keluaran model untuk informasi kritikal.
 ---
+## How to Get Started
 Load the model using **HuggingFace Transformers**:
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # 1. Configuration
+model_id = "aitfindonesia/Bakat-8B-Base"  # Replace with your actual Hub ID
 # 2. Load Model
 # Use bfloat16 for A100/A10G, float16 for T4
 ---
+## Training Details
+### Training Data
+* **Total size**: ~214M tokens
+* **Domains**: Digital Talent Policy (DTP), Pengawasan Ruang Digital (PRD), Wikipedia Indonesia
+* **Split**: Train (90%) / Validation (10%)
+### Training Procedure
+Model dilatih menggunakan **Continued Pre-Training (CPT)** dengan LoRA pada HuggingFace Transformers.
+#### Hyperparameters
+* **Precision**: bf16 (mixed precision)
+* **Quantization**: 4-bit (nf4)
+* **LoRA Rank (r)**: 8
+* **LoRA Alpha**: 16
+* **Target modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+* **Batch size**: 4 / device
+* **Gradient accumulation**: 16 (effective batch size = 32)
+* **Learning rate**: 2e-4 (linear schedule)
+* **Warmup ratio**: 0.03
+* **Epochs**: 1
+* **Optimizer**: adamw_8bit
 ---
+## Evaluation
+### Results
+* **Final Training Loss**: ~1.2685
+* **Final Validation Loss**: ~1.264
+* **Training Perplexity**: ~3.56
+* **Validation Perplexity**: ~3.55
+### Benchmark (General)
+* **MMLU**: ~74.20
+* **IndoMMLU**: ~65.66
+* **XCOPA-ID**: ~75.80
 ---
+## Environmental Impact
+Estimasi emisi karbon mengikuti metodologi Lacoste et al. (2019).
+* **Hardware**: NVIDIA A100 80GB
+* **Training time**: ~36 jam
+* **Compute region**: Indonesia
+* **Infrastructure**: University / Private Server
+---
+## Framework Versions
+* Transformers: 4.x
+* PyTorch: 2.x
+* Datasets: 2.x
+* Tokenizers: 0.x