# Bakti-8B-Base

- **library_name:** transformers
- **base_model:** Qwen/Qwen3-8B
- **tags:** qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp
- **license:** apache-2.0
- **language:** id, en

---

## 📌 Overview

**Bakti-8B-Base** is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas:

* **Perlindungan Ruang Digital (PRD)** – Digital Space Protection
* **Digital Talent Pool (DTP)** – Workforce and digital capability development

This model is built through **Continued Pre‑training (CPT)** on the **Qwen‑3‑8B** base model using a curated Indonesian dataset.

---

## 🧠 Model Details

### Model Description

* **Developed by:** *AITF Indonesia*
* **Model Type:** Causal Language Model (Base)
* **Base Model:** Qwen/Qwen3-8B
* **Language:** Indonesian (Primary), English (Secondary)
* **License:** Apache 2.0
* **Training Method:** Continued Pre‑training (CPT)

### 🎯 Goal

To create a sovereign, domain‑specialized Indonesian foundation model with strong understanding of:

* Digital policies (UU PDP, UU ITE)
* Digital workforce & skill landscape (DTP)

---

## 📚 Dataset Composition

Total Dataset Size: **~214.2 Million Tokens**

| Category         | Description                                                 | Token Count (M) | Percentage |
| ---------------- | ----------------------------------------------------------- | --------------- | ---------- |
| **DTP**          | Digital HR, tech syllabi, certifications, job trends        | 94.0            | ~43.9%     |
| **PRD**          | Cybersecurity, PDP Law, content moderation, hoax prevention | 92.0            | ~42.9%     |
| **Wikipedia ID** | General knowledge anchor & grammar stability                | 28.2            | ~13.2%     |
| **Total**        | —                                                           | **214.2**       | **100%**   |

---

## 🧩 Intended Use

As a **Base Model**, Bakti‑8B outputs **text completions** and can be adapted into chat/instruct variants.

### 1. PRD (Perlindungan Ruang Digital)

* Policy sentiment analysis
* Misinformation pattern detection
* Understanding legal terminology (UU ITE, UU PDP)

### 2. DTP (Digital Talent Pool)

* Skill gap analysis
* Curriculum drafting assistance
* Job description & talent understanding

---

## 🚀 How to Get Started

Load the model using **HuggingFace Transformers**:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
model_id = "YOUR_USERNAME/Bakti-8B-Base"  # Replace with your actual Hub ID

# 2. Load Model
# Use bfloat16 for A100/A10G, float16 for T4
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 3. Inference Example (Completion)
input_text = "Strategi utama untuk mengurangi gap talenta digital di Indonesia adalah"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## ⚙️ Training Details

### Training Procedure

The model was continued‑pretrained with a **causal language modeling (CLM)** objective while preserving base reasoning capabilities.

### Hardware & Environment

* **GPU:** NVIDIA A100 80GB (Colab Pro+)
* **Training Duration:** ~36 hours
* **Frameworks:** PyTorch, Transformers, Accelerate

### 🔧 Hyperparameters (Highlights)

* Sequence Length: **4096**
* Optimizer: **AdamW**
* Scheduler: **Cosine Decay**
* Precision: **bf16**

---

## ⚠️ Limitations

* **Base Model:** No SFT or RLHF; few‑shot prompting may be required.
* **Web Data Bias:** May inherit biases from Indonesian web sources.
* **Hallucinations:** Possible incorrect factual output.

---

## ✅ Recommendations

For production use, it is recommended to:

* Perform **Supervised Fine‑Tuning (SFT)** for PRD/DTP domains
* Add **high‑quality instruction datasets**
* Apply **evaluation benchmarks** before deployment

---