KomdigiUB-8B-Base / README.md
ismaprasetiyadi's picture
Update README.md
15b7d54 verified
|
raw
history blame
4.22 kB
# Bakti-8B-Base
- **library_name:** transformers
- **base_model:** Qwen/Qwen3-8B
- **tags:** qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp
- **license:** apache-2.0
- **language:** id, en
---
## 📌 Overview
**Bakti-8B-Base** is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas:
* **Perlindungan Ruang Digital (PRD)** – Digital Space Protection
* **Digital Talent Pool (DTP)** – Workforce and digital capability development
This model is built through **Continued Pre‑training (CPT)** on the **Qwen‑3‑8B** base model using a curated Indonesian dataset.
---
## 🧠 Model Details
### Model Description
* **Developed by:** *AITF Indonesia*
* **Model Type:** Causal Language Model (Base)
* **Base Model:** Qwen/Qwen3-8B
* **Language:** Indonesian (Primary), English (Secondary)
* **License:** Apache 2.0
* **Training Method:** Continued Pre‑training (CPT)
### 🎯 Goal
To create a sovereign, domain‑specialized Indonesian foundation model with strong understanding of:
* Digital policies (UU PDP, UU ITE)
* Digital workforce & skill landscape (DTP)
---
## 📚 Dataset Composition
Total Dataset Size: **~214.2 Million Tokens**
| Category | Description | Token Count (M) | Percentage |
| ---------------- | ----------------------------------------------------------- | --------------- | ---------- |
| **DTP** | Digital HR, tech syllabi, certifications, job trends | 94.0 | ~43.9% |
| **PRD** | Cybersecurity, PDP Law, content moderation, hoax prevention | 92.0 | ~42.9% |
| **Wikipedia ID** | General knowledge anchor & grammar stability | 28.2 | ~13.2% |
| **Total** | — | **214.2** | **100%** |
---
## 🧩 Intended Use
As a **Base Model**, Bakti‑8B outputs **text completions** and can be adapted into chat/instruct variants.
### 1. PRD (Perlindungan Ruang Digital)
* Policy sentiment analysis
* Misinformation pattern detection
* Understanding legal terminology (UU ITE, UU PDP)
### 2. DTP (Digital Talent Pool)
* Skill gap analysis
* Curriculum drafting assistance
* Job description & talent understanding
---
## 🚀 How to Get Started
Load the model using **HuggingFace Transformers**:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# 1. Configuration
model_id = "YOUR_USERNAME/Bakti-8B-Base" # Replace with your actual Hub ID
# 2. Load Model
# Use bfloat16 for A100/A10G, float16 for T4
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 3. Inference Example (Completion)
input_text = "Strategi utama untuk mengurangi gap talenta digital di Indonesia adalah"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## ⚙️ Training Details
### Training Procedure
The model was continued‑pretrained with a **causal language modeling (CLM)** objective while preserving base reasoning capabilities.
### Hardware & Environment
* **GPU:** NVIDIA A100 80GB (Colab Pro+)
* **Training Duration:** ~36 hours
* **Frameworks:** PyTorch, Transformers, Accelerate
### 🔧 Hyperparameters (Highlights)
* Sequence Length: **4096**
* Optimizer: **AdamW**
* Scheduler: **Cosine Decay**
* Precision: **bf16**
---
## ⚠️ Limitations
* **Base Model:** No SFT or RLHF; few‑shot prompting may be required.
* **Web Data Bias:** May inherit biases from Indonesian web sources.
* **Hallucinations:** Possible incorrect factual output.
---
## ✅ Recommendations
For production use, it is recommended to:
* Perform **Supervised Fine‑Tuning (SFT)** for PRD/DTP domains
* Add **high‑quality instruction datasets**
* Apply **evaluation benchmarks** before deployment
---