# Bakti-8B-Base - **library_name:** transformers - **base_model:** Qwen/Qwen3-8B - **tags:** qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp - **license:** apache-2.0 - **language:** id, en --- ## 📌 Overview **Bakti-8B-Base** is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas: * **Perlindungan Ruang Digital (PRD)** – Digital Space Protection * **Digital Talent Pool (DTP)** – Workforce and digital capability development This model is built through **Continued Pre‑training (CPT)** on the **Qwen‑3‑8B** base model using a curated Indonesian dataset. --- ## 🧠 Model Details ### Model Description * **Developed by:** *AITF Indonesia* * **Model Type:** Causal Language Model (Base) * **Base Model:** Qwen/Qwen3-8B * **Language:** Indonesian (Primary), English (Secondary) * **License:** Apache 2.0 * **Training Method:** Continued Pre‑training (CPT) ### 🎯 Goal To create a sovereign, domain‑specialized Indonesian foundation model with strong understanding of: * Digital policies (UU PDP, UU ITE) * Digital workforce & skill landscape (DTP) --- ## 📚 Dataset Composition Total Dataset Size: **~214.2 Million Tokens** | Category | Description | Token Count (M) | Percentage | | ---------------- | ----------------------------------------------------------- | --------------- | ---------- | | **DTP** | Digital HR, tech syllabi, certifications, job trends | 94.0 | ~43.9% | | **PRD** | Cybersecurity, PDP Law, content moderation, hoax prevention | 92.0 | ~42.9% | | **Wikipedia ID** | General knowledge anchor & grammar stability | 28.2 | ~13.2% | | **Total** | — | **214.2** | **100%** | --- ## 🧩 Intended Use As a **Base Model**, Bakti‑8B outputs **text completions** and can be adapted into chat/instruct variants. ### 1. PRD (Perlindungan Ruang Digital) * Policy sentiment analysis * Misinformation pattern detection * Understanding legal terminology (UU ITE, UU PDP) ### 2. DTP (Digital Talent Pool) * Skill gap analysis * Curriculum drafting assistance * Job description & talent understanding --- ## 🚀 How to Get Started Load the model using **HuggingFace Transformers**: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # 1. Configuration model_id = "YOUR_USERNAME/Bakti-8B-Base" # Replace with your actual Hub ID # 2. Load Model # Use bfloat16 for A100/A10G, float16 for T4 tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # 3. Inference Example (Completion) input_text = "Strategi utama untuk mengurangi gap talenta digital di Indonesia adalah" inputs = tokenizer(input_text, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## ⚙️ Training Details ### Training Procedure The model was continued‑pretrained with a **causal language modeling (CLM)** objective while preserving base reasoning capabilities. ### Hardware & Environment * **GPU:** NVIDIA A100 80GB (Colab Pro+) * **Training Duration:** ~36 hours * **Frameworks:** PyTorch, Transformers, Accelerate ### 🔧 Hyperparameters (Highlights) * Sequence Length: **4096** * Optimizer: **AdamW** * Scheduler: **Cosine Decay** * Precision: **bf16** --- ## ⚠️ Limitations * **Base Model:** No SFT or RLHF; few‑shot prompting may be required. * **Web Data Bias:** May inherit biases from Indonesian web sources. * **Hallucinations:** Possible incorrect factual output. --- ## ✅ Recommendations For production use, it is recommended to: * Perform **Supervised Fine‑Tuning (SFT)** for PRD/DTP domains * Add **high‑quality instruction datasets** * Apply **evaluation benchmarks** before deployment ---