aitfindonesia commited on
Commit
aeb6289
·
verified ·
1 Parent(s): bf0c7a2

Update README.md content

Browse files
Files changed (1) hide show
  1. README.md +104 -68
README.md CHANGED
@@ -1,76 +1,88 @@
1
- # Bakti-8B-Base
2
-
3
- - **library_name:** transformers
4
- - **base_model:** Qwen/Qwen3-8B
5
- - **tags:** qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp
6
- - **license:** apache-2.0
7
- - **language:** id, en
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- ## 📌 Overview
 
 
12
 
13
- **Bakti-8B-Base** is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas:
14
 
15
- * **Perlindungan Ruang Digital (PRD)** Digital Space Protection
16
- * **Digital Talent Pool (DTP)** – Workforce and digital capability development
17
 
18
- This model is built through **Continued Pre‑training (CPT)** on the **Qwen‑3‑8B** base model using a curated Indonesian dataset.
 
 
 
 
19
 
20
  ---
21
 
22
- ## 🧠 Model Details
23
-
24
- ### Model Description
25
 
26
- * **Developed by:** *AITF Indonesia*
27
- * **Model Type:** Causal Language Model (Base)
28
- * **Base Model:** Qwen/Qwen3-8B
29
- * **Language:** Indonesian (Primary), English (Secondary)
30
- * **License:** Apache 2.0
31
- * **Training Method:** Continued Pre‑training (CPT)
32
 
33
- ### 🎯 Goal
34
 
35
- To create a sovereign, domain‑specialized Indonesian foundation model with strong understanding of:
36
 
37
- * Digital policies (UU PDP, UU ITE)
38
- * Digital workforce & skill landscape (DTP)
39
 
40
- ---
41
 
42
- ## 📚 Dataset Composition
 
 
43
 
44
- Total Dataset Size: **~214.2 Million Tokens**
45
 
46
- | Category | Description | Token Count (M) | Percentage |
47
- | ---------------- | ----------------------------------------------------------- | --------------- | ---------- |
48
- | **DTP** | Digital HR, tech syllabi, certifications, job trends | 94.0 | ~43.9% |
49
- | **PRD** | Cybersecurity, PDP Law, content moderation, hoax prevention | 92.0 | ~42.9% |
50
- | **Wikipedia ID** | General knowledge anchor & grammar stability | 28.2 | ~13.2% |
51
- | **Total** | — | **214.2** | **100%** |
52
 
53
  ---
54
 
55
- ## 🧩 Intended Use
56
 
57
- As a **Base Model**, Bakti‑8B outputs **text completions** and can be adapted into chat/instruct variants.
 
 
58
 
59
- ### 1. PRD (Perlindungan Ruang Digital)
60
 
61
- * Policy sentiment analysis
62
- * Misinformation pattern detection
63
- * Understanding legal terminology (UU ITE, UU PDP)
64
 
65
- ### 2. DTP (Digital Talent Pool)
66
 
67
- * Skill gap analysis
68
- * Curriculum drafting assistance
69
- * Job description & talent understanding
70
 
71
  ---
72
 
73
- ## 🚀 How to Get Started
74
 
75
  Load the model using **HuggingFace Transformers**:
76
 
@@ -79,7 +91,7 @@ import torch
79
  from transformers import AutoTokenizer, AutoModelForCausalLM
80
 
81
  # 1. Configuration
82
- model_id = "YOUR_USERNAME/Bakti-8B-Base" # Replace with your actual Hub ID
83
 
84
  # 2. Load Model
85
  # Use bfloat16 for A100/A10G, float16 for T4
@@ -106,41 +118,65 @@ with torch.no_grad():
106
 
107
  ---
108
 
109
- ## ⚙️ Training Details
110
 
111
- ### Training Procedure
112
 
113
- The model was continued‑pretrained with a **causal language modeling (CLM)** objective while preserving base reasoning capabilities.
 
 
114
 
115
- ### Hardware & Environment
116
 
117
- * **GPU:** NVIDIA A100 80GB (Colab Pro+)
118
- * **Training Duration:** ~36 hours
119
- * **Frameworks:** PyTorch, Transformers, Accelerate
120
 
121
- ### 🔧 Hyperparameters (Highlights)
122
 
123
- * Sequence Length: **4096**
124
- * Optimizer: **AdamW**
125
- * Scheduler: **Cosine Decay**
126
- * Precision: **bf16**
 
 
 
 
 
 
 
127
 
128
  ---
129
 
130
- ## ⚠️ Limitations
 
 
 
 
 
 
 
131
 
132
- * **Base Model:** No SFT or RLHF; few‑shot prompting may be required.
133
- * **Web Data Bias:** May inherit biases from Indonesian web sources.
134
- * **Hallucinations:** Possible incorrect factual output.
 
 
135
 
136
  ---
137
 
138
- ## Recommendations
 
 
139
 
140
- For production use, it is recommended to:
 
 
 
 
 
141
 
142
- * Perform **Supervised Fine‑Tuning (SFT)** for PRD/DTP domains
143
- * Add **high‑quality instruction datasets**
144
- * Apply **evaluation benchmarks** before deployment
145
 
146
- ---
 
 
 
 
1
+ ---
 
 
 
 
 
 
2
 
3
+ base_model: aitfindonesia/Bakat-8B-Base
4
+ library_name: peft
5
+ pipeline_tag: text-generation
6
+ anguage:
7
+ - id
8
+ tags:
9
+ - base_model:Qwen/Qwen3-8B
10
+ - lora
11
+ - sft
12
+ - transformers
13
+ - trl
14
+ - lm-eval
15
+ - biawak
16
+ - indonesian
17
+ license: apache-2.0
18
+ datasets:
19
+ - internal-curated
20
  ---
21
 
22
+ # Bakti-8B-Base
23
+
24
+ ## Model Details
25
 
26
+ ### Model Description
27
 
28
+ **Bakti-8B-Base** adalah base model bahasa Indonesia yang dirancang untuk **Continued Pre-Training (CPT)** pada domain kebijakan dan pengawasan ruang digital. Model ini merupakan turunan dari **Biawak-8B-Base** dan dibangun di atas arsitektur **Qwen3-8B**, dengan pendekatan **LoRA (Low-Rank Adaptation)** dan **4-bit quantization** untuk efisiensi memori dan komputasi.
 
29
 
30
+ * **Developed by**: Tim 1 AITF
31
+ * **Model type**: Causal Language Model (LoRA Adapter)
32
+ * **Base architecture**: Qwen3-8B
33
+ * **Primary language**: Indonesian (id)
34
+ * **License**: Apache-2.0
35
 
36
  ---
37
 
38
+ ## Training Data Composition
 
 
39
 
40
+ | Kategori | Elemen | Jumlah Token (M) | Persentase |
41
+ | ---------------- | ----------------------------------------------------------------------------------------------------- | ---------------- | ---------- |
42
+ | **DTP** | Okupasi PON TIK, Tren Pekerjaan, Kompetensi & SDM, Kebijakan & Regulasi DTP, Teknologi Digital Talent | 94 | 43.9% |
43
+ | **PRD** | Judi Online, Hoax, Perlindungan Anak, Konten Edukasi, Kebijakan & Regulasi PRD, Kekerasan Masyarakat | 92 | 42.9% |
44
+ | **Wikipedia ID** | Pengetahuan Umum & Bahasa Daerah Seluruh Indonesia | 28.2 | 13.2% |
45
+ | **Total** | – | **214.2** | **100%** |
46
 
47
+ ---
48
 
49
+ ## Intended Use
50
 
51
+ ### Direct Use (Recommended)
 
52
 
53
+ Model ini **ditujukan untuk Continued Pre-Training**, khususnya untuk:
54
 
55
+ * Adaptasi domain kebijakan publik dan regulasi digital
56
+ * Pengayaan pengetahuan spesifik Indonesia
57
+ * Pre-adaptation sebelum Instruction Tuning atau SFT
58
 
59
+ ### Out-of-Scope Use
60
 
61
+ * **Long-context conversations** (belum dioptimalkan)
62
+ * **High-stakes decision making** (legal, medis, finansial)
63
+ * **Chat-oriented instruction following** tanpa fine-tuning lanjutan
 
 
 
64
 
65
  ---
66
 
67
+ ## Bias, Risks, and Limitations
68
 
69
+ * Dataset didominasi oleh domain kebijakan dan pengawasan ruang digital, sehingga bias topikal dapat muncul pada domain non-terkait.
70
+ * Model belum melalui tahap preference alignment (RLHF/DPO).
71
+ * Konten Wikipedia digunakan sebagai penyeimbang, namun tidak menjamin netralitas penuh.
72
 
73
+ Pengguna disarankan melakukan evaluasi tambahan sebelum penggunaan produksi.
74
 
75
+ ---
 
 
76
 
77
+ ## Recommendations
78
 
79
+ * Gunakan **Qwen3 chat template** untuk hasil generasi terbaik.
80
+ * Lakukan **Instruction Fine-Tuning** atau **Preference Tuning** sebelum deployment ke end-user.
81
+ * Verifikasi keluaran model untuk informasi kritikal.
82
 
83
  ---
84
 
85
+ ## How to Get Started
86
 
87
  Load the model using **HuggingFace Transformers**:
88
 
 
91
  from transformers import AutoTokenizer, AutoModelForCausalLM
92
 
93
  # 1. Configuration
94
+ model_id = "aitfindonesia/Bakat-8B-Base" # Replace with your actual Hub ID
95
 
96
  # 2. Load Model
97
  # Use bfloat16 for A100/A10G, float16 for T4
 
118
 
119
  ---
120
 
121
+ ## Training Details
122
 
123
+ ### Training Data
124
 
125
+ * **Total size**: ~214M tokens
126
+ * **Domains**: Digital Talent Policy (DTP), Pengawasan Ruang Digital (PRD), Wikipedia Indonesia
127
+ * **Split**: Train (90%) / Validation (10%)
128
 
129
+ ### Training Procedure
130
 
131
+ Model dilatih menggunakan **Continued Pre-Training (CPT)** dengan LoRA pada HuggingFace Transformers.
 
 
132
 
133
+ #### Hyperparameters
134
 
135
+ * **Precision**: bf16 (mixed precision)
136
+ * **Quantization**: 4-bit (nf4)
137
+ * **LoRA Rank (r)**: 8
138
+ * **LoRA Alpha**: 16
139
+ * **Target modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
140
+ * **Batch size**: 4 / device
141
+ * **Gradient accumulation**: 16 (effective batch size = 32)
142
+ * **Learning rate**: 2e-4 (linear schedule)
143
+ * **Warmup ratio**: 0.03
144
+ * **Epochs**: 1
145
+ * **Optimizer**: adamw_8bit
146
 
147
  ---
148
 
149
+ ## Evaluation
150
+
151
+ ### Results
152
+
153
+ * **Final Training Loss**: ~1.2685
154
+ * **Final Validation Loss**: ~1.264
155
+ * **Training Perplexity**: ~3.56
156
+ * **Validation Perplexity**: ~3.55
157
 
158
+ ### Benchmark (General)
159
+
160
+ * **MMLU**: ~74.20
161
+ * **IndoMMLU**: ~65.66
162
+ * **XCOPA-ID**: ~75.80
163
 
164
  ---
165
 
166
+ ## Environmental Impact
167
+
168
+ Estimasi emisi karbon mengikuti metodologi Lacoste et al. (2019).
169
 
170
+ * **Hardware**: NVIDIA A100 80GB
171
+ * **Training time**: ~36 jam
172
+ * **Compute region**: Indonesia
173
+ * **Infrastructure**: University / Private Server
174
+
175
+ ---
176
 
177
+ ## Framework Versions
 
 
178
 
179
+ * Transformers: 4.x
180
+ * PyTorch: 2.x
181
+ * Datasets: 2.x
182
+ * Tokenizers: 0.x