Document BCAI finance data source
Browse files
README.md
CHANGED
|
@@ -163,6 +163,7 @@ Completed and prepared datasets:
|
|
| 163 |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
|
| 164 |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
|
| 165 |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
|
|
|
|
| 166 |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
|
| 167 |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
|
| 168 |
|
|
@@ -184,12 +185,16 @@ Major source groups and provenance:
|
|
| 184 |
| DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
|
| 185 |
| structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
|
| 186 |
| Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
|
|
|
|
| 187 |
|
| 188 |
The full Korean legal/admin task upload is present in the dataset repository at:
|
| 189 |
|
| 190 |
- `korean_legal_tasks_full_v1/`
|
| 191 |
- `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
|
| 192 |
- `LEGAL_FULL_TASKS_README.md`
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
|
| 195 |
|
|
|
|
| 163 |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
|
| 164 |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
|
| 165 |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
|
| 166 |
+
| BCAI Finance Korean | 857.7M | 3.3G | prepared and uploaded for later Korean finance/domain stages |
|
| 167 |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
|
| 168 |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
|
| 169 |
|
|
|
|
| 185 |
| DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
|
| 186 |
| structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
|
| 187 |
| Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
|
| 188 |
+
| BCAI Finance Kor | https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-1862K | `sft_bcai_finance_kor_v1` |
|
| 189 |
|
| 190 |
The full Korean legal/admin task upload is present in the dataset repository at:
|
| 191 |
|
| 192 |
- `korean_legal_tasks_full_v1/`
|
| 193 |
- `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
|
| 194 |
- `LEGAL_FULL_TASKS_README.md`
|
| 195 |
+
- `sft_bcai_finance_kor_v1/`
|
| 196 |
+
- `raw_jsonl/bcai_finance_kor_hrm_20260524.jsonl`
|
| 197 |
+
- `FINANCE_BCAI_README.md`
|
| 198 |
|
| 199 |
Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
|
| 200 |
|