gyung commited on
Commit
862683a
·
verified ·
1 Parent(s): f131b4d

Document BCAI finance data source

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -163,6 +163,7 @@ Completed and prepared datasets:
163
  | Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
164
  | HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
165
  | Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
 
166
  | SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
167
  | GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
168
 
@@ -184,12 +185,16 @@ Major source groups and provenance:
184
  | DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
185
  | structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
186
  | Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
 
187
 
188
  The full Korean legal/admin task upload is present in the dataset repository at:
189
 
190
  - `korean_legal_tasks_full_v1/`
191
  - `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
192
  - `LEGAL_FULL_TASKS_README.md`
 
 
 
193
 
194
  Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
195
 
 
163
  | Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
164
  | HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
165
  | Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
166
+ | BCAI Finance Korean | 857.7M | 3.3G | prepared and uploaded for later Korean finance/domain stages |
167
  | SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
168
  | GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
169
 
 
185
  | DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
186
  | structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
187
  | Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
188
+ | BCAI Finance Kor | https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-1862K | `sft_bcai_finance_kor_v1` |
189
 
190
  The full Korean legal/admin task upload is present in the dataset repository at:
191
 
192
  - `korean_legal_tasks_full_v1/`
193
  - `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
194
  - `LEGAL_FULL_TASKS_README.md`
195
+ - `sft_bcai_finance_kor_v1/`
196
+ - `raw_jsonl/bcai_finance_kor_hrm_20260524.jsonl`
197
+ - `FINANCE_BCAI_README.md`
198
 
199
  Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
200