Phase 2C: Pre-training pipeline — data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing
Browse filesImplements the pre-training framework:
- data_pipeline.py: tokenize_user_sequences, pack_sequences (run_clm.py pattern), prepare_clm_dataset
- pretrain.py: pretrain_domain_model with HF Trainer, DataCollatorForLanguageModeling, cosine schedule
- test_training.py: 19 tests covering tokenization, packing, collation, integration, 24-step smoke test
- All 124 tests passing (72 tokenizer + 33 model + 19 training)
src/domain_tokenizer/training/__init__.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Training utilities for domainTokenizer.
|
| 3 |
+
|
| 4 |
+
- data_pipeline: tokenize_user_sequences, pack_sequences, prepare_clm_dataset
|
| 5 |
+
- pretrain: pretrain_domain_model
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from .data_pipeline import (
|
| 9 |
+
tokenize_user_sequences,
|
| 10 |
+
pack_sequences,
|
| 11 |
+
prepare_clm_dataset,
|
| 12 |
+
)
|
| 13 |
+
from .pretrain import pretrain_domain_model
|