rtferraz commited on
Commit
28118c7
·
verified ·
1 Parent(s): ab8a8b6

Phase 2C: Pre-training pipeline — data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing

Browse files

Implements the pre-training framework:
- data_pipeline.py: tokenize_user_sequences, pack_sequences (run_clm.py pattern), prepare_clm_dataset
- pretrain.py: pretrain_domain_model with HF Trainer, DataCollatorForLanguageModeling, cosine schedule
- test_training.py: 19 tests covering tokenization, packing, collation, integration, 24-step smoke test
- All 124 tests passing (72 tokenizer + 33 model + 19 training)

src/domain_tokenizer/training/__init__.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Training utilities for domainTokenizer.
3
+
4
+ - data_pipeline: tokenize_user_sequences, pack_sequences, prepare_clm_dataset
5
+ - pretrain: pretrain_domain_model
6
+ """
7
+
8
+ from .data_pipeline import (
9
+ tokenize_user_sequences,
10
+ pack_sequences,
11
+ prepare_clm_dataset,
12
+ )
13
+ from .pretrain import pretrain_domain_model