Add pretrain.py β pretrain_domain_model with HF Trainer, cosine schedule, DataCollatorForLanguageModeling 6ccb9e6 verified rtferraz commited on 10 days ago
Add data_pipeline.py β tokenize_user_sequences, pack_sequences, prepare_clm_dataset 1dfd4e2 verified rtferraz commited on 10 days ago
Phase 2C: Pre-training pipeline β data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing 28118c7 verified rtferraz commited on 10 days ago
Add model test suite β 33 tests covering config, model, PLR, DCNv2, joint fusion, integration ab8a8b6 verified rtferraz commited on 10 days ago
Add DCNv2 + JointFusionModel (nuFormer-style Transformer + tabular fusion) e881ea3 verified rtferraz commited on 10 days ago
Add DomainTransformerForCausalLM β GPT-style NoPE model with SDPA attention, weight tying, HF Trainer compatible 0dec8e4 verified rtferraz commited on 10 days ago
Add DomainTransformerConfig with presets (24M/85M/330M) 15fbfea verified rtferraz commited on 10 days ago
Phase 2B: Model architecture β DomainTransformerForCausalLM (NoPE, GPT-style), PLR embeddings, DCNv2 + JointFusion, 105 passing tests 2f5969e verified rtferraz commited on 10 days ago
Add comprehensive test suite β 72 passing tests covering all components 8efa945 verified rtferraz commited on 10 days ago
Add predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE) c00ac2c verified rtferraz commited on 10 days ago
Add domain_tokenizer.py β DomainTokenizerBuilder (core assembler, HF integration) 818a2e9 verified rtferraz commited on 10 days ago
Add field_tokenizers.py β Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical tokenizers 511f3aa verified rtferraz commited on 10 days ago
Add schema.py β DomainSchema, FieldSpec, FieldType definitions 1a9dad0 verified rtferraz commited on 10 days ago
Phase 2A: Core tokenizer library β schema, field tokenizers, composite builder, predefined schemas, 72 passing tests 0c1ca58 verified rtferraz commited on 10 days ago
Update README: add ADR reference, update documentation table and repo structure a239d6e verified rtferraz commited on 10 days ago
Add ADR-001: Implementation framework decision with detailed roadmap 25a1093 verified rtferraz commited on 10 days ago
Update README with Nubank case study and expanded repo structure e30a14d verified rtferraz commited on 10 days ago
Add Nubank nuFormer reverse-engineering analysis β full pipeline reconstruction 51149fa verified rtferraz commited on 10 days ago
Add comprehensive research report on domain-specific tokenization be86e60 verified rtferraz commited on 10 days ago