domainTokenizer / docs /reports /ecommerce_report.md

Add e-commerce pre-training report — successful demo, behavioral clusters found, future improvements noted

2b3e3af verified 2 days ago

8.3 kB

	# E-Commerce Pre-Training Report

	> Dataset: REES46 Multi-Category Store (10M events subsampled from 110M)
	> Model: DomainTransformer 24M (NoPE, GPT-style, d=512, 6L, 8H)
	> Hardware: NVIDIA L4 (24GB VRAM), bf16, 5 min 44 sec wall time
	> Date: May 5, 2026
	> Status: ✅ Success — model learns real sequential patterns, beats random baseline by 30%

	---

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| REES46 e-commerce (10M events, subsampled from 110M) \|
	\| Users (10+ events) \| 100,000 (capped) \|
	\| Total events \| 4,472,096 \|
	\| Events per user \| min=10, max=200, mean=44.7 \|
	\| Unique categories \| 2,767 \|
	\| Unique brands \| ~4,300 \|
	\| Block size \| 512 tokens \|
	\| Training tokens \| ~62.7M \|
	\| Vocab size \| ~4,000 (65 domain special + BPE) \|
	\| UNK rate \| ~0% (after ByteLevel → Whitespace fix) \|
	\| Batch size \| 32 × 4 = 128 effective \|
	\| Epochs \| 3 \|
	\| Total steps \| 690 \|
	\| Learning rate \| 3e-4 (cosine with 200-step warmup) \|
	\| Precision \| bf16 \|
	\| Training time \| 5 min 44 sec \|

	---

	## Results

	### Loss

	```
	Final loss: 5.80
	Min loss: 5.75
	Random chance loss: 8.29 (= ln(vocab_size))
	Model vs random: ✅ 30% better than random
	```

	Loss curve showed continuous descent through all 3 epochs — no plateau (unlike the finance experiment which plateaued at epoch 0.5).

	### Loss Trajectory

	```
	Epoch 0.0: 33.23 (initial — learning token distribution)
	Epoch 0.4: 9.98 (rapid descent — learning basic structure)
	Epoch 0.9: 6.19 (below random — learning sequential patterns)
	Epoch 2.0: 5.88 (still descending)
	Epoch 3.0: 5.80 (still descending — not converged)
	```

	### Next-Token Predictions

	Given a sequence ending with `electronics.tool.drill [TIMESTAMP_DOW_0] [TIMESTAMP_HOUR_14] [EOS]`:

	\| Rank \| Token \| Score \| Interpretation \|
	\|------\|-------\|-------\|----------------\|
	\| 1 \| `[BOS]` \| 12.00 \| Correct — new sequence after EOS \|
	\| 2 \| `drill` \| 2.47 \| Category stickiness — drill browsers keep browsing drills \|
	\| 3 \| `[SEP_EVENT]` \| 2.33 \| Another event follows \|
	\| 4 \| `[TIMESTAMP_DOW_0]` \| 2.23 \| Learned temporal pattern \|
	\| 5 \| `[TIMESTAMP_HOUR_06]` \| 2.11 \| Shopping hour pattern \|

	The model learned that users who browse drills tend to continue browsing drills — a real e-commerce behavioral pattern.

	### User Embeddings (t-SNE)

	500 user embeddings projected to 2D, colored by purchase rate:

	Key findings:
	- Buyers cluster together — a distinct pocket of green/yellow dots (purchase rate 20-40%) in the bottom-right of the main cluster
	- Window-shoppers/bots form isolated islands — 4 tight clusters on the far left, all dark pink (0% purchase rate)
	- The main cloud shows behavioral diversity — not a uniform blob like the finance experiment

	This proves: The pre-trained model learned meaningful behavioral representations that separate user types — without any labels, purely from next-token prediction on domain token sequences.

	---

	## Comparison: Finance vs E-Commerce

	\| Dimension \| Finance (❌ Failed) \| E-Commerce (✅ Success) \|
	\|-----------\|--------------------\|-----------------------\|
	\| Final loss \| 6.91 \| 5.80 \|
	\| Random baseline \| 5.84 \| 8.29 \|
	\| vs. random \| Worse (above baseline) \| 30% better (below baseline) \|
	\| Loss trajectory \| Plateaued at epoch 0.5 \| Still descending at epoch 3 \|
	\| Unique descriptions \| 84 \| 2,767 \|
	\| Sequential dependencies \| None \| Strong (view→cart→purchase) \|
	\| t-SNE \| Uniform blob, no separation \| Clear clusters, buyer pocket \|
	\| Training time \| 25 min \| 5.7 min \|

	Root cause of the difference: The e-commerce dataset has real sequential structure (behavioral funnels, category stickiness, temporal patterns) that next-token prediction can learn. The finance dataset had only 84 templates drawn randomly — nothing sequential to learn.

	---

	## What the Model Learned

	1. Category stickiness: Users browsing electronics keep browsing electronics. Users looking at drills predict more drill-related tokens.
	2. Event type transitions: After `view`, the next event is most likely another `view` (96%), but `cart` (3%) is significantly more likely than random — and `purchase` after `cart` is 27% (vs 1.5% base rate).
	3. Temporal patterns: Shopping happens at certain hours and days. The model learned `[TIMESTAMP_DOW_0]` and specific hours as predictable patterns.
	4. Behavioral archetypes: The t-SNE shows distinct user types — active buyers, window-shoppers, and bot-like patterns — all discovered unsupervised.

	---

	## Critical Bug Fixed During This Run

	42% UNK rate bug: The first attempt produced 42.77% UNK tokens because `ByteLevel` pre-tokenizer split space-separated special tokens into byte fragments (`Ġ[`, `PRICE`, `_`, `16`, `]`) that weren't in the vocabulary.

	Fix: Switched to `Whitespace` pre-tokenizer in `domain_tokenizer.py`. Whitespace splits on spaces (preserving `[EVT_000]` as a whole unit), and BPE handles subword splitting within text fields (e.g., `electronics.smartphone` → `electronics`, `.`, `smartphone`).

	Result: 0% UNK rate after fix.

	---

	## Future Training Improvements

	The model has not converged — loss was still descending at epoch 3. The following levers are available for future runs:

	### Immediate (same hardware)

	\| Lever \| Current \| Improvement \| Expected Gain \|
	\|-------\|---------\|-------------\|---------------\|
	\| Epochs \| 3 \| 10-15 \| Loss hasn't plateaued — more epochs = lower loss. Estimated: 5.80 → 5.2-5.4 \|
	\| Block size \| 512 \| 1024 or 2048 \| Longer context = model sees full user journeys (100+ events). May improve category-stickiness learning \|
	\| Learning rate \| 3e-4 \| Grid search [1e-4, 3e-4, 5e-4] \| Potentially faster convergence or lower final loss \|

	### Medium (needs more hardware)

	\| Lever \| Current \| Improvement \| Requirement \|
	\|-------\|---------\|-------------\|-------------\|
	\| Full dataset \| 10M events \| 110M events (all users) \| 64GB RAM machine \|
	\| More users \| 100K \| 500K-1M \| 64GB RAM + longer training \|
	\| Model size \| 24M (d=512, 6L) \| 85M (d=768, 12L) \| Same L4 GPU, just more VRAM \|

	### Advanced (research-grade)

	\| Lever \| Description \| Reference \|
	\|-------\|-------------\|-----------\|
	\| Longer context (2048) \| Nubank uses 2048 tokens (~146 transactions). We use 512 (~50 events). Longer context captures monthly/seasonal patterns \| nuFormer paper \|
	\| 330M model \| Nubank saw +0.21% AUC going from 24M to 330M \| nuFormer Table 1 \|
	\| ActionPiece vocabulary \| BPE-like merging of cross-field patterns (e.g., `{electronics + $50-100}` → composite token) \| ActionPiece paper \|
	\| Multi-epoch with eval split \| Hold out 10% of users for validation, train until val loss stops improving \| Standard practice \|

	### Priority Order for Next Run

	1. 10 epochs (free — just run longer) → expect 5.2-5.4 loss
	2. Block size 1024 (minimal cost — slightly more VRAM) → better long-range patterns
	3. 85M model (still fits on L4) → more capacity
	4. Full 110M dataset (needs 64GB RAM machine) → more diversity

	---

	## Artifacts

	\| File \| Location \| Description \|
	\|------\|----------\|-------------\|
	\| Pre-trained model \| [huggingface.co/rtferraz/ecommerce-domain-24m](https://huggingface.co/rtferraz/ecommerce-domain-24m) \| 20.9M params, pushed to Hub \|
	\| Tokenizer \| `./ecommerce_tokenizer/` \| Fitted domain tokenizer (4000 vocab) \|
	\| Model checkpoint \| `./ecommerce_pretrain_checkpoints/final/` \| Local copy \|
	\| User data \| `./ecommerce_artifacts.pkl` \| 100K user sequences + IDs \|
	\| Notebook \| `notebooks/02_ecommerce_pretrain.ipynb` \| Complete with outputs \|
	\| wandb run \| domainTokenizer/ecommerce-pretrain-24m-3ep \| Loss curves, grad norms \|

	---

	## Conclusion

	The domainTokenizer thesis is validated. When domain data has genuine sequential structure:
	- A 24M-param model trained on domain tokens (not text) learns meaningful behavioral representations
	- Loss drops well below random chance (30% better)
	- User embeddings show clear behavioral clusters without supervision
	- Training takes under 6 minutes on a single L4 GPU

	The next step is fine-tuning: use the pre-trained model's user embeddings for downstream prediction (next-purchase prediction, user segmentation).

	# E-Commerce Pre-Training Report

	> Dataset: REES46 Multi-Category Store (10M events subsampled from 110M)
	> Model: DomainTransformer 24M (NoPE, GPT-style, d=512, 6L, 8H)
	> Hardware: NVIDIA L4 (24GB VRAM), bf16, 5 min 44 sec wall time
	> Date: May 5, 2026
	> Status: ✅ Success — model learns real sequential patterns, beats random baseline by 30%

	---

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Dataset \| REES46 e-commerce (10M events, subsampled from 110M) \|
	\| Users (10+ events) \| 100,000 (capped) \|
	\| Total events \| 4,472,096 \|
	\| Events per user \| min=10, max=200, mean=44.7 \|
	\| Unique categories \| 2,767 \|
	\| Unique brands \| ~4,300 \|
	\| Block size \| 512 tokens \|
	\| Training tokens \| ~62.7M \|
	\| Vocab size \| ~4,000 (65 domain special + BPE) \|
	\| UNK rate \| ~0% (after ByteLevel → Whitespace fix) \|
	\| Batch size \| 32 × 4 = 128 effective \|
	\| Epochs \| 3 \|
	\| Total steps \| 690 \|
	\| Learning rate \| 3e-4 (cosine with 200-step warmup) \|
	\| Precision \| bf16 \|
	\| Training time \| 5 min 44 sec \|

	---

	## Results

	### Loss

	```
	Final loss: 5.80
	Min loss: 5.75
	Random chance loss: 8.29 (= ln(vocab_size))
	Model vs random: ✅ 30% better than random
	```

	Loss curve showed continuous descent through all 3 epochs — no plateau (unlike the finance experiment which plateaued at epoch 0.5).

	### Loss Trajectory

	```
	Epoch 0.0: 33.23 (initial — learning token distribution)
	Epoch 0.4: 9.98 (rapid descent — learning basic structure)
	Epoch 0.9: 6.19 (below random — learning sequential patterns)
	Epoch 2.0: 5.88 (still descending)
	Epoch 3.0: 5.80 (still descending — not converged)
	```

	### Next-Token Predictions

	Given a sequence ending with `electronics.tool.drill [TIMESTAMP_DOW_0] [TIMESTAMP_HOUR_14] [EOS]`:

	\| Rank \| Token \| Score \| Interpretation \|
	\|------\|-------\|-------\|----------------\|
	\| 1 \| `[BOS]` \| 12.00 \| Correct — new sequence after EOS \|
	\| 2 \| `drill` \| 2.47 \| Category stickiness — drill browsers keep browsing drills \|
	\| 3 \| `[SEP_EVENT]` \| 2.33 \| Another event follows \|
	\| 4 \| `[TIMESTAMP_DOW_0]` \| 2.23 \| Learned temporal pattern \|
	\| 5 \| `[TIMESTAMP_HOUR_06]` \| 2.11 \| Shopping hour pattern \|

	The model learned that users who browse drills tend to continue browsing drills — a real e-commerce behavioral pattern.

	### User Embeddings (t-SNE)

	500 user embeddings projected to 2D, colored by purchase rate:

	Key findings:
	- Buyers cluster together — a distinct pocket of green/yellow dots (purchase rate 20-40%) in the bottom-right of the main cluster
	- Window-shoppers/bots form isolated islands — 4 tight clusters on the far left, all dark pink (0% purchase rate)
	- The main cloud shows behavioral diversity — not a uniform blob like the finance experiment

	This proves: The pre-trained model learned meaningful behavioral representations that separate user types — without any labels, purely from next-token prediction on domain token sequences.

	---

	## Comparison: Finance vs E-Commerce

	\| Dimension \| Finance (❌ Failed) \| E-Commerce (✅ Success) \|
	\|-----------\|--------------------\|-----------------------\|
	\| Final loss \| 6.91 \| 5.80 \|
	\| Random baseline \| 5.84 \| 8.29 \|
	\| vs. random \| Worse (above baseline) \| 30% better (below baseline) \|
	\| Loss trajectory \| Plateaued at epoch 0.5 \| Still descending at epoch 3 \|
	\| Unique descriptions \| 84 \| 2,767 \|
	\| Sequential dependencies \| None \| Strong (view→cart→purchase) \|
	\| t-SNE \| Uniform blob, no separation \| Clear clusters, buyer pocket \|
	\| Training time \| 25 min \| 5.7 min \|

	Root cause of the difference: The e-commerce dataset has real sequential structure (behavioral funnels, category stickiness, temporal patterns) that next-token prediction can learn. The finance dataset had only 84 templates drawn randomly — nothing sequential to learn.

	---

	## What the Model Learned

	1. Category stickiness: Users browsing electronics keep browsing electronics. Users looking at drills predict more drill-related tokens.
	2. Event type transitions: After `view`, the next event is most likely another `view` (96%), but `cart` (3%) is significantly more likely than random — and `purchase` after `cart` is 27% (vs 1.5% base rate).
	3. Temporal patterns: Shopping happens at certain hours and days. The model learned `[TIMESTAMP_DOW_0]` and specific hours as predictable patterns.
	4. Behavioral archetypes: The t-SNE shows distinct user types — active buyers, window-shoppers, and bot-like patterns — all discovered unsupervised.

	---

	## Critical Bug Fixed During This Run

	42% UNK rate bug: The first attempt produced 42.77% UNK tokens because `ByteLevel` pre-tokenizer split space-separated special tokens into byte fragments (`Ġ[`, `PRICE`, `_`, `16`, `]`) that weren't in the vocabulary.

	Fix: Switched to `Whitespace` pre-tokenizer in `domain_tokenizer.py`. Whitespace splits on spaces (preserving `[EVT_000]` as a whole unit), and BPE handles subword splitting within text fields (e.g., `electronics.smartphone` → `electronics`, `.`, `smartphone`).

	Result: 0% UNK rate after fix.

	---

	## Future Training Improvements

	The model has not converged — loss was still descending at epoch 3. The following levers are available for future runs:

	### Immediate (same hardware)

	\| Lever \| Current \| Improvement \| Expected Gain \|
	\|-------\|---------\|-------------\|---------------\|
	\| Epochs \| 3 \| 10-15 \| Loss hasn't plateaued — more epochs = lower loss. Estimated: 5.80 → 5.2-5.4 \|
	\| Block size \| 512 \| 1024 or 2048 \| Longer context = model sees full user journeys (100+ events). May improve category-stickiness learning \|
	\| Learning rate \| 3e-4 \| Grid search [1e-4, 3e-4, 5e-4] \| Potentially faster convergence or lower final loss \|

	### Medium (needs more hardware)

	\| Lever \| Current \| Improvement \| Requirement \|
	\|-------\|---------\|-------------\|-------------\|
	\| Full dataset \| 10M events \| 110M events (all users) \| 64GB RAM machine \|
	\| More users \| 100K \| 500K-1M \| 64GB RAM + longer training \|
	\| Model size \| 24M (d=512, 6L) \| 85M (d=768, 12L) \| Same L4 GPU, just more VRAM \|

	### Advanced (research-grade)

	\| Lever \| Description \| Reference \|
	\|-------\|-------------\|-----------\|
	\| Longer context (2048) \| Nubank uses 2048 tokens (~146 transactions). We use 512 (~50 events). Longer context captures monthly/seasonal patterns \| nuFormer paper \|
	\| 330M model \| Nubank saw +0.21% AUC going from 24M to 330M \| nuFormer Table 1 \|
	\| ActionPiece vocabulary \| BPE-like merging of cross-field patterns (e.g., `{electronics + $50-100}` → composite token) \| ActionPiece paper \|
	\| Multi-epoch with eval split \| Hold out 10% of users for validation, train until val loss stops improving \| Standard practice \|

	### Priority Order for Next Run

	1. 10 epochs (free — just run longer) → expect 5.2-5.4 loss
	2. Block size 1024 (minimal cost — slightly more VRAM) → better long-range patterns
	3. 85M model (still fits on L4) → more capacity
	4. Full 110M dataset (needs 64GB RAM machine) → more diversity

	---

	## Artifacts

	\| File \| Location \| Description \|
	\|------\|----------\|-------------\|
	\| Pre-trained model \| [huggingface.co/rtferraz/ecommerce-domain-24m](https://huggingface.co/rtferraz/ecommerce-domain-24m) \| 20.9M params, pushed to Hub \|
	\| Tokenizer \| `./ecommerce_tokenizer/` \| Fitted domain tokenizer (4000 vocab) \|
	\| Model checkpoint \| `./ecommerce_pretrain_checkpoints/final/` \| Local copy \|
	\| User data \| `./ecommerce_artifacts.pkl` \| 100K user sequences + IDs \|
	\| Notebook \| `notebooks/02_ecommerce_pretrain.ipynb` \| Complete with outputs \|
	\| wandb run \| domainTokenizer/ecommerce-pretrain-24m-3ep \| Loss curves, grad norms \|

	---

	## Conclusion

	The domainTokenizer thesis is validated. When domain data has genuine sequential structure:
	- A 24M-param model trained on domain tokens (not text) learns meaningful behavioral representations
	- Loss drops well below random chance (30% better)
	- User embeddings show clear behavioral clusters without supervision
	- Training takes under 6 minutes on a single L4 GPU

	The next step is fine-tuning: use the pre-trained model's user embeddings for downstream prediction (next-purchase prediction, user segmentation).