OK I think I see the real issue now. The weighted sampling balances which tokens the model sees equally. But the labels (the actual return values) are determined by the random T_cutoff within each token, not the token's class. Even a class 5 token (100x return) only pumps in a tiny window of its lifetime. If you have 1000 trades and the pump happens between trade 200-400, then: T_cutoff at trade 50 → returns might be +500% ✅ T_cutoff at trade 500 → returns are -80% (post-pump bleed) T_cutoff at trade 700 → returns are -90% T_cutoff at trade 900 → returns are -95% So even for class 5 tokens, 80%+ of the cached training samples have negative Ground Truth labels. The model is correctly learning that at any random moment, even a "good" token is most likely going down. The class balancing doesn't change the fact that the actual Y labels are overwhelmingly negative across all classes. The model isn't broken — it learned exactly what the data showed it. The issue is that the training setup doesn't teach it to recognize the pre-pump moment specifically. **Main Issue** Your main problem was never just “bad checkpoint choice.” The core issue is training/data misalignment: - token `class_id` is token-level - the prediction target is context-level from random `T_cutoff` - even a good token produces many bad windows - so balanced token classes do not mean balanced future-return labels - the model then learns an over-negative prior A second major issue was cache construction: - cache was wasting disk/time on overwhelming numbers of garbage-token samples - later training weights cannot fix that upstream waste **What We Figured Out** - The model is not useless. - Wallet signal is real: ablations showed wallet removal hurts predictions materially. - OHLC matters, but mostly as a coarse summary, not real chart-pattern intelligence. - No obvious future leakage was found in OHLC construction. - Social looked basically unused. - Graph looked weaker than expected. - The movement head idea is valid, but only if labels are placed correctly in the pipeline. - Movement labels should come from the data loader, not be derived later in collator/training. - Cache balancing should not depend on fragile movement thresholds. - A single “movement class” for cache weighting was wrong because: - thresholds were unresolved - movement differs across horizons inside the same sample **Where You Corrected the Direction** You pushed on several important bad assumptions: - `return > 0` is too noisy as a label - movement class names should be threshold-agnostic - threshold-based movement balancing was premature - SQL/global distribution threshold inference was conceptually wrong because labels depend on sampled `T_cutoff` - cache should not be filtered by class map in a destructive way - cache balancing must happen at cache generation time, not be delegated to train weights - positive balancing should not be forced on garbage classes - exact class sample counts matter more than approximate expected weighting - `T_cutoff` does not need to be deterministic or pre-fixed - if cache balancing uses movement-like signals, use threshold-free binary polarity first Those corrections materially improved the design. **Proposed Methods Over the Chat** These were the main methods proposed, in order of evolution: 1. Forward time validation and token-grouped splits - to reduce misleading val results and leakage risk 2. Auxiliary head ideas - first fixed pump heads - then all-horizon direction head - then movement-type multiclass head - final stable view: one multi-horizon movement head is reasonable, but labels must be created correctly 3. Runtime/loader-side label derivation - final agreed direction: - labels belong in the data loader - collator should only stack them - model should just consume them 4. Cache-time balancing instead of train-time rescue - because disk/time waste happens before training starts - so train weights alone are too late 5. Class-id-based cache expansion - proposed because class `0` dominates raw token counts - later refined because exact quotas matter more than soft weighting 6. Movement-class-based cache balancing - proposed, then rejected correctly - because it depended on unresolved thresholds and collapsed multi-horizon information incorrectly 7. Binary polarity cache balancing - final better version: - use `positive` if max valid horizon return in a sample is `> 0` - else `negative` - this is threshold-free and less brittle 8. Exact class quotas + class-conditional polarity quotas - final strongest cache proposal: - exact equal sample budget per class - positive/negative balancing only for classes that can realistically produce positive contexts - keep class `0` mostly negative - keep `T_cutoff` random **Current Best Design** The strongest design we converged toward is: - keep `T_cutoff` random - make cache generation deterministic at the planning level - assign exact sample budgets per token class - for higher classes, request positive/negative context ratios - for low garbage classes, do not force positive quotas - build labels in the data loader - keep the model’s main task as future return prediction - use the movement head only as auxiliary supervision So: - token-type balance is controlled explicitly - context-type bias is controlled explicitly - disk usage is controlled at cache creation time - training does not need to rescue a broken corpus **What We Learned About the Model** - backbone has useful signal - wallet features matter - OHLC is being used, but not in the rich pattern-detection way you hoped - current main failure mode is much more about data construction and sampling than about the backbone learning nothing **High-Level Conclusions** 1. The main issue is data/sample construction, not just optimizer/checkpoint behavior. 2. Cache balancing is the right place to solve the waste and prior-bias problem. 3. Threshold-dependent movement balancing was too early and too brittle. 4. Binary short-horizon max-return polarity is a better first cache-balancing signal. 5. Exact per-class sample quotas plus class-conditional polarity balancing is the most coherent cache design we arrived at. If you want, I can next turn this into a short final blueprint with only: - `Main issue` - `Accepted design` - `Rejected ideas` - `Next implementation order`