OK I think I see the real issue now.

The weighted sampling balances which tokens the model sees equally. But the labels (the actual return values) are determined by the random T_cutoff within each token, not the token's class.

Even a class 5 token (100x return) only pumps in a tiny window of its lifetime. If you have 1000 trades and the pump happens between trade 200-400, then:

T_cutoff at trade 50 → returns might be +500% ✅
T_cutoff at trade 500 → returns are -80% (post-pump bleed)
T_cutoff at trade 700 → returns are -90%
T_cutoff at trade 900 → returns are -95%
So even for class 5 tokens, 80%+ of the cached training samples have negative Ground Truth labels. The model is correctly learning that at any random moment, even a "good" token is most likely going down. The class balancing doesn't change the fact that the actual Y labels are overwhelmingly negative across all classes.

The model isn't broken — it learned exactly what the data showed it. The issue is that the training setup doesn't teach it to recognize the pre-pump moment specifically.


**Main Issue**
Your main problem was never just “bad checkpoint choice.” The core issue is training/data misalignment:

- token `class_id` is token-level
- the prediction target is context-level from random `T_cutoff`
- even a good token produces many bad windows
- so balanced token classes do not mean balanced future-return labels
- the model then learns an over-negative prior

A second major issue was cache construction:
- cache was wasting disk/time on overwhelming numbers of garbage-token samples
- later training weights cannot fix that upstream waste

**What We Figured Out**
- The model is not useless.
- Wallet signal is real: ablations showed wallet removal hurts predictions materially.
- OHLC matters, but mostly as a coarse summary, not real chart-pattern intelligence.
- No obvious future leakage was found in OHLC construction.
- Social looked basically unused.
- Graph looked weaker than expected.
- The movement head idea is valid, but only if labels are placed correctly in the pipeline.
- Movement labels should come from the data loader, not be derived later in collator/training.
- Cache balancing should not depend on fragile movement thresholds.
- A single “movement class” for cache weighting was wrong because:
  - thresholds were unresolved
  - movement differs across horizons inside the same sample

**Where You Corrected the Direction**
You pushed on several important bad assumptions:

- `return > 0` is too noisy as a label
- movement class names should be threshold-agnostic
- threshold-based movement balancing was premature
- SQL/global distribution threshold inference was conceptually wrong because labels depend on sampled `T_cutoff`
- cache should not be filtered by class map in a destructive way
- cache balancing must happen at cache generation time, not be delegated to train weights
- positive balancing should not be forced on garbage classes
- exact class sample counts matter more than approximate expected weighting
- `T_cutoff` does not need to be deterministic or pre-fixed
- if cache balancing uses movement-like signals, use threshold-free binary polarity first

Those corrections materially improved the design.

**Proposed Methods Over the Chat**
These were the main methods proposed, in order of evolution:

1. Forward time validation and token-grouped splits
- to reduce misleading val results and leakage risk

2. Auxiliary head ideas
- first fixed pump heads
- then all-horizon direction head
- then movement-type multiclass head
- final stable view: one multi-horizon movement head is reasonable, but labels must be created correctly

3. Runtime/loader-side label derivation
- final agreed direction:
  - labels belong in the data loader
  - collator should only stack them
  - model should just consume them

4. Cache-time balancing instead of train-time rescue
- because disk/time waste happens before training starts
- so train weights alone are too late

5. Class-id-based cache expansion
- proposed because class `0` dominates raw token counts
- later refined because exact quotas matter more than soft weighting

6. Movement-class-based cache balancing
- proposed, then rejected correctly
- because it depended on unresolved thresholds and collapsed multi-horizon information incorrectly

7. Binary polarity cache balancing
- final better version:
  - use `positive` if max valid horizon return in a sample is `> 0`
  - else `negative`
- this is threshold-free and less brittle

8. Exact class quotas + class-conditional polarity quotas
- final strongest cache proposal:
  - exact equal sample budget per class
  - positive/negative balancing only for classes that can realistically produce positive contexts
  - keep class `0` mostly negative
  - keep `T_cutoff` random

**Current Best Design**
The strongest design we converged toward is:

- keep `T_cutoff` random
- make cache generation deterministic at the planning level
- assign exact sample budgets per token class
- for higher classes, request positive/negative context ratios
- for low garbage classes, do not force positive quotas
- build labels in the data loader
- keep the model’s main task as future return prediction
- use the movement head only as auxiliary supervision

So:
- token-type balance is controlled explicitly
- context-type bias is controlled explicitly
- disk usage is controlled at cache creation time
- training does not need to rescue a broken corpus

**What We Learned About the Model**
- backbone has useful signal
- wallet features matter
- OHLC is being used, but not in the rich pattern-detection way you hoped
- current main failure mode is much more about data construction and sampling than about the backbone learning nothing

**High-Level Conclusions**
1. The main issue is data/sample construction, not just optimizer/checkpoint behavior.
2. Cache balancing is the right place to solve the waste and prior-bias problem.
3. Threshold-dependent movement balancing was too early and too brittle.
4. Binary short-horizon max-return polarity is a better first cache-balancing signal.
5. Exact per-class sample quotas plus class-conditional polarity balancing is the most coherent cache design we arrived at.

If you want, I can next turn this into a short final blueprint with only:
- `Main issue`
- `Accepted design`
- `Rejected ideas`
- `Next implementation order`