CLIWorks commited on about 21 hours ago

Commit

d8bc908

verified ·

1 Parent(s): 10d05e0

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

REVIEW.md +224 -0
arbitor.egg-info/PKG-INFO +18 -0
arbitor.egg-info/SOURCES.txt +104 -0
arbitor.egg-info/dependency_links.txt +1 -0
arbitor.egg-info/requires.txt +16 -0
arbitor.egg-info/top_level.txt +6 -0
arbitor/__init__.py +35 -0
arbitor/attention/__init__.py +15 -0
arbitor/attention/context_attention.py +109 -0
arbitor/attention/frame_buffer.py +78 -0
arbitor/attention/kq_cache.py +30 -0
arbitor/attention/kv_ledger.py +57 -0
arbitor/attention/mla.py +176 -0
arbitor/attention/ring_buffer.py +49 -0
arbitor/components.py +1218 -0
arbitor/config.py +125 -0
arbitor/converters/convert_to_ternary2.py +81 -0
arbitor/converters/convert_to_ternary54.py +120 -0
arbitor/converters/convert_to_ternary64.py +111 -0
arbitor/converters/convert_to_ternary8.py +101 -0
arbitor/decoders.py +231 -0
arbitor/encoders/__init__.py +11 -0
arbitor/encoders/audio.py +83 -0
arbitor/encoders/mel_frontend.py +70 -0
arbitor/encoders/models/__init__.py +86 -0
arbitor/encoders/models/download.py +132 -0
arbitor/encoders/models/opensora-vae/config.json +35 -0
arbitor/encoders/models/opensora-vae/model.safetensors +3 -0
arbitor/encoders/models/pig-vae/model.safetensors +3 -0
arbitor/encoders/opensora_vae.py +145 -0
arbitor/encoders/opensora_vae_modules/autoencoder_2d.py +339 -0
arbitor/encoders/opensora_vae_modules/autoencoder_kl_causal_3d.py +638 -0
arbitor/encoders/opensora_vae_modules/registry.py +41 -0
arbitor/encoders/opensora_vae_modules/unet_causal_3d_blocks.py +476 -0
arbitor/encoders/opensora_vae_modules/vae.py +340 -0
arbitor/encoders/pig_vae.py +148 -0
arbitor/encoders/vae2d.py +56 -0
arbitor/kernel/flash_vq.py +510 -0
arbitor/kernel/ternary_audit.py +192 -0
arbitor/kernel/ternary_scale.py +1811 -0
arbitor/kernel/triton_video.py +75 -0
arbitor/main.py +585 -0
arbitor/optim/__init__.py +0 -0
arbitor/optim/sign_sgd.py +45 -0
arbitor/profiling.py +196 -0
arbitor/sequencers.py +218 -0
arbitor/vq.py +89 -0
docs/ARB-RENAME-NOTE.md +62 -0
docs/arbs-tts/README.md +90 -0
docs/benchmarks/BENCHMARK.md +151 -0

REVIEW.md ADDED Viewed

	@@ -0,0 +1,224 @@

+# ARBS Code Audit: Dead Imports, Dead Code, and Triton Kernel Analysis
+**Reviewed:** 2026-05-20T00:00:00Z
+**Depth:** standard
+**Files Reviewed:** 10
+## Summary
+The ARBS codebase has **3 BLOCKER bugs** that will cause runtime crashes, **8 unused class/function definitions** (dead code), **7 dead Triton kernels in components.py** that should be moved to `arbitor/kernel/`, and **21+ unused imports** across files. Two missing function definitions (`_graph_gather_add`, `_moe_dense_combine`) exist in dead code paths but would crash if those paths were ever activated.
+---
+## BLOCKER Issues
+### CR-01: `_TernaryLinearFn.forward` references undefined `x_2d` (NameError at runtime)
+**File:** `arbitor/kernel/ternary_scale.py:206-208`
+**Issue:** The TileLang `_TernaryLinearFn.forward()` method references `x_2d` on lines 206-208, but `x_2d` is never defined in the method's scope. This will cause a `NameError` at runtime if the TileLang code path is taken in `TernaryScaleTensor.forward` (line 1069). The Triton variant `_TritonTernaryLinearFn` (line 878) correctly defines `x_2d = x.reshape(-1, k_in).contiguous()` before use, so this was likely an omission when the TileLang function was written.
+```python
+# Line 206 — NameError: name 'x_2d' is not defined
+M = x_2d.shape[0]
+output = torch.empty(M, N, device=x.device, dtype=torch.float32)
+fwd_kernel(x_2d.half(), T_packed, E, output)
+```
+**Fix:** Add `x_2d = x.reshape(-1, K).contiguous()` before line 206:
+```python
+with torch.no_grad():
+    N, K = shape
+    x_2d = x.reshape(-1, K).contiguous()  # missing definition
+    M = x_2d.shape[0]
+    output = torch.empty(M, N, device=x.device, dtype=torch.float32)
+    fwd_kernel(x_2d.half(), T_packed, E, output)
+```
+---
+### CR-02: `_check_tilelang_finite` called but never defined (NameError at runtime)
+**File:** `arbitor/kernel/ternary_scale.py:1072`
+**Issue:** `_check_tilelang_finite()` is called in `TernaryScaleTensor.forward()` but is never defined anywhere in the codebase. This will cause a `NameError` at runtime when the TileLang path is active and the kernel produces valid output (the check is specifically gated by `_HAS_TILELANG` being True).
+**Fix:** Either define the function (if the check is intentional) or remove the call:
+```python
+# Replace line 1072 with a direct finiteness check or remove
+if not torch.isfinite(y).all():
+    raise FloatingPointError("TileLang ternary kernel produced non-finite activations")
+```
+---
+### CR-03: `self.modality_gate` used but never assigned (AttributeError at runtime)
+**File:** `arbitor/main.py:129-130`
+**Issue:** `ARBModel.forward()` references `self.modality_gate` but it is never assigned in `ARBModel.__init__()`. While `ModalityGate` is imported at line 19, it is never instantiated and stored as `self.modality_gate`. This will cause an `AttributeError` on any forward pass where `self.modality_gate is not None` is evaluated.
+The code at lines 129-132:
+```python
+if self.modality_gate is not None:
+    gate_weights, active_count, hops = self.modality_gate(active_mods)
+else:
+    gate_weights, active_count, hops = {}, len(active_mods), 1
+```
+**Fix:** Add `self.modality_gate = ModalityGate()` in `ARBModel.__init__()` (or assign `self.modality_gate = None` if the gate should be optional):
+```python
+# In ARBModel.__init__, after line 78:
+self.modality_gate = ModalityGate()
+```
+---
+## WARNING: Undefined Functions in Dead Code
+### WR-01: `_graph_gather_add` called but never defined
+**File:** `arbitor/components.py:739`
+**Issue:** `TernaryGraph.forward()` calls `_graph_gather_add(vq_output, node_features, vq_indices)` but this function is never defined anywhere in the codebase. `TernaryGraph` is dead code (never imported or used), so this does not crash currently, but it blocks any future use of `TernaryGraph`.
+**Fix:** Define `_graph_gather_add` or remove the dead class.
+---
+### WR-02: `_moe_dense_combine` called but never defined
+**File:** `arbitor/components.py:941`
+**Issue:** `SharedProjectionMoE.forward()` calls `_moe_dense_combine(torch.stack(...), topk_idx, topk_weights)` but this function is never defined. `SharedProjectionMoE` is dead code, but the missing function is a latent bug.
+**Fix:** Define `_moe_dense_combine` or remove the dead class.
+---
+## WARNING: Unused Class/Function Definitions (Dead Code)
+### WR-03: `TernaryLSTMCell` class — defined but never used
+**File:** `arbitor/components.py:189-207`
+**Issue:** `TernaryLSTMCell` is defined and re-exported from `__init__.py` (line 23) but is never instantiated anywhere in the codebase. The model uses `MoEGraph` with attention (MLA) instead of LSTM-based processing.
+---
+### WR-04: `TernaryGraph` class — defined but never used
+**File:** `arbitor/components.py:665-802`
+**Issue:** `TernaryGraph` is defined in `components.py` but never imported or instantiated. It was replaced by `MoEGraph` (line 1342). The only reference is in a comment (line 1348).
+**Also:** `TernaryGraph` references the undefined function `_graph_gather_add` (see WR-01), so it cannot function even if someone tried to use it.
+---
+### WR-05: `SharedProjectionMoE` class — defined but never used
+**File:** `arbitor/components.py:806-999`
+**Issue:** `SharedProjectionMoE` is defined in `components.py` but never imported or instantiated. It was replaced by `MoEGraph._run_expert()` (line 1429). The only reference is in a comment (line 1348).
+**Also:** References the undefined function `_moe_dense_combine` (see WR-02).
+---
+### WR-06: 7 dead Triton kernel functions in `components.py`
+**File:** `arbitor/components.py:266-386`
+**Issue:** These Triton kernel functions are defined inside the `if _HAS_TRITON:` block but are only referenced by their forward/backward wrapper functions which are themselves part of dead code (`TernaryGraph` and `SharedProjectionMoE`):
+| Line | Function | Used By |
+|------|----------|---------|
+| 268 | `_triton_graph_aggregate_fwd_kernel` | dead (TernaryGraph) |
+| 292 | `_triton_graph_aggregate_bwd_kernel` | dead (TernaryGraph) |
+| 316 | `_triton_graph_gather_add_fwd_kernel` | dead (TernaryGraph) |
+| 329 | `_triton_graph_gather_add_bwd_kernel` | dead (TernaryGraph) |
+| 342 | `_triton_moe_dense_combine_fwd_kernel` | dead (SharedProjectionMoE) |
+| 359 | `_triton_moe_dense_combine_bwd_expert_kernel` | dead (SharedProjectionMoE) |
+| 374 | `_triton_moe_dense_combine_bwd_weight_kernel` | dead (SharedProjectionMoE) |
+The live Triton kernels (`_triton_video_denoise_fwd_kernel` line 389, `_triton_video_denoise_bwd_kernel` line 402) are still in `components.py` and should also be moved to `arbitor/kernel/`.
+---
+### WR-07: `_triton_flash_vq_quantize_kernel` — dead Triton kernel
+**File:** `arbitor/kernel/flash_vq.py:370-402`
+**Issue:** This Triton kernel is defined but never called. The `_TritonFlashVQFn.forward()` method uses PyTorch's `embed[indices]` for the gather operation (line 468) instead of this kernel.
+---
+### WR-08: `TILE_SIZE = 384` — unused constant
+**File:** `arbitor/kernel/ternary_scale.py:949`
+**Issue:** `TILE_SIZE` is defined as a module-level constant but never referenced anywhere in the codebase.
+---
+## WARNING: Unused Imports
+### WR-09: Unused imports in `arbitor/main.py` (line 10)
+| Symbol | Used In File? |
+|--------|--------------|
+| `EMBEDDING_DIM` | No — not referenced in body |
+| `FFN_HIDDEN` | No — not referenced in body |
+| `CODEBOOK_DIM` | No — not referenced in body |
+| `ATTENTION_STRIDE` | No — not referenced in body |
+| `MG_N_EXPERTS` | No — MoEGraph uses default, not passed |
+| `MG_CORE_RANK` | No — MoEGraph uses default |
+| `MG_SHARED_INTER` | No — MoEGraph uses default |
+| `MG_ACT_ITERS` | No — MoEGraph uses default |
+---
+### WR-10: Unused imports in `arbitor/components.py` (line 21)
+| Symbol | Used In Live Code? | Note |
+|--------|-------------------|------|
+| `FFN_HIDDEN` | No | Not referenced in file body |
+| `CTX` | No | Not referenced in file body |
+| `THRESHOLD` | No | Only used in dead `TernaryGraph`. Live `MoEGraph` hardcodes `threshold=0.05` |
+| `KG_EMA_ALPHA` | No | Only used in dead `TernaryGraph`. Live `MoEGraph` hardcodes `0.99` |
+| `KG_REQUANT_EVERY` | No | Only used in dead `TernaryGraph`. Live `MoEGraph` hardcodes `50` |
+| `KG_TERNARY_THRESHOLD` | No | Only used in dead `TernaryGraph`. Live `MoEGraph` hardcodes `0.3` |
+---
+### WR-11: Unused imports in `arbitor/profiling.py` (line 17)
+| Symbol | Used In File? |
+|--------|--------------|
+| `VOCAB` | No — not referenced in body |
+| `math` (line 11) | No — not referenced in body |
+---
+## INFO: Triton Kernel Code in `components.py` Should Be Moved to `arbitor/kernel/`
+### IN-01: Live Triton kernels reside in `components.py` instead of `arbitor/kernel/`
+**File:** `arbitor/components.py:389-445`
+**Issue:** The codebase convention places Triton kernels in `arbitor/kernel/` (e.g., `ternary_scale.py`, `flash_vq.py`, `ternary_audit.py`). Two live Triton kernels remain in `components.py`:
+- `_triton_video_denoise_fwd_kernel` (line 389)
+- `_triton_video_denoise_bwd_kernel` (line 402)
+- `_TritonVideoDenoiseFn` (line 415)
+- `_video_denoise_step` (line 448)
+These should be extracted into `arbitor/kernel/video_denoise.py` and imported from there, following the pattern established by `ternary_scale.py` and `flash_vq.py`.
+---
+## INFO: Additional Dead Code
+### IN-02: Hardcoded MoEGraph config values bypass config constants
+**File:** `arbitor/components.py:1381-1383`
+**Issue:** `MoEGraph` uses hardcoded values (`50`, `0.3`, `0.99`) instead of the imported config constants (`KG_REQUANT_EVERY`, `KG_TERNARY_THRESHOLD`, `KG_EMA_ALPHA`). The values happen to match the config, but any future config changes will silently be ignored.
+---
+### IN-03: `AUDIO_VOCAB` not used meaningfully in `config.py`
+**File:** `arbitor/config.py:2`
+**Issue:** `AUDIO_VOCAB=288` is imported and used in `TalkerHead` and `TinyNeuralCodec`, but the `SPECIAL_VOCAB` map (line 65) defines tokens up to 287. `AUDIO_VOCAB` = `VOCAB` = 288, meaning the audio head has the same vocabulary as the text head. This may be intentional for the current prototype but is worth flagging given `AUDIO_VOCAB` vs `VOCAB` are separate constants.
+---
+_Reviewed: 2026-05-20T00:00:00Z_
+_Reviewer: gsd-code-reviewer (deep analysis)_

arbitor.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,18 @@

+Metadata-Version: 2.4
+Name: arbitor
+Version: 0.2.0
+Summary: ARB (Any Relational Bit) — ternary-weighted neural network system
+License: MIT
+Requires-Python: >=3.12
+Requires-Dist: torch>=2.5
+Requires-Dist: einops
+Requires-Dist: tqdm
+Provides-Extra: dev
+Requires-Dist: pytest; extra == "dev"
+Provides-Extra: cuda
+Requires-Dist: torch>=2.5; extra == "cuda"
+Requires-Dist: triton>=3.0; extra == "cuda"
+Provides-Extra: triton
+Requires-Dist: triton>=3.0; extra == "triton"
+Provides-Extra: tilelang
+Requires-Dist: tilelang; extra == "tilelang"

arbitor.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,104 @@

+pyproject.toml
+arbitor/__init__.py
+arbitor/components.py
+arbitor/config.py
+arbitor/decoders.py
+arbitor/main.py
+arbitor/profiling.py
+arbitor/sequencers.py
+arbitor/vq.py
+arbitor.egg-info/PKG-INFO
+arbitor.egg-info/SOURCES.txt
+arbitor.egg-info/dependency_links.txt
+arbitor.egg-info/requires.txt
+arbitor.egg-info/top_level.txt
+arbitor/attention/__init__.py
+arbitor/attention/context_attention.py
+arbitor/attention/frame_buffer.py
+arbitor/attention/kq_cache.py
+arbitor/attention/kv_ledger.py
+arbitor/attention/mla.py
+arbitor/attention/ring_buffer.py
+arbitor/converters/convert_to_ternary2.py
+arbitor/converters/convert_to_ternary54.py
+arbitor/converters/convert_to_ternary64.py
+arbitor/converters/convert_to_ternary8.py
+arbitor/encoders/__init__.py
+arbitor/encoders/audio.py
+arbitor/encoders/mel_frontend.py
+arbitor/encoders/opensora_vae.py
+arbitor/encoders/pig_vae.py
+arbitor/encoders/vae2d.py
+arbitor/encoders/models/__init__.py
+arbitor/encoders/models/download.py
+arbitor/encoders/opensora_vae_modules/autoencoder_2d.py
+arbitor/encoders/opensora_vae_modules/autoencoder_kl_causal_3d.py
+arbitor/encoders/opensora_vae_modules/registry.py
+arbitor/encoders/opensora_vae_modules/unet_causal_3d_blocks.py
+arbitor/encoders/opensora_vae_modules/vae.py
+arbitor/kernel/flash_vq.py
+arbitor/kernel/ternary_audit.py
+arbitor/kernel/ternary_scale.py
+arbitor/kernel/triton_video.py
+arbitor/optim/__init__.py
+arbitor/optim/sign_sgd.py
+testing/bigcalc.py
+testing/scaled_optum.py
+testing/sign_gsd.py
+testing/test_200_step_smoke.py
+testing/test_bigint_ternary.py
+testing/test_gradient_capture.py
+testing/test_polarity_validation.py
+testing/test_tilelang_training.py
+testing/test_tscale.py
+testing/tscale_mini.py
+testing/attention/__init__.py
+testing/attention/test_kq_cache.py
+testing/attention/test_kv_cache.py
+testing/attention/test_lstm_removal.py
+testing/attention/test_lstm_removal_clean.py
+testing/attention/test_mla.py
+testing/attention/test_ring_buffer.py
+testing/benchmarks/benchmark.py
+testing/benchmarks/benchmark_phase2.py
+testing/benchmarks/benchmark_true_ternary.py
+testing/eval/eval_checkpoints.py
+testing/eval/eval_generation.py
+testing/eval/eval_metrics.py
+testing/eval/test_eval.py
+testing/kg/test_composite_head.py
+testing/kg/test_kg_edges.py
+testing/kg/test_kv_integration.py
+testing/model/audio-comprehension.py
+testing/model/health.py
+testing/model/image-comprehension.py
+testing/model/test-stp.py
+testing/model/test_arb.py
+testing/model/test_flash.py
+testing/model/test_tscale.py
+testing/model/text-comprehension.py
+testing/model/video-comprehension.py
+testing/vae/test_opensora_vae.py
+tests/test_cross_modal.py
+tests/test_lti.py
+tests/test_moegraph_topk.py
+tests/test_vae2d.py
+tests/test_vae2d_sequencer.py
+training/audio.py
+training/diffusion.py
+training/pretrain.py
+training/text.py
+training/vision.py
+training/data/__init__.py
+training/data/prepare_cc12m.py
+training/data/prepare_fineweb.py
+training/data/prepare_librispeech.py
+training/data/prepare_starcoder.py
+training/data/prepare_webvid.py
+training/data/tokenize_from_hf.py
+training/finetuning/__init__.py
+training/finetuning/audio.py
+training/finetuning/diffusion.py
+training/finetuning/lora.py
+training/finetuning/text.py
+training/finetuning/vision.py

arbitor.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

arbitor.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+torch>=2.5
+einops
+tqdm
+[cuda]
+torch>=2.5
+triton>=3.0
+[dev]
+pytest
+[tilelang]
+tilelang
+[triton]
+triton>=3.0

arbitor.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+arbitor
+docs
+models
+testing
+tests
+training

arbitor/__init__.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""ARBitor — Any Relational Bit System.
+Core package for the ARB ternary-weighted neural network.
+Quick import: from arbitor import ARBModel, VOCAB
+"""
+from .config import VOCAB, AUDIO_VOCAB, AUDIO_SR, AUDIO_FRAME_RATE, \
+    EMBEDDING_DIM, HIDDEN_DIM, CTX, SPECIAL_VOCAB, \
+    CODEBOOK_DIM, SHARED_VQ_SIZE, \
+    MG_N_EXPERTS, MG_CORE_RANK, MG_SHARED_INTER, MG_ACT_ITERS, \
+    MEMGRAM_STRUCT_PRIMES, MEMGRAM_CONV_PRIMES, MEMGRAM_EMBED_DIM, MEMGRAM_KEY_DIM
+from .kernel.ternary_scale import (
+    TernaryScaleTensor, TernaryRMSNorm, TScaleType, GROUP_SIZES,
+    _HAS_TRITON, _HAS_TILELANG,
+)
+from .kernel.flash_vq import FlashVQCodebook
+from .kernel.ternary_audit import audit_model, format_audit, freeze_float_parameters, trainable_parameters
+from .converters.convert_to_ternary8 import pack_ternary, unpack_ternary
+from .sequencers import ByteEmbedding, Sequencer, TextSequencer, VAE2DSequencer, VAEAudioSequencer, MultimodalSequencer
+from .vq import SharedVQ
+from .components import (
+    TernaryEmbeddingTable, TernaryVQCodebook,
+    GNNLoRAAdapter, HaltingUnit,
+    MemGram, MoEGraph,
+    ByteHead, OutputRouter,
+    LossComponents, LossWeights, StickyZoneSTE,
+    KGVQCodebook, CompositeProposalHead,
+    _BOUNDARY_TOKEN_MAP,
+)
+from .decoders import VideoHead, TalkerHead, MRFBlock, TinyNeuralCodec
+from .main import ARBModel, _extract_boundary_from_input
+# Re-export encoders
+from .encoders import TinyNeuralCodec as Codec, AudioVQEncoder, load_vae, VAEWrapper

arbitor/attention/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""ARB Attention — KV Ledger, MLA, Sliding Window Attention."""
+from .ring_buffer import GPURingBuffer
+from .kv_ledger import KVLedger
+from .kq_cache import KQCache
+from .mla import (MultiHeadLatentAttention, apply_rotary_emb,
+                  precompute_freqs_cis)
+from .context_attention import ContextAttentionScheduler
+from .frame_buffer import TemporalFrameBuffer
+__all__ = [
+    "GPURingBuffer", "KVLedger", "KQCache",
+    "MultiHeadLatentAttention", "apply_rotary_emb",
+    "precompute_freqs_cis", "ContextAttentionScheduler",
+    "TemporalFrameBuffer",
+]

arbitor/attention/context_attention.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Context Attention Scheduler — sliding window + full context orchestration.
+Schedules 4 sliding window (d=64, CSA-compressed to d=16) and 4 full context
+(d=32, HCA-compressed to d=8) MLA attention passes. Combines both via gating.
+Pipeline: GNN output → ContextAttentionScheduler → MoE input
+"""
+import torch
+import torch.nn as nn
+from ..config import HIDDEN_DIM, MLA_HCA_STRIDE
+from ..kernel.ternary_scale import TernaryScaleTensor, TernaryRMSNorm, TScaleType
+from .mla import (MultiHeadLatentAttention, precompute_freqs_cis,
+                  MLA_N_LAYERS, MLA_N_HEADS, MLA_SLIDE_DIM, MLA_FULL_DIM,
+                  MLA_QK_NOPE_HEAD_DIM, MLA_QK_ROPE_HEAD_DIM,
+                  MLA_V_HEAD_DIM, MLA_ROPE_THETA,
+                  MLA_CSA_DIM, MLA_HCA_DIM, MLA_HCA_STRIDE)
+SLIDING_WINDOW_SIZE = 32768
+KV_LEDGER_SIZE = 262144
+class ContextAttentionScheduler(nn.Module):
+    def __init__(self, dim=HIDDEN_DIM):
+        super().__init__()
+        self.dim = dim
+        # Slide layers with CSA compression (d=64 → d=16) — half of total layers
+        n_layers_per_pass = max(1, MLA_N_LAYERS // 2)
+        self.slide_layers = nn.ModuleList([
+            MultiHeadLatentAttention(
+                dim=dim, n_heads=MLA_N_HEADS, kv_lora_rank=MLA_SLIDE_DIM,
+                qk_nope_head_dim=MLA_QK_NOPE_HEAD_DIM,
+                qk_rope_head_dim=MLA_QK_ROPE_HEAD_DIM,
+                v_head_dim=MLA_V_HEAD_DIM,
+                csa_dim=MLA_CSA_DIM, hca_dim=None,
+            ) for _ in range(n_layers_per_pass)
+        ])
+        # CSA: embed motif IDs → kv_lora_rank, then compress → csa_dim
+        self.slide_embed = TernaryScaleTensor(1, MLA_SLIDE_DIM, tscale_type=TScaleType.T32)
+        self.slide_compress = TernaryScaleTensor(MLA_SLIDE_DIM, MLA_CSA_DIM, tscale_type=TScaleType.T32)
+        # Full context layers with HCA compression (d=32 → d=8) — half of total layers
+        self.full_layers = nn.ModuleList([
+            MultiHeadLatentAttention(
+                dim=dim, n_heads=MLA_N_HEADS, kv_lora_rank=MLA_FULL_DIM,
+                qk_nope_head_dim=MLA_QK_NOPE_HEAD_DIM,
+                qk_rope_head_dim=MLA_QK_ROPE_HEAD_DIM,
+                v_head_dim=MLA_V_HEAD_DIM,
+                csa_dim=None, hca_dim=MLA_HCA_DIM,
+            ) for _ in range(n_layers_per_pass)
+        ])
+        # HCA: embed motif IDs → kv_lora_rank, then compress → hca_dim
+        self.full_embed = TernaryScaleTensor(1, MLA_FULL_DIM, tscale_type=TScaleType.T32)
+        self.full_compress = TernaryScaleTensor(MLA_FULL_DIM, MLA_HCA_DIM, tscale_type=TScaleType.T32)
+        self.gate = TernaryScaleTensor(dim, 1, tscale_type=TScaleType.T32)
+        self._freqs_cis = None
+        self._max_freq_len = 0
+    def _ensure_freqs(self, seq_len, device):
+        needed = max(seq_len, SLIDING_WINDOW_SIZE, KV_LEDGER_SIZE)
+        if self._freqs_cis is None or needed > self._max_freq_len:
+            self._max_freq_len = needed
+            self._freqs_cis = precompute_freqs_cis(
+                MLA_QK_ROPE_HEAD_DIM, needed, theta=MLA_ROPE_THETA
+            ).to(device)
+        return self._freqs_cis
+    def forward(self, x, kv_ledger, full_ledger=None, kq_cache=None):
+        bsz, seqlen, _ = x.shape
+        device = x.device
+        freqs_cis = self._ensure_freqs(seqlen, device)
+        full_ledger = full_ledger or kv_ledger
+        window_size = min(SLIDING_WINDOW_SIZE, kv_ledger.size) if kv_ledger.size > 0 else 0
+        out_slide = x
+        if window_size > 0:
+            start = max(0, kv_ledger.size - SLIDING_WINDOW_SIZE)
+            end = kv_ledger.size
+            slide_ids = kv_ledger.get_range(start, end).float().unsqueeze(-1)
+            # Embed to kv_lora_rank, then CSA compress to csa_dim
+            slide_latent = self.slide_embed(slide_ids)
+            csa_cache = self.slide_compress(slide_latent)
+            pe_cache = torch.zeros(csa_cache.shape[0], MLA_QK_ROPE_HEAD_DIM, device=device)
+            for layer in self.slide_layers:
+                out_slide = layer(out_slide, slide_latent, pe_cache,
+                                start_pos=0, freqs_cis=freqs_cis, mask=None,
+                                csa_cache=csa_cache)
+        out_full = x
+        if full_ledger.size > 0:
+            full = full_ledger.get_sparse(stride=MLA_HCA_STRIDE)
+            full_ids = full.float().unsqueeze(-1)
+            full_latent = self.full_embed(full_ids)
+            hca_cache = self.full_compress(full_latent)
+            pe_cache = torch.zeros(hca_cache.shape[0], MLA_QK_ROPE_HEAD_DIM, device=device)
+            for layer in self.full_layers:
+                out_full = layer(out_full, full_latent, pe_cache,
+                               start_pos=0, freqs_cis=freqs_cis, mask=None,
+                               hca_cache=hca_cache, hca_pe_cache=pe_cache)
+        gate = torch.sigmoid(self.gate(x.mean(dim=1, keepdim=True)))
+        out = gate * out_slide + (1 - gate) * out_full
+        return out

arbitor/attention/frame_buffer.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""TemporalFrameBuffer — ring buffer for video latents with HCA compression.
+Stores the last N video latents (local) and maintains a compressed long-range
+cache via TernaryScaleTensor projection. Used for conditioning video generation
+on previous time steps.
+Latent shape: [B, C, H', W'] where C=OPEN_SORA_LATENT_CHANNELS=4,
+H'=VIDEO_HEIGHT=32, W'=VIDEO_WIDTH=32. Each "latent" is one 4-frame chunk.
+"""
+import torch
+import torch.nn as nn
+from ..kernel.ternary_scale import TernaryScaleTensor, TScaleType
+from .ring_buffer import GPURingBuffer
+from ..config import FRAME_BUFFER_LOCAL_SIZE, FRAME_BUFFER_CACHE_STRIDE, \
+    OPEN_SORA_LATENT_CHANNELS, VIDEO_HEIGHT, VIDEO_WIDTH
+class TemporalFrameBuffer(nn.Module):
+    def __init__(self, local_size=FRAME_BUFFER_LOCAL_SIZE,
+                 cache_stride=FRAME_BUFFER_CACHE_STRIDE,
+                 latent_channels=OPEN_SORA_LATENT_CHANNELS,
+                 height=VIDEO_HEIGHT, width=VIDEO_WIDTH,
+                 tscale_type=TScaleType.T32):
+        super().__init__()
+        self.latent_channels = latent_channels
+        self.spatial_dim = height * width
+        self.latent_flat_dim = latent_channels * self.spatial_dim
+        self.local = GPURingBuffer(
+            max_size=local_size,
+            dtype=torch.float32,
+            dim=self.latent_flat_dim,
+        )
+        self.compress_proj = TernaryScaleTensor(
+            self.latent_flat_dim,
+            self.latent_flat_dim // 4,
+            tscale_type=tscale_type,
+        )
+        self.compressed_cache = []
+        self.cache_stride = cache_stride
+        self._frames_since_compress = 0
+    def append(self, latent):
+        B = latent.shape[0]
+        flat = latent.reshape(B, -1)
+        self.local.append(flat)
+        self._frames_since_compress += 1
+        if self._frames_since_compress >= self.cache_stride:
+            compressed = self.compress_proj(flat)
+            self.compressed_cache.append(compressed.detach())
+            self._frames_since_compress = 0
+    def get_local(self, n=None):
+        n = n or self.local.max_size
+        result = self.local.get_last_n(n)
+        if result.dim() == 0 or result.shape[0] == 0:
+            return torch.zeros(0, 1, self.latent_flat_dim)
+        if result.dim() == 1:
+            result = result.unsqueeze(0)
+        return result
+    def get_compressed_cache(self):
+        if not self.compressed_cache:
+            return torch.zeros(0, 1, self.latent_flat_dim // 4)
+        return torch.stack(self.compressed_cache, dim=0)
+    def get_conditioning(self, n_local=None):
+        return {
+            "local": self.get_local(n_local),
+            "compressed": self.get_compressed_cache(),
+        }
+    def reset(self):
+        self.local.reset()
+        self.compressed_cache = []
+        self._frames_since_compress = 0

arbitor/attention/kq_cache.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""KQ Cache — small ring buffer of last 8K motif IDs for O(1) peek.
+Per D-64: Small ring buffer holding last 8K motif IDs. No compression - just raw IDs.
+O(1) peek for fast motif lookup without MemGram query.
+Per D-65: Updated after each ByteHead output append to ledger.
+"""
+import torch
+import torch.nn as nn
+from ..config import KQ_CACHE_SIZE
+from .ring_buffer import GPURingBuffer
+class KQCache(nn.Module):
+    def __init__(self, max_size=KQ_CACHE_SIZE):
+        super().__init__()
+        self.ring = GPURingBuffer(max_size=max_size, dtype=torch.int32, dim=1)
+    def append(self, motif_id: int):
+        self.ring.append(torch.tensor(motif_id, dtype=torch.int32, device=self.ring.buffer.device))
+    def peek(self, n=1):
+        return self.ring.get_last_n(n)
+    @property
+    def size(self):
+        return self.ring.size
+    def reset(self):
+        self.ring.reset()

arbitor/attention/kv_ledger.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""KV Ledger — append-only ring buffer of motif IDs (int32), max 256K entries.
+Per D-57: Append-only ring buffer of motif IDs (int32), max 256K entries.
+When full, oldest entries are overwritten. Stored as flat tensor on GPU.
+Per D-59: The ledger stores only what the model outputs (motif IDs),
+not input prompts. Prompts go through VQ -> GNN -> Motif pipeline first.
+KV is consumed by the ContextAttentionScheduler. Its output is injected into
+MoEGraph, which then conditions the router and output heads through the shared
+processed relational state.
+"""
+import torch
+import torch.nn as nn
+from ..config import KV_LEDGER_SIZE, SLIDING_WINDOW_SIZE
+from .ring_buffer import GPURingBuffer
+class KVLedger(nn.Module):
+    def __init__(self, max_size=KV_LEDGER_SIZE):
+        super().__init__()
+        self.ring = GPURingBuffer(max_size=max_size, dtype=torch.int32, dim=1)
+    def append(self, motif_id: int):
+        self.ring.append(torch.tensor(motif_id, dtype=torch.int32, device=self.ring.buffer.device))
+    def get_sliding_window(self, n=SLIDING_WINDOW_SIZE):
+        return self.ring.get_last_n(n)
+    def get_range(self, start, end):
+        n = end - start
+        if n <= 0 or start >= self.ring.size:
+            return torch.zeros(0, dtype=torch.int32, device=self.ring.buffer.device)
+        if start + n <= self.ring.max_size:
+            return self.ring.buffer[start:start + n].squeeze(-1)
+        first = self.ring.buffer[start:].squeeze(-1)
+        second = self.ring.buffer[:n - (self.ring.max_size - start)].squeeze(-1)
+        return torch.cat([first, second])
+    def get_sparse(self, stride=8):
+        size = self.ring.size
+        if size == 0:
+            return torch.zeros(0, dtype=torch.int32, device=self.ring.buffer.device)
+        all_vals = self.ring.get_all()
+        indices = torch.arange(0, size, stride, device=self.ring.buffer.device, dtype=torch.long)
+        indices = indices[indices < len(all_vals)]
+        return all_vals[indices]
+    @property
+    def size(self):
+        return self.ring.size
+    def __len__(self):
+        return self.ring.size
+    def reset(self):
+        self.ring.reset()

arbitor/attention/mla.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""Multi-Head Latent Attention with CSA + HCA compression (DeepSeek V4 style).
+Ternary-weighted. KV cache stores compressed latent at multiple levels:
+- Base: MLA latent (d=kv_lora_rank, typically 64/32)
+- CSA: Secondary compression (d_csa, e.g. 16) — 4x compression on cache
+- HCA: Heavily compressed (d_hca, e.g. 8) — 8x compression, wider stride
+Scores = q_nope_absorbed @ decompress(kv_cache) + q_pe @ pe_cache
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import HIDDEN_DIM, MLA_CSA_DIM, MLA_HCA_DIM, MLA_HCA_STRIDE, MLA_N_LAYERS
+from ..kernel.ternary_scale import TernaryScaleTensor, TernaryRMSNorm, TScaleType
+MLA_N_HEADS = 32
+MLA_QK_NOPE_HEAD_DIM = 96
+MLA_QK_ROPE_HEAD_DIM = 32
+MLA_V_HEAD_DIM = 96
+MLA_ROPE_THETA = 10000.0
+MLA_SLIDE_DIM = 64
+MLA_FULL_DIM = 32
+def apply_rotary_emb(x, freqs_cis):
+    x_complex = torch.view_as_complex(
+        x.float().reshape(*x.shape[:-1], -1, 2)
+    )
+    freqs = freqs_cis.unsqueeze(1).unsqueeze(0)
+    return torch.view_as_real(x_complex * freqs).flatten(-2).to(x.dtype)
+def precompute_freqs_cis(dim, end, theta=MLA_ROPE_THETA):
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)
+    freqs = torch.outer(t, freqs)
+    return torch.polar(torch.ones_like(freqs), freqs)
+class MultiHeadLatentAttention(nn.Module):
+    def __init__(self, dim=HIDDEN_DIM, n_heads=MLA_N_HEADS, kv_lora_rank=MLA_SLIDE_DIM,
+                 qk_nope_head_dim=MLA_QK_NOPE_HEAD_DIM, qk_rope_head_dim=MLA_QK_ROPE_HEAD_DIM,
+                 v_head_dim=MLA_V_HEAD_DIM, max_seq_len=65536,
+                 csa_dim=MLA_CSA_DIM, hca_dim=MLA_HCA_DIM,
+                 tscale_type=TScaleType.T32):
+        super().__init__()
+        self.dim = dim
+        self.n_heads = n_heads
+        self.kv_lora_rank = kv_lora_rank
+        self.qk_nope_head_dim = qk_nope_head_dim
+        self.qk_rope_head_dim = qk_rope_head_dim
+        self.qk_head_dim = qk_nope_head_dim + qk_rope_head_dim
+        self.v_head_dim = v_head_dim
+        self.softmax_scale = self.qk_head_dim ** -0.5
+        self.max_seq_len = max_seq_len
+        self.csa_dim = csa_dim
+        self.hca_dim = hca_dim
+        self.wq_norm = TernaryRMSNorm(dim, tscale_type=tscale_type)
+        self.wq = TernaryScaleTensor(dim, n_heads * self.qk_head_dim, tscale_type=tscale_type)
+        combined_out = n_heads * (qk_nope_head_dim + v_head_dim)
+        self.wkv_b = TernaryScaleTensor(kv_lora_rank, combined_out, tscale_type=tscale_type)
+        self.wo = TernaryScaleTensor(n_heads * v_head_dim, dim, tscale_type=tscale_type)
+        # CSA: secondary compression (kv_lora_rank -> csa_dim)
+        if csa_dim and csa_dim < kv_lora_rank:
+            self.csa_compress = TernaryScaleTensor(kv_lora_rank, csa_dim, tscale_type=tscale_type)
+            self.csa_decompress = TernaryScaleTensor(csa_dim, kv_lora_rank, tscale_type=tscale_type)
+        else:
+            self.csa_compress = None
+            self.csa_decompress = None
+        # HCA: heavily compressed (kv_lora_rank -> hca_dim)
+        if hca_dim and hca_dim < (csa_dim or kv_lora_rank):
+            self.hca_compress = TernaryScaleTensor(kv_lora_rank, hca_dim, tscale_type=tscale_type)
+            self.hca_decompress = TernaryScaleTensor(hca_dim, kv_lora_rank, tscale_type=tscale_type)
+        else:
+            self.hca_compress = None
+            self.hca_decompress = None
+    def _compress(self, kv_cache, compress_proj):
+        """Compress kv_cache from kv_lora_rank to smaller dim."""
+        return compress_proj(kv_cache)
+    def _decompress(self, cache, decompress_proj):
+        """Decompress cache back to kv_lora_rank."""
+        return decompress_proj(cache)
+    def _compute_scores(self, q_nope_absorbed, q_pe, kv_flat, pe_flat,
+                        start_pos, seqlen, mask):
+        """Shared score computation for base, CSA, and HCA attention."""
+        n_keys = min(kv_flat.shape[0], pe_flat.shape[0])
+        kv_flat = kv_flat[:n_keys]
+        pe_flat = pe_flat[:n_keys]
+        if n_keys == 0:
+            return q_pe.new_zeros(q_pe.shape[0], seqlen, q_pe.shape[2], 0)
+        scores = (
+            torch.einsum("bshc,btc->bsht",
+                         q_nope_absorbed, kv_flat.unsqueeze(0))
+            + torch.einsum("bshr,btr->bsht",
+                           q_pe, pe_flat.unsqueeze(0))
+        ) * self.softmax_scale
+        if mask is not None:
+            scores = scores + mask.unsqueeze(0).unsqueeze(0)
+        if mask is None and seqlen > 1:
+            causal = torch.triu(
+                torch.full((seqlen, n_keys), float('-inf'), device=q_pe.device),
+                diagonal=1 + start_pos
+            )
+            scores = scores + causal.unsqueeze(0).unsqueeze(2)
+        return scores
+    def forward(self, x, kv_cache, pe_cache, start_pos=0, freqs_cis=None, mask=None,
+                csa_cache=None, hca_cache=None, hca_pe_cache=None):
+        bsz, seqlen, _ = x.size()
+        q = self.wq(self.wq_norm(x))
+        q = q.view(bsz, seqlen, self.n_heads, self.qk_head_dim)
+        q_nope, q_pe = torch.split(
+            q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
+        if freqs_cis is not None:
+            q_pe = apply_rotary_emb(q_pe, freqs_cis[start_pos:start_pos + seqlen])
+        wkv_b = self.wkv_b._get_T() * self.wkv_b._get_S()
+        wkv_b = wkv_b.view(self.n_heads, -1, self.kv_lora_rank)
+        q_nope_absorbed = torch.einsum(
+            "bshd,hdc->bshc", q_nope, wkv_b[:, :self.qk_nope_head_dim])
+        n_cache = min(kv_cache.shape[0], pe_cache.shape[0])
+        kv_flat = kv_cache[:n_cache]
+        pe_flat = pe_cache[:n_cache]
+        # Decompress CSA cache if provided (replaces base kv_cache)
+        if csa_cache is not None and self.csa_decompress is not None:
+            n_csa = min(csa_cache.shape[0], pe_flat.shape[0])
+            kv_flat = self._decompress(csa_cache[:n_csa], self.csa_decompress)
+            pe_flat = pe_flat[:n_csa]
+        # Base attention (exact, CSA-compressed if applicable)
+        scores = self._compute_scores(
+            q_nope_absorbed, q_pe, kv_flat, pe_flat,
+            start_pos, seqlen, mask,
+        )
+        scores = scores.softmax(dim=-1, dtype=torch.float32)
+        attn_out = torch.einsum(
+            "bsht,btc->bshc", scores, kv_flat.unsqueeze(0))
+        # HCA long-range attention (heavily compressed, strided)
+        hca_out = None
+        if hca_cache is not None and self.hca_decompress is not None:
+            hca_kv = self._decompress(hca_cache, self.hca_decompress)
+            if hca_pe_cache is None:
+                hca_pe = pe_cache[::MLA_HCA_STRIDE]
+            else:
+                hca_pe = hca_pe_cache
+            n_hca = min(hca_kv.shape[0], hca_pe.shape[0])
+            hca_kv = hca_kv[:n_hca]
+            hca_pe = hca_pe[:n_hca]
+            hca_scores = self._compute_scores(
+                q_nope_absorbed, q_pe, hca_kv, hca_pe,
+                start_pos, seqlen, mask=None,
+            )
+            hca_scores = hca_scores.softmax(dim=-1, dtype=torch.float32)
+            hca_out = torch.einsum(
+                "bsht,btc->bshc", hca_scores, hca_kv.unsqueeze(0))
+            attn_out = attn_out + hca_out
+        attn_unproj = torch.einsum(
+            "bshc,hdc->bshd", attn_out, wkv_b[:, -self.v_head_dim:])
+        return self.wo(attn_unproj.flatten(2))

arbitor/attention/ring_buffer.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""GPURingBuffer — generic GPU ring buffer utility.
+O(1) append via circular pointer, chronological get_last_n with wrap handling.
+All storage via register_buffer for device movement and state_dict serialization.
+"""
+import torch
+import torch.nn as nn
+class GPURingBuffer(nn.Module):
+    def __init__(self, max_size: int, dtype: torch.dtype = torch.int32, dim: int = 1):
+        super().__init__()
+        self.max_size = max_size
+        self.ptr = 0
+        self.size = 0
+        buffer_shape = (max_size, dim if dim > 1 else 1)
+        self.register_buffer("buffer", torch.zeros(buffer_shape, dtype=dtype))
+    def append(self, x):
+        if not isinstance(x, torch.Tensor):
+            x = torch.tensor(x, dtype=self.buffer.dtype, device=self.buffer.device)
+        if self.buffer.dim() == 2 and x.dim() == 0:
+            x = x.view(1)
+        self.buffer[self.ptr] = x
+        self.ptr = (self.ptr + 1) % self.max_size
+        self.size = min(self.size + 1, self.max_size)
+    def get_last_n(self, n: int):
+        n = min(n, self.size)
+        if n == 0:
+            return torch.zeros(0, *self.buffer.shape[1:], dtype=self.buffer.dtype, device=self.buffer.device)
+        start = (self.ptr - n) % self.max_size
+        if start + n <= self.max_size:
+            result = self.buffer[start:start + n]
+        else:
+            first = self.buffer[start:]
+            second = self.buffer[:n - (self.max_size - start)]
+            result = torch.cat([first, second], dim=0)
+        if result.dim() > 1 and result.shape[1] == 1:
+            result = result.squeeze(-1)
+        return result
+    def get_all(self):
+        return self.get_last_n(self.size)
+    def reset(self):
+        self.buffer.zero_()
+        self.ptr = 0
+        self.size = 0

arbitor/components.py ADDED Viewed

	@@ -0,0 +1,1218 @@

+"""Components — core neural network modules for the ARB system."""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from .kernel.ternary_scale import TernaryScaleTensor, TScaleType, TernaryRMSNorm, GROUP_SIZES, _COMPONENT_CONTEXT, _HAS_TRITON
+try:
+    from .kernel.ternary_scale import _TritonTernaryEmbedFn
+except ImportError:
+    _TritonTernaryEmbedFn = None
+from .converters.convert_to_ternary8 import pack_ternary, unpack_ternary
+from dataclasses import dataclass, field, fields
+from math import ceil as _ceil, log2 as _log2
+from transformers import AutoModel, AutoFeatureExtractor
+from .config import VOCAB, EMBEDDING_DIM, HIDDEN_DIM, AUDIO_VOCAB, AUDIO_SR, AUDIO_FRAME_RATE, SPECIAL_VOCAB, CODEBOOK_DIM, CODEBOOK_SIZE, FFN_HIDDEN, CTX, THRESHOLD, KG_EMA_ALPHA, KG_REQUANT_EVERY, KG_TERNARY_THRESHOLD, KGVQ_CODEBOOK_SIZE, KGVQ_CODEBOOK_DIM, KGVQ_DECAY, KGVQ_COMMITMENT_WEIGHT, KGVQ_DEAD_CODE_THRESHOLD, K_MAX_COMPOSITES, MG_N_EXPERTS, MG_CORE_RANK, MG_SHARED_INTER, MG_ACT_ITERS, MG_WORKSPACE_DIM, BYTEHEAD_ACT_MAX_ITERS, BYTEHEAD_ACT_HALT_CONSECUTIVE
+_ceil_div = lambda a, b: _ceil(a / b) if b > 0 else 0
+from .sequencers import ByteEmbedding
+@dataclass
+class LossWeights:
+    lm: float = 1.0
+    vq_commitment: float = 1.0
+    moe_aux: float = 1.0
+    graph_l1: float = 0.001
+    graph_ponder: float = 1.0
+    moe_ponder: float = 1.0
+    moegraph_ponder: float = 1.0
+    memgram_decay_reg: float = 0.01
+    composite_vq: float = 1.0
+@dataclass
+class LossComponents:
+    lm: torch.Tensor = None
+    vq_commitment: torch.Tensor = None
+    moe_aux: torch.Tensor = None
+    graph_l1: torch.Tensor = None
+    graph_ponder: torch.Tensor = None
+    moe_ponder: torch.Tensor = None
+    moegraph_ponder: torch.Tensor = None
+    memgram_decay_reg: torch.Tensor = None
+    composite_vq: torch.Tensor = None
+    weights: LossWeights = field(default_factory=LossWeights)
+    @property
+    def total(self) -> torch.Tensor:
+        w = self.weights
+        loss = None
+        def add_component(current, weight, component):
+            if component is None:
+                return current
+            weighted = weight * component
+            return weighted if current is None else current + weighted
+        loss = add_component(loss, w.lm, self.lm)
+        loss = add_component(loss, w.vq_commitment, self.vq_commitment)
+        loss = add_component(loss, w.moe_aux, self.moe_aux)
+        loss = add_component(loss, w.graph_l1, self.graph_l1)
+        loss = add_component(loss, w.graph_ponder, self.graph_ponder)
+        loss = add_component(loss, w.moe_ponder, self.moe_ponder)
+        loss = add_component(loss, w.moegraph_ponder, self.moegraph_ponder)
+        loss = add_component(loss, w.memgram_decay_reg, self.memgram_decay_reg)
+        loss = add_component(loss, w.composite_vq, self.composite_vq)
+        if loss is None:
+            raise ValueError("LossComponents.total requested with no active loss tensors")
+        return loss
+    @property
+    def active_fields(self) -> list[tuple[str, torch.Tensor, float]]:
+        result = []
+        for field in fields(self):
+            name = field.name
+            if name == 'weights':
+                continue
+            tensor = getattr(self, name)
+            if tensor is not None:
+                weight = getattr(self.weights, name)
+                result.append((name, tensor, weight))
+        return result
+    def log(self, writer, step, prefix="loss"):
+        writer.add_scalar(f"{prefix}/total", self.total.item(), step)
+        if self.lm is not None:
+            writer.add_scalar(f"{prefix}/lm", self.lm.item(), step)
+        if self.vq_commitment is not None:
+            writer.add_scalar(f"{prefix}/vq_commitment", self.vq_commitment.item(), step)
+        if self.moe_aux is not None:
+            writer.add_scalar(f"{prefix}/moe_aux", self.moe_aux.item(), step)
+        if self.graph_l1 is not None:
+            writer.add_scalar(f"{prefix}/graph_l1", self.graph_l1.item(), step)
+        if self.graph_ponder is not None:
+            writer.add_scalar(f"{prefix}/graph_ponder", self.graph_ponder.item(), step)
+        if self.moe_ponder is not None:
+            writer.add_scalar(f"{prefix}/moe_ponder", self.moe_ponder.item(), step)
+        if self.moegraph_ponder is not None:
+            writer.add_scalar(f"{prefix}/moegraph_ponder", self.moegraph_ponder.item(), step)
+        if self.memgram_decay_reg is not None:
+            writer.add_scalar(f"{prefix}/memgram_decay_reg", self.memgram_decay_reg.item(), step)
+        if self.composite_vq is not None:
+            writer.add_scalar(f"{prefix}/composite_vq", self.composite_vq.item(), step)
+    def backward(self, retain_graph=False):
+        self.total.backward(retain_graph=retain_graph)
+class StickyZoneSTE(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, w, threshold):
+        ctx.save_for_backward(w, torch.tensor(threshold))
+        return w.sign() * (w.abs() > threshold).to(w.dtype)
+    @staticmethod
+    def backward(ctx, grad_output):
+        w, threshold_tensor = ctx.saved_tensors
+        threshold = threshold_tensor.item()
+        ratio = torch.clamp(w.abs() / threshold, 0.0, 1.0)
+        return grad_output * ratio, None
+class TernaryEmbeddingTable(nn.Module):
+    def __init__(self, num_embeddings, embedding_dim, tscale_type=TScaleType.T32,
+                 init_std=0.02, threshold=0.05, normalize=False):
+        super().__init__()
+        self.num_embeddings = num_embeddings
+        self.embedding_dim = embedding_dim
+        self.tscale_type = tscale_type
+        init_threshold = min(float(threshold), 0.5 * float(init_std)) if init_std > 0 else threshold
+        self.threshold = init_threshold
+        self.normalize = normalize
+        self.group_size = GROUP_SIZES.get(tscale_type, GROUP_SIZES[TScaleType.T64])
+        self.sparse_threshold = 65_536
+        if num_embeddings >= self.sparse_threshold:
+            n_trits = num_embeddings * embedding_dim
+            n_packed = _ceil_div(n_trits, 5)
+            packed_T = torch.randint(0, 243, (n_packed,), dtype=torch.uint8)
+            T_pad = n_packed * 5 - n_trits
+            gpr = _ceil_div(embedding_dim, self.group_size)
+            init_exp = int(round(_log2(max(init_std, 1e-8))))
+            self.register_buffer("T_packed", packed_T)
+            self.register_buffer("_T_shape", torch.tensor([num_embeddings, embedding_dim], dtype=torch.long))
+            self.register_buffer("_T_pad", torch.tensor(T_pad, dtype=torch.long))
+            self.register_buffer(
+                "E",
+                torch.full((num_embeddings * gpr,), init_exp, dtype=torch.int8),
+            )
+            self.register_buffer("E_accum", torch.zeros_like(self.E, dtype=torch.int8))
+            self.register_buffer("T_accum", torch.zeros(num_embeddings, embedding_dim, dtype=torch.int8))
+            self._ema_alpha: float = 0.1
+            self._loss_temp_scale: float = 1.0
+            return
+        w_init = torch.randn(num_embeddings, embedding_dim) * init_std
+        T_init = w_init.sign() * (w_init.abs() > init_threshold).to(w_init.dtype)
+        packed_T, _, T_pad = pack_ternary(T_init)
+        self.register_buffer("T_packed", packed_T)
+        self.register_buffer("_T_shape", torch.tensor([num_embeddings, embedding_dim], dtype=torch.long))
+        self.register_buffer("_T_pad", torch.tensor(T_pad, dtype=torch.long))
+        gpr = _ceil_div(embedding_dim, self.group_size)
+        total_in = gpr * self.group_size
+        padded = torch.zeros(num_embeddings, total_in)
+        padded[:, :embedding_dim] = w_init.abs()
+        grouped = padded.view(num_embeddings, gpr, self.group_size)
+        E_vals = torch.where(grouped.mean(dim=2) > 0, grouped.mean(dim=2), torch.ones(num_embeddings, gpr))
+        self.register_buffer("E", E_vals.flatten().log2().clamp(-128, 127).to(torch.int8))
+        self.register_buffer("E_accum", torch.zeros_like(self.E, dtype=torch.int8))
+        self.register_buffer("T_accum", torch.zeros(num_embeddings, embedding_dim, dtype=torch.int8))
+        self._ema_alpha: float = 0.1
+        self._loss_temp_scale: float = 1.0
+    def _get_T(self):
+        return unpack_ternary(self.T_packed, tuple(self._T_shape.tolist()), int(self._T_pad.item()))
+    def _get_T_rows(self, indices):
+        indices = indices.reshape(-1).to(device=self.T_packed.device, dtype=torch.long)
+        dim = self.embedding_dim
+        cols = torch.arange(dim, device=indices.device, dtype=torch.long)
+        lin = indices[:, None] * dim + cols[None, :]
+        pack_idx = lin // 5
+        trit_pos = lin - pack_idx * 5
+        packed = self.T_packed[pack_idx].to(torch.long)
+        divisors = torch.tensor([1, 3, 9, 27, 81], device=indices.device, dtype=torch.long)
+        code = (packed // divisors[trit_pos]) % 3
+        return (code.to(torch.int8) - 1)
+    def _expand_E_rows(self, indices):
+        indices = indices.reshape(-1).to(device=self.E.device, dtype=torch.long)
+        gpr = _ceil_div(self.embedding_dim, self.group_size)
+        E_rows = self.E.view(self.num_embeddings, gpr)[indices]
+        E_exp = E_rows.repeat_interleave(self.group_size, dim=1)
+        return E_exp[:, :self.embedding_dim]
+    @torch.no_grad()
+    def _set_T_rows(self, row_indices, rows):
+        row_indices = row_indices.reshape(-1).to(device=self.T_packed.device, dtype=torch.long)
+        rows = rows.to(device=self.T_packed.device, dtype=torch.int8).reshape(row_indices.numel(), self.embedding_dim)
+        divisors = [1, 3, 9, 27, 81]
+        for row_pos, row_idx in enumerate(row_indices.tolist()):
+            row = rows[row_pos]
+            for col in range(self.embedding_dim):
+                lin = row_idx * self.embedding_dim + col
+                pack_idx = lin // 5
+                trit_pos = lin - pack_idx * 5
+                divisor = divisors[trit_pos]
+                old = int(self.T_packed[pack_idx].item())
+                old_code = (old // divisor) % 3
+                new_code = int(row[col].item()) + 1
+                if old_code != new_code:
+                    self.T_packed[pack_idx] = old - old_code * divisor + new_code * divisor
+    def _expand_E(self):
+        out_dim, in_dim = tuple(self._T_shape.tolist())
+        gpr = _ceil_div(in_dim, self.group_size)
+        E_2d = self.E.view(out_dim, gpr)
+        E_exp = E_2d.repeat_interleave(self.group_size, dim=1)
+        return E_exp[:, :in_dim]
+    def _ensure_E_accum(self):
+        if not hasattr(self, "E_accum"):
+            self.register_buffer("E_accum", torch.zeros_like(self.E, dtype=torch.int8))
+        elif self.E_accum.shape != self.E.shape or self.E_accum.device != self.E.device:
+            self.E_accum = torch.zeros_like(self.E, dtype=torch.int8)
+        return self.E_accum
+    def forward(self, indices):
+        use_sparse = self.num_embeddings >= self.sparse_threshold
+        if use_sparse:
+            idx_flat = indices.reshape(-1).to(device=self.T_packed.device, dtype=torch.long)
+            T_rows = self._get_T_rows(idx_flat)
+            E_exp = self._expand_E_rows(idx_flat)
+            w_eff = torch.exp2(E_exp.float()) * T_rows.float()
+            w_eff_grad = w_eff.detach().requires_grad_(torch.is_grad_enabled())
+            if torch.is_grad_enabled():
+                comp_name, _ = _COMPONENT_CONTEXT.get()
+                def capture_sparse_grad(grad):
+                    suffix = f"_{comp_name}" if comp_name is not None else ""
+                    setattr(self, f"_hook_sparse_indices{suffix}", idx_flat.detach())
+                    setattr(self, f"_hook_sparse_grad_sign{suffix}", grad.reshape(-1, self.embedding_dim).sign().to(torch.int8).detach())
+                    setattr(self, f"_hook_sparse_T{suffix}", T_rows.detach())
+                w_eff_grad.register_hook(capture_sparse_grad)
+            out = w_eff_grad.reshape(*indices.shape, self.embedding_dim)
+            return F.normalize(out, dim=-1) if self.normalize else out
+        if indices.is_cuda and _HAS_TRITON and _TritonTernaryEmbedFn is not None:
+            dummy = torch.zeros(1, device=indices.device, requires_grad=True)
+            out = _TritonTernaryEmbedFn.apply(indices, dummy, self)
+        else:
+            T = self._get_T()
+            w_eff = torch.exp2(self._expand_E().float()) * T.float()
+            w_eff_grad = w_eff.detach().requires_grad_(True)
+            self._hook_T = T
+            def capture_w_grad(grad_w):
+                self._hook_grad_T_sign = grad_w.sign().to(torch.int8)
+            w_eff_grad.register_hook(capture_w_grad)
+            out = F.embedding(indices, w_eff_grad)
+        return F.normalize(out, dim=-1) if self.normalize else out
+    def ternary_step(self, accum_threshold=3):
+        if hasattr(self, "_hook_sparse_indices") and hasattr(self, "_hook_sparse_grad_sign"):
+            return self._sparse_ternary_step(accum_threshold=accum_threshold)
+        if hasattr(self, "_hook_grad_T_sign"):
+            if hasattr(self, "_accumulate_corr_from_grad_sign"):
+                self._accumulate_corr_from_grad_sign(self._hook_grad_T_sign)
+            del self._hook_grad_T_sign
+    def update_E(self, loss_signal=None):
+        if hasattr(self, "_hook_sparse_indices") and hasattr(self, "_hook_sparse_grad_sign"):
+            return self._sparse_update_E(loss_signal=loss_signal)
+    @torch.no_grad()
+    def _sparse_ternary_step(self, accum_threshold=3):
+        indices = self._hook_sparse_indices.to(device=self.T_accum.device, dtype=torch.long)
+        grad_sign = self._hook_sparse_grad_sign.to(device=self.T_accum.device, dtype=torch.int16)
+        if indices.numel() == 0:
+            return
+        unique, inverse = torch.unique(indices, return_inverse=True)
+        grad_sum = torch.zeros(unique.numel(), self.embedding_dim, device=self.T_accum.device, dtype=torch.int16)
+        grad_sum.index_add_(0, inverse, grad_sign)
+        grad_step = grad_sum.sign().to(torch.int16) * int(getattr(self, "_t_accum_step", 1))
+        current = self.T_accum[unique].to(torch.int16)
+        updated = torch.clamp(current - grad_step, -128, 127).to(torch.int8)
+        pgt = getattr(self, "per_group_threshold", None)
+        if pgt is not None:
+            gpr = _ceil_div(self.embedding_dim, self.group_size)
+            threshold = pgt.view(self.num_embeddings, gpr)[unique]
+            threshold = threshold.unsqueeze(-1).expand(unique.numel(), gpr, self.group_size)
+            threshold = threshold.reshape(unique.numel(), gpr * self.group_size)[:, :self.embedding_dim]
+            threshold = threshold.to(updated.device)
+            flip_up = updated > threshold
+            flip_down = updated < -threshold
+        else:
+            flip_up = updated > accum_threshold
+            flip_down = updated < -accum_threshold
+        self._had_flip = bool((flip_up | flip_down).any().item())
+        if self._had_flip:
+            rows = self._get_T_rows(unique).to(updated.device)
+            rows = torch.where(flip_up, torch.ones_like(rows), torch.where(flip_down, -torch.ones_like(rows), rows))
+            self._set_T_rows(unique, rows)
+            updated = torch.where(flip_up | flip_down, torch.zeros_like(updated), updated)
+        self.T_accum[unique] = updated
+        del self._hook_sparse_indices
+        del self._hook_sparse_grad_sign
+        if hasattr(self, "_hook_sparse_T"):
+            del self._hook_sparse_T
+    @torch.no_grad()
+    def _sparse_update_E(self, loss_signal=None):
+        indices = self._hook_sparse_indices.to(device=self.E.device, dtype=torch.long)
+        grad_sign = self._hook_sparse_grad_sign.to(device=self.E.device, dtype=torch.int16)
+        T_rows = self._hook_sparse_T if hasattr(self, "_hook_sparse_T") else self._get_T_rows(indices)
+        T_rows = T_rows.to(device=self.E.device, dtype=torch.int16)
+        if indices.numel() == 0:
+            return
+        unique, inverse = torch.unique(indices, return_inverse=True)
+        gpr = _ceil_div(self.embedding_dim, self.group_size)
+        total_in = gpr * self.group_size
+        signed = grad_sign * T_rows
+        grouped = F.pad(signed, (0, total_in - self.embedding_dim)).view(indices.numel(), gpr, self.group_size)
+        score = grouped.sum(dim=2)
+        delta = torch.where(
+            score > 0,
+            torch.full_like(score, -1, dtype=torch.int16),
+            torch.where(score < 0, torch.ones_like(score, dtype=torch.int16), torch.zeros_like(score, dtype=torch.int16)),
+        )
+        delta_sum = torch.zeros(unique.numel(), gpr, device=self.E.device, dtype=torch.int16)
+        delta_sum.index_add_(0, inverse, delta)
+        delta_sign = delta_sum.sign()
+        e_idx = unique[:, None] * gpr + torch.arange(gpr, device=self.E.device, dtype=torch.long)[None, :]
+        accum = torch.clamp(self.E_accum[e_idx].to(torch.int16) + delta_sign, -128, 127)
+        threshold = int(getattr(self, "_e_accum_threshold", 4))
+        step = torch.where(
+            accum >= threshold,
+            torch.ones_like(accum, dtype=torch.int16),
+            torch.where(accum <= -threshold, torch.full_like(accum, -1, dtype=torch.int16), torch.zeros_like(accum, dtype=torch.int16)),
+        )
+        self.E[e_idx] = torch.clamp(self.E[e_idx].to(torch.int16) + step, -128, 127).to(torch.int8)
+        self.E_accum[e_idx] = (accum - step * threshold).to(torch.int8)
+class TernaryVQCodebook(nn.Module):
+    def __init__(self, codebook_size, codebook_dim, commitment_weight=1.0,
+                 tscale_type=TScaleType.T32, exact_lookup_max=16384,
+                 candidate_count=256):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.commitment_weight = commitment_weight
+        self.exact_lookup_max = exact_lookup_max
+        self.candidate_count = candidate_count
+        self.threshold_ema_dead_code = 2
+        self.table = TernaryEmbeddingTable(codebook_size, codebook_dim, tscale_type=tscale_type, normalize=True)
+        self.register_buffer("cluster_size", torch.zeros(codebook_size, dtype=torch.int16))
+    @property
+    def embed(self):
+        idx = torch.arange(self.codebook_size, device=self.table.T_packed.device)
+        return self.table(idx)
+    def _candidate_ids(self, flat):
+        c = min(self.candidate_count, self.codebook_size)
+        take = min(flat.shape[1], 16)
+        primes = torch.tensor(
+            [1009, 9176, 6361, 5333, 4447, 3469, 2531, 1613,
+             811, 421, 211, 109, 59, 31, 17, 7],
+            device=flat.device, dtype=torch.float32,
+        )[:take]
+        signed = torch.sign(flat[:, :take].float())
+        base = torch.abs(torch.round((signed * primes).sum(dim=1) * 104729)).to(torch.long)
+        offsets = torch.arange(c, device=flat.device, dtype=torch.long)
+        stride = 2_654_435_761
+        return (base[:, None] + offsets[None, :] * stride) % self.codebook_size
+    def _lookup(self, flat):
+        if self.codebook_size <= self.exact_lookup_max:
+            x_norm = F.normalize(flat.float(), dim=-1)
+            codebook = self.embed.to(device=flat.device)
+            sim = x_norm @ codebook.T
+            indices = sim.argmax(dim=-1)
+            quantized = codebook[indices]
+            return quantized, indices
+        candidate_ids = self._candidate_ids(flat)
+        x_norm = F.normalize(flat.float(), dim=-1)
+        n, c, d = flat.shape[0], candidate_ids.shape[1], flat.shape[1]
+        chunk = 64
+        quantized = torch.empty_like(flat)
+        indices = torch.empty(n, dtype=torch.long, device=flat.device)
+        for start in range(0, n, chunk):
+            end = min(start + chunk, n)
+            chunk_ids = candidate_ids[start:end]
+            chunk_vecs = self.table(chunk_ids).float()
+            chunk_norm = F.normalize(chunk_vecs, dim=-1)
+            chunk_sim = (chunk_norm * x_norm[start:end].unsqueeze(1)).sum(dim=-1)
+            chunk_best = chunk_sim.argmax(dim=-1)
+            indices[start:end] = candidate_ids[start:end].gather(1, chunk_best.unsqueeze(1)).squeeze(1)
+            quantized[start:end] = chunk_vecs[torch.arange(end - start, device=flat.device), chunk_best]
+        return quantized, indices
+    def forward(self, x):
+        orig_shape = x.shape
+        flat = x.reshape(-1, self.codebook_dim)
+        quantized, indices = self._lookup(flat)
+        commitment = self.commitment_weight * (
+            F.mse_loss(flat.float(), quantized.detach().float())
+            + 0.25 * F.mse_loss(quantized.float(), flat.detach().float())
+        )
+        quantized = flat + (quantized - flat).detach()
+        with torch.no_grad():
+            unique, counts = torch.unique(indices, return_counts=True)
+            current = self.cluster_size[unique].to(torch.int32)
+            updated = torch.clamp(current + counts.to(device=current.device, dtype=torch.int32), 0, 32767).to(torch.int16)
+            self.cluster_size[unique] = updated
+        return quantized.reshape(orig_shape), indices.reshape(orig_shape[:-1]), commitment
+class GNNLoRAAdapter(nn.Module):
+    def __init__(self, dim, rank=32, max_hops=4):
+        super().__init__()
+        self.max_hops = max_hops
+        self.down = TernaryScaleTensor(dim, rank, tscale_type=TScaleType.T32)
+        self.up = TernaryScaleTensor(rank, dim, tscale_type=TScaleType.T32)
+        self.scale = TernaryEmbeddingTable(max_hops, rank, tscale_type=TScaleType.T32)
+    def forward(self, x, hop_t):
+        t_idx = min(hop_t, self.max_hops - 1)
+        s = self.scale(torch.tensor(t_idx, device=x.device))
+        return self.up(self.down(x) * s)
+class HaltingUnit(nn.Module):
+    def __init__(self, dim, tscale_type=TScaleType.T32):
+        super().__init__()
+        self.proj = TernaryScaleTensor(dim, 1, tscale_type=tscale_type)
+        self.norm = TernaryRMSNorm(dim, tscale_type=tscale_type)
+    def forward(self, x):
+        return torch.sigmoid(self.proj(self.norm(x)))
+class _NgramHashMapping:
+    """N-gram hash mapping with CPU offloading (Spider Engram style).
+    Hashes token sequences to fixed-size embedding indices. Hash computation
+    runs on CPU via numpy, O(1) per token via precomputed multipliers.
+    """
+    def __init__(self, max_ngram_size, num_heads, table_size_base, layer_seed=0):
+        self.max_ngram_size = max_ngram_size
+        self.num_heads = num_heads
+        self.num_ngram_orders = max_ngram_size - 1
+        import numpy as np
+        PRIME_1 = 10007
+        g = torch.Generator()
+        g.manual_seed(int(layer_seed + PRIME_1 * int(layer_seed)))
+        r = torch.randint(0, 1 << 30, (max_ngram_size,), generator=g, dtype=torch.int64)
+        self.multipliers = r.numpy() * 2 + 1
+        seen_primes = set()
+        self.prime_table_sizes = []
+        for _ in range(self.num_ngram_orders):
+            head_sizes = []
+            ps = table_size_base - 1
+            for _ in range(num_heads):
+                p = self._next_prime(ps, seen_primes)
+                seen_primes.add(p)
+                head_sizes.append(p)
+                ps = p
+            self.prime_table_sizes.append(head_sizes)
+        self.all_head_sizes = [s for sub in self.prime_table_sizes for s in sub]
+        offsets = [0]
+        for s in self.all_head_sizes[:-1]:
+            offsets.append(offsets[-1] + s)
+        self.offsets_arr = offsets
+        self.total_slots = sum(self.all_head_sizes)
+    @staticmethod
+    def _next_prime(n, seen):
+        while n in seen or not _is_prime(n):
+            n -= 1
+        return n
+    def compute_hashes(self, token_ids):
+        import numpy as np
+        x = token_ids.cpu().numpy().astype(np.int64)
+        B, T = x.shape
+        shifts = [x]
+        for k in range(1, self.max_ngram_size):
+            shifts.append(np.pad(x, ((0, 0), (k, 0)), constant_values=0)[:, :T])
+        all_hashes = []
+        for order_idx in range(self.num_ngram_orders):
+            n = order_idx + 2
+            mix = shifts[0] * self.multipliers[0]
+            for k in range(1, n):
+                mix = np.bitwise_xor(mix, shifts[k].astype(np.int64) * self.multipliers[k])
+            for j, ms in enumerate(self.prime_table_sizes[order_idx]):
+                all_hashes.append((mix % ms).astype(np.int64, copy=False))
+        result = np.stack(all_hashes, axis=2)
+        return torch.from_numpy(result).to(device=token_ids.device)
+def _is_prime(n):
+    if n < 2:
+        return False
+    import math
+    for i in range(2, int(math.sqrt(n)) + 1):
+        if n % i == 0:
+            return False
+    return True
+class MemGram(nn.Module):
+    """Engram-style associative memory with O(1) hashed lookup (CPU offloaded).
+    Features:
+    - O(1) hash -> index -> embedding lookup (no search, no decay for retrieval)
+    - CPU-offloaded hash computation (numpy)
+    - Single offset-stacked embedding table (not per-head tables)
+    - Gated retrieval: sigmoid(Q*K/sqrt(d)) gates the memory read
+    - Depthwise conv1d processes retrieved memory (Engram-style)
+    - No strength/decay buffers (decay is handled by GraphMoE usage frequency)
+    - MemGram lookups do NOT affect KG decaying (separate mechanisms)
+    """
+    def __init__(self, struct_primes=[64901, 64919, 64921, 64927, 64937, 64951, 64969, 64997,
+                                        65003, 65011, 65027, 65029, 65033, 65053, 65063, 65071],
+                 conv_primes=[8009, 8011, 8017, 8039],
+                 embed_dim=64, hidden_dim=HIDDEN_DIM, key_dim=32,
+                 max_ngram_size=3, num_hash_heads=4, layer_seed=0):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.key_dim = key_dim
+        self.hidden_dim = hidden_dim
+        self.n_struct_heads = len(struct_primes)
+        self.n_conv_heads = len(conv_primes)
+        self.struct_hash = _NgramHashMapping(
+            max_ngram_size=max_ngram_size, num_heads=num_hash_heads,
+            table_size_base=struct_primes[0], layer_seed=layer_seed,
+        )
+        self.conv_hash = _NgramHashMapping(
+            max_ngram_size=max_ngram_size, num_heads=num_hash_heads,
+            table_size_base=conv_primes[0], layer_seed=layer_seed + 1000,
+        )
+        total_heads = self.struct_hash.num_ngram_orders * num_hash_heads
+        self.total_mem_dim = total_heads * embed_dim
+        total_slots = self.struct_hash.total_slots + self.conv_hash.total_slots
+        self.mem_embed = nn.Embedding(total_slots, embed_dim)
+        self.k_proj = nn.Linear(self.total_mem_dim, key_dim, bias=False)
+        self.q_proj = nn.Linear(hidden_dim, key_dim, bias=False)
+        self.v_proj = nn.Linear(self.total_mem_dim, hidden_dim, bias=False)
+        with torch.no_grad():
+            self.v_proj.weight.zero_()
+        self.conv_norm = nn.RMSNorm(hidden_dim)
+        self.conv = nn.Conv1d(
+            hidden_dim, hidden_dim,
+            kernel_size=4, padding=9, dilation=3, groups=hidden_dim,
+        )
+        with torch.no_grad():
+            self.conv.weight.zero_()
+            if self.conv.bias is not None:
+                self.conv.bias.zero_()
+    def _retrieve(self, token_ids, hash_mapping):
+        hash_ids = hash_mapping.compute_hashes(token_ids)
+        B, T, H = hash_ids.shape
+        flat_ids = hash_ids.reshape(B * T, H)
+        offsets = torch.tensor(hash_mapping.offsets_arr, device=flat_ids.device, dtype=torch.long)
+        emb = self.mem_embed(flat_ids + offsets)
+        return emb.reshape(B, T, H * self.embed_dim)
+    def forward(self, vq_indices, hidden_state):
+        B, T, D = hidden_state.shape
+        struct_mem = self._retrieve(vq_indices[:, 1:], self.struct_hash)
+        conv_mem = self._retrieve(vq_indices[:, 1:], self.conv_hash)
+        mem = struct_mem + conv_mem
+        idx_end = mem.shape[1]
+        q_proj = self.q_proj(hidden_state[:, :idx_end])
+        k = self.k_proj(mem)
+        v = self.v_proj(mem)
+        gate = torch.sigmoid((q_proj * k).sum(dim=-1, keepdim=True) / (self.key_dim ** 0.5))
+        v_gated = gate * v
+        v_normed = self.conv_norm(v_gated)
+        v_t = v_normed.transpose(1, 2)
+        conv_out = self.conv(v_t)
+        conv_out = conv_out[:, :, :v_t.shape[-1]].transpose(1, 2)
+        output = hidden_state[:, :idx_end] + F.silu(conv_out) + v_gated
+        if idx_end < T:
+            output = F.pad(output, (0, 0, 0, T - idx_end))
+        return output
+    def retrieve_cb(self, vq_indices):
+        B, T = vq_indices.shape
+        struct_mem = self._retrieve(vq_indices[:, 1:], self.struct_hash)
+        conv_mem = self._retrieve(vq_indices[:, 1:], self.conv_hash)
+        mem = struct_mem + conv_mem
+        idx_end = mem.shape[1]
+        pad = torch.zeros(B, T - idx_end, mem.shape[2], device=mem.device)
+        mem = torch.cat([mem, pad], dim=1)
+        q = mem.mean(dim=-1, keepdim=True)
+        gate = torch.sigmoid(q)
+        return gate * mem
+_BOUNDARY_TOKEN_MAP = {
+    SPECIAL_VOCAB['BOS']: 0,
+    SPECIAL_VOCAB['SYSTEM']: 1,
+    SPECIAL_VOCAB['USER']: 2,
+    SPECIAL_VOCAB['ASSISTANT']: 3,
+}
+class LTIInjection(nn.Module):
+    """LTI state injection: h = A*h + B*e + trans_out.
+    Spectral radius < 1 guaranteed by construction via ZOH discretization.
+    Prevents divergence in recurrent/ACT loops at high dimensions.
+    """
+    def __init__(self, dim: int):
+        super().__init__()
+        self.log_A = nn.Parameter(torch.zeros(dim))
+        self.log_dt = nn.Parameter(torch.zeros(1))
+        self.B = nn.Parameter(torch.ones(dim) * 0.1)
+        for p in (self.log_A, self.log_dt, self.B):
+            p.requires_grad_(False)
+    def get_A(self):
+        return torch.exp(-torch.exp((self.log_dt + self.log_A).clamp(-20, 20)))
+    def forward(self, h, e, trans_out):
+        return self.get_A() * h + self.B * e + trans_out
+class ByteHead(nn.Module):
+    """Deep 3-layer MLP byte prediction head with ACT loop.
+    Architecture: 8192 → 16384 → 8192 → 16384 → 288
+    ACT: up to 3 iterations, halts when argmax stable for 2 consecutive steps.
+    """
+    def __init__(self, tscale_type=TScaleType.T32,
+                 act_max_iters=BYTEHEAD_ACT_MAX_ITERS,
+                 act_halt_consecutive=BYTEHEAD_ACT_HALT_CONSECUTIVE):
+        super().__init__()
+        H = HIDDEN_DIM
+        W = HIDDEN_DIM * 2
+        self.act_max_iters = act_max_iters
+        self.act_halt_consecutive = act_halt_consecutive
+        self._last_ponder = 0.0
+        self.norm = TernaryRMSNorm(H, tscale_type=tscale_type)
+        self.up = TernaryScaleTensor(H, W, tscale_type=tscale_type)
+        self.up_norm = TernaryRMSNorm(W, tscale_type=tscale_type)
+        self.hidden = TernaryScaleTensor(W, H, tscale_type=tscale_type)
+        self.hidden_norm = TernaryRMSNorm(H, tscale_type=tscale_type)
+        self.out = TernaryScaleTensor(H, W, tscale_type=tscale_type)
+        self.out_norm = TernaryRMSNorm(W, tscale_type=tscale_type)
+        self.head = TernaryScaleTensor(W, VOCAB, tscale_type=tscale_type)
+        if act_max_iters > 1:
+            self.act_residual = TernaryScaleTensor(VOCAB, H, tscale_type=tscale_type)
+            self.lti = LTIInjection(H)
+        else:
+            self.act_residual = None
+            self.lti = None
+    def forward(self, x):
+        if self.act_max_iters <= 1 or self.act_residual is None:
+            hn = F.silu(self.up(self.norm(x)))
+            hn = F.silu(self.hidden(self.up_norm(hn)))
+            hn = F.silu(self.out(self.hidden_norm(hn)))
+            return self.head(self.out_norm(hn))
+        h = x
+        x_initial = x
+        prev_argmax = None
+        stable_count = 0
+        total_iters = 0
+        for i in range(self.act_max_iters):
+            hn = F.silu(self.up(self.norm(h)))
+            hn = F.silu(self.hidden(self.up_norm(hn)))
+            hn = F.silu(self.out(self.hidden_norm(hn)))
+            logits = self.head(self.out_norm(hn))
+            curr_argmax = logits.argmax(dim=-1)
+            if prev_argmax is not None and (curr_argmax == prev_argmax).all():
+                stable_count += 1
+            else:
+                stable_count = 0
+            total_iters = i + 1
+            if stable_count >= self.act_halt_consecutive:
+                break
+            prev_argmax = curr_argmax
+            trans_out = self.act_residual(logits)
+            h = self.lti(h, x_initial, trans_out)
+        self._last_ponder = total_iters / max(self.act_max_iters, 1)
+        return logits
+class OutputRouter(nn.Module):
+    """Routes HIDDEN_DIM relational tokens to ByteHead, VideoHead, or TalkerHead.
+    3-layer MLP when depth=3, 2-layer when depth=2, single projection when depth=1.
+    Argmax at inference, soft weighted routing at training.
+    """
+    def __init__(self, tscale_type=TScaleType.T32, depth=3):
+        super().__init__()
+        if depth >= 3:
+            self.hidden1 = TernaryScaleTensor(HIDDEN_DIM, HIDDEN_DIM, tscale_type=tscale_type)
+            self.hidden1_norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+            self.hidden2 = TernaryScaleTensor(HIDDEN_DIM, HIDDEN_DIM // 4, tscale_type=tscale_type)
+            self.gate = TernaryScaleTensor(HIDDEN_DIM // 4, 4, tscale_type=tscale_type)
+        elif depth == 2:
+            self.hidden1 = None
+            self.hidden1_norm = None
+            self.hidden2 = TernaryScaleTensor(HIDDEN_DIM, HIDDEN_DIM // 4, tscale_type=tscale_type)
+            self.gate = TernaryScaleTensor(HIDDEN_DIM // 4, 4, tscale_type=tscale_type)
+        else:
+            self.hidden1 = None
+            self.hidden1_norm = None
+            self.hidden2 = None
+            self.gate = TernaryScaleTensor(HIDDEN_DIM, 4, tscale_type=tscale_type)
+        # 0 = Null (continue), 1 = ByteHead, 2 = VideoHead, 3 = TalkerHead
+    def forward(self, x, training=False):
+        h = x
+        if self.hidden1 is not None:
+            h = F.silu(self.hidden1_norm(self.hidden1(h)))
+        if self.hidden2 is not None:
+            h = self.hidden2(h)
+        logits = self.gate(h)  # [B, T, 4]
+        logits = torch.nan_to_num(logits, nan=0.0, posinf=30.0, neginf=-30.0).clamp(-30.0, 30.0)
+        if training:
+            weights = F.softmax(logits, dim=-1)
+            return weights, logits
+        return logits.argmax(dim=-1)
+class KGVQCodebook(TernaryVQCodebook):
+    """Compatibility wrapper for the KG/composite VQ.
+    The old implementation kept float32 `embed` and `embed_avg` buffers. The
+    production path now uses the same packed ternary/int8 backing table as the
+    shared VQ so default 5M-code KG construction cannot allocate hidden float
+    codebook state.
+    """
+    def __init__(self, codebook_size=KGVQ_CODEBOOK_SIZE, codebook_dim=KGVQ_CODEBOOK_DIM,
+                 decay=KGVQ_DECAY, commitment_weight=KGVQ_COMMITMENT_WEIGHT,
+                 threshold_ema_dead_code=KGVQ_DEAD_CODE_THRESHOLD):
+        super().__init__(
+            codebook_size=codebook_size,
+            codebook_dim=codebook_dim,
+            commitment_weight=commitment_weight,
+        )
+        self.decay = decay
+        self.threshold_ema_dead_code = threshold_ema_dead_code
+    @property
+    def embed(self):
+        if self.codebook_size > self.exact_lookup_max:
+            raise RuntimeError(
+                "Full KG VQ materialization is disabled for large ternary codebooks; "
+                "query rows through `table(indices)` instead."
+            )
+        return super().embed
+    def _ema_update(self, x_flat, indices):
+        unique, counts = torch.unique(indices, return_counts=True)
+        current = self.cluster_size[unique].to(torch.int32)
+        updated = torch.clamp(
+            current + counts.to(device=current.device, dtype=torch.int32),
+            0,
+            32767,
+        ).to(torch.int16)
+        self.cluster_size[unique] = updated
+    def _dead_code_reset(self, x_flat):
+        return None
+class CompositeProposalHead(nn.Module):
+    """Multi-proposal head from pooled GNN output (Phase 17).
+    Projects GNN pool output (graph_pool_out [B, D]) to K_MAX composite motif
+    proposals, quantizes via KGVQ, and applies ACT-style halting.
+    """
+    def __init__(self, dim=HIDDEN_DIM, codebook_dim=KGVQ_CODEBOOK_DIM,
+                 k_max=K_MAX_COMPOSITES, codebook_size=KGVQ_CODEBOOK_SIZE,
+                 tscale_type=TScaleType.T32):
+        super().__init__()
+        self.dim = dim
+        self.k_max = k_max
+        self.codebook_dim = codebook_dim
+        self.proj = TernaryScaleTensor(dim, k_max * codebook_dim, tscale_type=tscale_type)
+        self.kgvq = TernaryVQCodebook(codebook_size=codebook_size, codebook_dim=codebook_dim,
+                                      tscale_type=tscale_type)
+        self.halt_gate = TernaryScaleTensor(dim, k_max, tscale_type=tscale_type)
+        self.diversity_weight = 0.1
+    def forward(self, pool_out):
+        B = pool_out.shape[0]
+        projections = self.proj(pool_out).view(B, self.k_max, self.codebook_dim)
+        quantized, composite_ids, vq_loss = self.kgvq(projections)
+        halt_logits = self.halt_gate(pool_out).clamp(-12.0, 12.0)
+        halt = torch.sigmoid(halt_logits)  # [B, K_MAX]
+        composite_ids = composite_ids.masked_fill(halt < 0.5, -1)
+        normed = F.normalize(projections, dim=-1)
+        sim_matrix = normed @ normed.transpose(-1, -2)
+        triu = torch.triu(sim_matrix, diagonal=1)
+        n_pairs = self.k_max * (self.k_max - 1) / 2
+        diversity_loss = triu.sum(dim=(-1, -2)).mean() / max(n_pairs, 1)
+        diversity_loss = diversity_loss * self.diversity_weight
+        return composite_ids, vq_loss + diversity_loss, halt
+class MoEGraph(nn.Module):
+    """Fused graph traversal + centroid-based MoE routing + ACT halting.
+    Each ACT iteration: traverse KG → aggregate neighbor emb → centroid route →
+    run expert → halt check. All operations at MG_WORKSPACE_DIM (1024).
+    Replaces: TernaryGraph + GraphMoEGate + GraphACTCell + SharedProjectionMoE + MoEACTCell.
+    """
+    def __init__(self, cb_dim=MG_WORKSPACE_DIM, trigram_dim=HIDDEN_DIM,
+                 codebook_dim=CODEBOOK_DIM,
+                 num_experts=MG_N_EXPERTS, core_rank=MG_CORE_RANK,
+                 shared_inter=MG_SHARED_INTER,                  max_iters=MG_ACT_ITERS,
+                 halt_threshold=0.99, tscale_type=TScaleType.T32,
+                 codebook_size=CODEBOOK_SIZE,
+                 active_graph_max_nodes=4096,
+                 top_k=1):
+        super().__init__()
+        self.cb_dim = cb_dim
+        self.trigram_dim = trigram_dim
+        self.codebook_dim = codebook_dim
+        self.num_experts = num_experts
+        self.core_rank = core_rank
+        self.shared_inter = shared_inter
+        self.max_iters = max_iters
+        self.halt_threshold = halt_threshold
+        self.codebook_size = codebook_size
+        self.active_graph_max_nodes = active_graph_max_nodes
+        self.top_k = top_k
+        self.down_proj = TernaryScaleTensor(trigram_dim, cb_dim, tscale_type=tscale_type)
+        self.down_norm = TernaryRMSNorm(trigram_dim, tscale_type=tscale_type)
+        self.up_proj = TernaryScaleTensor(cb_dim, trigram_dim, tscale_type=tscale_type)
+        self.up_norm = TernaryRMSNorm(cb_dim, tscale_type=tscale_type)
+        self.attn_down_proj = TernaryScaleTensor(trigram_dim, cb_dim, tscale_type=tscale_type)
+        self.codebook_up = TernaryScaleTensor(codebook_dim, cb_dim, tscale_type=tscale_type)
+        self.use_active_edge_store = self.codebook_size > self.active_graph_max_nodes
+        self.active_edge_capacity = max(int(self.active_graph_max_nodes) * 16, 65_536)
+        if self.use_active_edge_store:
+            self.register_buffer("edge_index", torch.zeros(2, 0, dtype=torch.int32))
+            self.register_buffer("edge_attr", torch.zeros(0, dtype=torch.int8))
+            self.register_buffer("edge_score", torch.zeros(0, dtype=torch.int8))
+            self.register_buffer("active_edge_src", torch.full((self.active_edge_capacity,), -1, dtype=torch.int32))
+            self.register_buffer("active_edge_dst", torch.full((self.active_edge_capacity,), -1, dtype=torch.int32))
+            self.register_buffer("active_edge_attr", torch.zeros(self.active_edge_capacity, dtype=torch.int8))
+            self.register_buffer("active_edge_score", torch.zeros(self.active_edge_capacity, dtype=torch.int8))
+            self.register_buffer("active_edge_ptr", torch.zeros((), dtype=torch.long))
+        else:
+            num_edges = self.codebook_size * 10
+            src = torch.arange(self.codebook_size, dtype=torch.int32).repeat_interleave(10)
+            dst = torch.randint(0, self.codebook_size, (num_edges,), dtype=torch.int32)
+            self.register_buffer("edge_index", torch.stack([src, dst], dim=0))
+            edge_init = torch.randint(-1, 2, (num_edges,), dtype=torch.int8)
+            self.register_buffer("edge_attr", edge_init)
+            self.register_buffer("edge_score", torch.zeros(num_edges, dtype=torch.int8))
+        self.register_buffer("_steps_since_requant", torch.tensor(0, dtype=torch.long))
+        self.requant_every = KG_REQUANT_EVERY
+        self.kg_ternary_threshold = KG_TERNARY_THRESHOLD
+        self.kg_ema_alpha = KG_EMA_ALPHA
+        self.centroids = TernaryEmbeddingTable(num_experts, cb_dim, tscale_type=tscale_type, normalize=True)
+        self.shared_up_norm = TernaryRMSNorm(cb_dim, tscale_type=tscale_type)
+        self.shared_up = TernaryScaleTensor(cb_dim, shared_inter, tscale_type=tscale_type)
+        self.shared_down_norm = TernaryRMSNorm(shared_inter, tscale_type=tscale_type)
+        self.shared_down = TernaryScaleTensor(shared_inter, cb_dim, tscale_type=tscale_type)
+        self.W_gate = nn.ModuleList([
+            TernaryScaleTensor(cb_dim, core_rank, tscale_type=tscale_type)
+            for _ in range(num_experts)
+        ])
+        self.W_gate_norms = nn.ModuleList([
+            TernaryRMSNorm(cb_dim, tscale_type=tscale_type)
+            for _ in range(num_experts)
+        ])
+        self.W_transform = nn.ModuleList([
+            TernaryScaleTensor(core_rank, shared_inter, tscale_type=tscale_type)
+            for _ in range(num_experts)
+        ])
+        self.W_transform_norms = nn.ModuleList([
+            TernaryRMSNorm(core_rank, tscale_type=tscale_type)
+            for _ in range(num_experts)
+        ])
+        self.hop_lora = GNNLoRAAdapter(dim=cb_dim, rank=32, max_hops=max_iters)
+        self.halting = HaltingUnit(dim=cb_dim, tscale_type=tscale_type)
+        self.lti = LTIInjection(cb_dim)
+        self._codebook_embed = None
+        self._codebook_table = None
+    def _codebook_tensor(self, device):
+        if self._codebook_table is not None:
+            idx = torch.arange(self.codebook_size, device=device)
+            codebook = self._codebook_table(idx)
+            if codebook.shape[-1] != self.cb_dim:
+                codebook = self.codebook_up(codebook)
+            return codebook
+        if self._codebook_embed is not None:
+            codebook = self._codebook_embed.to(device=device).squeeze(0)
+            if codebook.shape[-1] != self.cb_dim:
+                codebook = self.codebook_up(codebook)
+            return codebook
+        return torch.zeros(self.codebook_size, self.cb_dim, device=device)
+    def _active_codebook_features(self, vq_indices):
+        if self._codebook_table is not None:
+            safe_idx = vq_indices.clamp(min=0, max=self.codebook_size - 1)
+            active_code = self._codebook_table(safe_idx)
+        elif self._codebook_embed is not None:
+            codebook = self._codebook_embed.to(device=vq_indices.device).squeeze(0)
+            safe_idx = vq_indices.clamp(min=0, max=codebook.shape[0] - 1)
+            active_code = codebook[safe_idx]
+        else:
+            return torch.zeros(*vq_indices.shape, self.cb_dim, device=vq_indices.device)
+        if active_code.shape[-1] != self.cb_dim:
+            active_code = self.codebook_up(active_code)
+        return active_code
+    def _neighbor_aggregate(self, node_features, threshold):
+        N, D = node_features.shape
+        aggregated = torch.zeros(self.codebook_size, D, device=node_features.device, dtype=node_features.dtype)
+        edge_ternary = StickyZoneSTE.apply(self.edge_attr, threshold)
+        src_features = node_features[self.edge_index[0]]
+        messages = edge_ternary.unsqueeze(1).to(node_features.dtype) * src_features
+        dst_idx = self.edge_index[1].unsqueeze(1).expand(-1, D)
+        aggregated.scatter_add_(0, dst_idx, messages)
+        return aggregated
+    def _run_expert_batch(self, x, expert_idx):
+        B, T, D = x.shape
+        N = B * T
+        x_flat = rearrange(x, 'b t d -> (b t) d')
+        exp_flat = rearrange(expert_idx, 'b t -> (b t)')
+        shared_hidden = F.silu(self.shared_up(self.shared_up_norm(x_flat)))
+        sort_idx = exp_flat.argsort()
+        sorted_experts = exp_flat[sort_idx]
+        expert_counts = torch.bincount(sorted_experts, minlength=self.num_experts)
+        expert_boundaries = torch.cumsum(expert_counts, dim=0)
+        out_flat = torch.zeros(N, D, device=x.device, dtype=x.dtype)
+        for e in range(self.num_experts):
+            start = expert_boundaries[e] - expert_counts[e]
+            end = expert_boundaries[e]
+            if start == end:
+                continue
+            tok_idx = sort_idx[start:end]
+            inp = x_flat[tok_idx]
+            sh = shared_hidden[tok_idx]
+            gate = self.W_gate[e](self.W_gate_norms[e](inp))
+            core = self.W_transform[e](self.W_transform_norms[e](gate))
+            expert_out = self.shared_down(self.shared_down_norm(core * sh))
+            out_flat[tok_idx] = expert_out
+        return rearrange(out_flat, '(b t) d -> b t d', b=B, t=T)
+    def _run_expert(self, x, expert_idx):
+        return self._run_expert_batch(x, expert_idx)
+    def _active_node_add(self, vq_output, vq_indices):
+        return vq_output + self._active_codebook_features(vq_indices)
+    def forward(self, trigram_input, vq_indices, attention_output=None,
+                memgram_cb_output=None, threshold=0.05):
+        B, T, D = trigram_input.shape
+        device = trigram_input.device
+        x = self.down_proj(self.down_norm(trigram_input))
+        attn_cb = None
+        if attention_output is not None:
+            attn_cb = self.attn_down_proj(self.down_norm(attention_output))
+        halted = torch.zeros(B, T, device=device, dtype=torch.bool)
+        cumulative_p = torch.zeros(B, T, device=device)
+        acc = torch.zeros_like(x)
+        total_ponder = torch.zeros(B, T, device=device)
+        last_x = x
+        initial_x = x
+        use_active_graph = self.codebook_size > self.active_graph_max_nodes
+        node_features = None if use_active_graph else self._codebook_tensor(device)
+        for iter_t in range(self.max_iters):
+            if use_active_graph:
+                traversal = self._active_node_add(x, vq_indices)
+            else:
+                node_aggregated = self._neighbor_aggregate(node_features, threshold)
+                traversal = x + node_aggregated[vq_indices]
+            if attn_cb is not None:
+                traversal = traversal + attn_cb
+            if iter_t in [1, 3] and memgram_cb_output is not None:
+                memgram_raw = memgram_cb_output.to(device)
+                if memgram_raw.shape[-1] != self.cb_dim:
+                    memgram_raw = memgram_raw.mean(dim=-1, keepdim=True).expand(-1, -1, self.cb_dim)
+                traversal = traversal + memgram_raw
+            traversal = traversal + self.hop_lora(traversal, iter_t)
+            trav_norm = F.normalize(traversal, dim=-1, eps=1e-8)
+            centroid_ids = torch.arange(self.num_experts, device=device)
+            cent_norm = F.normalize(self.centroids(centroid_ids), dim=-1, eps=1e-8)
+            scores = trav_norm @ cent_norm.T
+            if self.top_k <= 1:
+                _, expert_idx = scores.max(dim=-1)
+                expert_out = self._run_expert(traversal, expert_idx)
+            else:
+                scores_topk, topk_idx = scores.topk(k=self.top_k, dim=-1)
+                weights = F.softmax(scores_topk / 0.1, dim=-1)
+                expert_out = 0
+                for i in range(self.top_k):
+                    wi = weights[..., i:i+1]
+                    ei = topk_idx[..., i]
+                    expert_out = expert_out + wi * self._run_expert(traversal, ei)
+            last_x = expert_out
+            p = self.halting(expert_out).squeeze(-1)
+            still_running = ~halted
+            remainder = (1.0 - cumulative_p).clamp(min=0)
+            weight = torch.where(
+                cumulative_p + p >= self.halt_threshold,
+                remainder, p,
+            )
+            weight = weight * still_running.float()
+            acc = acc + weight.unsqueeze(-1) * expert_out
+            cumulative_p = cumulative_p + p * still_running.float()
+            halted = halted | (cumulative_p >= self.halt_threshold)
+            total_ponder = total_ponder + (1.0 - cumulative_p).clamp(min=0)
+            x = self.lti(x, initial_x, expert_out)
+            if halted.all():
+                break
+        never_halted = (~halted).float().unsqueeze(-1)
+        acc = acc + never_halted * last_x
+        output = self.up_proj(self.up_norm(acc))
+        ponder_loss = total_ponder.mean() / self.max_iters
+        return output, ponder_loss
+    @torch.no_grad()
+    def update_kg_edges(self, all_vq_indices):
+        if self.use_active_edge_store:
+            self._update_active_edges(all_vq_indices)
+            return
+        unique_ids = torch.unique(all_vq_indices.to(device=self.edge_index.device, dtype=torch.int32))
+        src_in_batch = torch.isin(self.edge_index[0], unique_ids)
+        if src_in_batch.any():
+            dst_seen = torch.isin(self.edge_index[1][src_in_batch], unique_ids)
+            delta = torch.where(
+                dst_seen,
+                torch.ones_like(self.edge_score[src_in_batch], dtype=torch.int16),
+                torch.full_like(self.edge_score[src_in_batch], -1, dtype=torch.int16),
+            )
+            score = torch.clamp(self.edge_score[src_in_batch].to(torch.int16) + delta, -128, 127)
+            self.edge_score[src_in_batch] = score.to(torch.int8)
+        self._requantize_dense_edges()
+    @torch.no_grad()
+    def _update_active_edges(self, all_vq_indices):
+        ids = all_vq_indices.to(device=self.active_edge_src.device, dtype=torch.int32)
+        if ids.numel() < 2:
+            self._steps_since_requant.add_(1)
+            return
+        seq = ids.reshape(-1, ids.shape[-1]) if ids.dim() > 1 else ids.reshape(1, -1)
+        src = seq[:, :-1].reshape(-1)
+        dst = seq[:, 1:].reshape(-1)
+        valid = (src >= 0) & (dst >= 0) & (src < self.codebook_size) & (dst < self.codebook_size) & (src != dst)
+        src = src[valid]
+        dst = dst[valid]
+        if src.numel() == 0:
+            self._steps_since_requant.add_(1)
+            return
+        n_edges = min(src.numel(), self.active_edge_capacity)
+        src = src[-n_edges:]
+        dst = dst[-n_edges:]
+        ptr = int(self.active_edge_ptr.item())
+        slots = (torch.arange(n_edges, device=src.device, dtype=torch.long) + ptr) % self.active_edge_capacity
+        self.active_edge_src[slots] = src
+        self.active_edge_dst[slots] = dst
+        score = torch.clamp(self.active_edge_score[slots].to(torch.int16) + 1, -128, 127)
+        self.active_edge_score[slots] = score.to(torch.int8)
+        self.active_edge_attr[slots] = 1
+        self.active_edge_ptr.fill_((ptr + n_edges) % self.active_edge_capacity)
+        self._requantize_active_edges()
+    @torch.no_grad()
+    def _requantize_dense_edges(self):
+        if self._steps_since_requant.item() < self.requant_every:
+            self._steps_since_requant.add_(1)
+            return
+        self.edge_attr = self._score_to_attr(self.edge_score)
+        score = self.edge_score.to(torch.int16)
+        score = torch.where(score > 0, score - 1, torch.where(score < 0, score + 1, score))
+        self.edge_score = score.to(torch.int8)
+        self._steps_since_requant.zero_()
+    @torch.no_grad()
+    def _requantize_active_edges(self):
+        if self._steps_since_requant.item() < self.requant_every:
+            self._steps_since_requant.add_(1)
+            return
+        active = self.active_edge_src >= 0
+        if active.any():
+            self.active_edge_attr[active] = self._score_to_attr(self.active_edge_score[active])
+            score = self.active_edge_score[active].to(torch.int16)
+            score = torch.where(score > 0, score - 1, torch.where(score < 0, score + 1, score))
+            self.active_edge_score[active] = score.to(torch.int8)
+        self._steps_since_requant.zero_()
+    def _score_to_attr(self, score):
+        threshold = max(1, int(round(float(self.kg_ternary_threshold) * 8)))
+        score_i = score.to(torch.int16)
+        return torch.where(
+            score_i >= threshold,
+            torch.ones_like(score, dtype=torch.int8),
+            torch.where(
+                score_i <= -threshold,
+                torch.full_like(score, -1, dtype=torch.int8),
+                torch.zeros_like(score, dtype=torch.int8),
+            ),
+        )
+    @torch.no_grad()
+    def monitor_graph_health(self, threshold=0.05):
+        if self.use_active_edge_store:
+            active = self.active_edge_src >= 0
+            if not active.any():
+                return {
+                    "sparsity": 1.0, "isolated_nodes": self.codebook_size,
+                    "avg_polarity": 0.0, "dead_edges": 0,
+                    "score_mean": 0.0, "score_max": 0.0,
+                    "active_edges": 0,
+                }
+            edge_attr = self.active_edge_attr[active]
+            edge_score = self.active_edge_score[active]
+            nodes_with_edges = torch.unique(torch.cat([self.active_edge_src[active], self.active_edge_dst[active]]))
+        else:
+            edge_attr = self.edge_attr
+            edge_score = self.edge_score
+            nodes_with_edges = torch.unique(torch.cat([self.edge_index[0], self.edge_index[1]]))
+        ternary_edge = edge_attr.sign()
+        sparsity = (ternary_edge == 0).float().mean().item() if ternary_edge.numel() else 1.0
+        n_isolated = max(int(self.codebook_size) - int(nodes_with_edges.numel()), 0)
+        n_pos = (ternary_edge > 0).sum().item()
+        n_neg = (ternary_edge < 0).sum().item()
+        n_nonzero = n_pos + n_neg
+        avg_polarity = (n_pos - n_neg) / max(n_nonzero, 1)
+        dead_edges = ((ternary_edge == 0) & (edge_score != 0)).sum().item()
+        score_mean = edge_score.float().mean().item() if edge_score.numel() else 0.0
+        score_max = edge_score.float().abs().max().item() if edge_score.numel() else 0.0
+        return {
+            "sparsity": sparsity, "isolated_nodes": n_isolated,
+            "avg_polarity": avg_polarity, "dead_edges": dead_edges,
+            "score_mean": score_mean, "score_max": score_max,
+            "active_edges": int(ternary_edge.numel()),
+        }
+    def set_adjacency(self, edge_index, edge_attr_init=None):
+        self.use_active_edge_store = False
+        device = self.edge_attr.device
+        self.edge_index = edge_index.to(device=device, dtype=torch.int32)
+        if edge_attr_init is not None:
+            edge_attr = edge_attr_init.sign() * (edge_attr_init.abs() > 0).to(edge_attr_init.dtype)
+            self.edge_attr = edge_attr.to(device=device, dtype=torch.int8)
+        else:
+            self.edge_attr = torch.randint(-1, 2, (edge_index.size(1),),
+                device=device, dtype=torch.int8)
+        self.edge_score = self.edge_attr.clone()

arbitor/config.py ADDED Viewed

	@@ -0,0 +1,125 @@

+VOCAB=288
+AUDIO_VOCAB=288
+AUDIO_SR=16000
+AUDIO_FRAME_RATE=50
+THRESHOLD=0.05
+# -- 3B Target Dimensions --
+EMBEDDING_DIM=1536
+CODEBOOK_DIM=1024
+CODEBOOK_SIZE=524288            # Base unit
+# Shared multimodal VQ (256K entries × 1024-dim)
+SHARED_VQ_SIZE = 262144
+HIDDEN_DIM=8192             # Main hidden dimension
+FFN_HIDDEN=16384              # 2× HIDDEN_DIM
+CTX=256
+# MoEGraph (256 experts, centroid routing, unified ACT)
+MG_N_EXPERTS = 256
+MG_CORE_RANK = 384
+MG_SHARED_INTER = 1536
+MG_ACT_ITERS = 4
+MG_WORKSPACE_DIM = 768
+MG_TOP_K = 2
+# VQ
+# MemGram (32 heads × ~65K slots ≈ 2M total associative slots)
+MEMGRAM_STRUCT_PRIMES = [64901, 64919, 64921, 64927, 64937, 64951, 64969, 64997,
+                         65003, 65011, 65027, 65029, 65033, 65053, 65063, 65071,
+                         65101, 65119, 65123, 65129, 65141, 65147, 65167, 65171,
+                         65173, 65179, 65183, 65203, 65213, 65239, 65257, 65269]
+MEMGRAM_CONV_PRIMES = [8009, 8011, 8017, 8039, 8081, 8087, 8089, 8093]
+MEMGRAM_EMBED_DIM = 64
+MEMGRAM_KEY_DIM = 32
+# KV Ledger
+KV_LEDGER_SIZE = 262144
+SLIDING_WINDOW_SIZE = 32768
+KQ_CACHE_SIZE = 8192
+# MLA Attention dimensions
+MLA_N_HEADS = 32
+MLA_QK_NOPE_HEAD_DIM = 96
+MLA_QK_ROPE_HEAD_DIM = 32
+MLA_V_HEAD_DIM = 96
+MLA_SLIDE_DIM = 64
+MLA_FULL_DIM = 32
+MLA_N_LAYERS = 24
+# RoPE
+MLA_ROPE_THETA = 10000.0
+# Attention
+ATTENTION_STRIDE = 8
+KV_CONTEXT_LENGTH = 33554432
+# CSA / HCA compression (DeepSeek V4 hybrid attention)
+MLA_CSA_DIM = 16
+MLA_HCA_DIM = 16
+MLA_HCA_STRIDE = 32
+# KG EMA — Phase 17
+KG_EMA_ALPHA=0.99
+KG_REQUANT_EVERY=50
+KG_TERNARY_THRESHOLD=0.3
+# Composite Motif VQ — Phase 17 (64K entries × 1024-dim)
+KGVQ_CODEBOOK_SIZE=65536
+KGVQ_CODEBOOK_DIM=1024
+KGVQ_DECAY=0.99
+KGVQ_COMMITMENT_WEIGHT=1.0
+KGVQ_DEAD_CODE_THRESHOLD=2
+K_MAX_COMPOSITES=20
+# VideoHead (Open-Sora VAE: 4 latent channels, 8× spatial + 4× temporal compression)
+VIDEO_LATENT_CHANNELS = 4
+VIDEO_MAX_STEPS = 8
+VIDEO_HEIGHT = 64
+VIDEO_WIDTH = 64
+# -- Open-Sora 3D VAE (Phase 19) --
+OPEN_SORA_VAE_PATH = "arbitor/encoders/models/opensora-vae"
+OPEN_SORA_VAE_REPO = "hpcai-tech/OpenSora-VAE-v1.2"
+OPEN_SORA_LATENT_CHANNELS = 4
+OPEN_SORA_SCALE_FACTOR_SPATIAL = 8
+OPEN_SORA_SCALE_FACTOR_TEMPORAL = 4
+# -- ACT Loop Parameters (Phase 19) --
+BYTEHEAD_ACT_MAX_ITERS = 3
+BYTEHEAD_ACT_HALT_CONSECUTIVE = 2
+BYTEHEAD_ACT_PONDER_LAMBDA = 0.01
+VIDEOHEAD_ACT_MIN_FPS = 1
+VIDEOHEAD_ACT_MAX_FPS = 60
+VIDEOHEAD_ACT_FRAME_CHUNK = 8
+TALKERHEAD_ACT_CHUNK_FRAMES = 500
+# -- Timestamp Encoding (Phase 19) --
+TIMESTAMP_MAX_PERIOD = 10000.0
+# -- Temporal Frame Buffer (Phase 19) --
+FRAME_BUFFER_LOCAL_SIZE = 3
+FRAME_BUFFER_CACHE_STRIDE = 4
+SPECIAL_VOCAB = {
+    # Control
+    'PAD': 256, 'BOS': 257, 'EOS': 258, 'STOP': 259,
+    # Roles
+    'SYSTEM': 260, 'USER': 261, 'ASSISTANT': 262,
+    # Reasoning
+    'SCRATCHPAD': 263, 'PLAN': 264, 'REFLECTION': 265, 'SUMMARY': 266,
+    # Tool use
+    'ACTION': 267, 'TOOL': 268, 'TOOL_RESULT': 269,
+    # Code
+    'CODE': 270, 'CODE_BLOCK': 271, 'EXECUTION': 272,
+    # RAG
+    'SEARCH': 273, 'CONTEXT': 274, 'CITATION': 275,
+    # Quality / format
+    'ERROR': 276, 'FORMAT': 277,
+    # Multimodal
+    'IMAGE': 278, 'TEXT': 279, 'AUDIO': 280,
+    'VIDEO': 281, 'SPEAK': 282, 'IMG_GEN': 283,
+    # Future
+    'RES1': 284, 'RES2': 285, 'RES3': 286, 'RESERVED': 287,
+}

arbitor/converters/convert_to_ternary2.py ADDED Viewed

	@@ -0,0 +1,81 @@

+import os
+import sys
+import torch
+def pack_ternary(w):
+    q = torch.empty_like(w, dtype=torch.uint8)
+    q[w < 0] = 0
+    q[w == 0] = 1
+    q[w > 0] = 2
+    flat = q.flatten()
+    pad = (-len(flat)) % 4
+    if pad:
+        flat = torch.cat([flat, torch.zeros(pad, dtype=torch.uint8, device=flat.device)])
+    flat = flat.view(-1, 4)
+    packed = (
+        flat[:, 0]
+        | (flat[:, 1] << 2)
+        | (flat[:, 2] << 4)
+        | (flat[:, 3] << 6)
+    )
+    return packed.cpu(), w.shape
+def save_model(model, path="trigram-morph.pt"):
+    ternary_weights = {}
+    for name, param in model.named_parameters():
+        if "weight" in name and param.ndim >= 2 and "embed" not in name:
+            T = StickyZoneSTE.apply(param.data, THRESHOLD)
+            packed, shape = pack_ternary(T)
+            ternary_weights[name] = {"packed": packed, "shape": shape}
+    torch.save({
+        "model_state_dict": model.state_dict(),
+        "config": {
+            "vocab": VOCAB,
+            "embedding_dim": EMBEDDING_DIM,
+            "trigram_dim": HIDDEN_DIM,
+            "ffn_hidden": FFN_HIDDEN,
+            "ctx": CTX,
+            "threshold": THRESHOLD,
+        },
+        "ternary_packed": ternary_weights,
+        "format": "factorized_scaled_ternary",
+        "bpw": 1.58,
+    }, path)
+    total = sum(p.numel() for p in model.parameters())
+    print(f"Saved {total:,} params to {path}")
+def load_model(path="trigram-morph.pt", device="cpu"):
+    checkpoint = torch.load(path, map_location=device, weights_only=False)
+    model = ARBModel()
+    model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    model.to(device)
+    model.eval()
+    return model
+if __name__ == "__main__":
+    from ..trigram import ARBModel
+    model = ARBModel()
+    total = sum(p.numel() for p in model.parameters())
+    ternary = sum(
+        p.numel() for n, p in model.named_parameters()
+        if "weight" in n and p.ndim >= 2 and "embed" not in n
+    )
+    fp32 = sum(
+        p.numel() for n, p in model.named_parameters()
+        if not ("weight" in n and p.ndim >= 2 and "embed" not in n)
+    )
+    print(f"Total params: {total:,}")
+    print(f"Ternary params (1.58 BPW): {ternary:,}")
+    print(f"FP32 params: {fp32:,}")
+    save_model(model)

arbitor/converters/convert_to_ternary54.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import sys
+import torch
+def pack_ternary_34(w):
+    q = torch.empty_like(w, dtype=torch.uint8)
+    q[w < 0] = 0
+    q[w == 0] = 1
+    q[w > 0] = 2
+    flat = q.flatten()
+    # pad to multiple of 34
+    pad = (-len(flat)) % 34
+    if pad:
+        flat = torch.cat([
+            flat,
+            torch.ones(pad, dtype=torch.uint8, device=flat.device)
+        ])
+    flat = flat.view(-1, 34)
+    packed = torch.zeros(
+        flat.shape[0],
+        dtype=torch.uint64,
+        device=flat.device
+    )
+    multiplier = 1
+    for i in range(34):
+        packed += flat[:, i].to(torch.uint64) * multiplier
+        multiplier *= 3
+    return packed.cpu(), w.shape, pad
+def unpack_ternary_34(packed, shape, pad=0):
+    packed = packed.to(torch.uint64)
+    out = []
+    for _ in range(34):
+        trit = packed % 3
+        packed //= 3
+        out.append(trit)
+    out = torch.stack(out, dim=1).flatten()
+    if pad:
+        out = out[:-pad]
+    out = out.view(shape)
+    # restore ternary values
+    out = out.to(torch.int8)
+    out[out == 0] = -1
+    out[out == 1] = 0
+    out[out == 2] = 1
+    return out
+def save_model(model, path="trigram-morph.pt"):
+    ternary_weights = {}
+    for name, param in model.named_parameters():
+        if "weight" in name and param.ndim >= 2 and "embed" not in name:
+            T = StickyZoneSTE.apply(param.data, THRESHOLD)
+            packed, shape = pack_ternary(T)
+            ternary_weights[name] = {"packed": packed, "shape": shape}
+    torch.save({
+        "model_state_dict": model.state_dict(),
+        "config": {
+            "vocab": VOCAB,
+            "embedding_dim": EMBEDDING_DIM,
+            "trigram_dim": HIDDEN_DIM,
+            "ffn_hidden": FFN_HIDDEN,
+            "ctx": CTX,
+            "threshold": THRESHOLD,
+        },
+        "ternary_packed": ternary_weights,
+        "format": "factorized_scaled_ternary",
+        "bpw": 1.58,
+    }, path)
+    total = sum(p.numel() for p in model.parameters())
+    print(f"Saved {total:,} params to {path}")
+def load_model(path="trigram-morph.pt", device="cpu"):
+    checkpoint = torch.load(path, map_location=device, weights_only=False)
+    model = ARBModel()
+    model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    model.to(device)
+    model.eval()
+    return model
+if __name__ == "__main__":
+    from ..trigram import ARBModel
+    model = ARBModel()
+    total = sum(p.numel() for p in model.parameters())
+    ternary = sum(
+        p.numel() for n, p in model.named_parameters()
+        if "weight" in n and p.ndim >= 2 and "embed" not in n
+    )
+    fp32 = sum(
+        p.numel() for n, p in model.named_parameters()
+        if not ("weight" in n and p.ndim >= 2 and "embed" not in n)
+    )
+    print(f"Total params: {total:,}")
+    print(f"Ternary params (1.58 BPW): {ternary:,}")
+    print(f"FP32 params: {fp32:,}")
+    save_model(model)

arbitor/converters/convert_to_ternary64.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import os
+import sys
+import torch
+def pack_ternary(w):
+    q = torch.empty_like(w, dtype=torch.uint8)
+    q[w < 0] = 0
+    q[w == 0] = 1
+    q[w > 0] = 2
+    flat = q.flatten()
+    pad = (-len(flat)) % 40 # 40 trit -> 64 bit packing - Higher conversation than 1 trit -> 2 bit + uint64 performance
+    if pad:
+        flat = torch.cat([
+            flat,
+            torch.zeros(pad, dtype=torch.uint8, device=flat.device)
+        ])
+    flat = flat.view(-1, 40)
+    packed = torch.zeros(flat.shape[0], dtype=torch.uint64)
+    multiplier = 1
+    for i in range(40):
+        packed += flat[:, i].to(torch.uint64) * multiplier
+        multiplier *= 3
+    return packed.cpu(), w.shape, pad
+def unpack_ternary_40(packed, shape, pad=0):
+    packed = packed.to(torch.uint64)
+    out = []
+    for _ in range(40):
+        trit = packed % 3
+        packed //= 3
+        out.append(trit)
+    out = torch.stack(out, dim=1).flatten()
+    if pad:
+        out = out[:-pad]
+    out = out.view(shape)
+    out = out.to(torch.int8)
+    out[out == 0] = -1
+    out[out == 1] = 0
+    out[out == 2] = 1
+    return out
+def save_model(model, path="trigram-morph.pt"):
+    ternary_weights = {}
+    for name, param in model.named_parameters():
+        if "weight" in name and param.ndim >= 2 and "embed" not in name:
+            T = StickyZoneSTE.apply(param.data, THRESHOLD)
+            packed, shape = pack_ternary(T)
+            ternary_weights[name] = {"packed": packed, "shape": shape}
+    torch.save({
+        "model_state_dict": model.state_dict(),
+        "config": {
+            "vocab": VOCAB,
+            "embedding_dim": EMBEDDING_DIM,
+            "trigram_dim": HIDDEN_DIM,
+            "ffn_hidden": FFN_HIDDEN,
+            "ctx": CTX,
+            "threshold": THRESHOLD,
+        },
+        "ternary_packed": ternary_weights,
+        "format": "factorized_scaled_ternary",
+        "bpw": 1.58,
+    }, path)
+    total = sum(p.numel() for p in model.parameters())
+    print(f"Saved {total:,} params to {path}")
+def load_model(path="trigram-morph.pt", device="cpu"):
+    checkpoint = torch.load(path, map_location=device, weights_only=False)
+    model = ARBModel()
+    model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    model.to(device)
+    model.eval()
+    return model
+if __name__ == "__main__":
+    from ..trigram import ARBModel
+    model = ARBModel()
+    total = sum(p.numel() for p in model.parameters())
+    ternary = sum(
+        p.numel() for n, p in model.named_parameters()
+        if "weight" in n and p.ndim >= 2 and "embed" not in n
+    )
+    fp32 = sum(
+        p.numel() for n, p in model.named_parameters()
+        if not ("weight" in n and p.ndim >= 2 and "embed" not in n)
+    )
+    print(f"Total params: {total:,}")
+    print(f"Ternary params (1.58 BPW): {ternary:,}")
+    print(f"FP32 params: {fp32:,}")
+    save_model(model)

arbitor/converters/convert_to_ternary8.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import os
+import sys
+import torch
+# Lightweight imports used by pack_ternary/unpack_ternary (called by core system)
+# No circular deps here — these are just type conversions
+def pack_ternary(w):
+    q = torch.empty_like(w, dtype=torch.uint8)
+    q[w < 0] = 0
+    q[w == 0] = 1
+    q[w > 0] = 2
+    flat = q.flatten()
+    pad = (-len(flat)) % 5 # 5 trit -> 8 bit packing - Higher conversation than 1 trit -> 2 bit
+    if pad:
+        flat = torch.cat([
+            flat,
+            torch.zeros(pad, dtype=torch.uint8, device=flat.device)
+        ])
+    flat = flat.view(-1, 5)
+    packed = (
+        flat[:, 0]
+        + flat[:, 1] * 3
+        + flat[:, 2] * 9
+        + flat[:, 3] * 27
+        + flat[:, 4] * 81
+    ).to(torch.uint8)
+    return packed.cpu(), w.shape, pad
+def unpack_ternary(packed, shape, pad=0):
+    packed = packed.to(torch.int16)
+    t0 = packed % 3
+    packed //= 3
+    t1 = packed % 3
+    packed //= 3
+    t2 = packed % 3
+    packed //= 3
+    t3 = packed % 3
+    packed //= 3
+    t4 = packed % 3
+    out = torch.stack([t0, t1, t2, t3, t4], dim=1).flatten()
+    if pad:
+        out = out[:-pad]
+    out = out.view(shape)
+    # map back
+    out = out.to(torch.int8)
+    out[out == 0] = -1
+    out[out == 1] = 0
+    out[out == 2] = 1
+    return out
+def save_model(model, path="models/conversions/arb-model.pt"):
+    import os
+    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
+    torch.save({"model_state_dict": model.state_dict()}, path)
+    total = sum(p.numel() for p in model.parameters())
+    print(f"Saved {total:,} params to {path}")
+def load_model(path="models/conversions/arb-model.pt", device="cpu"):
+    from ..trigram import ARBModel
+    checkpoint = torch.load(path, map_location=device, weights_only=False)
+    model = ARBModel()
+    model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    model.to(device)
+    model.eval()
+    return model
+if __name__ == "__main__":
+    from ..trigram import ARBModel
+    model = ARBModel()
+    total = sum(p.numel() for p in model.parameters())
+    ternary = sum(
+        p.numel() for n, p in model.named_parameters()
+        if "weight" in n and p.ndim >= 2 and "embed" not in n
+    )
+    fp32 = sum(
+        p.numel() for n, p in model.named_parameters()
+        if not ("weight" in n and p.ndim >= 2 and "embed" not in n)
+    )
+    print(f"Total params: {total:,}")
+    print(f"Ternary params (1.58 BPW): {ternary:,}")
+    print(f"FP32 params: {fp32:,}")
+    save_model(model)

arbitor/decoders.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""Decoder modules — video diffusion, audio codec, speech generation.
+These modules convert HIDDEN_DIM relational states into modality-specific outputs:
+video (latent diffusion), audio (codec tokens), and speech (token striding + codec).
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .kernel.ternary_scale import TernaryScaleTensor, TScaleType, TernaryRMSNorm
+from .kernel.triton_video import video_denoise_step
+from .config import HIDDEN_DIM, AUDIO_VOCAB, AUDIO_SR, AUDIO_FRAME_RATE, \
+    VIDEO_LATENT_CHANNELS, VIDEO_MAX_STEPS, VIDEO_HEIGHT, VIDEO_WIDTH, \
+    VIDEOHEAD_ACT_MIN_FPS, VIDEOHEAD_ACT_MAX_FPS, VIDEOHEAD_ACT_FRAME_CHUNK, \
+    TALKERHEAD_ACT_CHUNK_FRAMES
+from .components import TernaryEmbeddingTable
+class LTIInjection(nn.Module):
+    """LTI state injection: h = A*h + B*e + trans_out.
+    Spectral radius < 1 guaranteed by construction via ZOH discretization.
+    """
+    def __init__(self, dim: int):
+        super().__init__()
+        self.log_A = nn.Parameter(torch.zeros(dim))
+        self.log_dt = nn.Parameter(torch.zeros(1))
+        self.B = nn.Parameter(torch.ones(dim) * 0.1)
+        for p in (self.log_A, self.log_dt, self.B):
+            p.requires_grad_(False)
+    def get_A(self):
+        return torch.exp(-torch.exp((self.log_dt + self.log_A).clamp(-20, 20)))
+    def forward(self, h, e, trans_out):
+        return self.get_A() * h + self.B * e + trans_out
+class VideoHead(nn.Module):
+    """Scaled latent diffusion with cross-attention conditioning, frame gate, and 4-frame latent.
+    Produces [B, ch, 4, H', W'] latents (4-frame temporal chunks) per D-102.
+    Frame gate controls adaptive fps in [MIN_FPS, MAX_FPS] range.
+    """
+    def __init__(self, tscale_type=TScaleType.T32, max_steps=VIDEO_MAX_STEPS,
+                 latent_channels=VIDEO_LATENT_CHANNELS, height=VIDEO_HEIGHT, width=VIDEO_WIDTH,
+                 min_fps=VIDEOHEAD_ACT_MIN_FPS, max_fps=VIDEOHEAD_ACT_MAX_FPS,
+                 frame_chunk=VIDEOHEAD_ACT_FRAME_CHUNK):
+        super().__init__()
+        self.max_steps = max_steps
+        self.latent_channels = latent_channels
+        self.height = height
+        self.width = width
+        self.latent_dim = latent_channels * height * width
+        self.halt_threshold = 0.05
+        self.min_fps = min_fps
+        self.max_fps = max_fps
+        self.frame_chunk = frame_chunk
+        self.cross_attn_q = TernaryScaleTensor(self.latent_dim, HIDDEN_DIM, tscale_type=tscale_type)
+        self.cross_attn_kv = TernaryScaleTensor(HIDDEN_DIM, HIDDEN_DIM, tscale_type=tscale_type)
+        self.diffusion_step = TernaryScaleTensor(HIDDEN_DIM, self.latent_dim, tscale_type=tscale_type)
+        self.halt_unit = TernaryScaleTensor(HIDDEN_DIM, 1, tscale_type=tscale_type)
+        self.frame_gate = TernaryScaleTensor(HIDDEN_DIM, 1, tscale_type=tscale_type)
+        self.noise_embed = TernaryEmbeddingTable(max_steps, HIDDEN_DIM, tscale_type=tscale_type)
+        self.lti = LTIInjection(self.latent_dim)
+    @torch.no_grad()
+    def _compute_fps(self, cond):
+        frame_prob = torch.sigmoid(self.frame_gate(cond))
+        fps = self.min_fps + frame_prob * (self.max_fps - self.min_fps)
+        return fps.mean().item()
+    def forward(self, relational, max_steps=None, duration_seconds=1.0):
+        B, T, D = relational.shape
+        max_steps = max_steps or self.max_steps
+        cond = relational.mean(dim=1, keepdim=True)
+        fps = self._compute_fps(cond)
+        n_frames = max(1, int(fps * duration_seconds))
+        n_latents = min((n_frames + self.frame_chunk - 1) // self.frame_chunk, max_steps)
+        all_latents = []
+        for chunk_idx in range(n_latents):
+            latent = torch.randn(B, 1, self.latent_dim, device=relational.device,
+                                requires_grad=torch.is_grad_enabled())
+            for step in range(max_steps):
+                q = self.cross_attn_q(latent)
+                kv = self.cross_attn_kv(cond.expand(-1, T, -1))
+                context = kv.mean(dim=1, keepdim=True)
+                step_embed = self.noise_embed(torch.tensor(step, device=relational.device))
+                step_embed = step_embed.expand(B, 1, -1)
+                step_input = q + context + step_embed
+                pred_noise = self.diffusion_step(step_input)
+                alpha = 0.9 ** step
+                trans_out = video_denoise_step(latent, pred_noise, alpha)
+                h = torch.zeros(B, 1, self.latent_dim, device=context.device)
+                h[:, :, :HIDDEN_DIM] = context
+                latent = self.lti(latent, h, trans_out)
+                halt = torch.sigmoid(self.halt_unit(context))
+                if halt.mean() > self.halt_threshold and step > 1:
+                    break
+            all_latents.append(latent.view(B, self.latent_channels, 1, self.height, self.width))
+        return torch.cat(all_latents, dim=2)
+class MRFBlock(nn.Module):
+    """Multi-Receptive Field Fusion block from HiFi-GAN."""
+    def __init__(self, channels, kernel_sizes=(3, 5, 7)):
+        super().__init__()
+        self.convs = nn.ModuleList([
+            nn.Sequential(
+                nn.LeakyReLU(0.1),
+                nn.Conv1d(channels, channels, k, padding=k//2, dilation=1),
+            )
+            for k in kernel_sizes
+        ])
+    def forward(self, x):
+        return sum(conv(x) for conv in self.convs) / len(self.convs)
+class TinyNeuralCodec(nn.Module):
+    """Lightweight neural audio decoder (frozen float32 sidecar).
+    Maps byte token sequences to 16 kHz audio waveforms via transposed conv.
+    Token rate: 50 Hz → output: [B, 1, T * 320] at 16 kHz.
+    """
+    def __init__(self, vocab=AUDIO_VOCAB, embed_dim=512, upsample_ratios=(5, 4, 4, 4)):
+        super().__init__()
+        self.embed = nn.Embedding(vocab, embed_dim)
+        in_ch = embed_dim
+        self.blocks = nn.ModuleList()
+        for i, ratio in enumerate(upsample_ratios):
+            out_ch = max(1, embed_dim // (2 ** (i + 1)))
+            k = ratio * 2
+            pad = (ratio + 1) // 2 if ratio % 2 else ratio // 2
+            op = max(0, ratio + 2 * pad - k)
+            block = nn.Sequential(
+                nn.ConvTranspose1d(in_ch, out_ch, k, stride=ratio, padding=pad, output_padding=op),
+                MRFBlock(out_ch),
+            )
+            self.blocks.append(block)
+            in_ch = out_ch
+        self.to_audio = nn.Conv1d(in_ch, 1, kernel_size=7, padding=3)
+    def forward(self, tokens):
+        x = self.embed(tokens)
+        x = x.permute(0, 2, 1)
+        for block in self.blocks:
+            x = block(x)
+        x = self.to_audio(x)
+        return torch.tanh(x)
+    def encode_audio(self, audio, frame_rate=AUDIO_FRAME_RATE, sr=AUDIO_SR):
+        B, C, T = audio.shape
+        frame_len = sr // frame_rate
+        pad = (frame_len - T % frame_len) % frame_len
+        if pad > 0:
+            audio = F.pad(audio, (0, pad))
+        frames = audio.unfold(2, frame_len, frame_len)
+        frames = frames.mean(dim=1)
+        emb = self.embed.weight
+        B, NF, FL = frames.shape
+        frames_flat = frames.reshape(-1, FL)
+        frame_energy = frames_flat.mean(dim=1)
+        tokens = torch.clamp(((frame_energy + 1.0) * 127.5).long(), 0, 255)
+        tokens = tokens.reshape(B, NF)
+        recon = self(tokens)
+        if pad > 0:
+            recon = recon[:, :, :T]
+        return tokens, recon
+class TalkerHead(nn.Module):
+    """Audio generation head with temporal stride and chunked ACT generation.
+    2-layer MLP: 8192 → 8192 → 288.
+    Generates byte token predictions at 50 Hz frame rate in 500-frame chunks.
+    TinyNeuralCodec decodes the predicted tokens to audio waveform.
+    """
+    def __init__(self, tscale_type=TScaleType.T32,
+                 chunk_frames=TALKERHEAD_ACT_CHUNK_FRAMES):
+        super().__init__()
+        self.norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+        self.hidden = TernaryScaleTensor(HIDDEN_DIM, HIDDEN_DIM, tscale_type=tscale_type)
+        self.hidden_norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+        self.head = TernaryScaleTensor(HIDDEN_DIM, AUDIO_VOCAB, tscale_type=tscale_type)
+        self.codec = None
+        self.max_frames = chunk_frames
+        self.chunk_frames = chunk_frames
+    def load_codec(self, device='cuda'):
+        if self.codec is None:
+            self.codec = TinyNeuralCodec().to(device)
+            self.codec.eval()
+        return self.codec
+    def token_logits(self, x, max_frames=None):
+        max_frames = max_frames or self.max_frames
+        cond = self.norm(x)
+        cond = F.silu(self.hidden_norm(self.hidden(cond)))
+        stride = max(1, max_frames // max(1, cond.shape[1]))
+        logits = self.head(cond)
+        logits = logits.repeat_interleave(stride, dim=1)
+        if logits.shape[1] > max_frames:
+            logits = logits[:, :max_frames, :]
+        elif logits.shape[1] < max_frames:
+            pad = logits.new_zeros(logits.shape[0], max_frames - logits.shape[1], logits.shape[2])
+            logits = torch.cat([logits, pad], dim=1)
+        return logits
+    def forward(self, x, max_frames=None):
+        return self.token_logits(x, max_frames=max_frames).argmax(dim=-1)
+    def generate_audio(self, x, max_frames=None, return_all=True):
+        if max_frames is None:
+            max_frames = self.max_frames
+        all_tokens = []
+        remaining = max_frames
+        while remaining > 0:
+            chunk = min(remaining, self.chunk_frames)
+            tokens = self.forward(x, max_frames=chunk)
+            all_tokens.append(tokens)
+            remaining -= chunk
+        tokens = torch.cat(all_tokens, dim=1)
+        codec = self.load_codec(x.device if hasattr(x, 'device') else 'cuda')
+        with torch.no_grad():
+            waveform = codec(tokens)
+        return waveform, tokens

arbitor/encoders/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Encoder sidecar modules for the ARB system.
+Each module exposes load(), encode(), decode() methods.
+Loaded on-demand as frozen float/int8 sidecars.
+"""
+from ..decoders import TinyNeuralCodec, MRFBlock
+from .audio import AudioVQEncoder
+from .pig_vae import load_vae, VAEWrapper
+from .opensora_vae import load_opensora_vae, OpenSoraVAEWrapper
+from .vae2d import VAE2DEncoder, load_vae2d
+from .mel_frontend import MelSpectrogram3Band

arbitor/encoders/audio.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""Audio training encoder — VQ encoder for TalkerHead target preparation.
+Training-only component (~5M float params). Maps audio at 50 Hz to 289-class byte tokens.
+TinyNeuralCodec (the decoder) is in arbitor.components — shared with TalkerHead.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..components import TernaryEmbeddingTable
+from ..kernel.ternary_scale import TernaryScaleTensor, TScaleType
+class TernaryConv1d(nn.Module):
+    """Conv1d implemented as unfold + ternary linear projection."""
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0,
+                 tscale_type=TScaleType.T32, bias=True):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.proj = TernaryScaleTensor(
+            in_channels * kernel_size,
+            out_channels,
+            tscale_type=tscale_type,
+            bias=bias,
+        )
+    def forward(self, x):
+        if self.padding:
+            x = F.pad(x, (self.padding, self.padding))
+        windows = x.unfold(2, self.kernel_size, self.stride)
+        windows = windows.permute(0, 2, 1, 3).reshape(x.size(0), -1, self.in_channels * self.kernel_size)
+        return self.proj(windows).permute(0, 2, 1)
+class AudioVQEncoder(nn.Module):
+    """Encodes audio to discrete byte tokens at 50 Hz for TalkerHead training.
+    Input: [B, 1, T] audio waveform at 16 kHz
+    Output: [B, T/320, 288] logits over byte vocab (50 Hz frame rate)
+    """
+    def __init__(self, vocab=288, codebook_dim=64, downsample_ratios=(4, 4, 4, 5),
+                 tscale_type=TScaleType.T32):
+        super().__init__()
+        in_ch = 1
+        self.down_blocks = nn.ModuleList()
+        for i, ratio in enumerate(downsample_ratios):
+            out_ch = min(128, 32 * (2 ** i))
+            block = nn.Sequential(
+                TernaryConv1d(in_ch, out_ch, kernel_size=ratio * 2, stride=ratio,
+                              padding=ratio // 2, tscale_type=tscale_type),
+                nn.LeakyReLU(0.1),
+                TernaryConv1d(out_ch, out_ch, kernel_size=3, padding=1,
+                              tscale_type=tscale_type),
+                nn.LeakyReLU(0.1),
+            )
+            self.down_blocks.append(block)
+            in_ch = out_ch
+        self.proj = TernaryScaleTensor(out_ch, codebook_dim, tscale_type=tscale_type, bias=True)
+        self.codebook = TernaryEmbeddingTable(vocab, codebook_dim, tscale_type=tscale_type)
+        self.out_proj = TernaryScaleTensor(codebook_dim, vocab, tscale_type=tscale_type, bias=True)
+    def forward(self, audio):
+        x = audio
+        for block in self.down_blocks:
+            x = block(x)
+        x = x.permute(0, 2, 1)
+        x = self.proj(x)
+        emb_idx = torch.arange(self.out_proj.out_dim, device=x.device)
+        emb = self.codebook(emb_idx).to(device=x.device, dtype=x.dtype)
+        dist = torch.cdist(x.float(), emb.unsqueeze(0).float())
+        indices = dist.argmin(dim=-1)
+        quantized = F.embedding(indices, emb)
+        quantized = x + (quantized - x).detach()
+        logits = self.out_proj(quantized)
+        return logits, indices
+    def encode(self, audio):
+        with torch.no_grad():
+            _, indices = self.forward(audio)
+        return indices

arbitor/encoders/mel_frontend.py ADDED Viewed

	@@ -0,0 +1,70 @@

+"""Mel spectrogram frontend for audio-to-image conversion.
+Converts [B, T] audio waveform to [B, 3, 64, T_mel] 3-channel mel
+spectrogram (low/mid/high frequency bands → RGB channels) suitable
+for encoding through the 2D VAE encoder.
+Band split:
+- Channel 0 (low):  0-1000 Hz
+- Channel 1 (mid):  1000-4000 Hz
+- Channel 2 (high): 4000-8000 Hz
+"""
+import torch
+import torch.nn as nn
+import torchaudio
+class MelSpectrogram3Band(nn.Module):
+    """Audio → 3-channel mel spectrogram (low/mid/high bands → RGB).
+    Splits audio into 3 frequency bands and computes mel spectrograms
+    independently, stacked as RGB-like 3-channel image for VAE encoding.
+    """
+    def __init__(self, sample_rate=16000, n_fft=1024, hop_length=512,
+                 n_mels=64, f_min=0, f_max=8000):
+        super().__init__()
+        self.sample_rate = sample_rate
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.n_mels = n_mels
+        self.mel_low = torchaudio.transforms.MelSpectrogram(
+            sample_rate=sample_rate, n_fft=n_fft, hop_length=hop_length,
+            n_mels=n_mels, f_min=f_min, f_max=1000,
+        )
+        self.mel_mid = torchaudio.transforms.MelSpectrogram(
+            sample_rate=sample_rate, n_fft=n_fft, hop_length=hop_length,
+            n_mels=n_mels, f_min=1000, f_max=4000,
+        )
+        self.mel_high = torchaudio.transforms.MelSpectrogram(
+            sample_rate=sample_rate, n_fft=n_fft, hop_length=hop_length,
+            n_mels=n_mels, f_min=4000, f_max=f_max,
+        )
+    def forward(self, waveform):
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        elif waveform.dim() == 3:
+            if waveform.shape[1] == 1:
+                waveform = waveform.squeeze(1)
+            else:
+                waveform = waveform.mean(dim=1)
+        spec_low = torchaudio.functional.amplitude_to_DB(
+            self.mel_low(waveform), multiplier=10.0, amin=1e-10, db_multiplier=0.0, top_db=80.0
+        )
+        spec_mid = torchaudio.functional.amplitude_to_DB(
+            self.mel_mid(waveform), multiplier=10.0, amin=1e-10, db_multiplier=0.0, top_db=80.0
+        )
+        spec_high = torchaudio.functional.amplitude_to_DB(
+            self.mel_high(waveform), multiplier=10.0, amin=1e-10, db_multiplier=0.0, top_db=80.0
+        )
+        specs = []
+        for spec in [spec_low, spec_mid, spec_high]:
+            s_min = spec.amin(dim=(-2, -1), keepdim=True)
+            s_max = spec.amax(dim=(-2, -1), keepdim=True)
+            s_range = s_max - s_min + 1e-8
+            specs.append((spec - s_min) / s_range)
+        return torch.stack(specs, dim=1)

arbitor/encoders/models/__init__.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""Local model loader — loads encoder models from local cache, falls back to HF.
+Model directories (saved via model.save_pretrained()):
+  dinov2-small/    — facebook/dinov2-small (21M params, 384-dim) vision
+  vit-base/        — google/vit-base-patch16-224 (86M, 768-dim) vision fallback
+  moonshine-base/  — UsefulSensors/moonshine-base (62M, 416-dim) audio
+  pig-vae/         — Wan2.1 VAE checkpoint (84M params) video latent codec
+Usage:
+    from arbitor.encoders.models import load_encoder, load_processor
+    model = load_encoder("dinov2-small")
+    processor = load_processor("dinov2-small", "image")
+Download models:
+    python -m arbitor.encoders.models.download
+"""
+import os
+_MODELS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)))
+# Map short names to (local_dir, hf_repo, type)
+REGISTRY = {
+    "dinov2-small": {
+        "local": os.path.join(_MODELS_DIR, "dinov2-small"),
+        "hf": "facebook/dinov2-small",
+        "type": "auto",
+    },
+    "vit-base": {
+        "local": os.path.join(_MODELS_DIR, "vit-base"),
+        "hf": "google/vit-base-patch16-224",
+        "type": "auto",
+    },
+    "moonshine-base": {
+        "local": os.path.join(_MODELS_DIR, "moonshine-base"),
+        "hf": "UsefulSensors/moonshine-base",
+        "type": "auto",
+    },
+}
+def resolve_path(name: str) -> tuple[str, dict]:
+    """Return (local_path_or_hf_name, registry_entry)."""
+    entry = REGISTRY.get(name)
+    if entry is None:
+        raise ValueError(f"Unknown model: {name}. Options: {list(REGISTRY.keys())}")
+    if os.path.isdir(entry["local"]):
+        return entry["local"], entry
+    return entry["hf"], entry
+def load_encoder(name: str, device=None, **kwargs):
+    """Load model from local cache, falling back to HuggingFace.
+    Args:
+        name: Short name ("dinov2-small", "vit-base", "moonshine-base")
+        device: Optional device to move model to (e.g. "cuda", "cpu")
+    Returns:
+        Loaded model in eval mode
+    """
+    from transformers import AutoModel
+    path, entry = resolve_path(name)
+    model = AutoModel.from_pretrained(path, low_cpu_mem_usage=True, **kwargs)
+    model.eval()
+    if device:
+        model = model.to(device)
+    return model
+def load_processor(name: str, modality: str = "image"):
+    """Load processor (image processor or feature extractor) from local cache.
+    Args:
+        name: Short model name
+        modality: "image" for AutoImageProcessor, "audio" for AutoFeatureExtractor
+    Returns:
+        Processor instance
+    """
+    path, _ = resolve_path(name)
+    if modality == "audio":
+        from transformers import AutoFeatureExtractor
+        return AutoFeatureExtractor.from_pretrained(path)
+    else:
+        from transformers import AutoImageProcessor
+        return AutoImageProcessor.from_pretrained(path)

arbitor/encoders/models/download.py ADDED Viewed

	@@ -0,0 +1,132 @@

+"""Download encoder models to local cache (arbitor/encoders/models/).
+Usage:
+    python -m arbitor.encoders.models.download           # Download all
+    python -m arbitor.encoders.models.download --model pig-vae --convert  # Also convert GGUF→safetensors
+Models are saved to arbitor/encoders/models/{name}/ and loaded from there
+by sequencers and encoders — no HuggingFace download needed at runtime.
+"""
+import os, sys, argparse, importlib
+MODELS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)))
+REGISTRY = {
+    "pig-vae": {
+        "type": "pth",
+        "hf_repo": "Wan-AI/Wan2.1-T2V-1.3B",
+        "hf_file": "Wan2.1_VAE.pth",
+        "desc": "Video VAE (16 latent channels, 84M params)",
+        "gguf_repo": "calcuis/pig-vae",
+        "gguf_file": "pig_wan_vae_fp32-f16.gguf",
+    },
+    "opensora-vae": {
+        "type": "pipeline",
+        "hf_repo": "hpcai-tech/OpenSora-VAE-v1.2",
+        "desc": "3D VAE (4 latent channels, 384M params, 8× spatial + 4× temporal compression)",
+    },
+}
+def convert_gguf_to_safetensors(name: str):
+    """Convert GGUF checkpoint to safetensors (pig-vae only)."""
+    dest = os.path.join(MODELS_DIR, name)
+    gguf_path = os.path.join(dest, f"{name.replace('-', '_')}_fp32-f16.gguf")
+    # Try alternate names
+    if not os.path.isfile(gguf_path):
+        alt = os.path.join(dest, "pig_wan_vae_fp32-f16.gguf")
+        if os.path.isfile(alt):
+            gguf_path = alt
+    if not os.path.isfile(gguf_path):
+        print(f"  No GGUF file found in {dest}")
+        return False
+    print(f"  Converting {gguf_path} to safetensors...", flush=True)
+    import gguf
+    import safetensors.torch
+    reader = gguf.GGUFReader(gguf_path)
+    state_dict = {t.name: __import__('torch').tensor(t.data) for t in reader.tensors}
+    safetensors_path = os.path.join(dest, "model.safetensors")
+    safetensors.torch.save_file(state_dict, safetensors_path)
+    size = os.path.getsize(safetensors_path)
+    print(f"  ✓ Saved {safetensors_path} ({size/1e6:.0f} MB, {len(state_dict)} tensors)", flush=True)
+    return True
+def download_model(name: str, convert: bool = False):
+    """Download a single model to local cache."""
+    entry = REGISTRY.get(name)
+    if entry is None:
+        print(f"Unknown model: {name}. Options: {list(REGISTRY.keys())}")
+        return False
+    dest = os.path.join(MODELS_DIR, name)
+    os.makedirs(dest, exist_ok=True)
+    if entry["type"] == "auto":
+        from transformers import AutoModel
+        print(f"Downloading {name} ({entry['desc']})...", flush=True)
+        model = AutoModel.from_pretrained(entry["hf_repo"], low_cpu_mem_usage=True)
+        model.save_pretrained(dest)
+        print(f"  ✓ {name} saved to {dest}", flush=True)
+        if "dinov2" in name or "vit" in name:
+            from transformers import AutoImageProcessor
+            proc = AutoImageProcessor.from_pretrained(entry["hf_repo"])
+            proc.save_pretrained(dest)
+        elif "moonshine" in name:
+            from transformers import AutoFeatureExtractor
+            proc = AutoFeatureExtractor.from_pretrained(entry["hf_repo"])
+            proc.save_pretrained(dest)
+    elif entry["type"] == "pth":
+        from huggingface_hub import hf_hub_download
+        print(f"Downloading {name} ({entry['desc']})...", flush=True)
+        hf_hub_download(entry["hf_repo"], entry["hf_file"],
+                        local_dir=dest, local_dir_use_symlinks=False)
+        print(f"  ✓ {name} .pth saved to {dest}", flush=True)
+        if convert:
+            convert_gguf_to_safetensors(name)
+    return True
+def download_all(convert: bool = False):
+    success = 0
+    for name in REGISTRY:
+        if os.path.isdir(os.path.join(MODELS_DIR, name)):
+            existing = [f for f in os.listdir(os.path.join(MODELS_DIR, name))
+                       if f.endswith(('.safetensors', '.pt', '.pth', '.gguf'))]
+            if existing:
+                print(f"  ○ {name} already exists — skipping")
+                success += 1
+                continue
+        if download_model(name, convert=convert):
+            success += 1
+    print(f"\nDownloaded {success}/{len(REGISTRY)} models to {MODELS_DIR}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Download encoder models for ARB")
+    parser.add_argument("--model", type=str, default=None,
+                        help=f"Model ({', '.join(REGISTRY.keys())})")
+    parser.add_argument("--convert", action="store_true",
+                        help="Convert pig-vae GGUF→safetensors after download")
+    parser.add_argument("--list", action="store_true", help="List available models")
+    args = parser.parse_args()
+    if args.list:
+        for name, info in REGISTRY.items():
+            d = os.path.join(MODELS_DIR, name)
+            files = os.listdir(d) if os.path.isdir(d) else []
+            status = "✓" if any(f.endswith(('.safetensors', '.pt', '.pth')) for f in files) else "○"
+            print(f"  {status} {name:<20} {info['desc']}")
+        sys.exit(0)
+    if args.model:
+        download_model(args.model, convert=args.convert)
+    else:
+        download_all(convert=args.convert)

arbitor/encoders/models/opensora-vae/config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "VideoAutoencoderPipeline"
+  ],
+  "cal_loss": false,
+  "freeze_vae_2d": false,
+  "from_pretrained": null,
+  "micro_frame_size": 17,
+  "model_type": "VideoAutoencoderPipeline",
+  "scale": [
+    3.85,
+    2.32,
+    2.33,
+    3.06
+  ],
+  "shift": [
+    -0.1,
+    0.34,
+    0.27,
+    0.98
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.36.2",
+  "vae_2d": {
+    "from_pretrained": "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
+    "local_files_only": false,
+    "micro_batch_size": 4,
+    "subfolder": "vae",
+    "type": "VideoAutoencoderKL"
+  },
+  "vae_temporal": {
+    "from_pretrained": null,
+    "type": "VAE_Temporal_SD"
+  }
+}

arbitor/encoders/models/opensora-vae/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:057f368f538ca04540c0728a6b8ef80ff529077c5a1c4ba810eb8ba017b8d7c9
+size 1573430548

arbitor/encoders/models/pig-vae/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5a9bb06d188abdf1585142159b84e2c6c8aaa3e64bb5c5792b7316ee2b44785
+size 253879612

arbitor/encoders/opensora_vae.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""Open-Sora 3D VAE v1.2 sidecar module.
+Latent: [B, 4, T/4, H/8, W/8]
+8× spatial compression, 4× temporal compression.
+Frozen float32 sidecar (no gradients).
+Uses PixArt SDXL VAE (from diffusers) for spatial encoding/decoding.
+Temporal VAE requires opensora package or custom module loading.
+"""
+import os
+import torch
+import torch.nn as nn
+from safetensors import safe_open
+_LOCAL_VAE_DIR = os.path.join(os.path.dirname(__file__), "models", "opensora-vae")
+_VAE_CONFIG = {
+    "scale": (3.85, 2.32, 2.33, 3.06),
+    "shift": (-0.10, 0.34, 0.27, 0.98),
+    "micro_frame_size": 17,
+}
+_QUANTO_CLASS_MARKERS = ("Q", "Quanto", "Quantized", "WeightQ")
+def _mark_quantized_sidecar(module, quant_type, applied):
+    module._arb_quantize_requested = quant_type
+    module._arb_quantized_int8 = bool(applied and quant_type == "int8")
+    module._arb_quantized = bool(applied)
+    for p in module.parameters():
+        p.requires_grad = False
+    return module
+def _has_quantized_modules(module):
+    return any(
+        any(marker in type(child).__name__ for marker in _QUANTO_CLASS_MARKERS)
+        for child in module.modules()
+    )
+def _freeze_sidecar(model, quantize_requested=None, quantized=False):
+    _mark_quantized_sidecar(model, quantize_requested, quantized)
+    return model
+def _quantize_int8_if_requested(model, quantize):
+    if quantize is None:
+        model = model.to(torch.bfloat16)
+        _mark_quantized_sidecar(model, quantize, False)
+        return model
+    try:
+        from optimum.quanto import quantize, freeze
+        qtype = {"int8": qint8}.get(quantize)
+        if qtype is None:
+            model = model.to(torch.bfloat16)
+            _mark_quantized_sidecar(model, quantize, False)
+            return model
+        quantize(model, weights=qtype)
+        freeze(model)
+        _mark_quantized_sidecar(model, quantize, _has_quantized_modules(model))
+    except ImportError:
+        model = model.to(torch.bfloat16)
+        _mark_quantized_sidecar(model, quantize, False)
+    return model
+def load_opensora_vae(device="cuda", quantize=None):
+    """Load Open-Sora 3D VAE as frozen float32 sidecar.
+    Loads the spatial VAE from PixArt SDXL (diffusers) and the temporal
+    VAE from local safetensors. Falls back to spatial-only if temporal
+    module can't be loaded.
+    """
+    try:
+        from diffusers import AutoencoderKL
+    except ImportError:
+        raise RuntimeError("need diffusers for Open-Sora VAE spatial component")
+    # Load spatial VAE
+    spatial_vae = AutoencoderKL.from_pretrained(
+        "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
+        subfolder="vae",
+        torch_dtype=torch.float32,
+    ).to(device)
+    spatial_vae.eval()
+    # Try to load temporal VAE weights
+    temporal_state = {}
+    safetensors_path = os.path.join(_LOCAL_VAE_DIR, "model.safetensors")
+    if os.path.isfile(safetensors_path):
+        with safe_open(safetensors_path, framework="pt") as f:
+            for k in f.keys():
+                if k.startswith("temporal_vae."):
+                    temporal_state[k] = f.get_tensor(k)
+                if k.startswith("scale"):
+                    temporal_state["scale"] = f.get_tensor(k)
+                if k.startswith("shift"):
+                    temporal_state["shift"] = f.get_tensor(k)
+    _freeze_sidecar(spatial_vae, quantize, False)
+    return OpenSoraVAEWrapper(spatial_vae, temporal_state)
+class OpenSoraVAEWrapper(nn.Module):
+    def __init__(self, spatial_vae, temporal_state=None):
+        super().__init__()
+        self.spatial = spatial_vae
+        self.latent_channels = 4
+        self.scale_factor_spatial = 8
+        self.scale_factor_temporal = 4
+        self.temporal_state = temporal_state
+        self.temporal_loaded = temporal_state is not None and len(temporal_state) > 0
+    @torch.no_grad()
+    def encode(self, video_tensor):
+        """Encode video tensor: [B,3,T,H,W] → [B,4,T/4,H/8,W/8]."""
+        B, C, T, H, W = video_tensor.shape
+        # Process frame-by-frame through spatial VAE
+        latents = []
+        for t in range(T):
+            frame = video_tensor[:, :, t, :, :]
+            latent = self.spatial.encode(frame).latent_dist.sample()
+            latents.append(latent)
+        latent = torch.stack(latents, dim=2)
+        # Scale
+        latent = latent * 0.18215
+        # Temporal downsample (simple: take every 4th)
+        if latent.shape[2] >= 4:
+            latent = latent[:, :, ::4, :, :]
+        return latent
+    @torch.no_grad()
+    def decode(self, latents, num_frames=None):
+        """Decode latents: [B,4,T/4,H/8,W/8] → [B,3,T,H,W]."""
+        B, C, T, H, W = latents.shape
+        # Temporal upsample (repeat each latent 4×)
+        latents = latents.repeat_interleave(4, dim=2)
+        # Unscale
+        latents = latents / 0.18215
+        # Decode frame-by-frame
+        frames = []
+        for t in range(latents.shape[2]):
+            frame = latents[:, :, t, :, :]
+            decoded = self.spatial.decode(frame).sample
+            frames.append(decoded)
+        return torch.stack(frames, dim=2)

arbitor/encoders/opensora_vae_modules/autoencoder_2d.py ADDED Viewed

	@@ -0,0 +1,339 @@

+# Modified from Flux
+#
+# Copyright 2024 Black Forest Labs
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+from dataclasses import dataclass
+import torch
+from einops import rearrange
+from torch import Tensor, nn
+from torch.nn.functional import silu as swish
+from opensora.registry import MODELS
+from opensora.utils.ckpt import load_checkpoint
+from .utils import DiagonalGaussianDistribution
+@dataclass
+class AutoEncoderConfig:
+    from_pretrained: str | None
+    cache_dir: str | None
+    resolution: int
+    in_channels: int
+    ch: int
+    out_ch: int
+    ch_mult: list[int]
+    num_res_blocks: int
+    z_channels: int
+    scale_factor: float
+    shift_factor: float
+    sample: bool = True
+class AttnBlock(nn.Module):
+    def __init__(self, in_channels: int):
+        super().__init__()
+        self.norm = nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        self.q = nn.Conv2d(in_channels, in_channels, kernel_size=1)
+        self.k = nn.Conv2d(in_channels, in_channels, kernel_size=1)
+        self.v = nn.Conv2d(in_channels, in_channels, kernel_size=1)
+        self.proj_out = nn.Conv2d(in_channels, in_channels, kernel_size=1)
+    def attention(self, h_: Tensor) -> Tensor:
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+        b, c, h, w = q.shape
+        q = rearrange(q, "b c h w -> b 1 (h w) c").contiguous()
+        k = rearrange(k, "b c h w -> b 1 (h w) c").contiguous()
+        v = rearrange(v, "b c h w -> b 1 (h w) c").contiguous()
+        h_ = nn.functional.scaled_dot_product_attention(q, k, v)
+        return rearrange(h_, "b 1 (h w) c -> b c h w", h=h, w=w)
+    def forward(self, x: Tensor) -> Tensor:
+        return x + self.proj_out(self.attention(x))
+class ResnetBlock(nn.Module):
+    def __init__(self, in_channels: int, out_channels: int):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.norm1 = nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
+        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
+        self.norm2 = nn.GroupNorm(num_groups=32, num_channels=out_channels, eps=1e-6, affine=True)
+        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
+        if self.in_channels != self.out_channels:
+            self.nin_shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+    def forward(self, x):
+        h = x
+        h = self.norm1(h)
+        h = swish(h)
+        h = self.conv1(h)
+        h = self.norm2(h)
+        h = swish(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            x = self.nin_shortcut(x)
+        return x + h
+class Downsample(nn.Module):
+    def __init__(self, in_channels: int):
+        super().__init__()
+        self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=2, padding=0)
+    def forward(self, x: Tensor) -> Tensor:
+        pad = (0, 1, 0, 1)
+        x = nn.functional.pad(x, pad, mode="constant", value=0)
+        return self.conv(x)
+class Upsample(nn.Module):
+    def __init__(self, in_channels: int):
+        super().__init__()
+        self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)
+    def forward(self, x: Tensor) -> Tensor:
+        x = nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
+        return self.conv(x)
+class Encoder(nn.Module):
+    def __init__(self, config: AutoEncoderConfig):
+        super().__init__()
+        self.ch = config.ch
+        self.num_resolutions = len(config.ch_mult)
+        self.num_res_blocks = config.num_res_blocks
+        self.resolution = config.resolution
+        self.in_channels = config.in_channels
+        # downsampling
+        self.conv_in = nn.Conv2d(config.in_channels, self.ch, kernel_size=3, stride=1, padding=1)
+        curr_res = config.resolution
+        in_ch_mult = (1,) + tuple(config.ch_mult)
+        self.in_ch_mult = in_ch_mult
+        self.down = nn.ModuleList()
+        block_in = self.ch
+        for i_level in range(self.num_resolutions):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = config.ch * in_ch_mult[i_level]
+            block_out = config.ch * config.ch_mult[i_level]
+            for _ in range(self.num_res_blocks):
+                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
+                block_in = block_out
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if i_level != self.num_resolutions - 1:
+                down.downsample = Downsample(block_in)
+                curr_res = curr_res // 2
+            self.down.append(down)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
+        self.mid.attn_1 = AttnBlock(block_in)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)
+        # end
+        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in, eps=1e-6, affine=True)
+        self.conv_out = nn.Conv2d(block_in, 2 * config.z_channels, kernel_size=3, stride=1, padding=1)
+    def forward(self, x: Tensor) -> Tensor:
+        # downsampling
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_resolutions):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1])
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                hs.append(h)
+            if i_level != self.num_resolutions - 1:
+                hs.append(self.down[i_level].downsample(hs[-1]))
+        # middle
+        h = hs[-1]
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+        # end
+        h = self.norm_out(h)
+        h = swish(h)
+        h = self.conv_out(h)
+        return h
+class Decoder(nn.Module):
+    def __init__(self, config: AutoEncoderConfig):
+        super().__init__()
+        self.ch = config.ch
+        self.num_resolutions = len(config.ch_mult)
+        self.num_res_blocks = config.num_res_blocks
+        self.resolution = config.resolution
+        self.in_channels = config.in_channels
+        self.ffactor = 2 ** (self.num_resolutions - 1)
+        block_in = config.ch * config.ch_mult[self.num_resolutions - 1]
+        curr_res = config.resolution // 2 ** (self.num_resolutions - 1)
+        self.z_shape = (1, config.z_channels, curr_res, curr_res)
+        # z to block_in
+        self.conv_in = nn.Conv2d(config.z_channels, block_in, kernel_size=3, stride=1, padding=1)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
+        self.mid.attn_1 = AttnBlock(block_in)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)
+        # upsampling
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_resolutions)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = config.ch * config.ch_mult[i_level]
+            for _ in range(self.num_res_blocks + 1):
+                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
+                block_in = block_out
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if i_level != 0:
+                up.upsample = Upsample(block_in)
+                curr_res = curr_res * 2
+            self.up.insert(0, up)  # prepend to get consistent order
+        # end
+        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in, eps=1e-6, affine=True)
+        self.conv_out = nn.Conv2d(block_in, config.out_ch, kernel_size=3, stride=1, padding=1)
+    def forward(self, z: Tensor) -> Tensor:
+        # z to block_in
+        h = self.conv_in(z)
+        # middle
+        h = self.mid.block_1(h)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h)
+        # upsampling
+        for i_level in reversed(range(self.num_resolutions)):
+            for i_block in range(self.num_res_blocks + 1):
+                h = self.up[i_level].block[i_block](h)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+            if i_level != 0:
+                h = self.up[i_level].upsample(h)
+        # end
+        h = self.norm_out(h)
+        h = swish(h)
+        return self.conv_out(h)
+class AutoEncoder(nn.Module):
+    def __init__(self, config: AutoEncoderConfig):
+        super().__init__()
+        self.encoder = Encoder(config)
+        self.decoder = Decoder(config)
+        self.scale_factor = config.scale_factor
+        self.shift_factor = config.shift_factor
+        self.sample = config.sample
+    def encode_(self, x: Tensor) -> tuple[Tensor, DiagonalGaussianDistribution]:
+        T = x.shape[2]
+        x = rearrange(x, "b c t h w -> (b t) c h w")
+        params = self.encoder(x)
+        params = rearrange(params, "(b t) c h w -> b c t h w", t=T)
+        posterior = DiagonalGaussianDistribution(params)
+        if self.sample:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        z = self.scale_factor * (z - self.shift_factor)
+        return z, posterior
+    def encode(self, x: Tensor) -> Tensor:
+        return self.encode_(x)[0]
+    def decode(self, z: Tensor) -> Tensor:
+        T = z.shape[2]
+        z = rearrange(z, "b c t h w -> (b t) c h w")
+        z = z / self.scale_factor + self.shift_factor
+        x = self.decoder(z)
+        x = rearrange(x, "(b t) c h w -> b c t h w", t=T)
+        return x
+    def forward(self, x: Tensor) -> tuple[Tensor, DiagonalGaussianDistribution, Tensor]:
+        # encode
+        x.shape[2]
+        z, posterior = self.encode_(x)
+        # decode
+        x_rec = self.decode(z)
+        return x_rec, posterior, z
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+@MODELS.register_module("autoencoder_2d")
+def AutoEncoderFlux(
+    from_pretrained: str,
+    cache_dir=None,
+    resolution=256,
+    in_channels=3,
+    ch=128,
+    out_ch=3,
+    ch_mult=[1, 2, 4, 4],
+    num_res_blocks=2,
+    z_channels=16,
+    scale_factor=0.3611,
+    shift_factor=0.1159,
+    device_map: str | torch.device = "cuda",
+    torch_dtype: torch.dtype = torch.bfloat16,
+) -> AutoEncoder:
+    config = AutoEncoderConfig(
+        from_pretrained=from_pretrained,
+        cache_dir=cache_dir,
+        resolution=resolution,
+        in_channels=in_channels,
+        ch=ch,
+        out_ch=out_ch,
+        ch_mult=ch_mult,
+        num_res_blocks=num_res_blocks,
+        z_channels=z_channels,
+        scale_factor=scale_factor,
+        shift_factor=shift_factor,
+    )
+    with torch.device(device_map):
+        model = AutoEncoder(config).to(torch_dtype)
+    if from_pretrained:
+        model = load_checkpoint(model, from_pretrained, cache_dir=cache_dir, device_map=device_map)
+    return model

arbitor/encoders/opensora_vae_modules/autoencoder_kl_causal_3d.py ADDED Viewed

	@@ -0,0 +1,638 @@

+# Modified from diffusers==0.29.2 and HunyuanVideo
+#
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright 2024 HunyuanVideo
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+from dataclasses import dataclass
+from typing import Dict, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from opensora.registry import MODELS
+from opensora.utils.ckpt import load_checkpoint
+try:
+    # This diffusers is modified and packed in the mirror.
+    from diffusers.loaders import FromOriginalVAEMixin
+except ImportError:
+    # Use this to be compatible with the original diffusers.
+    from diffusers.loaders.single_file_model import FromOriginalModelMixin as FromOriginalVAEMixin
+from diffusers.models.attention_processor import (
+    ADDED_KV_ATTENTION_PROCESSORS,
+    CROSS_ATTENTION_PROCESSORS,
+    Attention,
+    AttentionProcessor,
+    AttnAddedKVProcessor,
+    AttnProcessor,
+)
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.utils.accelerate_utils import apply_forward_hook
+from opensora.models.hunyuan_vae.vae import (
+    DecoderCausal3D,
+    DecoderOutput,
+    DiagonalGaussianDistribution,
+    EncoderCausal3D,
+)
+@dataclass
+class AutoEncoder3DConfig:
+    from_pretrained: str | None
+    act_fn: str = "silu"
+    in_channels: int = 3
+    out_channels: int = 3
+    latent_channels: int = 16
+    layers_per_block: int = 2
+    norm_num_groups: int = 32
+    scale_factor: float = 0.476986
+    shift_factor: float = 0
+    time_compression_ratio: int = 4
+    spatial_compression_ratio: int = 8
+    mid_block_add_attention: bool = True
+    block_out_channels: tuple[int] = (128, 256, 512, 512)
+    sample_size: int = 256
+    sample_tsize: int = 64
+    use_slicing: bool = False
+    use_spatial_tiling: bool = False
+    use_temporal_tiling: bool = False
+    tile_overlap_factor: float = 0.25
+    dropout: float = 0.0
+    channel: bool = False
+class AutoencoderKLCausal3D(ModelMixin, ConfigMixin, FromOriginalVAEMixin):
+    r"""
+    A VAE model with KL loss for encoding images/videos into latents and decoding latent representations into images/videos.
+    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
+    for all models (such as downloading or saving).
+    """
+    _supports_gradient_checkpointing = True
+    @register_to_config
+    def __init__(self, config: AutoEncoder3DConfig):
+        super().__init__()
+        self.scale_factor = config.scale_factor
+        self.shift_factor = config.shift_factor
+        self.time_compression_ratio = config.time_compression_ratio
+        self.spatial_compression_ratio = config.spatial_compression_ratio
+        self.z_channels = config.latent_channels
+        self.encoder = EncoderCausal3D(
+            in_channels=config.in_channels,
+            out_channels=config.latent_channels,
+            block_out_channels=config.block_out_channels,
+            layers_per_block=config.layers_per_block,
+            act_fn=config.act_fn,
+            norm_num_groups=config.norm_num_groups,
+            double_z=True,
+            time_compression_ratio=config.time_compression_ratio,
+            spatial_compression_ratio=config.spatial_compression_ratio,
+            mid_block_add_attention=config.mid_block_add_attention,
+            dropout=config.dropout,
+        )
+        self.decoder = DecoderCausal3D(
+            in_channels=config.latent_channels,
+            out_channels=config.out_channels,
+            block_out_channels=config.block_out_channels,
+            layers_per_block=config.layers_per_block,
+            norm_num_groups=config.norm_num_groups,
+            act_fn=config.act_fn,
+            time_compression_ratio=config.time_compression_ratio,
+            spatial_compression_ratio=config.spatial_compression_ratio,
+            mid_block_add_attention=config.mid_block_add_attention,
+            dropout=config.dropout,
+        )
+        self.quant_conv = nn.Conv3d(2 * config.latent_channels, 2 * config.latent_channels, kernel_size=1)
+        self.post_quant_conv = nn.Conv3d(config.latent_channels, config.latent_channels, kernel_size=1)
+        self.use_slicing = config.use_slicing
+        self.use_spatial_tiling = config.use_spatial_tiling
+        self.use_temporal_tiling = config.use_temporal_tiling
+        # only relevant if vae tiling is enabled
+        self.tile_sample_min_tsize = config.sample_tsize
+        self.tile_latent_min_tsize = config.sample_tsize // config.time_compression_ratio
+        self.tile_sample_min_size = config.sample_size
+        sample_size = config.sample_size[0] if isinstance(config.sample_size, (list, tuple)) else config.sample_size
+        self.tile_latent_min_size = int(sample_size / (2 ** (len(config.block_out_channels) - 1)))
+        self.tile_overlap_factor = config.tile_overlap_factor
+    def enable_temporal_tiling(self, use_tiling: bool = True):
+        self.use_temporal_tiling = use_tiling
+    def disable_temporal_tiling(self):
+        self.enable_temporal_tiling(False)
+    def enable_spatial_tiling(self, use_tiling: bool = True):
+        self.use_spatial_tiling = use_tiling
+    def disable_spatial_tiling(self):
+        self.enable_spatial_tiling(False)
+    def enable_tiling(self, use_tiling: bool = True):
+        r"""
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
+        processing larger videos.
+        """
+        self.enable_spatial_tiling(use_tiling)
+        self.enable_temporal_tiling(use_tiling)
+    def disable_tiling(self):
+        r"""
+        Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing
+        decoding in one step.
+        """
+        self.disable_spatial_tiling()
+        self.disable_temporal_tiling()
+    def enable_slicing(self):
+        r"""
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
+        """
+        self.use_slicing = True
+    def disable_slicing(self):
+        r"""
+        Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing
+        decoding in one step.
+        """
+        self.use_slicing = False
+    @property
+    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
+    def attn_processors(self) -> Dict[str, AttentionProcessor]:
+        r"""
+        Returns:
+            `dict` of attention processors: A dictionary containing all attention processors used in the model with
+            indexed by its weight name.
+        """
+        # set recursively
+        processors = {}
+        def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
+            if hasattr(module, "get_processor"):
+                processors[f"{name}.processor"] = module.get_processor(return_deprecated_lora=True)
+            for sub_name, child in module.named_children():
+                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
+            return processors
+        for name, module in self.named_children():
+            fn_recursive_add_processors(name, module, processors)
+        return processors
+    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attn_processor
+    def set_attn_processor(
+        self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]], _remove_lora=False
+    ):
+        r"""
+        Sets the attention processor to use to compute attention.
+        Parameters:
+            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
+                The instantiated processor class or a dictionary of processor classes that will be set as the processor
+                for **all** `Attention` layers.
+                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
+                processor. This is strongly recommended when setting trainable attention processors.
+        """
+        count = len(self.attn_processors.keys())
+        if isinstance(processor, dict) and len(processor) != count:
+            raise ValueError(
+                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
+                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
+            )
+        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
+            if hasattr(module, "set_processor"):
+                if not isinstance(processor, dict):
+                    module.set_processor(processor, _remove_lora=_remove_lora)
+                else:
+                    module.set_processor(processor.pop(f"{name}.processor"), _remove_lora=_remove_lora)
+            for sub_name, child in module.named_children():
+                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
+        for name, module in self.named_children():
+            fn_recursive_attn_processor(name, module, processor)
+    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor
+    def set_default_attn_processor(self):
+        """
+        Disables custom attention processors and sets the default attention implementation.
+        """
+        if all(proc.__class__ in ADDED_KV_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
+            processor = AttnAddedKVProcessor()
+        elif all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
+            processor = AttnProcessor()
+        else:
+            raise ValueError(
+                f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
+            )
+        self.set_attn_processor(processor, _remove_lora=True)
+    @apply_forward_hook
+    def encode(
+        self,
+        x: torch.FloatTensor,
+        sample_posterior: bool = True,
+        return_posterior: bool = False,
+        generator: Optional[torch.Generator] = None,
+    ) -> Union[torch.FloatTensor, Tuple[DiagonalGaussianDistribution]]:
+        """
+        Encode a batch of images/videos into latents.
+        Args:
+            x (`torch.FloatTensor`): Input batch of images/videos.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
+        Returns:
+                The latent representations of the encoded images/videos. If `return_dict` is True, a
+                [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is returned.
+        """
+        assert len(x.shape) == 5, "The input tensor should have 5 dimensions."
+        if self.use_temporal_tiling and x.shape[2] > self.tile_sample_min_tsize:
+            posterior = self.temporal_tiled_encode(x)
+        elif self.use_spatial_tiling and (
+            x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size
+        ):
+            posterior = self.spatial_tiled_encode(x)
+        else:
+            if self.use_slicing and x.shape[0] > 1:
+                encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
+                h = torch.cat(encoded_slices)
+            else:
+                h = self.encoder(x)
+            moments = self.quant_conv(h)
+            posterior = DiagonalGaussianDistribution(moments)
+        if sample_posterior:
+            z = posterior.sample(generator=generator)
+        else:
+            z = posterior.mode()
+        z = self.scale_factor * (z - self.shift_factor)  # shift & scale
+        if return_posterior:
+            return z, posterior
+        else:
+            return z
+    def _decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
+        assert len(z.shape) == 5, "The input tensor should have 5 dimensions."
+        if self.use_temporal_tiling and z.shape[2] > self.tile_latent_min_tsize:
+            return self.temporal_tiled_decode(z, return_dict=return_dict)
+        if self.use_spatial_tiling and (
+            z.shape[-1] > self.tile_latent_min_size or z.shape[-2] > self.tile_latent_min_size
+        ):
+            return self.spatial_tiled_decode(z, return_dict=return_dict)
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        if not return_dict:
+            return (dec,)
+        return DecoderOutput(sample=dec)
+    @apply_forward_hook
+    def decode(self, z: torch.FloatTensor) -> torch.FloatTensor:
+        """
+        Decode a batch of images/videos.
+        Args:
+            z (`torch.FloatTensor`): Input batch of latent vectors.
+        Returns:
+            [`~models.vae.DecoderOutput`] or `tuple`:
+                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
+                returned.
+        """
+        z = z / self.scale_factor + self.shift_factor  # scale & shift
+        if self.use_slicing and z.shape[0] > 1:
+            decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
+            decoded = torch.cat(decoded_slices)
+        else:
+            decoded = self._decode(z).sample
+        return decoded
+    def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
+        blend_extent = min(a.shape[-2], b.shape[-2], blend_extent)
+        for y in range(blend_extent):
+            b[:, :, :, y, :] = a[:, :, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, :, y, :] * (
+                y / blend_extent
+            )
+        return b
+    def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
+        blend_extent = min(a.shape[-1], b.shape[-1], blend_extent)
+        for x in range(blend_extent):
+            b[:, :, :, :, x] = a[:, :, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, :, x] * (
+                x / blend_extent
+            )
+        return b
+    def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
+        blend_extent = min(a.shape[-3], b.shape[-3], blend_extent)
+        for x in range(blend_extent):
+            b[:, :, x, :, :] = a[:, :, -blend_extent + x, :, :] * (1 - x / blend_extent) + b[:, :, x, :, :] * (
+                x / blend_extent
+            )
+        return b
+    def spatial_tiled_encode(self, x: torch.FloatTensor, return_moments: bool = False) -> DiagonalGaussianDistribution:
+        r"""Encode a batch of images/videos using a tiled encoder.
+        When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several
+        steps. This is useful to keep memory use constant regardless of image/videos size. The end result of tiled encoding is
+        different from non-tiled encoding because each tile uses a different encoder. To avoid tiling artifacts, the
+        tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the
+        output, but they should be much less noticeable.
+        Args:
+            x (`torch.FloatTensor`): Input batch of images/videos.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
+        Returns:
+            [`~models.autoencoder_kl.AutoencoderKLOutput`] or `tuple`:
+                If return_dict is True, a [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain
+                `tuple` is returned.
+        """
+        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_latent_min_size - blend_extent
+        # Split video into tiles and encode them separately.
+        rows = []
+        for i in range(0, x.shape[-2], overlap_size):
+            row = []
+            for j in range(0, x.shape[-1], overlap_size):
+                tile = x[:, :, :, i : i + self.tile_sample_min_size, j : j + self.tile_sample_min_size]
+                tile = self.encoder(tile)
+                tile = self.quant_conv(tile)
+                row.append(tile)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=-1))
+        moments = torch.cat(result_rows, dim=-2)
+        if return_moments:
+            return moments
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def spatial_tiled_decode(
+        self, z: torch.FloatTensor, return_dict: bool = True
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        r"""
+        Decode a batch of images/videos using a tiled decoder.
+        Args:
+            z (`torch.FloatTensor`): Input batch of latent vectors.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
+        Returns:
+            [`~models.vae.DecoderOutput`] or `tuple`:
+                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
+                returned.
+        """
+        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
+        row_limit = self.tile_sample_min_size - blend_extent
+        # Split z into overlapping tiles and decode them separately.
+        # The tiles have an overlap to avoid seams between tiles.
+        rows = []
+        for i in range(0, z.shape[-2], overlap_size):
+            row = []
+            for j in range(0, z.shape[-1], overlap_size):
+                tile = z[:, :, :, i : i + self.tile_latent_min_size, j : j + self.tile_latent_min_size]
+                tile = self.post_quant_conv(tile)
+                decoded = self.decoder(tile)
+                row.append(decoded)
+            rows.append(row)
+        result_rows = []
+        for i, row in enumerate(rows):
+            result_row = []
+            for j, tile in enumerate(row):
+                # blend the above tile and the left tile
+                # to the current tile and add the current tile to the result row
+                if i > 0:
+                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
+                if j > 0:
+                    tile = self.blend_h(row[j - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :, :row_limit, :row_limit])
+            result_rows.append(torch.cat(result_row, dim=-1))
+        dec = torch.cat(result_rows, dim=-2)
+        if not return_dict:
+            return (dec,)
+        return DecoderOutput(sample=dec)
+    def temporal_tiled_encode(self, x: torch.FloatTensor) -> DiagonalGaussianDistribution:
+        B, C, T, H, W = x.shape
+        overlap_size = int(self.tile_sample_min_tsize * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_latent_min_tsize * self.tile_overlap_factor)
+        t_limit = self.tile_latent_min_tsize - blend_extent
+        # Split the video into tiles and encode them separately.
+        row = []
+        for i in range(0, T, overlap_size):
+            tile = x[:, :, i : i + self.tile_sample_min_tsize + 1, :, :]
+            if self.use_spatial_tiling and (
+                tile.shape[-1] > self.tile_sample_min_size or tile.shape[-2] > self.tile_sample_min_size
+            ):
+                tile = self.spatial_tiled_encode(tile, return_moments=True)
+            else:
+                tile = self.encoder(tile)
+                tile = self.quant_conv(tile)
+            if i > 0:
+                tile = tile[:, :, 1:, :, :]
+            row.append(tile)
+        result_row = []
+        for i, tile in enumerate(row):
+            if i > 0:
+                tile = self.blend_t(row[i - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :t_limit, :, :])
+            else:
+                result_row.append(tile[:, :, : t_limit + 1, :, :])
+        moments = torch.cat(result_row, dim=2)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def temporal_tiled_decode(
+        self, z: torch.FloatTensor, return_dict: bool = True
+    ) -> Union[DecoderOutput, torch.FloatTensor]:
+        # Split z into overlapping tiles and decode them separately.
+        B, C, T, H, W = z.shape
+        overlap_size = int(self.tile_latent_min_tsize * (1 - self.tile_overlap_factor))
+        blend_extent = int(self.tile_sample_min_tsize * self.tile_overlap_factor)
+        t_limit = self.tile_sample_min_tsize - blend_extent
+        row = []
+        for i in range(0, T, overlap_size):
+            tile = z[:, :, i : i + self.tile_latent_min_tsize + 1, :, :]
+            if self.use_spatial_tiling and (
+                tile.shape[-1] > self.tile_latent_min_size or tile.shape[-2] > self.tile_latent_min_size
+            ):
+                decoded = self.spatial_tiled_decode(tile, return_dict=True).sample
+            else:
+                tile = self.post_quant_conv(tile)
+                decoded = self.decoder(tile)
+            if i > 0:
+                decoded = decoded[:, :, 1:, :, :]
+            row.append(decoded)
+        result_row = []
+        for i, tile in enumerate(row):
+            if i > 0:
+                tile = self.blend_t(row[i - 1], tile, blend_extent)
+                result_row.append(tile[:, :, :t_limit, :, :])
+            else:
+                result_row.append(tile[:, :, : t_limit + 1, :, :])
+        dec = torch.cat(result_row, dim=2)
+        if not return_dict:
+            return (dec,)
+        return DecoderOutput(sample=dec)
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+        sample_posterior: bool = True,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[torch.FloatTensor, DiagonalGaussianDistribution, torch.FloatTensor]:
+        r"""
+        Args:
+            sample (`torch.FloatTensor`): Input sample.
+            sample_posterior (`bool`, *optional*, defaults to `False`):
+                Whether to sample from the posterior.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
+        """
+        x = sample
+        z, posterior = self.encode(x, return_posterior=True, sample_posterior=sample_posterior, generator=generator)
+        dec = self.decode(z)
+        return (dec, posterior, z)
+    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
+    def fuse_qkv_projections(self):
+        """
+        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query,
+        key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
+        <Tip warning={true}>
+        This API is 🧪 experimental.
+        </Tip>
+        """
+        self.original_attn_processors = None
+        for _, attn_processor in self.attn_processors.items():
+            if "Added" in str(attn_processor.__class__.__name__):
+                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
+        self.original_attn_processors = self.attn_processors
+        for module in self.modules():
+            if isinstance(module, Attention):
+                module.fuse_projections(fuse=True)
+    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
+    def unfuse_qkv_projections(self):
+        """Disables the fused QKV projection if enabled.
+        <Tip warning={true}>
+        This API is 🧪 experimental.
+        </Tip>
+        """
+        if self.original_attn_processors is not None:
+            self.set_attn_processor(self.original_attn_processors)
+    def get_last_layer(self):
+        return self.decoder.conv_out.conv.weight
+    def get_latent_size(self, input_size: list[int]) -> list[int]:
+        latent_size = []
+        # T
+        latent_size.append((input_size[0] - 1) // self.time_compression_ratio + 1)
+        # H, w
+        for i in range(1, 3):
+            latent_size.append((input_size[i] - 1) // self.spatial_compression_ratio + 1)
+        return latent_size
+@MODELS.register_module("hunyuan_vae")
+def CausalVAE3D_HUNYUAN(
+    from_pretrained: str = None,
+    device_map: str | torch.device = "cuda",
+    torch_dtype: torch.dtype = torch.bfloat16,
+    **kwargs,
+) -> AutoencoderKLCausal3D:
+    config = AutoEncoder3DConfig(from_pretrained=from_pretrained, **kwargs)
+    with torch.device(device_map):
+        model = AutoencoderKLCausal3D(config).to(torch_dtype)
+    if from_pretrained:
+        model = load_checkpoint(model, from_pretrained, device_map=device_map, strict=True)
+    return model

arbitor/encoders/opensora_vae_modules/registry.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from copy import deepcopy
+import torch.nn as nn
+from mmengine.registry import Registry
+def build_module(module: dict | nn.Module, builder: Registry, **kwargs) -> nn.Module | None:
+    """Build module from config or return the module itself.
+    Args:
+        module (dict | nn.Module): The module to build.
+        builder (Registry): The registry to build module.
+        *args, **kwargs: Arguments passed to build function.
+    Returns:
+        (None | nn.Module): The created model.
+    """
+    if module is None:
+        return None
+    if isinstance(module, dict):
+        cfg = deepcopy(module)
+        for k, v in kwargs.items():
+            cfg[k] = v
+        return builder.build(cfg)
+    elif isinstance(module, nn.Module):
+        return module
+    elif module is None:
+        return None
+    else:
+        raise TypeError(f"Only support dict and nn.Module, but got {type(module)}.")
+MODELS = Registry(
+    "model",
+    locations=["opensora.models"],
+)
+DATASETS = Registry(
+    "dataset",
+    locations=["opensora.datasets"],
+)

arbitor/encoders/opensora_vae_modules/unet_causal_3d_blocks.py ADDED Viewed

	@@ -0,0 +1,476 @@

+# Modified from diffusers==0.29.2 and HunyuanVideo
+#
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# #
+# Copyright 2024 HunyuanVideo
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+from typing import Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from diffusers.models.activations import get_activation
+from diffusers.models.attention_processor import Attention
+from diffusers.utils import logging
+from einops import rearrange
+from torch import nn
+from opensora.acceleration.checkpoint import auto_grad_checkpoint
+from opensora.models.vae.utils import ChannelChunkConv3d, get_conv3d_n_chunks
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+INTERPOLATE_NUMEL_LIMIT = 2**31 - 1
+def chunk_nearest_interpolate(
+    x: torch.Tensor,
+    scale_factor,
+):
+    limit = INTERPOLATE_NUMEL_LIMIT // np.prod(scale_factor)
+    n_chunks = get_conv3d_n_chunks(x.numel(), x.size(1), limit)
+    x_chunks = x.chunk(n_chunks, dim=1)
+    x_chunks = [F.interpolate(x_chunk, scale_factor=scale_factor, mode="nearest") for x_chunk in x_chunks]
+    return torch.cat(x_chunks, dim=1)
+def prepare_causal_attention_mask(n_frame: int, n_hw: int, dtype, device, batch_size: int = None):
+    seq_len = n_frame * n_hw
+    mask = torch.full((seq_len, seq_len), float("-inf"), dtype=dtype, device=device)
+    for i in range(seq_len):
+        i_frame = i // n_hw
+        mask[i, : (i_frame + 1) * n_hw] = 0
+    if batch_size is not None:
+        mask = mask.unsqueeze(0).expand(batch_size, -1, -1)
+    return mask
+class CausalConv3d(nn.Module):
+    """
+    Implements a causal 3D convolution layer where each position only depends on previous timesteps and current spatial locations.
+    This maintains temporal causality in video generation tasks.
+    """
+    def __init__(
+        self,
+        chan_in,
+        chan_out,
+        kernel_size: Union[int, Tuple[int, int, int]],
+        stride: Union[int, Tuple[int, int, int]] = 1,
+        dilation: Union[int, Tuple[int, int, int]] = 1,
+        pad_mode="replicate",
+        **kwargs,
+    ):
+        super().__init__()
+        self.pad_mode = pad_mode
+        padding = (
+            kernel_size // 2,
+            kernel_size // 2,
+            kernel_size // 2,
+            kernel_size // 2,
+            kernel_size - 1,
+            0,
+        )  # W, H, T
+        self.time_causal_padding = padding
+        self.conv = ChannelChunkConv3d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
+    def forward(self, x):
+        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
+        return self.conv(x)
+class UpsampleCausal3D(nn.Module):
+    """
+    A 3D upsampling layer with an optional convolution.
+    """
+    def __init__(
+        self,
+        channels: int,
+        out_channels: Optional[int] = None,
+        kernel_size: int = 3,
+        bias=True,
+        upsample_factor=(2, 2, 2),
+    ):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.upsample_factor = upsample_factor
+        self.conv = CausalConv3d(self.channels, self.out_channels, kernel_size=kernel_size, bias=bias)
+    def forward(
+        self,
+        input_tensor: torch.FloatTensor,
+    ) -> torch.FloatTensor:
+        assert input_tensor.shape[1] == self.channels
+        #######################
+        # handle hidden states
+        #######################
+        hidden_states = input_tensor
+        # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
+        # dtype = hidden_states.dtype
+        # if dtype == torch.bfloat16:
+        #     hidden_states = hidden_states.to(torch.float32)
+        # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
+        if hidden_states.shape[0] >= 64:
+            hidden_states = hidden_states.contiguous()
+        # interpolate H & W only for the first frame; interpolate T & H & W for the rest
+        T = hidden_states.size(2)
+        first_h, other_h = hidden_states.split((1, T - 1), dim=2)
+        # process non-1st frames
+        if T > 1:
+            other_h = chunk_nearest_interpolate(other_h, scale_factor=self.upsample_factor)
+        # proess 1st fram
+        first_h = first_h.squeeze(2)
+        first_h = chunk_nearest_interpolate(first_h, scale_factor=self.upsample_factor[1:])
+        first_h = first_h.unsqueeze(2)
+        # concat together
+        if T > 1:
+            hidden_states = torch.cat((first_h, other_h), dim=2)
+        else:
+            hidden_states = first_h
+        # If the input is bfloat16, we cast back to bfloat16
+        # if dtype == torch.bfloat16:
+        #     hidden_states = hidden_states.to(dtype)
+        hidden_states = self.conv(hidden_states)
+        return hidden_states
+class DownsampleCausal3D(nn.Module):
+    """
+    A 3D downsampling layer with an optional convolution.
+    """
+    def __init__(
+        self,
+        channels: int,
+        kernel_size=3,
+        bias=True,
+        stride=2,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = channels
+        self.conv = CausalConv3d(self.channels, self.out_channels, kernel_size=kernel_size, stride=stride, bias=bias)
+    def forward(self, input_tensor: torch.FloatTensor) -> torch.FloatTensor:
+        assert input_tensor.shape[1] == self.channels
+        hidden_states = self.conv(input_tensor)
+        return hidden_states
+class ResnetBlockCausal3D(nn.Module):
+    r"""
+    A Resnet block.
+    """
+    def __init__(
+        self,
+        *,
+        in_channels: int,
+        out_channels: Optional[int] = None,
+        dropout: float = 0.0,
+        groups: int = 32,
+        groups_out: Optional[int] = None,
+        pre_norm: bool = True,
+        eps: float = 1e-6,
+        non_linearity: str = "swish",
+        output_scale_factor: float = 1.0,
+        use_in_shortcut: Optional[bool] = None,
+        conv_shortcut_bias: bool = True,
+        conv_3d_out_channels: Optional[int] = None,
+    ):
+        super().__init__()
+        self.pre_norm = pre_norm
+        self.pre_norm = True
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.output_scale_factor = output_scale_factor
+        if groups_out is None:
+            groups_out = groups
+        self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
+        self.conv1 = CausalConv3d(in_channels, out_channels, kernel_size=3, stride=1)
+        self.norm2 = torch.nn.GroupNorm(num_groups=groups_out, num_channels=out_channels, eps=eps, affine=True)
+        self.dropout = torch.nn.Dropout(dropout)
+        conv_3d_out_channels = conv_3d_out_channels or out_channels
+        self.conv2 = CausalConv3d(out_channels, conv_3d_out_channels, kernel_size=3, stride=1)
+        self.nonlinearity = get_activation(non_linearity)
+        self.upsample = self.downsample = None
+        self.use_in_shortcut = self.in_channels != conv_3d_out_channels if use_in_shortcut is None else use_in_shortcut
+        self.conv_shortcut = None
+        if self.use_in_shortcut:
+            self.conv_shortcut = CausalConv3d(
+                in_channels,
+                conv_3d_out_channels,
+                kernel_size=1,
+                stride=1,
+                bias=conv_shortcut_bias,
+            )
+    def forward(
+        self,
+        input_tensor: torch.FloatTensor,
+    ) -> torch.FloatTensor:
+        hidden_states = input_tensor
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.nonlinearity(hidden_states)
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = self.norm2(hidden_states)
+        hidden_states = self.nonlinearity(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+        if self.conv_shortcut is not None:
+            input_tensor = self.conv_shortcut(input_tensor)
+        output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
+        return output_tensor
+class UNetMidBlockCausal3D(nn.Module):
+    """
+    A 3D UNet mid-block [`UNetMidBlockCausal3D`] with multiple residual blocks and optional attention blocks.
+    """
+    def __init__(
+        self,
+        in_channels: int,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_act_fn: str = "swish",
+        resnet_groups: int = 32,
+        attn_groups: Optional[int] = None,
+        resnet_pre_norm: bool = True,
+        add_attention: bool = True,
+        attention_head_dim: int = 1,
+        output_scale_factor: float = 1.0,
+    ):
+        super().__init__()
+        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        self.add_attention = add_attention
+        if attn_groups is None:
+            attn_groups = resnet_groups
+        # there is always at least one resnet
+        resnets = [
+            ResnetBlockCausal3D(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                eps=resnet_eps,
+                groups=resnet_groups,
+                dropout=dropout,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+            )
+        ]
+        attentions = []
+        if attention_head_dim is None:
+            logger.warn(
+                f"It is not recommend to pass `attention_head_dim=None`. Defaulting `attention_head_dim` to `in_channels`: {in_channels}."
+            )
+            attention_head_dim = in_channels
+        for _ in range(num_layers):
+            if self.add_attention:
+                attentions.append(
+                    Attention(
+                        in_channels,
+                        heads=in_channels // attention_head_dim,
+                        dim_head=attention_head_dim,
+                        rescale_output_factor=output_scale_factor,
+                        eps=resnet_eps,
+                        norm_num_groups=attn_groups,
+                        spatial_norm_dim=None,
+                        residual_connection=True,
+                        bias=True,
+                        upcast_softmax=True,
+                        _from_deprecated_attn_block=True,
+                    )
+                )
+            else:
+                attentions.append(None)
+            resnets.append(
+                ResnetBlockCausal3D(
+                    in_channels=in_channels,
+                    out_channels=in_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+    def forward(self, hidden_states: torch.FloatTensor, attention_mask: Optional[torch.Tensor]) -> torch.FloatTensor:
+        hidden_states = self.resnets[0](hidden_states)
+        for attn, resnet in zip(self.attentions, self.resnets[1:]):
+            if attn is not None:
+                B, C, T, H, W = hidden_states.shape
+                hidden_states = rearrange(hidden_states, "b c f h w -> b (f h w) c")
+                hidden_states = attn(hidden_states, attention_mask=attention_mask)
+                hidden_states = rearrange(hidden_states, "b (f h w) c -> b c f h w", f=T, h=H, w=W)
+            hidden_states = resnet(hidden_states)
+        return hidden_states
+class DownEncoderBlockCausal3D(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_act_fn: str = "swish",
+        resnet_groups: int = 32,
+        resnet_pre_norm: bool = True,
+        output_scale_factor: float = 1.0,
+        add_downsample: bool = True,
+        downsample_stride: int = 2,
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlockCausal3D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    DownsampleCausal3D(
+                        out_channels,
+                        stride=downsample_stride,
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        for resnet in self.resnets:
+            hidden_states = auto_grad_checkpoint(resnet, hidden_states)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = auto_grad_checkpoint(downsampler, hidden_states)
+        return hidden_states
+class UpDecoderBlockCausal3D(nn.Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        resolution_idx: Optional[int] = None,
+        dropout: float = 0.0,
+        num_layers: int = 1,
+        resnet_eps: float = 1e-6,
+        resnet_act_fn: str = "swish",
+        resnet_groups: int = 32,
+        resnet_pre_norm: bool = True,
+        output_scale_factor: float = 1.0,
+        add_upsample: bool = True,
+        upsample_scale_factor=(2, 2, 2),
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            input_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlockCausal3D(
+                    in_channels=input_channels,
+                    out_channels=out_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList(
+                [
+                    UpsampleCausal3D(
+                        out_channels,
+                        out_channels=out_channels,
+                        upsample_factor=upsample_scale_factor,
+                    )
+                ]
+            )
+        else:
+            self.upsamplers = None
+        self.resolution_idx = resolution_idx
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        for resnet in self.resnets:
+            hidden_states = auto_grad_checkpoint(resnet, hidden_states)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = auto_grad_checkpoint(upsampler, hidden_states)
+        return hidden_states

arbitor/encoders/opensora_vae_modules/vae.py ADDED Viewed

	@@ -0,0 +1,340 @@

+# Modified from HunyuanVideo
+#
+# Copyright 2024 HunyuanVideo
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import numpy as np
+import torch
+import torch.nn as nn
+from diffusers.utils import BaseOutput
+from diffusers.utils.torch_utils import randn_tensor
+from opensora.acceleration.checkpoint import auto_grad_checkpoint, checkpoint
+from opensora.models.hunyuan_vae.unet_causal_3d_blocks import (
+    CausalConv3d,
+    DownEncoderBlockCausal3D,
+    UNetMidBlockCausal3D,
+    UpDecoderBlockCausal3D,
+    prepare_causal_attention_mask,
+)
+@dataclass
+class DecoderOutput(BaseOutput):
+    r"""
+    Output of decoding method.
+    Args:
+        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            The decoded output sample from the last layer of the model.
+    """
+    sample: torch.FloatTensor
+class EncoderCausal3D(nn.Module):
+    r"""
+    The `EncoderCausal3D` layer of a variational autoencoder that encodes its input into a latent representation.
+    """
+    def __init__(
+        self,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        block_out_channels: Tuple[int, ...] = (64,),
+        layers_per_block: int = 2,
+        norm_num_groups: int = 32,
+        act_fn: str = "silu",
+        double_z: bool = True,
+        mid_block_add_attention=True,
+        time_compression_ratio: int = 4,
+        spatial_compression_ratio: int = 8,
+        dropout: float = 0.0,
+    ):
+        super().__init__()
+        self.layers_per_block = layers_per_block
+        self.conv_in = CausalConv3d(in_channels, block_out_channels[0], kernel_size=3, stride=1)
+        self.mid_block = None
+        self.down_blocks = nn.ModuleList([])
+        # down
+        output_channel = block_out_channels[0]
+        for i, _ in enumerate(block_out_channels):
+            input_channel = output_channel
+            output_channel = block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            num_spatial_downsample_layers = int(np.log2(spatial_compression_ratio))
+            num_time_downsample_layers = int(np.log2(time_compression_ratio))
+            if time_compression_ratio == 4:
+                add_spatial_downsample = bool(i < num_spatial_downsample_layers)
+                add_time_downsample = bool(
+                    i >= (len(block_out_channels) - 1 - num_time_downsample_layers) and not is_final_block
+                )
+            elif time_compression_ratio == 8:
+                add_spatial_downsample = bool(i < num_spatial_downsample_layers)
+                add_time_downsample = bool(i < num_spatial_downsample_layers)
+            else:
+                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
+            downsample_stride_HW = (2, 2) if add_spatial_downsample else (1, 1)
+            downsample_stride_T = (2,) if add_time_downsample else (1,)
+            downsample_stride = tuple(downsample_stride_T + downsample_stride_HW)
+            down_block = DownEncoderBlockCausal3D(
+                num_layers=self.layers_per_block,
+                in_channels=input_channel,
+                out_channels=output_channel,
+                dropout=dropout,
+                add_downsample=bool(add_spatial_downsample or add_time_downsample),
+                downsample_stride=downsample_stride,
+                resnet_eps=1e-6,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+            )
+            self.down_blocks.append(down_block)
+        # mid
+        self.mid_block = UNetMidBlockCausal3D(
+            in_channels=block_out_channels[-1],
+            resnet_eps=1e-6,
+            resnet_act_fn=act_fn,
+            output_scale_factor=1,
+            attention_head_dim=block_out_channels[-1],
+            resnet_groups=norm_num_groups,
+            add_attention=mid_block_add_attention,
+        )
+        # out
+        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
+        self.conv_act = nn.SiLU()
+        conv_out_channels = 2 * out_channels if double_z else out_channels
+        self.conv_out = CausalConv3d(block_out_channels[-1], conv_out_channels, kernel_size=3)
+    def prepare_attention_mask(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        B, C, T, H, W = hidden_states.shape
+        attention_mask = prepare_causal_attention_mask(
+            T, H * W, hidden_states.dtype, hidden_states.device, batch_size=B
+        )
+        return attention_mask
+    def forward(self, sample: torch.FloatTensor) -> torch.FloatTensor:
+        r"""The forward method of the `EncoderCausal3D` class."""
+        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions"
+        sample = self.conv_in(sample)
+        # down
+        for down_block in self.down_blocks:
+            sample = down_block(sample)
+        # middle
+        if self.mid_block.add_attention:
+            attention_mask = self.prepare_attention_mask(sample)
+        else:
+            attention_mask = None
+        sample = auto_grad_checkpoint(self.mid_block, sample, attention_mask)
+        # post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        return sample
+class DecoderCausal3D(nn.Module):
+    r"""
+    The `DecoderCausal3D` layer of a variational autoencoder that decodes its latent representation into an output sample.
+    """
+    def __init__(
+        self,
+        in_channels: int = 3,
+        out_channels: int = 3,
+        block_out_channels: Tuple[int, ...] = (64,),
+        layers_per_block: int = 2,
+        norm_num_groups: int = 32,
+        act_fn: str = "silu",
+        mid_block_add_attention=True,
+        time_compression_ratio: int = 4,
+        spatial_compression_ratio: int = 8,
+        dropout: float = 0.0,
+    ):
+        super().__init__()
+        self.layers_per_block = layers_per_block
+        self.conv_in = CausalConv3d(in_channels, block_out_channels[-1], kernel_size=3, stride=1)
+        self.mid_block = None
+        self.up_blocks = nn.ModuleList([])
+        # mid
+        self.mid_block = UNetMidBlockCausal3D(
+            in_channels=block_out_channels[-1],
+            resnet_eps=1e-6,
+            resnet_act_fn=act_fn,
+            output_scale_factor=1,
+            attention_head_dim=block_out_channels[-1],
+            resnet_groups=norm_num_groups,
+            add_attention=mid_block_add_attention,
+        )
+        # up
+        reversed_block_out_channels = list(reversed(block_out_channels))
+        output_channel = reversed_block_out_channels[0]
+        for i, _ in enumerate(block_out_channels):
+            prev_output_channel = output_channel
+            output_channel = reversed_block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            num_spatial_upsample_layers = int(np.log2(spatial_compression_ratio))
+            num_time_upsample_layers = int(np.log2(time_compression_ratio))
+            if time_compression_ratio == 4:
+                add_spatial_upsample = bool(i < num_spatial_upsample_layers)
+                add_time_upsample = bool(
+                    i >= len(block_out_channels) - 1 - num_time_upsample_layers and not is_final_block
+                )
+            elif time_compression_ratio == 8:
+                add_spatial_upsample = bool(i < num_spatial_upsample_layers)
+                add_time_upsample = bool(i < num_spatial_upsample_layers)
+            else:
+                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
+            upsample_scale_factor_HW = (2, 2) if add_spatial_upsample else (1, 1)
+            upsample_scale_factor_T = (2,) if add_time_upsample else (1,)
+            upsample_scale_factor = tuple(upsample_scale_factor_T + upsample_scale_factor_HW)
+            up_block = UpDecoderBlockCausal3D(
+                num_layers=self.layers_per_block + 1,
+                in_channels=prev_output_channel,
+                out_channels=output_channel,
+                resolution_idx=None,
+                dropout=dropout,
+                add_upsample=bool(add_spatial_upsample or add_time_upsample),
+                upsample_scale_factor=upsample_scale_factor,
+                resnet_eps=1e-6,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+            )
+            self.up_blocks.append(up_block)
+            prev_output_channel = output_channel
+        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=1e-6)
+        self.conv_act = nn.SiLU()
+        self.conv_out = CausalConv3d(block_out_channels[0], out_channels, kernel_size=3)
+    def post_process(self, sample: torch.Tensor) -> torch.Tensor:
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        return sample
+    def prepare_attention_mask(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        B, C, T, H, W = hidden_states.shape
+        attention_mask = prepare_causal_attention_mask(
+            T, H * W, hidden_states.dtype, hidden_states.device, batch_size=B
+        )
+        return attention_mask
+    def forward(
+        self,
+        sample: torch.FloatTensor,
+    ) -> torch.FloatTensor:
+        r"""The forward method of the `DecoderCausal3D` class."""
+        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions."
+        sample = self.conv_in(sample)
+        upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
+        # middle
+        if self.mid_block.add_attention:
+            attention_mask = self.prepare_attention_mask(sample)
+        else:
+            attention_mask = None
+        sample = auto_grad_checkpoint(self.mid_block, sample, attention_mask)
+        sample = sample.to(upscale_dtype)
+        # up
+        for up_block in self.up_blocks:
+            sample = up_block(sample)
+        # post-process
+        if getattr(self, "grad_checkpointing", False):
+            sample = checkpoint(self.post_process, sample, use_reentrant=True)
+        else:
+            sample = self.post_process(sample)
+        sample = self.conv_out(sample)
+        return sample
+class DiagonalGaussianDistribution(object):
+    def __init__(self, parameters: torch.Tensor, deterministic: bool = False):
+        if parameters.ndim == 3:
+            dim = 2  # (B, L, C)
+        elif parameters.ndim == 5 or parameters.ndim == 4:
+            dim = 1  # (B, C, T, H ,W) / (B, C, H, W)
+        else:
+            raise NotImplementedError
+        self.parameters = parameters
+        self.mean, self.logvar = torch.chunk(parameters, 2, dim=dim)
+        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
+        self.deterministic = deterministic
+        self.std = torch.exp(0.5 * self.logvar)
+        self.var = torch.exp(self.logvar)
+        if self.deterministic:
+            self.var = self.std = torch.zeros_like(
+                self.mean, device=self.parameters.device, dtype=self.parameters.dtype
+            )
+    def sample(self, generator: Optional[torch.Generator] = None) -> torch.FloatTensor:
+        # make sure sample is on the same device as the parameters and has same dtype
+        sample = randn_tensor(
+            self.mean.shape,
+            generator=generator,
+            device=self.parameters.device,
+            dtype=self.parameters.dtype,
+        )
+        x = self.mean + self.std * sample
+        return x
+    def kl(self, other: "DiagonalGaussianDistribution" = None) -> torch.Tensor:
+        if self.deterministic:
+            return torch.Tensor([0.0])
+        else:
+            reduce_dim = list(range(1, self.mean.ndim))
+            if other is None:
+                return 0.5 * torch.sum(
+                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
+                    dim=reduce_dim,
+                )
+            else:
+                return 0.5 * torch.sum(
+                    torch.pow(self.mean - other.mean, 2) / other.var
+                    + self.var / other.var
+                    - 1.0
+                    - self.logvar
+                    + other.logvar,
+                    dim=reduce_dim,
+                )
+    def nll(self, sample: torch.Tensor, dims: Tuple[int, ...] = [1, 2, 3]) -> torch.Tensor:
+        if self.deterministic:
+            return torch.Tensor([0.0])
+        logtwopi = np.log(2.0 * np.pi)
+        return 0.5 * torch.sum(
+            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
+            dim=dims,
+        )
+    def mode(self) -> torch.Tensor:
+        return self.mean

arbitor/encoders/pig_vae.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""pig-vae (WanVAE) sidecar module.
+Loads from local safetensors, .pth, or diffusers AutoencoderKLWan.
+Exposes encode() and decode() for the VideoHead training pipeline.
+Latent shape: [B, 16, T/4, H/8, W/8] for input video of T frames at HxW.
+"""
+import os, torch
+import torch.nn as nn
+_LOCAL_VAE_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "models", "pig-vae")
+_VAE_CONFIG = {
+    "base_dim": 96, "z_dim": 16, "dim_mult": [1, 2, 4, 4],
+    "num_res_blocks": 2, "dropout": 0.0,
+    "temperal_downsample": [False, True, True],
+    "in_channels": 3, "out_channels": 3,
+    "scale_factor_temporal": 4, "scale_factor_spatial": 8,
+}
+def _freeze_sidecar(model, quantize_requested=None, quantized=False):
+    model._arb_quantize_requested = quantize_requested
+    model._arb_quantized_int8 = bool(quantized and quantize_requested == "int8")
+    model._arb_quantized = bool(quantized)
+    for p in model.parameters():
+        p.requires_grad = False
+    return model
+def _has_quantized_modules(model):
+    markers = ("Q", "Quanto", "Quantized", "WeightQ")
+    return any(any(marker in type(module).__name__ for marker in markers) for module in model.modules())
+def _quantize_int8_if_requested(model, quantize):
+    if quantize == 'int8':
+        from optimum.quanto import quantize as quanto_quantize, freeze, qint8
+        quanto_quantize(model, weights=qint8)
+        freeze(model)
+        return _freeze_sidecar(model, quantize_requested=quantize, quantized=_has_quantized_modules(model))
+    return _freeze_sidecar(model, quantize_requested=quantize, quantized=False)
+def _wan_vae_cls():
+    try:
+        from diffusers import AutoencoderKLWan
+    except ModuleNotFoundError as exc:
+        raise RuntimeError(
+            "pig-vae requires the optional diffusers dependency. "
+            "Install the project with `pip install -e .[diffusers]` in a venv "
+            "before loading or verifying pig-vae int8 quantization."
+        ) from exc
+    return AutoencoderKLWan
+def load_vae(device='cuda', quantize='int8'):
+    """Load pig-vae from local cache or diffusers. Optionally int8 quantize."""
+    safetensors_path = os.path.join(_LOCAL_VAE_DIR, "model.safetensors")
+    gguf_path = os.path.join(_LOCAL_VAE_DIR, "pig_wan_vae_fp32-f16.gguf")
+    if os.path.isfile(safetensors_path):
+        return _load_local(safetensors_path, device, quantize, is_safetensors=True)
+    if os.path.isfile(gguf_path):
+        return _load_gguf(gguf_path, device, quantize)
+    return _load_from_hf(device, quantize)
+def _build_vae():
+    AutoencoderKLWan = _wan_vae_cls()
+    return AutoencoderKLWan(
+        **_VAE_CONFIG,
+        latents_mean=[-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653,
+                      -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632,
+                      -0.1922, -0.9497, 0.2503, -0.2921],
+        latents_std=[2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708,
+                     2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579,
+                     1.6382, 1.1253, 2.8251, 1.916],
+    )
+def _load_local(path, device, quantize, is_safetensors=False):
+    if is_safetensors:
+        AutoencoderKLWan = _wan_vae_cls()
+        model = AutoencoderKLWan.from_single_file(path)
+    else:
+        model = _build_vae()
+        ckpt = torch.load(path, map_location="cpu", weights_only=True)
+        missing, unexpected = model.load_state_dict(ckpt, strict=False)
+        if missing or unexpected:
+            raise RuntimeError(
+                "pig-vae local .pth checkpoint does not match AutoencoderKLWan "
+                f"(missing={len(missing)}, unexpected={len(unexpected)})."
+            )
+    model = model.to(device)
+    model.eval()
+    model = _quantize_int8_if_requested(model, quantize)
+    return VAEWrapper(model)
+def _load_gguf(path, device, quantize):
+    import gguf
+    reader = gguf.GGUFReader(path)
+    state_dict = {t.name: torch.tensor(t.data) for t in reader.tensors}
+    model = _build_vae()
+    missing, unexpected = model.load_state_dict(state_dict, strict=False)
+    if missing or unexpected:
+        raise RuntimeError(
+            "pig-vae local GGUF checkpoint does not match AutoencoderKLWan "
+            f"(missing={len(missing)}, unexpected={len(unexpected)})."
+        )
+    model = model.to(device)
+    model.eval()
+    model = _quantize_int8_if_requested(model, quantize)
+    return VAEWrapper(model)
+def _load_from_hf(device, quantize):
+    AutoencoderKLWan = _wan_vae_cls()
+    model = AutoencoderKLWan.from_pretrained(
+        "Wan-AI/Wan2.1-T2V-1.3B", subfolder="vae",
+        torch_dtype=torch.bfloat16,
+    )
+    model = model.to(device)
+    model.eval()
+    model = _quantize_int8_if_requested(model, quantize)
+    return VAEWrapper(model)
+class VAEWrapper(nn.Module):
+    def __init__(self, vae):
+        super().__init__()
+        self.vae = vae
+        self.latent_channels = _VAE_CONFIG["z_dim"]
+        self.scale_factor = 0.476986
+    def encode(self, video_tensor):
+        with torch.no_grad():
+            dist = self.vae.encode(video_tensor)
+            latents = dist.latent_dist.sample() if hasattr(dist, 'latent_dist') else dist
+            latents = latents * self.scale_factor
+        return latents
+    def decode(self, latents):
+        with torch.no_grad():
+            latents = latents / self.scale_factor
+            video = self.vae.decode(latents)
+            video = video.sample if hasattr(video, 'sample') else video
+        return video

arbitor/encoders/vae2d.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""2D VAE encoder — wraps PixArt SDXL AutoencoderKL encoder half.
+Encodes images or mel spectrograms to [B, 4, H/8, W/8] latents.
+Same encoder used for images AND audio spectrograms (via MelSpectrogram3Band).
+Frozen float32 sidecar (no gradients).
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+def load_vae2d(device="cuda", quantize=None):
+    from diffusers import AutoencoderKL
+    vae = AutoencoderKL.from_pretrained(
+        "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
+        subfolder="vae",
+        torch_dtype=torch.float32,
+    ).to(device)
+    vae.eval()
+    for p in vae.parameters():
+        p.requires_grad = False
+    return VAE2DEncoder(vae)
+class VAE2DEncoder(nn.Module):
+    def __init__(self, vae):
+        super().__init__()
+        self.encoder = vae.encoder
+        self.quant_conv = vae.quant_conv
+        self.latent_channels = 4
+        self.input_scale = 0.18215
+    def forward(self, x):
+        H, W = x.shape[-2], x.shape[-1]
+        pad_h = (8 - H % 8) % 8
+        pad_w = (8 - W % 8) % 8
+        if pad_h > 0 or pad_w > 0:
+            x = F.pad(x, (0, pad_w, 0, pad_h))
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = torch.distributions.Normal(
+            moments[:, :self.latent_channels],
+            torch.nn.functional.softplus(moments[:, self.latent_channels:])
+        )
+        latent = posterior.rsample()
+        latent = latent * self.input_scale
+        if pad_h > 0 or pad_w > 0:
+            out_h = H // 8 if H >= 8 else 1
+            out_w = W // 8 if W >= 8 else 1
+            latent = latent[:, :, :out_h, :out_w]
+        return latent

arbitor/kernel/flash_vq.py ADDED Viewed

	@@ -0,0 +1,510 @@

+"""
+FlashVQ: Custom Vector Quantization with dual Triton GPU + PyTorch CPU path.
+Replaces vector_quantize_pytorch entirely (D-100). FlashVQCodebook is a standalone
+nn.Module implementing all VQ operations:
+  - Cosine similarity codebook lookup
+  - EMA codebook update
+  - Dead code reset
+  - Rotation trick (gradient through quantization)
+  - Commitment loss
+Dispatch pattern (following tscale.py):
+  if x.is_cuda and _HAS_TRITON → _TritonFlashVQFn.apply()
+  else → self._cpu_forward()
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+_HAS_TRITON = False
+try:
+    import triton
+    import triton.language as tl
+    _HAS_TRITON = True
+except ImportError:
+    pass
+class _RotationTrickFn(torch.autograd.Function):
+    """
+    Rotation trick gradient through vector quantization.
+    Instead of straight-through estimator (STE), rotate the encoder output
+    gradient toward the quantized vector direction. This helps the encoder
+    learn to produce outputs that align with codebook entries.
+    """
+    @staticmethod
+    def forward(ctx, x, quantized):
+        ctx.save_for_backward(x.detach(), quantized.detach())
+        return quantized
+    @staticmethod
+    def backward(ctx, grad_output):
+        x, quantized = ctx.saved_tensors
+        # Normalize in fp32 for numerical stability
+        x_norm = F.normalize(x.float(), dim=-1)
+        q_norm = F.normalize(quantized.float(), dim=-1)
+        # Gradient deflection: subtract projection onto (x_norm - q_norm)
+        # This rotates the gradient toward the quantized direction
+        diff = x_norm - q_norm
+        proj = (grad_output.float() * x_norm).sum(dim=-1, keepdim=True)
+        grad_x = grad_output.float() - proj * diff
+        return grad_x.to(grad_output.dtype), None
+class FlashVQCodebook(nn.Module):
+    """
+    Vector quantization codebook with dual GPU (Triton) / CPU (PyTorch) paths.
+    Interface matches vector_quantize_pytorch.VectorQuantize:
+      forward(x) → (quantized, indices, commitment_loss)
+    All VQ operations are self-contained:
+      - Cosine similarity codebook lookup
+      - Straight-through estimator (STE) with optional rotation trick
+      - EMA codebook update (decay=0.99)
+      - Dead code reset (threshold_ema_dead_code=2)
+      - Commitment loss
+    """
+    def __init__(
+        self,
+        codebook_size: int = 8192,
+        codebook_dim: int = 32,
+        decay: float = 0.99,
+        commitment_weight: float = 1.0,
+        threshold_ema_dead_code: int = 2,
+        kmeans_init: bool = True,
+        kmeans_iters: int = 10,
+        rotation_trick: bool = True,
+    ):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.decay = decay
+        self.commitment_weight = commitment_weight
+        self.threshold_ema_dead_code = threshold_ema_dead_code
+        self.kmeans_init = kmeans_init
+        self.kmeans_iters = kmeans_iters
+        self.rotation_trick = rotation_trick
+        # Codebook buffers
+        self.register_buffer('embed', torch.randn(codebook_size, codebook_dim) * 0.02)
+        self.register_buffer('cluster_size', torch.zeros(codebook_size))
+        self.register_buffer('embed_avg', torch.zeros(codebook_size, codebook_dim))
+        # Tile sizes for Triton kernel (set on first GPU forward)
+        self._triton_block_bt = 16
+        self._triton_tile_k = 1024
+    def _compute_tile_sizes(self):
+        """
+        Dynamic tile sizing per D-102.
+        Queries GPU device properties to determine SRAM budget, then computes
+        BLOCK_BT and TILE_K such that:
+            BLOCK_BT * codebook_dim * 2 + TILE_K * codebook_dim * 2 < SRAM * 0.9
+        For sm_89 (RTX 4060, 99KB SRAM per SM):
+          codebook_size=8192, codebook_dim=32 → BLOCK_BT=16, TILE_K=1024 (65KB)
+          codebook_size=4096, codebook_dim=32 → BLOCK_BT=16, TILE_K=512  (33KB)
+        """
+        if not torch.cuda.is_available():
+            return
+        try:
+            props = torch.cuda.get_device_properties(0)
+            sram_budget = 99 * 1024  # SM 8.9: 99KB per SM
+            # Conservative estimate: each element is 2 bytes (bf16) in SRAM
+            elem_bytes = 2
+            # Find largest TILE_K that fits with BLOCK_BT=16
+            bt = 16
+            for tk in [2048, 1024, 512, 256, 128]:
+                sram_usage = bt * self.codebook_dim * elem_bytes + tk * self.codebook_dim * elem_bytes
+                if sram_usage < sram_budget * 0.9:
+                    self._triton_block_bt = bt
+                    self._triton_tile_k = tk
+                    return
+            # Fallback for very constrained SRAM or large codebook_dim
+            self._triton_block_bt = 8
+            self._triton_tile_k = 256
+        except Exception:
+            # Default values
+            self._triton_block_bt = 16
+            self._triton_tile_k = 1024
+    def forward(self, x: torch.Tensor):
+        """
+        Args:
+            x: Input tensor of shape [*, codebook_dim]
+        Returns:
+            quantized: Tensor of same shape as x
+            indices: Tensor of shape [*] with codebook indices
+            commitment_loss: Scalar tensor
+        """
+        orig_shape = x.shape
+        x_flat = x.reshape(-1, self.codebook_dim)
+        if x.is_cuda and _HAS_TRITON:
+            quantized, indices, commitment_loss = self._triton_forward(x_flat)
+        else:
+            quantized, indices, commitment_loss = self._cpu_forward(x_flat)
+        quantized = quantized.reshape(orig_shape)
+        indices = indices.reshape(orig_shape[:-1])
+        return quantized, indices, commitment_loss
+    def _triton_forward(self, x_flat: torch.Tensor):
+        """Triton GPU path — dispatched when CUDA + Triton available."""
+        # Use _TritonFlashVQFn for forward + backward via autograd
+        quantized, indices, commitment_loss = _TritonFlashVQFn.apply(
+            x_flat, self.embed, self.cluster_size, self.embed_avg,
+            self.codebook_size, self.codebook_dim,
+            self.commitment_weight, self.rotation_trick,
+        )
+        # EMA update and dead code reset (under torch.no_grad)
+        with torch.no_grad():
+            self._ema_update(x_flat, indices)
+            self._dead_code_reset(x_flat)
+        return quantized, indices, commitment_loss
+    def _cpu_forward(self, x_flat: torch.Tensor):
+        """
+        Pure PyTorch CPU path — implements all VQ operations.
+        Steps:
+          1. Cosine similarity lookup → nearest codebook entry indices
+          2. Quantize via straight-through estimator (or rotation trick)
+          3. Compute commitment loss
+          4. EMA update codebook (under torch.no_grad)
+          5. Dead code reset (under torch.no_grad)
+        """
+        # ── Step 1: Cosine similarity lookup ──
+        x_norm = F.normalize(x_flat.float(), dim=-1)
+        embed_norm = F.normalize(self.embed.float(), dim=-1)
+        sim = x_norm @ embed_norm.T  # [N, codebook_size]
+        indices = sim.argmax(dim=-1)  # [N]
+        # ── Step 2: Quantize with STE or rotation trick ──
+        with torch.no_grad():
+            quantized = self.embed[indices]  # [N, D]
+        if self.rotation_trick:
+            quantized = _RotationTrickFn.apply(x_flat, quantized)
+        else:
+            # Straight-through estimator
+            quantized = x_flat + (quantized - x_flat).detach()
+        # ── Step 3: Commitment loss ──
+        commitment_loss = self.commitment_weight * F.mse_loss(
+            x_flat.float(), quantized.detach().float()
+        )
+        # ── Step 4: EMA update ──
+        with torch.no_grad():
+            self._ema_update(x_flat, indices)
+            # ── Step 5: Dead code reset ──
+            self._dead_code_reset(x_flat)
+        return quantized, indices, commitment_loss
+    def _ema_update(self, x_flat: torch.Tensor, indices: torch.Tensor):
+        """
+        Exponential moving average codebook update.
+        Args:
+            x_flat: [N, D] input vectors
+            indices: [N] codebook indices for each input vector
+        """
+        one_hot = F.one_hot(indices, num_classes=self.codebook_size).float()  # [N, codebook_size]
+        n_assign = one_hot.sum(dim=0)  # [codebook_size]
+        # EMA on cluster_size (how many inputs assigned to each code)
+        self.cluster_size.mul_(self.decay).add_(n_assign * (1 - self.decay))
+        # EMA on embed_avg: weighted sum of assigned inputs
+        # embed_avg[c] = decay * embed_avg[c] + (1 - decay) * sum(x assigned to c)
+        x_float = x_flat.float()
+        for c in range(self.codebook_size):
+            mask = indices == c
+            count = mask.sum().item()
+            if count > 0:
+                assigned_sum = x_float[mask].sum(dim=0)
+                self.embed_avg[c].mul_(self.decay).add_(assigned_sum * (1 - self.decay))
+        # Normalize: embed = embed_avg / cluster_size (with epsilon)
+        cluster_size_safe = self.cluster_size.clamp(min=1e-5)
+        self.embed.copy_(self.embed_avg / cluster_size_safe.unsqueeze(1))
+    def _dead_code_reset(self, x_flat: torch.Tensor):
+        """
+        Replace dead codebook entries (cluster_size < threshold) with
+        random vectors from the current input batch.
+        """
+        dead_mask = self.cluster_size < self.threshold_ema_dead_code
+        n_dead = dead_mask.sum().item()
+        if n_dead == 0:
+            return
+        dead_indices = torch.where(dead_mask)[0]
+        # Replace with random input vectors
+        rand_idx = torch.randint(0, x_flat.shape[0], (n_dead,), device=x_flat.device)
+        self.embed[dead_indices] = x_flat[rand_idx].detach()
+        self.cluster_size[dead_indices] = 0.0
+        self.embed_avg[dead_indices] = 0.0
+    @torch.no_grad()
+    def kmeans_init_codebook(self, x: torch.Tensor):
+        """Initialize codebook via k-means on first batch."""
+        x_flat = x.reshape(-1, self.codebook_dim).float()
+        centroids = x_flat[torch.randperm(x_flat.shape[0])[:self.codebook_size]].clone()
+        for _ in range(self.kmeans_iters):
+            dist = torch.cdist(x_flat, centroids)
+            assign = dist.argmin(dim=-1)
+            for i in range(self.codebook_size):
+                mask = assign == i
+                if mask.sum() > 0:
+                    centroids[i] = x_flat[mask].mean(dim=0)
+        self.embed.copy_(centroids)
+    @torch.no_grad()
+    def get_codebook_utilization(self) -> float:
+        """Fraction of codebook entries with any usage."""
+        return (self.cluster_size > 0).float().mean().item()
+    @torch.no_grad()
+    def get_dead_code_count(self) -> int:
+        """Number of codebook entries below EMA dead threshold."""
+        return (self.cluster_size < self.threshold_ema_dead_code).sum().item()
+# ─── Triton GPU Kernels ───
+# Only defined when Triton is available
+if _HAS_TRITON:
+    @triton.jit
+    def _triton_flash_vq_lookup_kernel(
+        x_ptr, codebook_ptr, indices_ptr,
+        stride_xb, stride_xd,
+        stride_cb, stride_cd,
+        N_CTX: tl.constexpr,
+        CODEBOOK_SIZE: tl.constexpr,
+        CODEBOOK_DIM: tl.constexpr,
+        BLOCK_BT: tl.constexpr,
+        TILE_K: tl.constexpr,
+    ):
+        """
+        Tiled cosine similarity + argmax lookup for VQ codebook.
+        Architecture:
+            pid = batch tile index
+            Load input tile [BLOCK_BT, CODEBOOK_DIM]
+            Normalize in fp32
+            Tile over codebook in TILE_K chunks:
+                Load codebook tile [TILE_K, CODEBOOK_DIM]
+                Normalize in fp32
+                Compute dot product via tl.dot → [BLOCK_BT, TILE_K]
+                Update running argmax
+            Store best indices
+        SRAM: all arithmetic in fp32 with small tiles to fit 99KB budget.
+        """
+        pid = tl.program_id(0)
+        offs_bt = pid * BLOCK_BT + tl.arange(0, BLOCK_BT)
+        offs_d = tl.arange(0, CODEBOOK_DIM)
+        # ── Load input tile ──
+        x_ptrs = x_ptr + offs_bt[:, None] * stride_xb + offs_d[None, :] * stride_xd
+        x = tl.load(x_ptrs, mask=offs_bt[:, None] < N_CTX, other=0.0)
+        # ── Normalize input in fp32 (no keepdims in Triton tl.sum) ──
+        x_f32 = x.to(tl.float32)
+        x_sq = tl.sum(x_f32 * x_f32, axis=1)  # [BLOCK_BT]
+        x_norm_f32 = x_f32 / tl.sqrt(x_sq[:, None] + 1e-8)
+        # ── Running argmax over tiled codebook ──
+        best_sim = tl.full([BLOCK_BT], -float('inf'), dtype=tl.float32)
+        best_idx = tl.zeros([BLOCK_BT], dtype=tl.int32)
+        for k_start in range(0, CODEBOOK_SIZE, TILE_K):
+            offs_k = k_start + tl.arange(0, TILE_K)
+            k_mask = offs_k < CODEBOOK_SIZE
+            # Load codebook tile into fp32 directly for normalization
+            cb_ptrs = (codebook_ptr
+                       + offs_k[:, None] * stride_cb
+                       + offs_d[None, :] * stride_cd)
+            cb = tl.load(cb_ptrs, mask=k_mask[:, None], other=0.0)
+            # Normalize codebook tile in fp32
+            cb_f32 = cb.to(tl.float32)
+            cb_sq = tl.sum(cb_f32 * cb_f32, axis=1)  # [TILE_K]
+            cb_norm_f32 = cb_f32 / tl.sqrt(cb_sq[:, None] + 1e-8)
+            # Cosine similarity via tl.dot (tf32 on sm_89)
+            sim = tl.dot(x_norm_f32, tl.trans(cb_norm_f32))  # [BLOCK_BT, TILE_K]
+            # Running argmax within this tile
+            tile_max = tl.max(sim, axis=1)
+            tile_argmax = tl.argmax(sim, axis=1)
+            tile_idx = k_start + tile_argmax
+            # Merge with best across tiles using element-wise mask
+            update_mask = tile_max > best_sim
+            best_sim = tl.where(update_mask, tile_max, best_sim)
+            best_idx = tl.where(update_mask, tile_idx, best_idx)
+        # ── Store results ──
+        tl.store(indices_ptr + offs_bt, best_idx, mask=offs_bt < N_CTX)
+    @triton.jit
+    def _triton_flash_vq_quantize_kernel(
+        codebook_ptr, indices_ptr, quantized_ptr,
+        stride_cb, stride_cd,
+        stride_qb, stride_qd,
+        N_CTX: tl.constexpr,
+        CODEBOOK_DIM: tl.constexpr,
+        BLOCK_BT: tl.constexpr,
+    ):
+        """
+        Gather quantized vectors from codebook at given indices.
+        Kernel form of: quantized[i] = codebook[indices[i]]
+        """
+        pid = tl.program_id(0)
+        offs_bt = pid * BLOCK_BT + tl.arange(0, BLOCK_BT)
+        offs_d = tl.arange(0, CODEBOOK_DIM)
+        # Load indices for this batch tile
+        idx = tl.load(indices_ptr + offs_bt, mask=offs_bt < N_CTX, other=0)
+        # Gather: for each i in BLOCK_BT, load codebook[idx[i], :]
+        # Pointer arithmetic with broadcasting
+        gather_ptrs = (codebook_ptr
+                       + idx[:, None] * stride_cb
+                       + offs_d[None, :] * stride_cd)
+        quantized = tl.load(gather_ptrs,
+                            mask=offs_bt[:, None] < N_CTX,
+                            other=0.0)
+        # Store quantized output
+        out_ptrs = (quantized_ptr
+                    + offs_bt[:, None] * stride_qb
+                    + offs_d[None, :] * stride_qd)
+        tl.store(out_ptrs, quantized, mask=offs_bt[:, None] < N_CTX)
+    def _triton_lookup(x, embed, block_bt=None, tile_k=None):
+        """
+        Launch Triton VQ lookup kernel with SRAM-safe tile sizes.
+        Args:
+            x: [N, D] input tensor (cuda, contiguous)
+            embed: [codebook_size, D] codebook (cuda, contiguous)
+            block_bt: BLOCK_BT tile size (auto-computed if None)
+            tile_k: TILE_K tile size (auto-computed if None)
+        Returns:
+            indices: [N] int64 tensor of argmax indices
+        """
+        N, D = x.shape
+        codebook_size = embed.shape[0]
+        assert embed.shape[1] == D, f"Codebook dim {embed.shape[1]} != input dim {D}"
+        # SRAM-safe tile sizes: kernel uses tf32 (fp32 math), and Triton
+        # pipelines data through shared memory. Conservative sizing ensures
+        # fits within ~99KB (sm_89) even with default num_stages=3.
+        #
+        # fp32 codebook tile: TILE_K * D * 4  →  128*32*4 = 16KB
+        # fp32 input tile:    BLOCK_BT * D * 4 →  8*32*4 = 1KB
+        # Accumulator:        BLOCK_BT*TILE_K*4 → 8*128*4 = 4KB
+        # Per stage: ~21KB. With 3 pipeline stages: ~63KB (fits in 99KB).
+        #
+        # Larger tiles oversubscribe SRAM (tested: TILE_K=1024 → 321KB needed).
+        if block_bt is None or tile_k is None:
+            BLOCK_BT = 8
+            TILE_K = 128
+        else:
+            BLOCK_BT, TILE_K = block_bt, tile_k
+        grid = (triton.cdiv(N, BLOCK_BT),)
+        indices = torch.empty(N, dtype=torch.int32, device=x.device)
+        _triton_flash_vq_lookup_kernel[grid](
+            x, embed, indices,
+            x.stride(0), x.stride(1),
+            embed.stride(0), embed.stride(1),
+            N, codebook_size, D,
+            BLOCK_BT=BLOCK_BT, TILE_K=TILE_K,
+        )
+        return indices.long()
+class _TritonFlashVQFn(torch.autograd.Function):
+    """
+    Custom autograd Function wrapping Triton VQ kernels.
+    Forward: Triton tiled cosine similarity + argmax lookup
+    Backward: Rotation trick gradient or straight-through estimator
+    """
+    @staticmethod
+    def forward(ctx, x_flat, embed, cluster_size, embed_avg,
+                codebook_size, codebook_dim,
+                commitment_weight, rotation_trick):
+        # Triton tiled lookup for indices
+        with torch.no_grad():
+            indices = _triton_lookup(x_flat.contiguous(), embed.contiguous())
+        quantized = embed[indices]
+        commitment_loss = commitment_weight * F.mse_loss(x_flat.float(), quantized.detach().float())
+        # Clone saved tensors to avoid version conflicts with in-place EMA updates
+        ctx.save_for_backward(
+            x_flat.detach().clone(),
+            quantized.detach().clone(),
+            embed.detach().clone(),
+        )
+        ctx.codebook_dim = codebook_dim
+        ctx.rotation_trick = rotation_trick
+        return quantized, indices, commitment_loss
+    @staticmethod
+    def backward(ctx, grad_quantized, grad_indices, grad_commitment):
+        x_flat, quantized, embed = ctx.saved_tensors
+        if ctx.rotation_trick:
+            # Rotation trick gradient
+            x_norm = F.normalize(x_flat.float(), dim=-1)
+            q_norm = F.normalize(quantized.float(), dim=-1)
+            diff = x_norm - q_norm
+            proj = (grad_quantized.float() * x_norm).sum(dim=-1, keepdim=True)
+            grad_x = grad_quantized.float() - proj * diff
+        else:
+            # Straight-through estimator
+            grad_x = grad_quantized.float()
+        return grad_x.to(grad_quantized.dtype), None, None, None, None, None, None, None
+# When Triton is not available, define a fallback lookup
+if not _HAS_TRITON:
+    def _triton_lookup(x, embed):
+        """Fallback: torch-based cosine similarity lookup (CPU or CUDA without Triton)."""
+        with torch.no_grad():
+            x_norm = F.normalize(x.float(), dim=-1)
+            embed_norm = F.normalize(embed.float(), dim=-1)
+            sim = x_norm @ embed_norm.T
+            indices = sim.argmax(dim=-1)
+        return indices

arbitor/kernel/ternary_audit.py ADDED Viewed

	@@ -0,0 +1,192 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Iterable
+import torch
+@dataclass
+class TensorState:
+    name: str
+    shape: tuple[int, ...]
+    dtype: str
+    bytes: int
+    trainable: bool = False
+@dataclass
+class TernaryAudit:
+    logical_ternary_weights: int
+    ternary_packed_bytes: int
+    ternary_scale_bytes: int
+    ternary_scale_accum_bytes: int
+    ternary_accum_bytes: int
+    ternary_corr_accum_bytes: int
+    ternary_step_counter_bytes: int
+    trainable_float_params: list[TensorState]
+    frozen_float_params: list[TensorState]
+    float_buffers: list[TensorState]
+    @property
+    def ternary_training_bytes(self) -> int:
+        return (
+            self.ternary_packed_bytes
+            + self.ternary_scale_bytes
+            + self.ternary_scale_accum_bytes
+            + self.ternary_accum_bytes
+            + self.ternary_corr_accum_bytes
+            + self.ternary_step_counter_bytes
+        )
+    @property
+    def trainable_float_bytes(self) -> int:
+        return sum(item.bytes for item in self.trainable_float_params)
+    @property
+    def frozen_float_bytes(self) -> int:
+        return sum(item.bytes for item in self.frozen_float_params)
+    @property
+    def float_buffer_bytes(self) -> int:
+        return sum(item.bytes for item in self.float_buffers)
+def _tensor_bytes(t: torch.Tensor) -> int:
+    return t.numel() * t.element_size()
+def _tensor_state(name: str, t: torch.Tensor, trainable: bool = False) -> TensorState:
+    return TensorState(
+        name=name,
+        shape=tuple(t.shape),
+        dtype=str(t.dtype).replace("torch.", ""),
+        bytes=_tensor_bytes(t),
+        trainable=trainable,
+    )
+def _mb(n_bytes: int) -> float:
+    return n_bytes / (1024 * 1024)
+def audit_model(model: torch.nn.Module) -> TernaryAudit:
+    logical_ternary_weights = 0
+    ternary_packed_bytes = 0
+    ternary_scale_bytes = 0
+    ternary_scale_accum_bytes = 0
+    ternary_accum_bytes = 0
+    ternary_corr_accum_bytes = 0
+    ternary_step_counter_bytes = 0
+    for module in model.modules():
+        if hasattr(module, "T_packed") and hasattr(module, "_T_shape"):
+            shape = tuple(int(x) for x in module._T_shape.tolist())
+            n_weights = 1
+            for dim in shape:
+                n_weights *= dim
+            logical_ternary_weights += n_weights
+            ternary_packed_bytes += _tensor_bytes(module.T_packed)
+            if hasattr(module, "E"):
+                ternary_scale_bytes += _tensor_bytes(module.E)
+            if hasattr(module, "E_accum"):
+                ternary_scale_accum_bytes += _tensor_bytes(module.E_accum)
+            if hasattr(module, "T_accum"):
+                ternary_accum_bytes += _tensor_bytes(module.T_accum)
+            if hasattr(module, "corr_accum"):
+                ternary_corr_accum_bytes += _tensor_bytes(module.corr_accum)
+            if hasattr(module, "step_counter"):
+                ternary_step_counter_bytes += _tensor_bytes(module.step_counter)
+    trainable_float_params: list[TensorState] = []
+    frozen_float_params: list[TensorState] = []
+    for name, param in model.named_parameters():
+        if not param.dtype.is_floating_point:
+            continue
+        state = _tensor_state(name, param, trainable=param.requires_grad)
+        if param.requires_grad:
+            trainable_float_params.append(state)
+        else:
+            frozen_float_params.append(state)
+    float_buffers = [
+        _tensor_state(name, buf)
+        for name, buf in model.named_buffers()
+        if buf.dtype.is_floating_point
+    ]
+    return TernaryAudit(
+        logical_ternary_weights=logical_ternary_weights,
+        ternary_packed_bytes=ternary_packed_bytes,
+        ternary_scale_bytes=ternary_scale_bytes,
+        ternary_scale_accum_bytes=ternary_scale_accum_bytes,
+        ternary_accum_bytes=ternary_accum_bytes,
+        ternary_corr_accum_bytes=ternary_corr_accum_bytes,
+        ternary_step_counter_bytes=ternary_step_counter_bytes,
+        trainable_float_params=trainable_float_params,
+        frozen_float_params=frozen_float_params,
+        float_buffers=float_buffers,
+    )
+def format_audit(audit: TernaryAudit, limit: int = 12) -> str:
+    lines = [
+        "Ternary state audit:",
+        f"  logical ternary weights: {audit.logical_ternary_weights:,}",
+        (
+            "  ternary training state: "
+            f"{_mb(audit.ternary_training_bytes):.2f} MB "
+            f"(T={_mb(audit.ternary_packed_bytes):.2f}, "
+            f"E={_mb(audit.ternary_scale_bytes):.2f}, "
+            f"E_accum={_mb(audit.ternary_scale_accum_bytes):.2f}, "
+            f"T_accum={_mb(audit.ternary_accum_bytes):.2f}, "
+            f"corr_accum={_mb(audit.ternary_corr_accum_bytes):.2f}, "
+            f"steps={_mb(audit.ternary_step_counter_bytes):.4f})"
+        ),
+        (
+            "  trainable float params: "
+            f"{len(audit.trainable_float_params)} tensors, "
+            f"{_mb(audit.trainable_float_bytes):.2f} MB"
+        ),
+        (
+            "  frozen float params: "
+            f"{len(audit.frozen_float_params)} tensors, "
+            f"{_mb(audit.frozen_float_bytes):.2f} MB"
+        ),
+        (
+            "  float buffers: "
+            f"{len(audit.float_buffers)} tensors, "
+            f"{_mb(audit.float_buffer_bytes):.2f} MB"
+        ),
+    ]
+    if audit.trainable_float_params:
+        lines.append("  largest trainable float params:")
+        for item in sorted(audit.trainable_float_params, key=lambda x: x.bytes, reverse=True)[:limit]:
+            lines.append(f"    {item.name}: {item.shape} {item.dtype} {_mb(item.bytes):.2f} MB")
+    if audit.float_buffers:
+        lines.append("  largest float buffers:")
+        for item in sorted(audit.float_buffers, key=lambda x: x.bytes, reverse=True)[:limit]:
+            lines.append(f"    {item.name}: {item.shape} {item.dtype} {_mb(item.bytes):.2f} MB")
+    return "\n".join(lines)
+def freeze_float_parameters(
+    model: torch.nn.Module,
+    allow_prefixes: Iterable[str] = (),
+) -> list[TensorState]:
+    allow = tuple(allow_prefixes)
+    frozen: list[TensorState] = []
+    for name, param in model.named_parameters():
+        if allow and name.startswith(allow):
+            continue
+        if param.dtype.is_floating_point and param.requires_grad:
+            frozen.append(_tensor_state(name, param, trainable=True))
+            param.requires_grad_(False)
+    return frozen
+def trainable_parameters(model: torch.nn.Module) -> list[torch.nn.Parameter]:
+    return [p for p in model.parameters() if p.requires_grad]

arbitor/kernel/ternary_scale.py ADDED Viewed

	@@ -0,0 +1,1811 @@

+import os
+import threading
+import warnings
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from enum import IntEnum
+from math import ceil
+from ..converters.convert_to_ternary8 import pack_ternary, unpack_ternary
+_HAS_TILELANG = False
+try:
+    import tilelang
+    import tilelang.language as T
+    _HAS_TILELANG = True
+except ImportError:
+    pass
+_HAS_TRITON = False
+try:
+    import triton
+    import triton.language as tl
+    _HAS_TRITON = True
+except ImportError:
+    pass
+def _backend_preference() -> str:
+    backend = os.environ.get("ARB_TERNARY_BACKEND", "auto").strip().lower()
+    if backend not in {"auto", "tilelang", "triton", "torch"}:
+        warnings.warn(
+            f"Unknown ARB_TERNARY_BACKEND={backend!r}; falling back to auto.",
+            RuntimeWarning,
+            stacklevel=2,
+        )
+        return "auto"
+    return backend
+def _rmsnorm_triton_max_dim() -> int:
+    raw = os.environ.get("ARB_RMSNORM_TRITON_MAX_DIM", "4096").strip()
+    try:
+        return max(0, int(raw))
+    except ValueError:
+        warnings.warn(
+            f"Invalid ARB_RMSNORM_TRITON_MAX_DIM={raw!r}; using 4096.",
+            RuntimeWarning,
+            stacklevel=2,
+        )
+        return 4096
+def _bigint_corr_strength() -> float:
+    raw = os.environ.get("ARB_BIGINT_CORR_STRENGTH", "4.0").strip()
+    try:
+        return float(raw)
+    except ValueError:
+        warnings.warn(
+            f"Invalid ARB_BIGINT_CORR_STRENGTH={raw!r}; using 4.0.",
+            RuntimeWarning,
+            stacklevel=2,
+        )
+        return 4.0
+class _ComponentContext:
+    _local = threading.local()
+    @classmethod
+    def get(cls):
+        val = getattr(cls._local, "current", None)
+        if val is None:
+            return None, 1.0
+        return val
+    @classmethod
+    def set(cls, name, weight=1.0):
+        if name is None:
+            cls._local.current = None
+        else:
+            cls._local.current = (name, weight)
+    @classmethod
+    def clear(cls):
+        cls._local.current = None
+_COMPONENT_CONTEXT = _ComponentContext
+def _tilelang_training_enabled() -> bool:
+    return os.environ.get("ARB_TILELANG_TRAINING", "0").strip().lower() in {"1", "true", "yes"}
+if _HAS_TILELANG:
+    tilelang_jit = tilelang.jit(pass_configs={"tl.disable_warp_specialized": True})
+    def _ternary_fwd_kernel(
+        M: int, N: int, K: int, group_size: int = 12,
+        corr_strength: float = 4.0,
+        block_M: int = 64, block_N: int = 64, block_K: int = 32,
+        threads: int = 128, num_stages: int = 2,
+    ):
+        gpr = (K + group_size - 1) // group_size
+        cs = corr_strength
+        @T.prim_func
+        def kernel(
+            x: T.Tensor((M, K), "float16"),
+            T_packed: T.Tensor((N * K + 4) // 5, "uint8"),
+            E: T.Tensor((N * gpr), "int8"),
+            corr_accum: T.Tensor((N * gpr), "int64"),
+            step_counter: T.Tensor((1,), "int64"),
+            output: T.Tensor((M, N), "float32"),
+        ):
+            steps = T.cast(step_counter[0], "int32")
+            with T.Kernel(T.ceildiv(M, block_M), T.ceildiv(N, block_N), threads=threads) as (bx, by):
+                x_shared = T.alloc_shared((block_M, block_K), dtype="float16")
+                dq_shared = T.alloc_shared((block_N, block_K), dtype="float16")
+                acc = T.alloc_fragment((block_M, block_N), dtype="float32")
+                T.use_swizzle(10)
+                T.clear(acc)
+                for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages):
+                    T.copy(x[bx * block_M, k * block_K], x_shared)
+                    for i, j in T.Parallel(block_N, block_K):
+                        i_glob = by * block_N + i
+                        j_glob = k * block_K + j
+                        if i_glob < N and j_glob < K:
+                            lin_idx = i_glob * K + j_glob
+                            pack_idx = lin_idx // 5
+                            trit_pos = lin_idx % 5
+                            packed_val = T.cast(T_packed[pack_idx], "int32")
+                            trit = T.if_then_else(
+                                trit_pos == 0, packed_val % 3,
+                                T.if_then_else(trit_pos == 1, (packed_val // 3) % 3,
+                                T.if_then_else(trit_pos == 2, (packed_val // 9) % 3,
+                                T.if_then_else(trit_pos == 3, (packed_val // 27) % 3,
+                                (packed_val // 81) % 3))))
+                            sign_val = T.cast(trit, "int32") - 1
+                            exp_idx = i_glob * gpr + j_glob // group_size
+                            exp_val = T.cast(E[exp_idx], "int32")
+                            ca = T.cast(corr_accum[exp_idx], "int32")
+                            den = T.max(steps * group_size, 1)
+                            mc = T.cast(ca, "float32") / T.cast(den, "float32")
+                            e_adj = T.cast(exp_val, "float32") + mc * cs
+                            ecl = T.min(T.max(e_adj, -14.0), 15.0)
+                            dq_shared[i, j] = T.cast(T.exp2(ecl) * T.cast(sign_val, "float32"), "float16")
+                    T.gemm(x_shared, dq_shared, acc, transpose_B=True)
+                T.copy(acc, output[bx * block_M, by * block_N])
+        return tilelang_jit(kernel)
+    def _ternary_grad_x_kernel(
+        M: int, N: int, K: int, group_size: int = 12,
+        corr_strength: float = 4.0,
+        block_M: int = 64, block_N: int = 64, block_K: int = 32,
+        threads: int = 128, num_stages: int = 2,
+    ):
+        gpr = (K + group_size - 1) // group_size
+        cs = corr_strength
+        @T.prim_func
+        def kernel(
+            grad_y: T.Tensor((M, N), "float16"),
+            T_packed: T.Tensor((N * K + 4) // 5, "uint8"),
+            E: T.Tensor((N * gpr), "int8"),
+            corr_accum: T.Tensor((N * gpr), "int64"),
+            step_counter: T.Tensor((1,), "int64"),
+            output: T.Tensor((M, K), "float32"),
+        ):
+            steps = T.cast(step_counter[0], "int32")
+            with T.Kernel(T.ceildiv(M, block_M), T.ceildiv(K, block_K), threads=threads) as (bx, by):
+                gy_shared = T.alloc_shared((block_M, block_N), dtype="float16")
+                dq_shared = T.alloc_shared((block_N, block_K), dtype="float16")
+                acc = T.alloc_fragment((block_M, block_K), dtype="float32")
+                T.use_swizzle(10)
+                T.clear(acc)
+                for n in T.Pipelined(T.ceildiv(N, block_N), num_stages=num_stages):
+                    T.copy(grad_y[bx * block_M, n * block_N], gy_shared)
+                    for i, j in T.Parallel(block_N, block_K):
+                        i_glob = n * block_N + i
+                        j_glob = by * block_K + j
+                        if i_glob < N and j_glob < K:
+                            lin_idx = i_glob * K + j_glob
+                            pack_idx = lin_idx // 5
+                            trit_pos = lin_idx % 5
+                            packed_val = T.cast(T_packed[pack_idx], "int32")
+                            trit = T.if_then_else(
+                                trit_pos == 0, packed_val % 3,
+                                T.if_then_else(trit_pos == 1, (packed_val // 3) % 3,
+                                T.if_then_else(trit_pos == 2, (packed_val // 9) % 3,
+                                T.if_then_else(trit_pos == 3, (packed_val // 27) % 3,
+                                (packed_val // 81) % 3))))
+                            sign_val = T.cast(trit, "int32") - 1
+                            exp_idx = i_glob * gpr + j_glob // group_size
+                            exp_val = T.cast(E[exp_idx], "int32")
+                            ca = T.cast(corr_accum[exp_idx], "int32")
+                            den = T.max(steps * group_size, 1)
+                            mc = T.cast(ca, "float32") / T.cast(den, "float32")
+                            e_adj = T.cast(exp_val, "float32") + mc * cs
+                            ecl = T.min(T.max(e_adj, -14.0), 15.0)
+                            dq_shared[i, j] = T.cast(T.exp2(ecl) * T.cast(sign_val, "float32"), "float16")
+                    T.gemm(gy_shared, dq_shared, acc)
+                T.copy(acc, output[bx * block_M, by * block_K])
+        return tilelang_jit(kernel)
+_KERNEL_CACHE_FWD = {}
+_KERNEL_CACHE_GX = {}
+def _get_kernel(M, N, K, group_size, mode, corr_strength=4.0):
+    cs = corr_strength
+    if mode == "fwd":
+        cache = _KERNEL_CACHE_FWD
+        key = (M, N, K, group_size, cs)
+        if key not in cache:
+            cache[key] = _ternary_fwd_kernel(M, N, K, group_size, corr_strength=cs)
+        return cache[key]
+    elif mode == "grad_x":
+        cache = _KERNEL_CACHE_GX
+        key = (M, N, K, group_size)
+        if key not in cache:
+            cache[key] = _ternary_grad_x_kernel(M, N, K, group_size)
+        return cache[key]
+    raise ValueError(f"Unknown TileLang kernel mode: {mode}")
+def _get_grad_kernels(M, N, K, group_size):
+    return _get_kernel(M, N, K, group_size, "grad_x")
+class _TernaryLinearFn(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, module, fwd_kernel):
+        ctx.module = module
+        T_packed = module.T_packed
+        E = module.E
+        shape = tuple(module._T_shape.tolist())
+        N, K = shape
+        x_2d = x.reshape(-1, K).contiguous()
+        ctx.group_size = module.group_size
+        ctx.shape = shape
+        ctx.x_shape = x.shape
+        comp_name, _ = _COMPONENT_CONTEXT.get()
+        ctx.comp_name = comp_name
+        ctx.x_dtype = x.dtype
+        has_corr = hasattr(module, "corr_accum") and hasattr(module, "step_counter")
+        ctx.save_for_backward(x_2d, T_packed, E)
+        ctx.has_corr = has_corr
+        ctx.step_snapshot = int(module.step_counter.item()) if has_corr else 0
+        with torch.no_grad():
+            M = x_2d.shape[0]
+            output = torch.empty(M, N, device=x.device, dtype=torch.float32)
+            if has_corr:
+                fwd_kernel(x_2d.half(), T_packed, E,
+                           module.corr_accum.contiguous(),
+                           module.step_counter.contiguous(), output)
+            else:
+                fwd_kernel(x_2d.half(), T_packed, E,
+                           torch.zeros(N * ((K + module.group_size - 1) // module.group_size),
+                                       dtype=torch.int64, device=x.device),
+                           torch.zeros(1, dtype=torch.int64, device=x.device), output)
+        return output.reshape(*x.shape[:-1], N)
+    @staticmethod
+    def backward(ctx, grad_output):
+        x_2d, T_packed, E = ctx.saved_tensors
+        group_size = ctx.group_size
+        N, K = ctx.shape
+        M = x_2d.shape[0]
+        grad_2d = grad_output.reshape(-1, N).contiguous()
+        if ctx.has_corr:
+            corr_accum = ctx.module.corr_accum.contiguous()
+            step_counter = torch.tensor([ctx.step_snapshot], dtype=torch.int64, device=x_2d.device)
+        else:
+            corr_accum = torch.zeros(N * ((K + group_size - 1) // group_size),
+                                     dtype=torch.int64, device=x_2d.device)
+            step_counter = torch.zeros(1, dtype=torch.int64, device=x_2d.device)
+        grad_x_kernel = _get_grad_kernels(M, N, K, group_size)
+        with torch.no_grad():
+            grad_x = torch.empty(M, K, device=x_2d.device, dtype=torch.float32)
+            grad_x_kernel(grad_2d.half(), T_packed, E, corr_accum, step_counter, grad_x)
+            comp_name = ctx.comp_name
+            if _HAS_TRITON and ctx.has_corr and getattr(ctx.module, "_stream_backward_updates", True):
+                bwd_name, bwd_weight = _COMPONENT_CONTEXT.get()
+                if bwd_name is None:
+                    bwd_weight = 1.0
+                base_step = int(getattr(ctx.module, "_backward_t_accum_step", 1))
+                corr_step = max(1, int(round(abs(float(bwd_weight)) * base_step)))
+                if bwd_weight < 0:
+                    corr_step = -corr_step
+                _triton_accumulate_corr_direct(
+                    T_packed, grad_2d, x_2d, ctx.module.corr_accum,
+                    N, K, group_size, corr_step=corr_step,
+                )
+                ctx.module.step_counter.add_(abs(corr_step))
+                ctx.module._streamed_bigint_backward = True
+            elif _HAS_TRITON:
+                grad_sign = _triton_ternary_grad_sign(grad_2d, x_2d, N, K)
+                if comp_name is not None:
+                    setattr(ctx.module, f"_hook_grad_T_sign_{comp_name}", grad_sign.detach())
+                else:
+                    ctx.module._hook_grad_T_sign = grad_sign.detach()
+            elif comp_name is not None:
+                setattr(ctx.module, f"_hook_grad_2d_{comp_name}", grad_2d.detach())
+                setattr(ctx.module, f"_hook_x_2d_{comp_name}", x_2d.detach())
+            else:
+                ctx.module._hook_grad_2d = grad_2d.detach()
+                ctx.module._hook_x_2d = x_2d.detach()
+        grad_x_reshaped = grad_x.reshape(*ctx.x_shape).to(dtype=ctx.x_dtype)
+        return grad_x_reshaped, None, None
+if _HAS_TRITON:
+    @triton.jit
+    def _triton_ternary_fwd_kernel(
+        x_ptr, packed_ptr, e_ptr, corr_ptr, step_ptr, out_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        CORR_STRENGTH: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_m = tl.program_id(0)
+        pid_n = tl.program_id(1)
+        offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_k = tl.arange(0, BLOCK_K)
+        acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+        for k0 in range(0, K, BLOCK_K):
+            k = k0 + offs_k
+            x = tl.load(
+                x_ptr + offs_m[:, None] * K + k[None, :],
+                mask=(offs_m[:, None] < M) & (k[None, :] < K),
+                other=0.0,
+            )
+            lin = offs_n[:, None] * K + k[None, :]
+            pack_idx = lin // 5
+            trit_pos = lin - pack_idx * 5
+            packed = tl.load(
+                packed_ptr + pack_idx,
+                mask=(offs_n[:, None] < N) & (k[None, :] < K),
+                other=0,
+            ).to(tl.int32)
+            divisor = tl.where(
+                trit_pos == 0, 1,
+                tl.where(trit_pos == 1, 3,
+                tl.where(trit_pos == 2, 9,
+                tl.where(trit_pos == 3, 27, 81))),
+            )
+            trit = (packed // divisor) % 3
+            sign = trit.to(tl.int32) - 1
+            e_idx = offs_n[:, None] * GPR + k[None, :] // GROUP_SIZE
+            e_val = tl.load(
+                e_ptr + e_idx,
+                mask=(offs_n[:, None] < N) & (k[None, :] < K),
+                other=0,
+            ).to(tl.float32)
+            corr_val = tl.load(
+                corr_ptr + e_idx,
+                mask=(offs_n[:, None] < N) & (k[None, :] < K),
+                other=0,
+            ).to(tl.float32)
+            step_val = tl.load(step_ptr).to(tl.float32)
+            denom = tl.maximum(step_val * GROUP_SIZE, 1.0)
+            e_adj = e_val + (corr_val / denom) * CORR_STRENGTH
+            w = sign.to(tl.float32) * tl.exp2(e_adj)
+            w = tl.where((offs_n[:, None] < N) & (k[None, :] < K), w, 0.0)
+            acc += tl.dot(x, tl.trans(w))
+        tl.store(
+            out_ptr + offs_m[:, None] * N + offs_n[None, :],
+            acc,
+            mask=(offs_m[:, None] < M) & (offs_n[None, :] < N),
+        )
+    @triton.jit
+    def _triton_ternary_grad_x_kernel(
+        grad_ptr, packed_ptr, e_ptr, corr_ptr, step_ptr, out_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        CORR_STRENGTH: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_m = tl.program_id(0)
+        pid_k = tl.program_id(1)
+        offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+        offs_k = pid_k * BLOCK_K + tl.arange(0, BLOCK_K)
+        offs_n = tl.arange(0, BLOCK_N)
+        acc = tl.zeros((BLOCK_M, BLOCK_K), dtype=tl.float32)
+        for n0 in range(0, N, BLOCK_N):
+            n = n0 + offs_n
+            grad = tl.load(
+                grad_ptr + offs_m[:, None] * N + n[None, :],
+                mask=(offs_m[:, None] < M) & (n[None, :] < N),
+                other=0.0,
+            )
+            lin = n[:, None] * K + offs_k[None, :]
+            pack_idx = lin // 5
+            trit_pos = lin - pack_idx * 5
+            packed = tl.load(
+                packed_ptr + pack_idx,
+                mask=(n[:, None] < N) & (offs_k[None, :] < K),
+                other=0,
+            ).to(tl.int32)
+            divisor = tl.where(
+                trit_pos == 0, 1,
+                tl.where(trit_pos == 1, 3,
+                tl.where(trit_pos == 2, 9,
+                tl.where(trit_pos == 3, 27, 81))),
+            )
+            trit = (packed // divisor) % 3
+            sign = trit.to(tl.int32) - 1
+            e_idx = n[:, None] * GPR + offs_k[None, :] // GROUP_SIZE
+            e_val = tl.load(
+                e_ptr + e_idx,
+                mask=(n[:, None] < N) & (offs_k[None, :] < K),
+                other=0,
+            ).to(tl.float32)
+            corr_val = tl.load(
+                corr_ptr + e_idx,
+                mask=(n[:, None] < N) & (offs_k[None, :] < K),
+                other=0,
+            ).to(tl.float32)
+            step_val = tl.load(step_ptr).to(tl.float32)
+            denom = tl.maximum(step_val * GROUP_SIZE, 1.0)
+            e_adj = e_val + (corr_val / denom) * CORR_STRENGTH
+            w = sign.to(tl.float32) * tl.exp2(e_adj)
+            w = tl.where((n[:, None] < N) & (offs_k[None, :] < K), w, 0.0)
+            acc += tl.dot(grad, w)
+        tl.store(
+            out_ptr + offs_m[:, None] * K + offs_k[None, :],
+            acc,
+            mask=(offs_m[:, None] < M) & (offs_k[None, :] < K),
+        )
+    @triton.jit
+    def _triton_ternary_grad_sign_kernel(
+        grad_ptr, x_ptr, sign_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_n = tl.program_id(0)
+        pid_k = tl.program_id(1)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_k = pid_k * BLOCK_K + tl.arange(0, BLOCK_K)
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_N, BLOCK_K), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + offs_n[None, :],
+                mask=(m[:, None] < M) & (offs_n[None, :] < N),
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + offs_k[None, :],
+                mask=(m[:, None] < M) & (offs_k[None, :] < K),
+                other=0.0,
+            )
+            acc += tl.dot(tl.trans(grad), x, input_precision="ieee")
+        sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0))
+        tl.store(
+            sign_ptr + offs_n[:, None] * K + offs_k[None, :],
+            sign.to(tl.int8),
+            mask=(offs_n[:, None] < N) & (offs_k[None, :] < K),
+        )
+    @triton.jit
+    def _triton_update_e_kernel(
+        packed_ptr, grad_sign_ptr, e_ptr, e_accum_ptr,
+        N: tl.constexpr, K: tl.constexpr,
+        GROUP_SIZE: tl.constexpr, GPR: tl.constexpr,
+        E_ACCUM_THRESHOLD: tl.constexpr,
+        BLOCK_N: tl.constexpr, BLOCK_G: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_n = tl.program_id(0)
+        pid_g = tl.program_id(1)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_g = pid_g * BLOCK_G + tl.arange(0, BLOCK_G)
+        offs_r = tl.arange(0, BLOCK_K)
+        k = offs_g[:, None] * GROUP_SIZE + offs_r[None, :]
+        valid_group = offs_g < GPR
+        lin = offs_n[:, None, None] * K + k[None, :, :]
+        pack_idx = lin // 5
+        trit_pos = lin - pack_idx * 5
+        packed = tl.load(
+            packed_ptr + pack_idx,
+            mask=(offs_n[:, None, None] < N) & valid_group[None, :, None] & (offs_r[None, None, :] < GROUP_SIZE) & (k[None, :, :] < K),
+            other=0,
+        ).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        ternary = trit.to(tl.int32) - 1
+        grad_sign = tl.load(
+            grad_sign_ptr + offs_n[:, None, None] * K + k[None, :, :],
+            mask=(offs_n[:, None, None] < N) & valid_group[None, :, None] & (offs_r[None, None, :] < GROUP_SIZE) & (k[None, :, :] < K),
+            other=0,
+        ).to(tl.int32)
+        contrib = grad_sign * ternary
+        score = tl.sum(contrib, axis=2)
+        delta = tl.where(score > 0, -1, tl.where(score < 0, 1, 0))
+        e_idx = offs_n[:, None] * GPR + offs_g[None, :]
+        old_accum = tl.load(
+            e_accum_ptr + e_idx,
+            mask=(offs_n[:, None] < N) & valid_group[None, :],
+            other=0,
+        ).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum + delta))
+        step_up = new_accum >= E_ACCUM_THRESHOLD
+        step_down = new_accum <= -E_ACCUM_THRESHOLD
+        e_step = tl.where(step_up, 1, tl.where(step_down, -1, 0))
+        stored_accum = new_accum - e_step * E_ACCUM_THRESHOLD
+        old_e = tl.load(
+            e_ptr + e_idx,
+            mask=(offs_n[:, None] < N) & valid_group[None, :],
+            other=0,
+        ).to(tl.int32)
+        new_e = tl.minimum(127, tl.maximum(-128, old_e + e_step))
+        tl.store(
+            e_ptr + e_idx,
+            new_e.to(tl.int8),
+            mask=(offs_n[:, None] < N) & valid_group[None, :],
+        )
+        tl.store(
+            e_accum_ptr + e_idx,
+            stored_accum.to(tl.int8),
+            mask=(offs_n[:, None] < N) & valid_group[None, :],
+        )
+    @triton.jit
+    def _triton_ternary_step_kernel(
+        packed_ptr, grad_sign_ptr, accum_ptr, per_group_threshold_ptr,
+        TOTAL: tl.constexpr, ACCUM_THRESHOLD: tl.constexpr,
+        T_ACCUM_STEP: tl.constexpr,
+        K: tl.constexpr, GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        HAS_PER_GROUP_THRESHOLD: tl.constexpr,
+        BLOCK_T: tl.constexpr,
+    ):
+        pack_idx = tl.program_id(0)
+        offs_t = tl.arange(0, BLOCK_T)
+        valid_trit = offs_t < 5
+        lin = pack_idx * 5 + offs_t
+        valid = valid_trit & (lin < TOTAL)
+        old_packed = tl.load(packed_ptr + pack_idx).to(tl.int32)
+        divisor = tl.where(
+            offs_t == 0, 1,
+            tl.where(offs_t == 1, 3,
+            tl.where(offs_t == 2, 9,
+            tl.where(offs_t == 3, 27, 81))),
+        )
+        old_code = (old_packed // divisor) % 3
+        old_sign = old_code.to(tl.int32) - 1
+        grad_sign = tl.load(grad_sign_ptr + lin, mask=valid, other=0).to(tl.int32)
+        old_accum = tl.load(accum_ptr + lin, mask=valid, other=0).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum - grad_sign * T_ACCUM_STEP))
+        if HAS_PER_GROUP_THRESHOLD:
+            n = lin // K
+            k = lin - n * K
+            g_idx = n * GPR + k // GROUP_SIZE
+            threshold = tl.load(per_group_threshold_ptr + g_idx, mask=valid, other=ACCUM_THRESHOLD).to(tl.int32)
+        else:
+            threshold = ACCUM_THRESHOLD
+        flip_up = new_accum > threshold
+        flip_down = new_accum < -threshold
+        did_flip = valid & (flip_up | flip_down)
+        new_sign = tl.where(flip_up, 1, tl.where(flip_down, -1, old_sign))
+        stored_accum = tl.where(did_flip, 0, new_accum)
+        tl.store(accum_ptr + lin, stored_accum.to(tl.int8), mask=valid)
+        new_code = tl.where(valid, new_sign + 1, 0)
+        packed_val = tl.sum(new_code * divisor, axis=0)
+        tl.store(packed_ptr + pack_idx, packed_val.to(tl.uint8))
+    @triton.jit
+    def _triton_update_e_direct_kernel(
+        packed_ptr, grad_ptr, x_ptr, e_ptr, e_accum_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        GROUP_SIZE: tl.constexpr, GPR: tl.constexpr,
+        E_ACCUM_THRESHOLD: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_n = tl.program_id(0)
+        pid_g = tl.program_id(1)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_r = tl.arange(0, BLOCK_K)
+        k = pid_g * GROUP_SIZE + offs_r
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_N, BLOCK_K), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + offs_n[None, :],
+                mask=(m[:, None] < M) & (offs_n[None, :] < N),
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + k[None, :],
+                mask=(m[:, None] < M) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+                other=0.0,
+            )
+            acc += tl.dot(tl.trans(grad), x, input_precision="ieee")
+        grad_sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int32)
+        lin = offs_n[:, None] * K + k[None, :]
+        pack_idx = lin // 5
+        trit_pos = lin - pack_idx * 5
+        packed = tl.load(
+            packed_ptr + pack_idx,
+            mask=(offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            other=0,
+        ).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        ternary = trit.to(tl.int32) - 1
+        contrib = tl.where(
+            (offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            grad_sign * ternary,
+            0,
+        )
+        score = tl.sum(contrib, axis=1)
+        delta = tl.where(score > 0, -1, tl.where(score < 0, 1, 0))
+        e_idx = offs_n * GPR + pid_g
+        old_accum = tl.load(e_accum_ptr + e_idx, mask=offs_n < N, other=0).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum + delta))
+        step_up = new_accum >= E_ACCUM_THRESHOLD
+        step_down = new_accum <= -E_ACCUM_THRESHOLD
+        e_step = tl.where(step_up, 1, tl.where(step_down, -1, 0))
+        stored_accum = new_accum - e_step * E_ACCUM_THRESHOLD
+        old_e = tl.load(e_ptr + e_idx, mask=offs_n < N, other=0).to(tl.int32)
+        new_e = tl.minimum(127, tl.maximum(-128, old_e + e_step))
+        tl.store(e_ptr + e_idx, new_e.to(tl.int8), mask=offs_n < N)
+        tl.store(e_accum_ptr + e_idx, stored_accum.to(tl.int8), mask=offs_n < N)
+    @triton.jit
+    def _triton_ternary_step_direct_kernel(
+        packed_ptr, grad_ptr, x_ptr, accum_ptr, per_group_threshold_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        TOTAL: tl.constexpr, ACCUM_THRESHOLD: tl.constexpr,
+        T_ACCUM_STEP: tl.constexpr,
+        GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        HAS_PER_GROUP_THRESHOLD: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_T: tl.constexpr,
+    ):
+        pack_idx = tl.program_id(0)
+        offs_t = tl.arange(0, BLOCK_T)
+        lin = pack_idx * 5 + offs_t
+        valid_trit = offs_t < 5
+        valid = valid_trit & (lin < TOTAL)
+        n = lin // K
+        k = lin - n * K
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_T,), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + n[None, :],
+                mask=(m[:, None] < M) & valid[None, :],
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + k[None, :],
+                mask=(m[:, None] < M) & valid[None, :],
+                other=0.0,
+            )
+            acc += tl.sum(grad * x, axis=0)
+        grad_sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int32)
+        old_packed = tl.load(packed_ptr + pack_idx).to(tl.int32)
+        divisor = tl.where(
+            offs_t == 0, 1,
+            tl.where(offs_t == 1, 3,
+            tl.where(offs_t == 2, 9,
+            tl.where(offs_t == 3, 27, 81))),
+        )
+        old_code = (old_packed // divisor) % 3
+        old_sign = old_code.to(tl.int32) - 1
+        old_accum = tl.load(accum_ptr + lin, mask=valid, other=0).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum - grad_sign * T_ACCUM_STEP))
+        if HAS_PER_GROUP_THRESHOLD:
+            g_idx = n * GPR + k // GROUP_SIZE
+            threshold = tl.load(per_group_threshold_ptr + g_idx, mask=valid, other=ACCUM_THRESHOLD).to(tl.int32)
+        else:
+            threshold = ACCUM_THRESHOLD
+        flip_up = new_accum > threshold
+        flip_down = new_accum < -threshold
+        did_flip = valid & (flip_up | flip_down)
+        new_sign = tl.where(flip_up, 1, tl.where(flip_down, -1, old_sign))
+        stored_accum = tl.where(did_flip, 0, new_accum)
+        tl.store(accum_ptr + lin, stored_accum.to(tl.int8), mask=valid)
+        new_code = tl.where(valid, new_sign + 1, 0)
+        packed_val = tl.sum(new_code * divisor, axis=0)
+        tl.store(packed_ptr + pack_idx, packed_val.to(tl.uint8))
+    @triton.jit
+    def _triton_accumulate_t_direct_kernel(
+        grad_ptr, x_ptr, accum_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        TOTAL: tl.constexpr, T_ACCUM_STEP: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_T: tl.constexpr,
+    ):
+        pack_idx = tl.program_id(0)
+        offs_t = tl.arange(0, BLOCK_T)
+        lin = pack_idx * 5 + offs_t
+        valid_trit = offs_t < 5
+        valid = valid_trit & (lin < TOTAL)
+        n = lin // K
+        k = lin - n * K
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_T,), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + n[None, :],
+                mask=(m[:, None] < M) & valid[None, :],
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + k[None, :],
+                mask=(m[:, None] < M) & valid[None, :],
+                other=0.0,
+            )
+            acc += tl.sum(grad * x, axis=0)
+        grad_sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int32)
+        old_accum = tl.load(accum_ptr + lin, mask=valid, other=0).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum - grad_sign * T_ACCUM_STEP))
+        tl.store(accum_ptr + lin, new_accum.to(tl.int8), mask=valid)
+    @triton.jit
+    def _triton_accumulate_e_direct_kernel(
+        packed_ptr, grad_ptr, x_ptr, e_accum_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        GROUP_SIZE: tl.constexpr, GPR: tl.constexpr,
+        E_ACCUM_STEP: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_n = tl.program_id(0)
+        pid_g = tl.program_id(1)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_r = tl.arange(0, BLOCK_K)
+        k = pid_g * GROUP_SIZE + offs_r
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_N, BLOCK_K), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + offs_n[None, :],
+                mask=(m[:, None] < M) & (offs_n[None, :] < N),
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + k[None, :],
+                mask=(m[:, None] < M) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+                other=0.0,
+            )
+            acc += tl.dot(tl.trans(grad), x, input_precision="ieee")
+        grad_sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int32)
+        lin = offs_n[:, None] * K + k[None, :]
+        pack_idx = lin // 5
+        trit_pos = lin - pack_idx * 5
+        packed = tl.load(
+            packed_ptr + pack_idx,
+            mask=(offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            other=0,
+        ).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        ternary = trit.to(tl.int32) - 1
+        contrib = tl.where(
+            (offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            grad_sign * ternary,
+            0,
+        )
+        score = tl.sum(contrib, axis=1)
+        delta = tl.where(score > 0, -1, tl.where(score < 0, 1, 0))
+        e_idx = offs_n * GPR + pid_g
+        old_accum = tl.load(e_accum_ptr + e_idx, mask=offs_n < N, other=0).to(tl.int32)
+        new_accum = tl.minimum(127, tl.maximum(-128, old_accum + delta * E_ACCUM_STEP))
+        tl.store(e_accum_ptr + e_idx, new_accum.to(tl.int8), mask=offs_n < N)
+    @triton.jit
+    def _triton_accumulate_corr_direct_kernel(
+        packed_ptr, grad_ptr, x_ptr, corr_ptr,
+        M: tl.constexpr, N: tl.constexpr, K: tl.constexpr,
+        GROUP_SIZE: tl.constexpr, GPR: tl.constexpr,
+        CORR_STEP: tl.constexpr,
+        BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+    ):
+        pid_n = tl.program_id(0)
+        pid_g = tl.program_id(1)
+        offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+        offs_r = tl.arange(0, BLOCK_K)
+        k = pid_g * GROUP_SIZE + offs_r
+        offs_m = tl.arange(0, BLOCK_M)
+        acc = tl.zeros((BLOCK_N, BLOCK_K), dtype=tl.float32)
+        for m0 in range(0, M, BLOCK_M):
+            m = m0 + offs_m
+            grad = tl.load(
+                grad_ptr + m[:, None] * N + offs_n[None, :],
+                mask=(m[:, None] < M) & (offs_n[None, :] < N),
+                other=0.0,
+            )
+            x = tl.load(
+                x_ptr + m[:, None] * K + k[None, :],
+                mask=(m[:, None] < M) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+                other=0.0,
+            )
+            acc += tl.dot(tl.trans(grad), x, input_precision="ieee")
+        grad_sign = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int32)
+        lin = offs_n[:, None] * K + k[None, :]
+        pack_idx = lin // 5
+        trit_pos = lin - pack_idx * 5
+        packed = tl.load(
+            packed_ptr + pack_idx,
+            mask=(offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            other=0,
+        ).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        ternary = trit.to(tl.int32) - 1
+        contrib = tl.where(
+            (offs_n[:, None] < N) & (offs_r[None, :] < GROUP_SIZE) & (k[None, :] < K),
+            grad_sign * ternary,
+            0,
+        )
+        score = tl.sum(contrib, axis=1)
+        corr_idx = offs_n * GPR + pid_g
+        old_corr = tl.load(corr_ptr + corr_idx, mask=offs_n < N, other=0).to(tl.int64)
+        new_corr = old_corr - score.to(tl.int64) * CORR_STEP
+        tl.store(corr_ptr + corr_idx, new_corr, mask=offs_n < N)
+    @triton.jit
+    def _triton_apply_accumulated_flips_kernel(
+        packed_ptr, accum_ptr, per_group_threshold_ptr,
+        TOTAL: tl.constexpr, ACCUM_THRESHOLD: tl.constexpr,
+        K: tl.constexpr, GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        HAS_PER_GROUP_THRESHOLD: tl.constexpr,
+        BLOCK_T: tl.constexpr,
+    ):
+        pack_idx = tl.program_id(0)
+        offs_t = tl.arange(0, BLOCK_T)
+        valid_trit = offs_t < 5
+        lin = pack_idx * 5 + offs_t
+        valid = valid_trit & (lin < TOTAL)
+        old_packed = tl.load(packed_ptr + pack_idx).to(tl.int32)
+        divisor = tl.where(
+            offs_t == 0, 1,
+            tl.where(offs_t == 1, 3,
+            tl.where(offs_t == 2, 9,
+            tl.where(offs_t == 3, 27, 81))),
+        )
+        old_code = (old_packed // divisor) % 3
+        old_sign = old_code.to(tl.int32) - 1
+        old_accum = tl.load(accum_ptr + lin, mask=valid, other=0).to(tl.int32)
+        if HAS_PER_GROUP_THRESHOLD:
+            n = lin // K
+            k = lin - n * K
+            g_idx = n * GPR + k // GROUP_SIZE
+            threshold = tl.load(per_group_threshold_ptr + g_idx, mask=valid, other=ACCUM_THRESHOLD).to(tl.int32)
+        else:
+            threshold = ACCUM_THRESHOLD
+        flip_up = old_accum > threshold
+        flip_down = old_accum < -threshold
+        did_flip = valid & (flip_up | flip_down)
+        new_sign = tl.where(flip_up, 1, tl.where(flip_down, -1, old_sign))
+        stored_accum = tl.where(did_flip, 0, old_accum)
+        tl.store(accum_ptr + lin, stored_accum.to(tl.int8), mask=valid)
+        new_code = tl.where(valid, new_sign + 1, 0)
+        packed_val = tl.sum(new_code * divisor, axis=0)
+        tl.store(packed_ptr + pack_idx, packed_val.to(tl.uint8))
+def _triton_ternary_forward(x_2d, packed, e, corr_accum, step_counter, n_out, k_in, group_size):
+    block_m, block_n, block_k = 16, 16, 32
+    out = torch.empty((x_2d.shape[0], n_out), device=x_2d.device, dtype=torch.float32)
+    grid = (triton.cdiv(x_2d.shape[0], block_m), triton.cdiv(n_out, block_n))
+    _triton_ternary_fwd_kernel[grid](
+        x_2d, packed, e, corr_accum, step_counter, out,
+        x_2d.shape[0], n_out, k_in, ceil(k_in / group_size), group_size,
+        _bigint_corr_strength(),
+        BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+    )
+    return out
+def _triton_ternary_grad_x(grad_2d, packed, e, corr_accum, step_counter, m_rows, n_out, k_in, group_size):
+    block_m, block_n, block_k = 16, 16, 32
+    out = torch.empty((m_rows, k_in), device=grad_2d.device, dtype=torch.float32)
+    grid = (triton.cdiv(m_rows, block_m), triton.cdiv(k_in, block_k))
+    _triton_ternary_grad_x_kernel[grid](
+        grad_2d, packed, e, corr_accum, step_counter, out,
+        m_rows, n_out, k_in, ceil(k_in / group_size), group_size,
+        _bigint_corr_strength(),
+        BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+    )
+    return out
+def _triton_ternary_grad_sign(grad_2d, x_2d, n_out, k_in):
+    block_m, block_n, block_k = 32, 16, 32
+    out = torch.empty((n_out, k_in), device=grad_2d.device, dtype=torch.int8)
+    grid = (triton.cdiv(n_out, block_n), triton.cdiv(k_in, block_k))
+    _triton_ternary_grad_sign_kernel[grid](
+        grad_2d, x_2d, out,
+        x_2d.shape[0], n_out, k_in,
+        BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+    )
+    return out
+def _triton_update_e(packed, grad_sign, e, e_accum, n_out, k_in, group_size, e_accum_threshold=4):
+    block_n, block_g = 8, 4
+    gpr = ceil(k_in / group_size)
+    block_k = 1 << (group_size - 1).bit_length()
+    grid = (triton.cdiv(n_out, block_n), triton.cdiv(gpr, block_g))
+    _triton_update_e_kernel[grid](
+        packed, grad_sign, e, e_accum,
+        n_out, k_in, group_size, gpr, int(e_accum_threshold),
+        BLOCK_N=block_n, BLOCK_G=block_g, BLOCK_K=block_k,
+    )
+def _triton_update_e_direct(packed, grad_2d, x_2d, e, e_accum, n_out, k_in, group_size, e_accum_threshold=4):
+    block_m, block_n = 32, 8
+    block_k = 1 << (group_size - 1).bit_length()
+    gpr = ceil(k_in / group_size)
+    grid = (triton.cdiv(n_out, block_n), gpr)
+    _triton_update_e_direct_kernel[grid](
+        packed, grad_2d, x_2d, e, e_accum,
+        x_2d.shape[0], n_out, k_in, group_size, gpr, int(e_accum_threshold),
+        BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+    )
+def _triton_ternary_step(packed, grad_sign, accum, total, accum_threshold, t_accum_step=1,
+                         per_group_threshold=None, n_out=0, k_in=0, group_size=0):
+    block_t = 8
+    grid = (triton.cdiv(total, 5),)
+    has_pgt = per_group_threshold is not None
+    dummy = torch.empty(1, device=accum.device, dtype=torch.int8)
+    gpr = (k_in + group_size - 1) // group_size if has_pgt else 0
+    _triton_ternary_step_kernel[grid](
+        packed, grad_sign, accum,
+        per_group_threshold if has_pgt else dummy,
+        total, accum_threshold, int(t_accum_step),
+        k_in if has_pgt else 0, gpr, group_size if has_pgt else 0,
+        has_pgt,
+        BLOCK_T=block_t,
+    )
+def _triton_ternary_step_direct(packed, grad_2d, x_2d, accum, n_out, k_in, total, accum_threshold, t_accum_step=1,
+                                per_group_threshold=None, group_size=0):
+    block_m, block_t = 32, 8
+    grid = (triton.cdiv(total, 5),)
+    has_pgt = per_group_threshold is not None
+    dummy = torch.empty(1, device=accum.device, dtype=torch.int8)
+    gpr = (k_in + group_size - 1) // group_size if has_pgt else 0
+    _triton_ternary_step_direct_kernel[grid](
+        packed, grad_2d, x_2d, accum,
+        per_group_threshold if has_pgt else dummy,
+        x_2d.shape[0], n_out, k_in,
+        total, accum_threshold, int(t_accum_step),
+        gpr, group_size if has_pgt else 0,
+        has_pgt,
+        BLOCK_M=block_m, BLOCK_T=block_t,
+    )
+def _triton_accumulate_direct(packed, grad_2d, x_2d, t_accum, e_accum,
+                              n_out, k_in, group_size,
+                              t_accum_step=1, e_accum_step=1,
+                              update_scales=True):
+    block_m, block_t = 32, 8
+    total = n_out * k_in
+    grid = (triton.cdiv(total, 5),)
+    _triton_accumulate_t_direct_kernel[grid](
+        grad_2d, x_2d, t_accum,
+        grad_2d.shape[0], n_out, k_in, total, int(t_accum_step),
+        BLOCK_M=block_m, BLOCK_T=block_t,
+    )
+    if update_scales and e_accum is not None:
+        block_n = 8
+        block_k = 1 << (group_size - 1).bit_length()
+        gpr = ceil(k_in / group_size)
+        grid_e = (triton.cdiv(n_out, block_n), gpr)
+        _triton_accumulate_e_direct_kernel[grid_e](
+            packed, grad_2d, x_2d, e_accum,
+            grad_2d.shape[0], n_out, k_in, group_size, gpr, int(e_accum_step),
+            BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+        )
+def _triton_accumulate_corr_direct(packed, grad_2d, x_2d, corr_accum,
+                                   n_out, k_in, group_size, corr_step=1):
+    block_m, block_n = 32, 8
+    block_k = 1 << (group_size - 1).bit_length()
+    gpr = ceil(k_in / group_size)
+    grid = (triton.cdiv(n_out, block_n), gpr)
+    _triton_accumulate_corr_direct_kernel[grid](
+        packed, grad_2d, x_2d, corr_accum,
+        grad_2d.shape[0], n_out, k_in, group_size, gpr, int(corr_step),
+        BLOCK_M=block_m, BLOCK_N=block_n, BLOCK_K=block_k,
+    )
+def _triton_apply_accumulated_flips(packed, accum, total, accum_threshold,
+                                    per_group_threshold=None,
+                                    k_in=0, group_size=0):
+    block_t = 8
+    grid = (triton.cdiv(total, 5),)
+    has_pgt = per_group_threshold is not None
+    dummy = torch.empty(1, device=accum.device, dtype=torch.int8)
+    gpr = (k_in + group_size - 1) // group_size if has_pgt else 0
+    _triton_apply_accumulated_flips_kernel[grid](
+        packed, accum,
+        per_group_threshold if has_pgt else dummy,
+        total, accum_threshold,
+        k_in if has_pgt else 0, gpr, group_size if has_pgt else 0,
+        has_pgt,
+        BLOCK_T=block_t,
+    )
+@triton.jit
+def _triton_ternary_embed_fwd_kernel(
+    idx_ptr, packed_ptr, e_ptr, out_ptr,
+    NUM_IDX: tl.constexpr, DIM: tl.constexpr,
+    VOCAB: tl.constexpr, GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+    BLOCK_B: tl.constexpr, BLOCK_D: tl.constexpr,
+):
+    pid = tl.program_id(0)
+    offs_b = pid * BLOCK_B + tl.arange(0, BLOCK_B)
+    offs_d = tl.arange(0, BLOCK_D)
+    idx = tl.load(idx_ptr + offs_b, mask=offs_b < NUM_IDX, other=0).to(tl.int32)
+    lin = idx[:, None] * DIM + offs_d[None, :]
+    pack_idx = lin // 5
+    trit_pos = lin - pack_idx * 5
+    packed = tl.load(packed_ptr + pack_idx, mask=(offs_b[:, None] < NUM_IDX) & (offs_d[None, :] < DIM), other=0).to(tl.int32)
+    divisor = tl.where(
+        trit_pos == 0, 1,
+        tl.where(trit_pos == 1, 3,
+        tl.where(trit_pos == 2, 9,
+        tl.where(trit_pos == 3, 27, 81))),
+    )
+    trit = (packed // divisor) % 3
+    sign = trit.to(tl.int32) - 1
+    e_idx = idx[:, None] * GPR + offs_d[None, :] // GROUP_SIZE
+    e_val = tl.load(e_ptr + e_idx, mask=(offs_b[:, None] < NUM_IDX) & (offs_d[None, :] < DIM), other=0).to(tl.float32)
+    w = sign.to(tl.float32) * tl.exp2(e_val)
+    w = tl.where((offs_b[:, None] < NUM_IDX) & (offs_d[None, :] < DIM), w, 0.0)
+    tl.store(
+        out_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+        w,
+        mask=(offs_b[:, None] < NUM_IDX) & (offs_d[None, :] < DIM),
+    )
+@triton.jit
+def _triton_ternary_embed_bwd_accum_kernel(
+    idx_ptr, grad_ptr, accum_ptr,
+    NUM_IDX: tl.constexpr, DIM: tl.constexpr,
+    BLOCK_B: tl.constexpr, BLOCK_D: tl.constexpr,
+):
+    pid = tl.program_id(0)
+    offs_b = pid * BLOCK_B + tl.arange(0, BLOCK_B)
+    offs_d = tl.arange(0, BLOCK_D)
+    valid = (offs_b[:, None] < NUM_IDX) & (offs_d[None, :] < DIM)
+    idx = tl.load(idx_ptr + offs_b, mask=offs_b < NUM_IDX, other=0).to(tl.int32)
+    g = tl.load(grad_ptr + offs_b[:, None] * DIM + offs_d[None, :], mask=valid, other=0.0)
+    dst = idx[:, None] * DIM + offs_d[None, :]
+    tl.atomic_add(accum_ptr + dst, g, mask=valid)
+@triton.jit
+def _triton_ternary_embed_bwd_sign_kernel(
+    accum_ptr, sign_ptr,
+    VOCAB: tl.constexpr, DIM: tl.constexpr,
+    BLOCK_V: tl.constexpr, BLOCK_D: tl.constexpr,
+):
+    pid_v = tl.program_id(0)
+    offs_v = pid_v * BLOCK_V + tl.arange(0, BLOCK_V)
+    offs_d = tl.arange(0, BLOCK_D)
+    valid = (offs_v[:, None] < VOCAB) & (offs_d[None, :] < DIM)
+    acc = tl.load(accum_ptr + offs_v[:, None] * DIM + offs_d[None, :], mask=valid, other=0.0)
+    sign_val = tl.where(acc > 0.0, 1, tl.where(acc < 0.0, -1, 0)).to(tl.int8)
+    tl.store(sign_ptr + offs_v[:, None] * DIM + offs_d[None, :], sign_val, mask=valid)
+def _triton_ternary_embed_grad_sign(indices, grad_output, vocab, dim):
+    flat_idx = indices.reshape(-1).contiguous().to(torch.int32)
+    grad_2d = grad_output.reshape(-1, dim).contiguous()
+    num_idx = flat_idx.shape[0]
+    accum = torch.zeros(vocab, dim, device=grad_output.device, dtype=torch.float32)
+    block_b = 64
+    grid = (triton.cdiv(num_idx, block_b),)
+    _triton_ternary_embed_bwd_accum_kernel[grid](
+        flat_idx, grad_2d, accum,
+        num_idx, dim,
+        BLOCK_B=block_b, BLOCK_D=triton.next_power_of_2(dim),
+    )
+    sign_out = torch.empty(vocab, dim, device=grad_output.device, dtype=torch.int8)
+    block_v = 32
+    grid2 = (triton.cdiv(vocab, block_v),)
+    _triton_ternary_embed_bwd_sign_kernel[grid2](
+        accum, sign_out,
+        vocab, dim,
+        BLOCK_V=block_v, BLOCK_D=triton.next_power_of_2(dim),
+    )
+    return sign_out
+def _triton_ternary_embed(indices, packed, e, vocab, dim, group_size):
+    flat_idx = indices.reshape(-1).contiguous().to(torch.int32)
+    num_idx = flat_idx.shape[0]
+    out = torch.empty((num_idx, dim), device=indices.device, dtype=torch.float32)
+    block_b, block_d = 32, triton.next_power_of_2(dim)
+    gpr = ceil(dim / group_size)
+    grid = (triton.cdiv(num_idx, block_b),)
+    _triton_ternary_embed_fwd_kernel[grid](
+        flat_idx, packed, e, out,
+        num_idx, dim, vocab, gpr, group_size,
+        BLOCK_B=block_b, BLOCK_D=block_d,
+    )
+    return out.reshape(*indices.shape, dim)
+class _TritonTernaryEmbedFn(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, indices, _dummy, module):
+        shape = tuple(module._T_shape.tolist())
+        vocab, dim = shape
+        packed = module.T_packed.contiguous()
+        e = module.E.contiguous()
+        ctx.save_for_backward(indices, packed, e)
+        ctx.module = module
+        ctx.shape = shape
+        ctx.group_size = module.group_size
+        comp_name, _ = _COMPONENT_CONTEXT.get()
+        ctx.comp_name = comp_name
+        return _triton_ternary_embed(indices, packed, e, vocab, dim, module.group_size)
+    @staticmethod
+    def backward(ctx, grad_output):
+        indices, packed, e = ctx.saved_tensors
+        vocab, dim = ctx.shape
+        grad_2d = grad_output.reshape(-1, dim).contiguous()
+        comp_name = ctx.comp_name
+        has_corr = hasattr(ctx.module, "corr_accum") and hasattr(ctx.module, "_accumulate_corr_from_grad_sign")
+        if getattr(ctx.module, "_stream_backward_updates", True) and has_corr:
+            # BigInt streaming: accumulate correlation directly
+            grad_sign = _triton_ternary_embed_grad_sign(indices, grad_2d, vocab, dim)
+            T = unpack_ternary(packed, tuple(ctx.module._T_shape.tolist()), int(ctx.module._T_pad.item())).to(device=grad_sign.device)
+            signed = grad_sign.to(torch.int16) * T.to(torch.int16)
+            ctx.module._accumulate_corr_from_grad_sign(grad_sign)
+            ctx.module._streamed_bigint_backward = True
+        elif comp_name is not None:
+            setattr(ctx.module, f"_hook_grad_T_sign_{comp_name}", _triton_ternary_embed_grad_sign(indices, grad_2d, vocab, dim))
+            T = unpack_ternary(packed, tuple(ctx.module._T_shape.tolist()), int(ctx.module._T_pad.item()))
+            setattr(ctx.module, f"_hook_T_{comp_name}", T.to(device=grad_2d.device))
+        else:
+            ctx.module._hook_grad_T_sign = _triton_ternary_embed_grad_sign(indices, grad_2d, vocab, dim)
+            T = unpack_ternary(packed, tuple(ctx.module._T_shape.tolist()), int(ctx.module._T_pad.item()))
+            ctx.module._hook_T = T.to(device=grad_2d.device)
+        return None, None, None
+class _TritonTernaryLinearFn(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, module):
+        shape = tuple(module._T_shape.tolist())
+        n_out, k_in = shape
+        x_2d = x.reshape(-1, k_in).contiguous()
+        packed = module.T_packed.contiguous()
+        e = module.E.contiguous()
+        ctx.save_for_backward(x_2d, packed, e)
+        ctx.step_snapshot = int(module.step_counter.item())
+        ctx.x_shape = x.shape
+        ctx.shape = shape
+        ctx.group_size = module.group_size
+        ctx.module = module
+        comp_name, _ = _COMPONENT_CONTEXT.get()
+        ctx.comp_name = comp_name
+        corr = module.corr_accum.contiguous()
+        step = module.step_counter.contiguous()
+        out = _triton_ternary_forward(x_2d, packed, e, corr, step, n_out, k_in, module.group_size)
+        return out.reshape(*x.shape[:-1], n_out)
+    @staticmethod
+    def backward(ctx, grad_output):
+        x_2d, packed, e = ctx.saved_tensors
+        n_out, k_in = ctx.shape
+        grad_2d = grad_output.reshape(-1, n_out).contiguous()
+        corr = ctx.module.corr_accum.contiguous()
+        step = torch.tensor([ctx.step_snapshot], device=e.device, dtype=torch.int64)
+        grad_x = _triton_ternary_grad_x(
+            grad_2d, packed, e, corr, step, x_2d.shape[0], n_out, k_in, ctx.group_size
+        )
+        with torch.no_grad():
+            if getattr(ctx.module, "_stream_backward_updates", True):
+                _, bwd_weight = _COMPONENT_CONTEXT.get()
+                corr_step = max(1, int(round(abs(float(bwd_weight)))))
+                if bwd_weight < 0:
+                    corr_step = -corr_step
+                _triton_accumulate_corr_direct(
+                    packed, grad_2d, x_2d, ctx.module.corr_accum,
+                    n_out, k_in, ctx.group_size, corr_step=corr_step,
+                )
+                ctx.module.step_counter.add_(abs(corr_step))
+                ctx.module._streamed_bigint_backward = True
+            else:
+                grad_sign = _triton_ternary_grad_sign(grad_2d, x_2d, n_out, k_in)
+                comp_name = ctx.comp_name
+                if comp_name is not None:
+                    setattr(ctx.module, f"_hook_grad_T_sign_{comp_name}", grad_sign.detach())
+                else:
+                    ctx.module._hook_grad_T_sign = grad_sign.detach()
+        return grad_x.reshape(*ctx.x_shape), None
+class _BigIntTernaryLinearFn(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, module):
+        shape = tuple(module._T_shape.tolist())
+        n_out, k_in = shape
+        x_2d = x.reshape(-1, k_in).contiguous()
+        ctx.module = module
+        ctx.x_shape = x.shape
+        ctx.shape = shape
+        ctx.x_dtype = x.dtype
+        ctx.save_for_backward(x_2d)
+        with torch.no_grad():
+            w_eff = module.dequantize().to(device=x.device, dtype=torch.float32)
+            out = F.linear(x_2d.float(), w_eff, module.bias.float() if module.bias is not None else None)
+        return out.reshape(*x.shape[:-1], n_out)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (x_2d,) = ctx.saved_tensors
+        module = ctx.module
+        n_out, k_in = ctx.shape
+        grad_2d = grad_output.reshape(-1, n_out).contiguous()
+        with torch.no_grad():
+            w_eff = module.dequantize().to(device=grad_2d.device, dtype=torch.float32)
+            grad_x = grad_2d.float() @ w_eff
+            grad_sign = (grad_2d.float().transpose(0, 1) @ x_2d.float()).sign().to(torch.int8)
+            module._accumulate_corr_from_grad_sign(grad_sign)
+            module._streamed_bigint_backward = True
+        return grad_x.reshape(*ctx.x_shape).to(dtype=ctx.x_dtype), None
+"""
+Log-Space Group Scale Representation
+Convention (matching agents' Option B recommendation):
+  S = 2^E      where S = scale, E = int8 log-space exponent
+  W_eff = T * 2^E
+Key log-space properties exploited:
+  Multiplication → addition:   S1 * S2 = 2^(E1 + E2)
+  Division → subtraction:      S1 / S2 = 2^(E1 - E2)
+  Dequant → integer shift:     2^E * T = T << E   (for E >= 0)
+No IEEE floats in persistent state. E is stored as int8.
+Ephemeral float only exists in autograd's computation graph.
+"""
+class TScaleType(IntEnum):
+    T4 = 4
+    T6 = 6
+    T8 = 8
+    T16 = 16
+    T32 = 32
+    T64 = 64
+    T96 = 96
+GROUP_SIZES = {
+    TScaleType.T4: 4,
+    TScaleType.T6: 6,
+    TScaleType.T8: 8,
+    TScaleType.T16: 16,
+    TScaleType.T32: 32,
+    TScaleType.T64: 64,
+    TScaleType.T96: 96,
+}
+TILE_SIZE = 384
+def _n_groups(shape, group_size):
+    out_dim, in_dim = shape
+    return out_dim * ceil(in_dim / group_size)
+def _expand_E(E, shape, group_size):
+    out_dim, in_dim = shape
+    gpr = ceil(in_dim / group_size)
+    E_2d = E.view(out_dim, gpr)
+    E_exp = E_2d.repeat_interleave(group_size, dim=1)
+    if E_exp.shape[1] > in_dim:
+        E_exp = E_exp[:, :in_dim]
+    return E_exp
+def _ternarize(x, threshold=0.05):
+    return x.sign() * (x.abs() > threshold).to(x.dtype)
+def _scaled_init_threshold(threshold: float, init_std: float) -> float:
+    if init_std <= 0:
+        return threshold
+    return min(float(threshold), 0.5 * float(init_std))
+class TernaryScaleTensor(nn.Module):
+    def __init__(
+        self,
+        in_dim: int,
+        out_dim: int,
+        threshold: float = 0.05,
+        weight_init_std: float | None = None,
+        tscale_type: TScaleType = TScaleType.T32,
+        bias: bool = False,
+    ):
+        super().__init__()
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+        init_std = min(0.1, in_dim ** -0.5) if weight_init_std is None else float(weight_init_std)
+        init_threshold = _scaled_init_threshold(threshold, init_std)
+        self.threshold = init_threshold
+        self.tscale_type = tscale_type
+        self.group_size = GROUP_SIZES[tscale_type]
+        shape = (out_dim, in_dim)
+        n_grp = _n_groups(shape, self.group_size)
+        w_init = torch.randn(out_dim, in_dim) * init_std
+        T_init = _ternarize(w_init, init_threshold)
+        packed_T, T_shape, T_pad = pack_ternary(T_init)
+        self.register_buffer("T_packed", packed_T)
+        self.register_buffer("_T_shape", torch.tensor([out_dim, in_dim], dtype=torch.long))
+        self.register_buffer("_T_pad", torch.tensor(T_pad, dtype=torch.long))
+        gpr = ceil(in_dim / self.group_size)
+        total_in = gpr * self.group_size
+        padded = torch.zeros(out_dim, total_in)
+        abs_w = w_init.abs()
+        padded[:, :in_dim] = abs_w
+        grouped = padded.view(out_dim, gpr, self.group_size)
+        grp_means = grouped.mean(dim=2)
+        E_vals = torch.where(grp_means > 0, grp_means, torch.ones_like(grp_means))
+        E_int = E_vals.log2().clamp(-128, 127).to(torch.int8)
+        self.register_buffer("E", E_int.flatten())
+        self.register_buffer("corr_accum", torch.zeros_like(self.E, dtype=torch.int64))
+        self.register_buffer("step_counter", torch.zeros(1, dtype=torch.int64))
+        if bias:
+            self.register_buffer("bias", torch.zeros(out_dim, dtype=torch.int32))
+        else:
+            self.bias = None
+    def _get_T(self):
+        return unpack_ternary(self.T_packed, tuple(self._T_shape.tolist()), int(self._T_pad.item()))
+    def _get_S(self):
+        gpr = ceil(self.in_dim / self.group_size)
+        e_adj = self.E.float()
+        if hasattr(self, "corr_accum") and hasattr(self, "step_counter"):
+            step = int(self.step_counter.item())
+            if step > 0:
+                denom = max(step * self.group_size, 1)
+                e_adj = e_adj + (self.corr_accum.float() / denom) * _bigint_corr_strength()
+        E_exp = _expand_E(e_adj, (self.out_dim, self.in_dim), self.group_size)
+        return torch.exp2(E_exp)
+    def _ensure_group_lr(self):
+        if not hasattr(self, "group_lr"):
+            self.register_buffer("group_lr", torch.ones_like(self.E, dtype=torch.int8))
+        elif self.group_lr.shape != self.E.shape or self.group_lr.device != self.E.device:
+            self.group_lr = torch.ones_like(self.E, dtype=torch.int8)
+        return self.group_lr
+    def precompile_kernels(self, M: int):
+        pass
+    def forward(self, x):
+        backend = _backend_preference()
+        if backend == "tilelang" and _HAS_TILELANG:
+            if torch.is_grad_enabled() and not _tilelang_training_enabled():
+                raise RuntimeError(
+                    "ARB_TERNARY_BACKEND='tilelang' is inference-only by default. "
+                    "BigInt ternary training should use ARB_TERNARY_BACKEND='triton'. "
+                    "Set ARB_TILELANG_TRAINING=1 only for experimental TileLang training."
+                )
+            x_for_grad = x
+            if torch.is_grad_enabled() and not x.requires_grad:
+                x_for_grad = x.detach().requires_grad_(True)
+            N, K = tuple(self._T_shape.tolist())
+            x_2d = x_for_grad.reshape(-1, K)
+            M = x_2d.shape[0]
+            try:
+                fwd_kernel = _get_kernel(M, N, K, self.group_size, "fwd")
+                y = _TernaryLinearFn.apply(x_for_grad, self, fwd_kernel)
+                if self.bias is not None:
+                    y = y + self.bias.float()
+                return y
+            except Exception as e:
+                warnings.warn(f"TileLang forward failed for {self._T_shape.tolist()}: {e}")
+                if _HAS_TRITON:
+                    backend = "triton"
+                else:
+                    backend = "torch"
+        if x.is_cuda and _HAS_TRITON and backend in {"auto", "triton"}:
+            x_for_grad = x
+            if torch.is_grad_enabled() and not x.requires_grad:
+                x_for_grad = x.detach().requires_grad_(True)
+            y = _TritonTernaryLinearFn.apply(x_for_grad, self)
+            if self.bias is not None:
+                y = y + self.bias.float()
+            return y
+        if backend == "triton":
+            raise RuntimeError("ARB_TERNARY_BACKEND='triton' requested, but Triton is unavailable for this input.")
+        x_for_grad = x
+        if torch.is_grad_enabled() and not x.requires_grad:
+            x_for_grad = x.detach().requires_grad_(True)
+        return _BigIntTernaryLinearFn.apply(x_for_grad, self)
+    @torch.no_grad()
+    def _accumulate_corr_from_grad_sign(self, grad_sign, corr_step=1):
+        shape = tuple(self._T_shape.tolist())
+        out_dim, in_dim = shape
+        if tuple(grad_sign.shape) != shape:
+            return
+        T = self._get_T().to(device=grad_sign.device, dtype=torch.int16)
+        signed = grad_sign.to(torch.int16) * T
+        gpr = ceil(in_dim / self.group_size)
+        total_in = gpr * self.group_size
+        if total_in > in_dim:
+            signed = F.pad(signed, (0, total_in - in_dim))
+        score = signed.view(out_dim, gpr, self.group_size).sum(dim=2, dtype=torch.int16)
+        self.corr_accum -= score.flatten().to(device=self.corr_accum.device, dtype=torch.int64) * int(corr_step)
+        self.step_counter += abs(int(corr_step))
+    def ternary_step(self, lr=1, accum_threshold=None):
+        self._had_flip = False
+        if hasattr(self, "_hook_grad_T_sign"):
+            self._accumulate_corr_from_grad_sign(self._hook_grad_T_sign)
+            del self._hook_grad_T_sign
+    def update_E(self, lr=1, loss_signal=None):
+        has_dense_grad = hasattr(self, "_hook_grad_T_sign")
+        has_direct_grad = hasattr(self, "_hook_grad_2d") and hasattr(self, "_hook_x_2d")
+        if not has_dense_grad and not has_direct_grad:
+            return
+        if has_dense_grad:
+            self._accumulate_corr_from_grad_sign(self._hook_grad_T_sign)
+            del self._hook_grad_T_sign
+        else:
+            grad = self._hook_grad_2d.to(device=self.E.device, dtype=torch.float32)
+            x = self._hook_x_2d.to(device=self.E.device, dtype=torch.float32)
+            grad_sign = (grad.transpose(0, 1) @ x).sign().to(torch.int8)
+            self._accumulate_corr_from_grad_sign(grad_sign)
+            del self._hook_grad_2d
+            del self._hook_x_2d
+        if hasattr(self, "_hook_T"):
+            del self._hook_T
+    @property
+    def effective_bpw(self) -> float:
+        group_size = self.group_size
+        total = self._T_shape[0].item() * self._T_shape[1].item()
+        n_grp = _n_groups(tuple(self._T_shape.tolist()), group_size)
+        sign_bits = total * (8 / 5)
+        scale_bits = n_grp * 8.0
+        corr_bits = n_grp * 64.0
+        bias_bits = self.bias.numel() * 32.0 if self.bias is not None else 0.0
+        return (sign_bits + scale_bits + corr_bits + bias_bits) / total
+    def dequantize(self) -> torch.Tensor:
+        T = self._get_T().float()
+        S = self._get_S()
+        return S * T
+    def tscale_to(self, tscale_type: TScaleType):
+        self.tscale_type = tscale_type
+        old_group_size = self.group_size
+        self.group_size = GROUP_SIZES[tscale_type]
+        shape = tuple(self._T_shape.tolist())
+        out_dim, in_dim = shape
+        new_gpr = ceil(in_dim / self.group_size)
+        new_n_grp = out_dim * new_gpr
+        if self.E.shape[0] != new_n_grp:
+            T = self._get_T().float()
+            total_in = new_gpr * self.group_size
+            padded = torch.zeros(out_dim, total_in, device=self.T_packed.device)
+            abs_w = T.abs()
+            padded[:, :in_dim] = abs_w
+            grouped = padded.view(out_dim, new_gpr, self.group_size)
+            grp_means = grouped.mean(dim=2)
+            E_new = torch.where(grp_means > 0, grp_means, torch.ones_like(grp_means))
+            E_int = E_new.log2().clamp(-128, 127).to(torch.int8)
+            self.E = E_int.flatten()
+            self.corr_accum = torch.zeros_like(self.E, dtype=torch.int64)
+            self.step_counter = torch.zeros(1, dtype=torch.int64, device=self.E.device)
+        return self
+    tscale_cast = tscale_to
+    def extra_repr(self) -> str:
+        return (
+            f"in_dim={self.in_dim}, out_dim={self.out_dim}, "
+            f"tscale_type={self.tscale_type.name}, group_size={self.group_size}, "
+            f"effective_bpw={self.effective_bpw:.2f}"
+        )
+if _HAS_TRITON:
+    @triton.jit
+    def _triton_rmsnorm_fwd_kernel(
+        x_ptr, packed_ptr, e_ptr, out_ptr,
+        BATCH: tl.constexpr, DIM: tl.constexpr,
+        GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        BLOCK_B: tl.constexpr, BLOCK_D: tl.constexpr,
+    ):
+        pid_b = tl.program_id(0)
+        offs_b = pid_b * BLOCK_B + tl.arange(0, BLOCK_B)
+        offs_d = tl.arange(0, BLOCK_D)
+        x = tl.load(
+            x_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+            mask=(offs_b[:, None] < BATCH) & (offs_d[None, :] < DIM),
+            other=0.0,
+        )
+        sq = x * x
+        msq = tl.sum(sq, axis=1, keep_dims=True) / DIM
+        rms = tl.sqrt(msq + 1e-5)
+        x_norm = x / rms
+        pack_idx = offs_d // 5
+        trit_pos = offs_d - pack_idx * 5
+        packed = tl.load(packed_ptr + pack_idx, mask=offs_d < DIM, other=0).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        sign = trit.to(tl.int32) - 1
+        e_idx = offs_d // GROUP_SIZE
+        e_val = tl.load(e_ptr + e_idx, mask=offs_d < DIM, other=0).to(tl.float32)
+        w = sign.to(tl.float32) * tl.exp2(e_val)
+        w = tl.where(offs_d < DIM, w, 0.0)
+        out = x_norm * w[None, :]
+        tl.store(
+            out_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+            out,
+            mask=(offs_b[:, None] < BATCH) & (offs_d[None, :] < DIM),
+        )
+    @triton.jit
+    def _triton_rmsnorm_bwd_kernel(
+        grad_out_ptr, x_ptr, packed_ptr, e_ptr,
+        grad_x_ptr,
+        BATCH: tl.constexpr, DIM: tl.constexpr,
+        GPR: tl.constexpr, GROUP_SIZE: tl.constexpr,
+        BLOCK_B: tl.constexpr, BLOCK_D: tl.constexpr,
+    ):
+        pid_b = tl.program_id(0)
+        offs_b = pid_b * BLOCK_B + tl.arange(0, BLOCK_B)
+        offs_d = tl.arange(0, BLOCK_D)
+        x = tl.load(
+            x_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+            mask=(offs_b[:, None] < BATCH) & (offs_d[None, :] < DIM),
+            other=0.0,
+        )
+        sq = x * x
+        msq = tl.sum(sq, axis=1, keep_dims=True) / DIM
+        rms = tl.sqrt(msq + 1e-5)
+        x_norm = x / rms
+        pack_idx = offs_d // 5
+        trit_pos = offs_d - pack_idx * 5
+        packed = tl.load(packed_ptr + pack_idx, mask=offs_d < DIM, other=0).to(tl.int32)
+        divisor = tl.where(
+            trit_pos == 0, 1,
+            tl.where(trit_pos == 1, 3,
+            tl.where(trit_pos == 2, 9,
+            tl.where(trit_pos == 3, 27, 81))),
+        )
+        trit = (packed // divisor) % 3
+        sign = trit.to(tl.int32) - 1
+        e_idx = offs_d // GROUP_SIZE
+        e_val = tl.load(e_ptr + e_idx, mask=offs_d < DIM, other=0).to(tl.float32)
+        w = sign.to(tl.float32) * tl.exp2(e_val)
+        w = tl.where(offs_d < DIM, w, 0.0)
+        dy = tl.load(
+            grad_out_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+            mask=(offs_b[:, None] < BATCH) & (offs_d[None, :] < DIM),
+            other=0.0,
+        )
+        dyw = dy * w[None, :]
+        c1 = tl.sum(x_norm * dyw, axis=1, keep_dims=True) / DIM
+        dx = (dyw - x_norm * c1) / rms
+        tl.store(
+            grad_x_ptr + offs_b[:, None] * DIM + offs_d[None, :],
+            dx,
+            mask=(offs_b[:, None] < BATCH) & (offs_d[None, :] < DIM),
+        )
+    class _TritonRMSNormFn(torch.autograd.Function):
+        @staticmethod
+        def forward(ctx, x, module, packed, e, dim, group_size):
+            ctx.module = module
+            x_2d = x.reshape(-1, dim).contiguous()
+            batch = x_2d.shape[0]
+            out = torch.empty_like(x_2d)
+            block_b = 16
+            grid = (triton.cdiv(batch, block_b),)
+            _triton_rmsnorm_fwd_kernel[grid](
+                x_2d, packed, e, out,
+                batch, dim, ceil(dim / group_size), group_size,
+                BLOCK_B=block_b, BLOCK_D=triton.next_power_of_2(dim),
+            )
+            ctx.save_for_backward(x_2d, packed, e)
+            ctx.dim = dim
+            ctx.group_size = group_size
+            comp_name, _ = _COMPONENT_CONTEXT.get()
+            ctx.comp_name = comp_name
+            return out.reshape(*x.shape)
+        @staticmethod
+        def backward(ctx, grad_output):
+            x_2d, packed, e = ctx.saved_tensors
+            dim = ctx.dim
+            group_size = ctx.group_size
+            grad_2d = grad_output.reshape(-1, dim).contiguous()
+            batch = grad_2d.shape[0]
+            grad_x = torch.empty_like(x_2d)
+            block_b = 16
+            grid = (triton.cdiv(batch, block_b),)
+            _triton_rmsnorm_bwd_kernel[grid](
+                grad_2d, x_2d, packed, e, grad_x,
+                batch, dim, ceil(dim / group_size), group_size,
+                BLOCK_B=block_b, BLOCK_D=triton.next_power_of_2(dim),
+            )
+            return grad_x.reshape(*grad_output.shape), None, None, None, None, None
+class TernaryRMSNorm(nn.Module):
+    def __init__(self, dim, eps=1e-5, threshold=0.05, tscale_type=TScaleType.T64):
+        super().__init__()
+        self.dim = dim
+        self.eps = eps
+        self.threshold = threshold
+        self.tscale_type = tscale_type
+        self.group_size = GROUP_SIZES[tscale_type]
+        shape = (1, dim)
+        n_grp = _n_groups(shape, self.group_size)
+        w_init = torch.ones(1, dim)
+        T_init = _ternarize(w_init, threshold)
+        packed_T, T_shape, T_pad = pack_ternary(T_init)
+        self.register_buffer("T_packed", packed_T)
+        self.register_buffer("_T_shape", torch.tensor([1, dim], dtype=torch.long))
+        self.register_buffer("_T_pad", torch.tensor(T_pad, dtype=torch.long))
+        gpr = ceil(dim / self.group_size)
+        total_in = gpr * self.group_size
+        padded = torch.zeros(1, total_in)
+        abs_w = w_init.abs()
+        padded[:, :dim] = abs_w
+        grouped = padded.view(1, gpr, self.group_size)
+        grp_means = grouped.mean(dim=2)
+        E_vals = torch.where(grp_means > 0, grp_means, torch.ones_like(grp_means))
+        self.register_buffer("E", E_vals.flatten().log2().clamp(-128, 127).to(torch.int8))
+        self.register_buffer("E_accum", torch.zeros_like(self.E, dtype=torch.int8))
+        self.register_buffer("group_lr", torch.ones_like(self.E, dtype=torch.int8))
+        self.register_buffer("T_accum", torch.zeros(1, dim, dtype=torch.int8))
+    def _ensure_E_accum(self):
+        if not hasattr(self, "E_accum"):
+            self.register_buffer("E_accum", torch.zeros_like(self.E, dtype=torch.int8))
+        elif self.E_accum.shape != self.E.shape or self.E_accum.device != self.E.device:
+            self.E_accum = torch.zeros_like(self.E, dtype=torch.int8)
+        return self.E_accum
+    def _ensure_group_lr(self):
+        if not hasattr(self, "group_lr"):
+            self.register_buffer("group_lr", torch.ones_like(self.E, dtype=torch.int8))
+        elif self.group_lr.shape != self.E.shape or self.group_lr.device != self.E.device:
+            self.group_lr = torch.ones_like(self.E, dtype=torch.int8)
+        return self.group_lr
+    def _get_T(self):
+        return unpack_ternary(self.T_packed, tuple(self._T_shape.tolist()), int(self._T_pad.item())).squeeze(0)
+    def forward(self, x):
+        if x.is_cuda and _HAS_TRITON and self.dim <= _rmsnorm_triton_max_dim():
+            return _TritonRMSNormFn.apply(
+                x, self, self.T_packed.contiguous(), self.E.contiguous(),
+                self.dim, self.group_size,
+            )
+        inv_rms = torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + self.eps)
+        if x.is_cuda:
+            # TernaryRMSNorm is initialized as an identity scale and does not
+            # train E/T. Avoid unpacking a full large-dim weight or launching
+            # the high-register Triton backward kernel on 8GB GPUs.
+            return x * inv_rms
+        T = self._get_T()
+        E_exp = _expand_E(self.E, tuple(self._T_shape.tolist()), self.group_size).squeeze(0)
+        S = torch.exp2(E_exp.float())
+        weight = S * T.float()
+        return weight * (x * inv_rms)
+    def ternary_step(self, lr=1, accum_threshold=3):
+        pass
+    def update_E(self, lr=1, loss_signal=None):
+        pass
+    def extra_repr(self):
+        return f"dim={self.dim}, tscale_type={self.tscale_type.name}"

arbitor/kernel/triton_video.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Triton kernels for video denoising (used by VideoHead)."""
+import torch
+import torch.nn as nn
+from math import ceil as _ceil
+from .ternary_scale import _HAS_TRITON
+if _HAS_TRITON:
+    import triton
+    import triton.language as tl
+    @triton.jit
+    def _triton_video_denoise_fwd_kernel(
+        latent, pred_noise, out,
+        TOTAL: tl.constexpr, ALPHA: tl.constexpr, BLOCK: tl.constexpr,
+    ):
+        offsets = tl.program_id(0) * BLOCK + tl.arange(0, BLOCK)
+        mask = offsets < TOTAL
+        l = tl.load(latent + offsets, mask=mask, other=0.0)
+        p = tl.load(pred_noise + offsets, mask=mask, other=0.0)
+        beta = 1.0 - ALPHA
+        inv_sqrt = 1.0 / tl.sqrt(ALPHA + 0.00000001)
+        tl.store(out + offsets, (l - beta * p) * inv_sqrt, mask=mask)
+    @triton.jit
+    def _triton_video_denoise_bwd_kernel(
+        grad_out, grad_latent, grad_pred,
+        TOTAL: tl.constexpr, ALPHA: tl.constexpr, BLOCK: tl.constexpr,
+    ):
+        offsets = tl.program_id(0) * BLOCK + tl.arange(0, BLOCK)
+        mask = offsets < TOTAL
+        g = tl.load(grad_out + offsets, mask=mask, other=0.0)
+        beta = 1.0 - ALPHA
+        inv_sqrt = 1.0 / tl.sqrt(ALPHA + 0.00000001)
+        tl.store(grad_latent + offsets, g * inv_sqrt, mask=mask)
+        tl.store(grad_pred + offsets, -beta * g * inv_sqrt, mask=mask)
+    class _TritonVideoDenoiseFn(torch.autograd.Function):
+        @staticmethod
+        def forward(ctx, latent, pred_noise, alpha):
+            latent_c = latent.contiguous()
+            pred_c = pred_noise.contiguous()
+            out = torch.empty_like(latent_c)
+            total = latent_c.numel()
+            block = 256
+            grid = (_ceil_div(total, block),)
+            alpha_f = float(alpha)
+            _triton_video_denoise_fwd_kernel[grid](
+                latent_c, pred_c, out,
+                total, alpha_f, BLOCK=block,
+            )
+            ctx.alpha = alpha_f
+            ctx.shape = latent.shape
+            return out.reshape_as(latent)
+        @staticmethod
+        def backward(ctx, grad_out):
+            grad_c = grad_out.contiguous()
+            grad_latent = torch.empty_like(grad_c)
+            grad_pred = torch.empty_like(grad_c)
+            total = grad_c.numel()
+            block = 256
+            grid = (_ceil_div(total, block),)
+            _triton_video_denoise_bwd_kernel[grid](
+                grad_c, grad_latent, grad_pred,
+                total, ctx.alpha, BLOCK=block,
+            )
+            return grad_latent.reshape(ctx.shape), grad_pred.reshape(ctx.shape), None
+def video_denoise_step(latent, pred_noise, alpha):
+    if _HAS_TRITON and latent.is_cuda and pred_noise.is_cuda and _TritonVideoDenoiseFn is not None:
+        return _TritonVideoDenoiseFn.apply(latent, pred_noise, alpha)
+    return (latent - (1 - alpha) * pred_noise) / (alpha ** 0.5 + 1e-8)

arbitor/main.py ADDED Viewed

	@@ -0,0 +1,585 @@

+"""ARB — Any Relational Bit. Core model assembly."""
+import warnings
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from math import ceil as _ceil
+_ceil_div = lambda a, b: _ceil(a / b) if b > 0 else 0
+from .config import VOCAB, HIDDEN_DIM, SPECIAL_VOCAB, CTX, THRESHOLD, CODEBOOK_DIM, CODEBOOK_SIZE, KV_LEDGER_SIZE, KQ_CACHE_SIZE, MEMGRAM_STRUCT_PRIMES, MEMGRAM_CONV_PRIMES, MEMGRAM_EMBED_DIM, MEMGRAM_KEY_DIM, KGVQ_CODEBOOK_SIZE, KGVQ_CODEBOOK_DIM, K_MAX_COMPOSITES, MG_TOP_K
+from .kernel.ternary_scale import TScaleType, TernaryScaleTensor, TernaryRMSNorm, _HAS_TRITON
+try:
+    from .kernel.ternary_scale import _triton_apply_accumulated_flips
+except ImportError:
+    _triton_apply_accumulated_flips = None
+from .converters.convert_to_ternary8 import pack_ternary
+try:
+    from .kernel.ternary_scale import _TritonTernaryEmbedFn
+except ImportError:
+    _TritonTernaryEmbedFn = None
+from .sequencers import ByteEmbedding, MultimodalSequencer
+from .vq import SharedVQ
+from .components import (
+    ByteHead, OutputRouter,
+    MemGram, LossComponents, LossWeights,
+    CompositeProposalHead, MoEGraph,
+)
+from .decoders import VideoHead, TalkerHead
+from .components import _BOUNDARY_TOKEN_MAP as _BOUNDARY_MAP
+from .attention import KVLedger, KQCache, ContextAttentionScheduler
+from .kernel.flash_vq import FlashVQCodebook
+def _extract_boundary_from_input(x):
+    if x.dim() != 2:
+        return None
+    first_token = x[0, 0].item()
+    if first_token in _BOUNDARY_MAP:
+        return first_token
+    for tok in x[0].tolist():
+        if tok in _BOUNDARY_MAP:
+            return tok
+    return None
+class ARBModel(nn.Module):
+    def __init__(self, tscale_type=TScaleType.T32, threshold=THRESHOLD,
+        max_graph_hops=4, max_moe_iters=4, halt_threshold=0.99,
+        enable_image=False, enable_audio=False, enable_vq=True, enable_graph=True,
+        enable_memory_modules=False, enable_moe=True,
+        shared_vq_size=None, kgvq_codebook_size=None,
+        enable_attention=True, enable_output_router=True,
+        enable_video_output=True, enable_talker_output=True):
+        super().__init__()
+        self.image_enabled = enable_image
+        self.audio_enabled = enable_audio
+        self.embedding = ByteEmbedding(tscale_type=tscale_type)
+        self.multimodal_sequencer = MultimodalSequencer(
+            tscale_type=tscale_type,
+            enable_text=True, enable_image=enable_image, enable_audio=enable_audio,
+        )
+        self.text_sequencer = self.multimodal_sequencer.text
+        self.image_sequencer = self.multimodal_sequencer.image
+        self.audio_sequencer = self.multimodal_sequencer.audio
+        self.vq_enabled = enable_vq
+        self.bridge = SharedVQ(
+            codebook_size=shared_vq_size,
+            tscale_type=tscale_type, enable_image=enable_image, enable_audio=enable_audio,
+        ) if enable_vq else None
+        self.vq_to_trigram = TernaryScaleTensor(CODEBOOK_DIM, HIDDEN_DIM, tscale_type=tscale_type) if enable_vq else None
+        self.vq_to_trigram_norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type) if enable_vq else None
+        self.graph_enabled = enable_graph and enable_vq
+        graph_vocab_size = self.bridge.total_codebook_size if self.graph_enabled else None
+        self.threshold = threshold
+        self.moegraph = MoEGraph(
+            trigram_dim=HIDDEN_DIM, codebook_size=graph_vocab_size or CODEBOOK_SIZE,
+            max_iters=max_moe_iters, halt_threshold=halt_threshold,
+            top_k=MG_TOP_K,
+        ) if self.graph_enabled else None
+        self.byte_head = ByteHead(tscale_type=tscale_type)
+        # Composite motif generation (Phase 17)
+        self.composite_head = CompositeProposalHead(
+            dim=HIDDEN_DIM, codebook_dim=KGVQ_CODEBOOK_DIM,
+            k_max=K_MAX_COMPOSITES, codebook_size=kgvq_codebook_size or KGVQ_CODEBOOK_SIZE,
+            tscale_type=tscale_type,
+        ) if self.graph_enabled else None
+        self.output_router = OutputRouter(tscale_type=tscale_type, depth=3) if enable_output_router else None
+        self.video_head = VideoHead(tscale_type=tscale_type) if enable_video_output else None
+        self.talker_head = TalkerHead(tscale_type=tscale_type) if enable_talker_output else None
+        self.memgram = MemGram(
+            struct_primes=MEMGRAM_STRUCT_PRIMES,
+            conv_primes=MEMGRAM_CONV_PRIMES,
+            embed_dim=MEMGRAM_EMBED_DIM, key_dim=MEMGRAM_KEY_DIM, hidden_dim=HIDDEN_DIM,
+        ) if enable_memory_modules else None
+        self.memgram_enabled = self.memgram is not None
+        # KV Ledger + Attention (Phase 16 — replaces LSTM)
+        self.kv_ledger = KVLedger(max_size=KV_LEDGER_SIZE) if enable_attention else None
+        self.kq_cache = KQCache(max_size=KQ_CACHE_SIZE) if enable_attention else None
+        self.attention = ContextAttentionScheduler(dim=HIDDEN_DIM) if enable_attention else None
+        self.attention_enabled = bool(enable_attention)
+    def forward(self, x, targets=None, commitment_warmup_weight=1.0,
+                act_warmup_mode=False, ponder_lambda=0.01, images=None,
+                audio=None, timestep=0, loss_weights=None, output_mode=None):
+        has_image = images is not None
+        has_audio = audio is not None
+        if has_image and (not self.image_enabled or self.image_sequencer is None):
+            raise ValueError("images provided but model has enable_image=False")
+        if has_audio and (not self.audio_enabled or self.audio_sequencer is None):
+            raise ValueError("audio provided but model has enable_audio=False")
+        embedded = self.embedding(x)
+        seq_inputs = {'text': embedded}
+        if has_image:
+            seq_inputs['image'] = images
+        if has_audio:
+            seq_inputs['audio'] = audio
+        seq_outputs = self.multimodal_sequencer(seq_inputs)
+        relational = seq_outputs['text']
+        indices_dict = {}
+        if self.vq_enabled:
+            bridge_inputs = {'text': relational}
+            if 'image' in seq_outputs:
+                bridge_inputs['image'] = seq_outputs['image']
+            if 'audio' in seq_outputs:
+                bridge_inputs['audio'] = seq_outputs['audio']
+            combined, vq_losses, indices_dict = self.bridge(bridge_inputs, timestep=timestep)
+            if combined is None:
+                combined = relational
+            elif combined.shape[-1] == CODEBOOK_DIM:
+                combined = self.vq_to_trigram_norm(self.vq_to_trigram(combined))
+            vq_loss = vq_losses.get('text_vq', torch.zeros((), device=x.device))
+            if 'image_vq' in vq_losses:
+                vq_loss = vq_loss + vq_losses['image_vq']
+            if 'audio_vq' in vq_losses:
+                vq_loss = vq_loss + vq_losses['audio_vq']
+        else:
+            combined = relational
+            vq_loss = torch.zeros((), device=x.device)
+        active_mods = ['text']
+        if has_image:
+            active_mods.append('image')
+        if has_audio:
+            active_mods.append('audio')
+        active_count = len(active_mods)
+        # MemGram injection (after VQ, before Graph — D92)
+        memgram_decay_reg = torch.tensor(0.0, device=x.device)
+        if self.memgram_enabled and self.memgram is not None and self.vq_enabled:
+            vq_indices = indices_dict.get('text', torch.zeros(combined.shape[0], combined.shape[1], dtype=torch.long, device=x.device))
+            combined = self.memgram(
+                vq_indices=vq_indices,
+                hidden_state=combined,
+            )
+        all_indices = None
+        composite_ids = None
+        composite_vq_loss = None
+        processed = combined
+        moegraph_ponder_loss = torch.tensor(0.0, device=x.device)
+        if self.graph_enabled and self.moegraph is not None and self.vq_enabled and vq_loss is not None:
+            self.moegraph._codebook_table = self.bridge.vq.table
+            self.moegraph._codebook_embed = None
+            all_indices = indices_dict.get('text', combined.new_zeros(combined.shape[0], combined.shape[1], dtype=torch.long))
+            if has_image and 'image' in indices_dict:
+                all_indices = torch.cat([all_indices, indices_dict['image']], dim=1)
+            if has_audio and 'audio' in indices_dict:
+                all_indices = torch.cat([all_indices, indices_dict['audio']], dim=1)
+            # MemGram retrieval for MoEGraph injection
+            memgram_cb = None
+            if self.memgram_enabled and self.memgram is not None and self.vq_enabled:
+                vq_idx = indices_dict.get('text', combined.new_zeros(combined.shape[0], combined.shape[1], dtype=torch.long))
+                memgram_cb = self.memgram.retrieve_cb(vq_idx)
+            # Attention output for KV conditioning
+            attn_out = None
+            if self.attention_enabled and self.attention is not None and self.kv_ledger is not None:
+                attn_out = self.attention(combined, self.kv_ledger, kq_cache=self.kq_cache)
+            # MoEGraph forward (unified ACT loop)
+            processed, moegraph_ponder_loss = self.moegraph(
+                combined, all_indices,
+                attention_output=attn_out,
+                memgram_cb_output=memgram_cb,
+                threshold=self.threshold,
+            )
+            # Composite motif generation (Phase 17)
+            if self.composite_head is not None:
+                composite_ids, composite_vq_loss, _ = self.composite_head(processed.mean(dim=1))
+            # Update bounded int-only KG co-occurrence state.
+            self.moegraph.update_kg_edges(all_indices)
+        # OutputRouter: route to appropriate head
+        if targets is not None or output_mode == "text":
+            logits = self.byte_head(processed)
+        elif output_mode == "video":
+            if self.video_head is None:
+                raise ValueError("output_mode='video' requested but video output is disabled")
+            logits = self.video_head(processed)
+        elif output_mode in {"audio", "talker"}:
+            if self.talker_head is None:
+                raise ValueError("audio/talker output requested but talker output is disabled")
+            logits = self.talker_head(processed)
+        elif self.training and self.output_router is not None:
+            route = self.output_router(processed, training=True)
+            route_weights, route_logits = route
+            logits = self.byte_head(processed)
+        elif self.output_router is not None:
+            route = self.output_router(processed, training=False)
+            if isinstance(route, torch.Tensor) and route.numel() > 0:
+                use_video = (route == 2).any() and self.video_head is not None
+                use_talk = (route == 3).any() and self.talker_head is not None
+                logits = self.video_head(processed) if use_video else \
+                         self.talker_head(processed) if use_talk else \
+                         self.byte_head(processed)
+            else:
+                logits = self.byte_head(processed)
+        else:
+            logits = self.byte_head(processed)
+        T_text = relational.shape[1]
+        if logits.dim() == 3 and logits.shape[-1] == VOCAB:
+            logits = logits[:, :T_text, :]
+            with torch.no_grad():
+                self._append_predictions_to_kv(logits.argmax(dim=-1), composite_ids=composite_ids)
+        losses = None
+        if targets is not None:
+            next_byte_logits = logits[:, :-1, :].contiguous()
+            lm_loss = F.cross_entropy(
+                next_byte_logits.view(-1, VOCAB),
+                targets.contiguous().view(-1),
+                ignore_index=SPECIAL_VOCAB["PAD"]
+            )
+            vq_component = commitment_warmup_weight * vq_loss if self.vq_enabled else None
+            losses = LossComponents(
+                lm=lm_loss,
+                vq_commitment=vq_component,
+                graph_l1=None,
+                moegraph_ponder=moegraph_ponder_loss,
+                memgram_decay_reg=memgram_decay_reg if self.memgram_enabled else None,
+                composite_vq=composite_vq_loss if self.composite_head is not None and composite_ids is not None else None,
+                weights=loss_weights if loss_weights is not None else LossWeights(),
+            )
+        return logits, losses, all_indices, None
+    @torch.no_grad()
+    def _append_predictions_to_kv(self, pred_ids, composite_ids=None):
+        if self.kv_ledger is None or self.kq_cache is None:
+            return
+        for b in range(pred_ids.shape[0]):
+            for t in range(pred_ids.shape[1]):
+                token_id = int(pred_ids[b, t])
+                self.kv_ledger.append(token_id)
+                self.kq_cache.append(token_id)
+            if composite_ids is None:
+                continue
+            composite_offset = self.bridge.total_codebook_size if self.vq_enabled and self.bridge is not None else 0
+            for k in range(composite_ids.shape[1]):
+                cid = int(composite_ids[b, k])
+                if cid >= 0:
+                    self.kv_ledger.append(composite_offset + cid)
+    def _ternary_update_memory(self, accum_threshold=8, update_scales=True,
+                               loss_components=None, loss_signal=None):
+        signal = loss_components.total if loss_components is not None else loss_signal
+        t_step = self._ternary_t_step(signal)
+        if signal is not None and not torch.isfinite(signal.detach()).all():
+            warnings.warn("Non-finite loss detected — skipping ternary state update",
+                          RuntimeWarning, stacklevel=2)
+            self._clear_ternary_hooks()
+            self.zero_grad(set_to_none=True)
+            return
+        if loss_components is not None:
+            self._componentwise_ternary_backward(loss_components, t_step, update_scales, accum_threshold)
+        else:
+            self._apply_regular_ternary_hooks(accum_threshold, update_scales, t_step, loss_signal)
+        self._clear_ternary_hooks()
+        self._clear_backward_update_flags()
+    def prepare_ternary_backward(self, loss_signal=None, update_scales=True):
+        """Configure streaming CUDA ternary updates before `loss.backward()`.
+        BigInt-scaled dense linear backward accumulates directly into int64
+        `corr_accum`, while legacy sparse tables still use int8 `T_accum`.
+        Calling this before backward lets the streaming path use the same
+        loss-scaled step that `_ternary_update_memory()` will finalize.
+        """
+        t_step = self._ternary_t_step(loss_signal)
+        for module in self.modules():
+            if hasattr(module, "T_accum") or hasattr(module, "corr_accum"):
+                module._backward_t_accum_step = t_step
+                module._backward_update_scales = bool(update_scales)
+                module._stream_backward_updates = True
+    def _clear_backward_update_flags(self):
+        for module in self.modules():
+            for attr in (
+                "_backward_t_accum_step",
+                "_backward_update_scales",
+                "_stream_backward_updates",
+                "_streamed_ternary_backward",
+                "_streamed_bigint_backward",
+            ):
+                if hasattr(module, attr):
+                    delattr(module, attr)
+    @staticmethod
+    def _ternary_t_step(loss_signal):
+        return 1
+    def _clear_ternary_hooks(self):
+        base_names = [
+            "_hook_grad_T_sign", "_hook_grad_2d", "_hook_x_2d", "_hook_T",
+            "_hook_sparse_indices", "_hook_sparse_grad_sign", "_hook_sparse_T",
+        ]
+        for module in self.modules():
+            if hasattr(module, "_T_accum_fp"):
+                delattr(module, "_T_accum_fp")
+            for hook_name in base_names:
+                if hasattr(module, hook_name):
+                    delattr(module, hook_name)
+            for hook_name in list(vars(module).keys()):
+                if hook_name.startswith((
+                    "_hook_grad_T_sign_", "_hook_grad_2d_", "_hook_x_2d_", "_hook_T_",
+                    "_hook_sparse_indices_", "_hook_sparse_grad_sign_", "_hook_sparse_T_",
+                )):
+                    delattr(module, hook_name)
+    def _componentwise_ternary_backward(self, loss_components, t_step, update_scales, accum_threshold):
+        from arbitor.kernel.ternary_scale import _COMPONENT_CONTEXT
+        self.prepare_ternary_backward(loss_components.total, update_scales=update_scales)
+        active = [(n, t, w) for n, t, w in loss_components.active_fields
+                  if t is not None and t.dim() == 0 and t.requires_grad and float(w) != 0.0]
+        for idx, (name, comp_tensor, weight) in enumerate(active):
+            retain = idx < len(active) - 1
+            _COMPONENT_CONTEXT.set(name, weight)
+            try:
+                comp_tensor.backward(retain_graph=retain)
+            finally:
+                _COMPONENT_CONTEXT.clear()
+            self._consume_component_hooks(name, weight, t_step, update_scales, accum_threshold)
+        with torch.no_grad():
+            for module in self.modules():
+                if self._is_large_sparse_embedding(module):
+                    continue
+                if update_scales:
+                    self._step_E_from_accum(module)
+                self._apply_accumulated_flips(module, accum_threshold=accum_threshold)
+    def _consume_component_hooks(self, name, weight, t_step, update_scales, accum_threshold):
+        for module in self.modules():
+            sparse_idx_key = f"_hook_sparse_indices_{name}"
+            sparse_grad_key = f"_hook_sparse_grad_sign_{name}"
+            sparse_t_key = f"_hook_sparse_T_{name}"
+            if hasattr(module, sparse_idx_key) and hasattr(module, sparse_grad_key):
+                setattr(module, "_hook_sparse_indices", getattr(module, sparse_idx_key))
+                setattr(module, "_hook_sparse_grad_sign", getattr(module, sparse_grad_key))
+                if hasattr(module, sparse_t_key):
+                    setattr(module, "_hook_sparse_T", getattr(module, sparse_t_key))
+                if update_scales and hasattr(module, "update_E"):
+                    module._e_accum_threshold = 8
+                    module.update_E()
+                if hasattr(module, "T_accum"):
+                    module._t_accum_step = max(1, int(round(abs(float(weight)) * t_step)))
+                if hasattr(module, "ternary_step"):
+                    module.ternary_step(accum_threshold=accum_threshold)
+                for key in (sparse_idx_key, sparse_grad_key, sparse_t_key):
+                    if hasattr(module, key):
+                        delattr(module, key)
+                continue
+            dense_key = f"_hook_grad_T_sign_{name}"
+            dense_t_key = f"_hook_T_{name}"
+            if hasattr(module, dense_key):
+                grad_sign = getattr(module, dense_key)
+                hook_t = getattr(module, dense_t_key, None)
+                self._accumulate_component_grad_continuous(
+                    module, grad_sign, weight, t_step,
+                )
+                delattr(module, dense_key)
+                if hasattr(module, dense_t_key):
+                    delattr(module, dense_t_key)
+            grad_key = f"_hook_grad_2d_{name}"
+            x_key = f"_hook_x_2d_{name}"
+            if not hasattr(module, grad_key) or not hasattr(module, x_key):
+                continue
+            comp_grad = getattr(module, grad_key)
+            comp_x = getattr(module, x_key)
+            if torch.isfinite(comp_grad).all() and torch.isfinite(comp_x).all():
+                raw_grad = torch.clamp(comp_grad.transpose(0, 1) @ comp_x, -10.0, 10.0)
+                self._accumulate_component_grad_continuous(
+                    module, raw_grad, weight, t_step,
+                )
+            delattr(module, grad_key)
+            delattr(module, x_key)
+    def _accumulate_component_grad_continuous(self, module, raw_grad, weight, t_step):
+        """Component loss accumulation without persistent float optimizer state."""
+        if not hasattr(module, "_T_shape"):
+            return
+        shape = tuple(int(x) for x in module._T_shape.tolist())
+        if tuple(raw_grad.shape) != shape:
+            return
+        with torch.no_grad():
+            step = max(1, int(round(abs(float(weight)) * t_step)))
+            if float(weight) < 0:
+                step = -step
+            if hasattr(module, "corr_accum") and hasattr(module, "_accumulate_corr_from_grad_sign"):
+                signed = raw_grad.sign().to(device=module.corr_accum.device, dtype=torch.int8)
+                module._accumulate_corr_from_grad_sign(signed, corr_step=step)
+                return
+            if not hasattr(module, "T_accum") or tuple(module.T_accum.shape) != shape:
+                return
+            if hasattr(module, "_T_accum_fp"):
+                delattr(module, "_T_accum_fp")
+            signed = raw_grad.sign().to(device=module.T_accum.device, dtype=torch.int8)
+            module.T_accum.copy_(
+                torch.clamp(
+                    module.T_accum.to(torch.int16) - signed.to(torch.int16) * step,
+                    -127,
+                    127,
+                ).to(torch.int8)
+            )
+    def _apply_regular_ternary_hooks(self, accum_threshold, update_scales, t_step, loss_signal):
+        for module in self.modules():
+            is_bigint = hasattr(module, "corr_accum") and hasattr(module, "_accumulate_corr_from_grad_sign")
+            is_legacy = hasattr(module, "T_accum") or hasattr(module, "E_accum")
+            if is_bigint or is_legacy:
+                self._prepare_per_group_threshold(module)
+            streamed = bool(getattr(module, "_streamed_ternary_backward", False))
+            has_hook = (
+                hasattr(module, "_hook_grad_T_sign")
+                or (hasattr(module, "_hook_grad_2d") and hasattr(module, "_hook_x_2d"))
+                or (hasattr(module, "_hook_sparse_indices") and hasattr(module, "_hook_sparse_grad_sign"))
+            )
+            bigint_streamed = bool(getattr(module, "_streamed_bigint_backward", False))
+            if (streamed or bigint_streamed) and not has_hook:
+                if streamed and update_scales:
+                    self._step_E_from_accum(module)
+                if streamed:
+                    had_flip = self._apply_accumulated_flips(module, accum_threshold=accum_threshold)
+                    self._record_flip_health(module, had_flip)
+                if hasattr(module, "per_group_threshold"):
+                    del module.per_group_threshold
+                continue
+            if has_hook:
+                if hasattr(module, "_hook_grad_T_sign") and hasattr(module, "_accumulate_corr_from_grad_sign"):
+                    module._accumulate_corr_from_grad_sign(module._hook_grad_T_sign)
+                    del module._hook_grad_T_sign
+                if hasattr(module, "ternary_step"):
+                    module.ternary_step(accum_threshold=accum_threshold)
+            if hasattr(module, "per_group_threshold"):
+                del module.per_group_threshold
+    def _prepare_per_group_threshold(self, module):
+        if self._is_large_sparse_embedding(module):
+            module.per_group_threshold = None
+            return
+        if hasattr(module, "corr_accum") and not hasattr(module, "T_accum"):
+            module.per_group_threshold = None
+            return
+        if not hasattr(module, "E") or not hasattr(module, "_T_shape"):
+            module.per_group_threshold = None
+            return
+        shape = tuple(int(x) for x in module._T_shape.tolist())
+        out_dim, in_dim = shape
+        gpr = _ceil_div(in_dim, module.group_size)
+        E_view = module.E.view(out_dim, gpr).float()
+        threshold_g = 8.0 + 0.25 * torch.min(E_view.abs(), torch.tensor(32.0, device=E_view.device))
+        module.per_group_threshold = torch.clamp(threshold_g, max=16.0).to(torch.int8).reshape(-1)
+    @staticmethod
+    def _is_large_sparse_embedding(module):
+        return (
+            hasattr(module, "num_embeddings")
+            and hasattr(module, "sparse_threshold")
+            and module.num_embeddings >= module.sparse_threshold
+        )
+    @staticmethod
+    def _step_E_from_accum(module):
+        if hasattr(module, "corr_accum"):
+            return  # BigInt modules don't use E_accum threshold flips
+        if not hasattr(module, "E") or not hasattr(module, "E_accum"):
+            return
+        threshold = int(getattr(module, "_e_accum_threshold", 8))
+        accum = module.E_accum.to(torch.int16)
+        step = torch.where(
+            accum >= threshold,
+            torch.ones_like(accum, dtype=torch.int16),
+            torch.where(accum <= -threshold, torch.full_like(accum, -1, dtype=torch.int16), torch.zeros_like(accum, dtype=torch.int16)),
+        )
+        if step.any():
+            module.E = torch.clamp(module.E.to(torch.int16) + step, -128, 127).to(torch.int8)
+            module.E_accum = (accum - step * threshold).to(torch.int8)
+    @staticmethod
+    def _apply_accumulated_flips(module, accum_threshold=3):
+        """Packed-byte carry: when T_accum crosses ±1, move trit by ±1 via ±3^pos."""
+        if not hasattr(module, "T_accum") or not hasattr(module, "T_packed") or not hasattr(module, "_T_shape"):
+            return False
+        shape = tuple(int(x) for x in module._T_shape.tolist())
+        if tuple(module.T_accum.shape) != shape:
+            return False
+        carry_up = module.T_accum > 1
+        carry_down = module.T_accum < -1
+        if not carry_up.any() and not carry_down.any():
+            return False
+        dev = module.T_packed.device
+        out_dim, in_dim = shape
+        pows = torch.tensor([1, 3, 9, 27, 81], device=dev, dtype=torch.int16)
+        pk = module.T_packed.to(torch.int16).clone()
+        for p in range(5):
+            if p >= in_dim:
+                continue
+            cols = torch.arange(p, in_dim, 5, device=dev)
+            if cols.numel() == 0:
+                continue
+            is_up = carry_up[:, cols]
+            is_dn = carry_down[:, cols]
+            if not is_up.any() and not is_dn.any():
+                continue
+            rows_2d = torch.arange(out_dim, device=dev)[:, None]
+            lin_idx = rows_2d * in_dim + cols[None, :]
+            byte_idx = lin_idx // 5
+            pv = pk[byte_idx]
+            p_up = (pv + pows[p]).clamp(0, 242)
+            p_dn = (pv - pows[p]).clamp(0, 242)
+            pk[byte_idx] = torch.where(is_up, p_up, torch.where(is_dn, p_dn, pv))
+        module.T_packed = pk.to(torch.uint8)
+        # Reset T_accum to 0 on carry so W = T_accum × T doesn't jump
+        mask = carry_up | carry_down
+        module.T_accum[mask] = torch.zeros_like(module.T_accum[mask])
+        return True
+    @staticmethod
+    def _record_flip_health(module, had_flip):
+        if not hasattr(module, "T_accum"):
+            return
+        steps_since = getattr(module, "_steps_since_flip", 0)
+        module._steps_since_flip = 0 if had_flip else steps_since + 1
+        module._had_flip = False
+    def generate(self, idx, max_new_token, temperature=1.0, images=None, audio=None,
+                 conversation_id=None, top_k=None, min_new_tokens=0, return_metadata=False):
+        if self.kv_ledger is not None and self.kv_ledger.size == 0:
+            with torch.no_grad():
+                for token_id in idx.reshape(-1).tolist():
+                    self.kv_ledger.append(int(token_id))
+                    self.kq_cache.append(int(token_id))
+        for i in range(max_new_token):
+            idx_cond = idx[:, -CTX:]
+            logits, _, _, _ = self(idx_cond, images=images, audio=audio, timestep=i, output_mode="text")
+            last_logits = logits[:, -1, :] / temperature
+            # top-k filtering
+            if top_k is not None and top_k > 0:
+                v, _ = torch.topk(last_logits, min(top_k, last_logits.size(-1)))
+                kth = v[:, -1].unsqueeze(-1).expand_as(last_logits)
+                last_logits = last_logits.where(last_logits >= kth, float('-inf'))
+            probs = F.softmax(last_logits, dim=-1)
+            idx_next = torch.multinomial(probs, num_samples=1)
+            idx = torch.cat([idx, idx_next], dim=1)
+        # Enforce min_new_tokens (only relevant if caller truncates after generation)
+        generated = idx.shape[1] - (min_new_tokens if return_metadata else 0)
+        if return_metadata:
+            return {
+                "tokens": idx,
+                "n_generated": generated,
+                "temperature": temperature,
+            }
+        return idx

arbitor/optim/__init__.py ADDED Viewed

File without changes

arbitor/optim/sign_sgd.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import torch
+from torch.optim import Optimizer
+class SignSGD(Optimizer):
+    def __init__(self, params, lr=1e-2, weight_decay=0.0):
+        defaults = dict(lr=lr, weight_decay=weight_decay)
+        super().__init__(params, defaults)
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+        for group in self.param_groups:
+            lr = group["lr"]
+            wd = group["weight_decay"]
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+                grad = p.grad
+                if grad.is_sparse:
+                    grad = grad.to_dense()
+                update = grad.sign()
+                if wd > 0:
+                    update = update + wd * p.sign()
+                p.add_(-lr * update)
+        return loss
+    @torch.no_grad()
+    def get_memory_mb(self, params=None) -> float:
+        if params is None:
+            params = []
+            for group in self.param_groups:
+                params.extend(group["params"])
+        total_bytes = sum(p.numel() * p.element_size() for p in params)
+        return total_bytes / (1024 * 1024)

arbitor/profiling.py ADDED Viewed

	@@ -0,0 +1,196 @@

+"""
+Profiling utilities: torch.profiler wrapper and analysis tools.
+Following D-103: profile first, optimize only hot paths.
+Uses torch.profiler to identify training loop bottlenecks.
+"""
+import sys
+import os
+import json
+import math
+import torch
+sys.path.insert(0, os.path.dirname(__file__))
+from .main import ARBModel
+from .config import VOCAB, CTX
+def profile_training(model, train_data, device, n_steps=20, warmup_steps=5,
+                     top_k=10, batch_size=64, ctx=CTX):
+    """
+    Profile N training steps using torch.profiler.
+    Runs profiling with CUDA + CPU activity tracing, warmup steps (no profiling),
+    then profiled steps. Returns list of top-K hot path tuples and saves JSON.
+    Args:
+        model: ARBModel instance
+        train_data: 1D byte tensor of training data
+        device: 'cuda' or 'cpu'
+        n_steps: Number of profiled training steps
+        warmup_steps: Steps before profiling begins (no tracing)
+        top_k: Number of top operations to return
+        batch_size: Batch size for each training step
+        ctx: Context window length
+    Returns:
+        List of dicts with keys: op_name, cuda_time_us, cpu_time_us, calls
+    """
+    model.train()
+    prof = None
+    if device == "cuda":
+        prof = torch.profiler.profile(
+            activities=[
+                torch.profiler.ProfilerActivity.CPU,
+                torch.profiler.ProfilerActivity.CUDA,
+            ],
+            record_shapes=True,
+            with_stack=True,
+            with_flops=True,
+        )
+    else:
+        prof = torch.profiler.profile(
+            activities=[torch.profiler.ProfilerActivity.CPU],
+            record_shapes=True,
+            with_stack=False,
+        )
+    # Warmup steps (no profiling)
+    for _ in range(warmup_steps):
+        ix = torch.randint(0, len(train_data) - ctx - 1, (batch_size,))
+        x = torch.stack([train_data[j: j + ctx] for j in ix])
+        targets = x[:, 3:]
+        x = x.to(device)
+        targets = targets.to(device)
+        with torch.no_grad():
+            model(x, targets=targets)
+    # Profiled steps
+    prof.start()
+    for _ in range(n_steps):
+        ix = torch.randint(0, len(train_data) - ctx - 1, (batch_size,))
+        x = torch.stack([train_data[j: j + ctx] for j in ix])
+        targets = x[:, 3:]
+        x = x.to(device)
+        targets = targets.to(device)
+        with torch.no_grad():
+            model(x, targets=targets)
+        if device == "cuda":
+            torch.cuda.synchronize()
+    prof.stop()
+    # Process profiler output
+    if device == "cuda":
+        key_avg = prof.key_averages()
+        table = key_avg.table(sort_by="cuda_time_total", row_limit=top_k)
+    else:
+        key_avg = prof.key_averages()
+        table = key_avg.table(sort_by="cpu_time_total", row_limit=top_k)
+    # Extract top-K entries
+    events = key_avg.events() if hasattr(key_avg, 'events') else key_avg[:top_k]
+    top_results = []
+    for evt in events[:top_k]:
+        # device_time replaces deprecated cuda_time in recent PyTorch
+        cuda_t = (evt.device_time if hasattr(evt, 'device_time') and evt.device_time is not None
+                  else evt.cuda_time if hasattr(evt, 'cuda_time') else 0)
+        entry = {
+            "op_name": evt.key if hasattr(evt, 'key') else str(evt),
+            "cuda_time_us": cuda_t,
+            "cpu_time_us": evt.cpu_time if hasattr(evt, 'cpu_time') else 0,
+            "calls": evt.count if hasattr(evt, 'count') else 1,
+        }
+        top_results.append(entry)
+    # Print summary
+    print("\n=== Profiling Results (Top-{} Hot Paths) ===".format(top_k))
+    print(table)
+    print("============================================\n")
+    # Save profiler output as JSON
+    prof.export_chrome_trace("/tmp/profiler_trace.json")
+    return top_results
+def analyze_profiler_output(prof_path):
+    """
+    Load saved profiler JSON output and extract key insights.
+    Args:
+        prof_path: Path to saved profiler JSON file
+    Returns:
+        List of dicts with op_name, cuda_time_us, cpu_time_us, calls
+    """
+    with open(prof_path, "r") as f:
+        data = json.load(f)
+    # Profiler JSON can be a dict with 'traceEvents' or a flat list
+    if isinstance(data, dict) and "traceEvents" in data:
+        events = data["traceEvents"]
+    elif isinstance(data, list):
+        events = data
+    else:
+        events = []
+    # Aggregate events by name
+    op_stats = {}
+    for evt in events:
+        if isinstance(evt, dict):
+            name = evt.get("name", "unknown")
+            dur = evt.get("dur", 0)  # microseconds
+            cat = evt.get("cat", "")
+            if name not in op_stats:
+                op_stats[name] = {"cuda_time_us": 0, "cpu_time_us": 0, "calls": 0}
+            if "gpu" in cat.lower():
+                op_stats[name]["cuda_time_us"] += dur
+            elif "cpu" in cat.lower() or cat == "":
+                op_stats[name]["cpu_time_us"] += dur
+            op_stats[name]["calls"] += 1
+    # Sort by CUDA time descending
+    sorted_ops = sorted(
+        op_stats.items(),
+        key=lambda x: x[1]["cuda_time_us"],
+        reverse=True,
+    )
+    results = []
+    for name, stats in sorted_ops:
+        results.append({
+            "op_name": name,
+            "cuda_time_us": stats["cuda_time_us"],
+            "cpu_time_us": stats["cpu_time_us"],
+            "calls": stats["calls"],
+        })
+    # Print formatted summary
+    print("\n=== Profiler Analysis ===")
+    print(f"{'Operation':<40} {'CUDA Time (us)':>15} {'CPU Time (us)':>15} {'Calls':>8}")
+    print("-" * 80)
+    for r in results[:20]:
+        print(f"{r['op_name']:<40} {r['cuda_time_us']:>15.0f} {r['cpu_time_us']:>15.0f} {r['calls']:>8}")
+    # Identify dominating patterns
+    total_cuda = sum(r["cuda_time_us"] for r in results)
+    if total_cuda > 0:
+        print("\n=== Hot Path Analysis ===")
+        for r in results[:5]:
+            pct = (r["cuda_time_us"] / total_cuda) * 100 if total_cuda > 0 else 0
+            label = ""
+            if "vq" in r["op_name"].lower() or "flash_vq" in r["op_name"].lower():
+                label = " → VQ candidate for Triton kernel"
+            elif "moe" in r["op_name"].lower() or "scatter" in r["op_name"].lower():
+                label = " → MoE dispatch candidate"
+            elif "embed" in r["op_name"].lower() or "gather" in r["op_name"].lower():
+                label = " → Embedding gather (existing Triton kernel)"
+            elif "mm" in r["op_name"].lower() or "linear" in r["op_name"].lower():
+                label = " → General matmul (torch.compile candidate)"
+            print(f"  {r['op_name']:<40} {pct:>5.1f}%{label}")
+    print("============================================\n")
+    return results

arbitor/sequencers.py ADDED Viewed

	@@ -0,0 +1,218 @@

+"""Sequencer modules — input processing for all modalities."""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from .kernel.ternary_scale import TernaryScaleTensor, TScaleType, TernaryRMSNorm, GROUP_SIZES, _HAS_TRITON, _HAS_TILELANG
+if _HAS_TRITON:
+    import triton
+    import triton.language as tl
+else:
+    triton = None
+    tl = None
+try:
+    from .kernel.ternary_scale import _TritonTernaryEmbedFn
+except ImportError:
+    _TritonTernaryEmbedFn = None
+from .converters.convert_to_ternary8 import pack_ternary, unpack_ternary
+from math import ceil as _ceil
+_ceil_div = lambda a, b: _ceil(a / b) if b > 0 else 0
+from .config import VOCAB, EMBEDDING_DIM, HIDDEN_DIM, AUDIO_SR, AUDIO_FRAME_RATE
+class ByteEmbedding(nn.Module):
+    """Byte-level embedding via packed ternary + BigInt correlation.
+    All training state is integer. T_accum/E_accum replaced by
+    corr_accum (int64 per group, never clips or resets).
+    S = 2^(E + K × mean_corr)  where mean_corr = corr_accum / (step × gs)
+    """
+    def __init__(self, tscale_type=TScaleType.T32):
+        super().__init__()
+        self.tscale_type = tscale_type
+        self.threshold = 0.05
+        self.group_size = GROUP_SIZES.get(tscale_type, GROUP_SIZES[TScaleType.T64])
+        shape = (VOCAB, EMBEDDING_DIM)
+        init_std = 0.02
+        init_threshold = min(self.threshold, 0.5 * init_std)
+        self.threshold = init_threshold
+        w_init = torch.randn(VOCAB, EMBEDDING_DIM) * init_std
+        T_init = w_init.sign() * (w_init.abs() > init_threshold).to(w_init.dtype)
+        packed_T, T_shape, T_pad = pack_ternary(T_init)
+        self.register_buffer("T_packed", packed_T)
+        self.register_buffer("_T_shape", torch.tensor([VOCAB, EMBEDDING_DIM], dtype=torch.long))
+        self.register_buffer("_T_pad", torch.tensor(T_pad, dtype=torch.long))
+        out_dim, in_dim = shape
+        gpr = _ceil_div(in_dim, self.group_size)
+        total_in = gpr * self.group_size
+        padded = torch.zeros(out_dim, total_in)
+        abs_w = w_init.abs()
+        padded[:, :in_dim] = abs_w
+        grouped = padded.view(out_dim, gpr, self.group_size)
+        grp_means = grouped.mean(dim=2)
+        E_vals = torch.where(grp_means > 0, grp_means, torch.ones_like(grp_means))
+        self.register_buffer("E", E_vals.flatten().log2().clamp(-128, 127).to(torch.int8))
+        # BigInt correlation accumulator (replaces T_accum + E_accum)
+        n_grp = out_dim * gpr
+        self.register_buffer("corr_accum", torch.zeros(n_grp, dtype=torch.int64))
+        self.register_buffer("step_counter", torch.zeros(1, dtype=torch.int64))
+        self.norm = TernaryRMSNorm(EMBEDDING_DIM, tscale_type=tscale_type)
+    def _get_T(self):
+        return unpack_ternary(self.T_packed, tuple(self._T_shape.tolist()), int(self._T_pad.item()))
+    def _get_S(self):
+        gpr = _ceil_div(EMBEDDING_DIM, self.group_size)
+        e_adj = self.E.float()
+        step = int(self.step_counter.item())
+        if step > 0:
+            from .kernel.ternary_scale import _bigint_corr_strength
+            denom = max(step * self.group_size, 1)
+            e_adj = e_adj + (self.corr_accum.float() / denom) * _bigint_corr_strength()
+        E_exp = e_adj.view(VOCAB, gpr).repeat_interleave(self.group_size, dim=1)
+        if E_exp.shape[1] > EMBEDDING_DIM:
+            E_exp = E_exp[:, :EMBEDDING_DIM]
+        return torch.exp2(E_exp)
+    @torch.no_grad()
+    def _accumulate_corr_from_grad_sign(self, grad_sign, corr_step=1):
+        if grad_sign is None:
+            return
+        shape = tuple(self._T_shape.tolist())
+        out_dim, in_dim = shape
+        if tuple(grad_sign.shape) != shape:
+            return
+        gs = self.group_size
+        T = self._get_T().to(device=grad_sign.device, dtype=torch.int16)
+        signed = grad_sign.to(torch.int16) * T
+        gpr = _ceil_div(in_dim, gs)
+        total_in = gpr * gs
+        if total_in > in_dim:
+            signed = F.pad(signed, (0, total_in - in_dim))
+        score = signed.view(out_dim, gpr, gs).sum(dim=2, dtype=torch.int16)
+        self.corr_accum -= score.flatten().to(dtype=torch.int64) * int(corr_step)
+        self.step_counter += abs(int(corr_step))
+    def forward(self, x):
+        if x.is_cuda and _HAS_TRITON and _TritonTernaryEmbedFn is not None:
+            _dummy = torch.zeros(1, device=x.device, requires_grad=True)
+            emb = _TritonTernaryEmbedFn.apply(x, _dummy, self)
+            return self.norm(emb)
+        T = self._get_T()
+        S = self._get_S()
+        w_eff = S * T.float()
+        w_eff_grad = w_eff.detach().requires_grad_(True)
+        def capture_w_grad(grad_w):
+            self._hook_grad_T_sign = grad_w.sign().to(torch.int8)
+        w_eff_grad.register_hook(capture_w_grad)
+        out = self.norm(F.embedding(x, w_eff_grad))
+        return out
+    def ternary_step(self, accum_threshold=3):
+        if hasattr(self, "_hook_grad_T_sign"):
+            if hasattr(self, "_accumulate_corr_from_grad_sign"):
+                self._accumulate_corr_from_grad_sign(self._hook_grad_T_sign)
+            del self._hook_grad_T_sign
+    def update_E(self, loss_signal=None):
+        pass  # E is fixed; S adjusted via corr_accum
+class Sequencer(nn.Module):
+    def __init__(self, modality, window_size, tscale_type=TScaleType.T32):
+        super().__init__()
+        self.modality = modality
+        self.window_size = window_size
+        self.tscale_type = tscale_type
+    def forward(self, x):
+        raise NotImplementedError
+class TextSequencer(Sequencer):
+    def __init__(self, tscale_type=TScaleType.T32):
+        super().__init__(modality='text', window_size=3, tscale_type=tscale_type)
+        self.projection = TernaryScaleTensor(EMBEDDING_DIM * self.window_size, HIDDEN_DIM, tscale_type=tscale_type)
+        self.norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+    def forward(self, x):
+        trigrams = x.unfold(dimension=1, size=self.window_size, step=1)
+        trigrams = rearrange(trigrams, 'b t d w -> b t (d w)')
+        relational = self.projection(trigrams)
+        return self.norm(relational)
+class VAE2DSequencer(Sequencer):
+    def __init__(self, tscale_type=TScaleType.T32, quantize=None, device="cpu"):
+        super().__init__(modality='image', window_size=1, tscale_type=tscale_type)
+        from .encoders.vae2d import load_vae2d as _load_vae2d
+        self.vae = _load_vae2d(device=device, quantize=quantize)
+        self.vae_device = torch.device(device)
+        self.project = TernaryScaleTensor(4, HIDDEN_DIM, tscale_type=tscale_type)
+        self.norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+    def forward(self, x):
+        if x.device != self.vae_device:
+            x = x.to(self.vae_device)
+        latent = self.vae(x)
+        tokens = rearrange(latent, 'b c h w -> b (h w) c')
+        out = self.project(tokens)
+        return self.norm(out)
+class VAEAudioSequencer(Sequencer):
+    def __init__(self, tscale_type=TScaleType.T32, quantize=None, device="cpu"):
+        super().__init__(modality='audio', window_size=1, tscale_type=tscale_type)
+        from .encoders.vae2d import load_vae2d as _load_vae2d
+        from .encoders.mel_frontend import MelSpectrogram3Band as _Mel3Band
+        self.vae = _load_vae2d(device=device, quantize=quantize)
+        self.vae_device = torch.device(device)
+        self.mel = _Mel3Band(sample_rate=AUDIO_SR)
+        self.project = TernaryScaleTensor(4, HIDDEN_DIM, tscale_type=tscale_type)
+        self.norm = TernaryRMSNorm(HIDDEN_DIM, tscale_type=tscale_type)
+    def forward(self, waveform):
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        elif waveform.dim() == 3:
+            if waveform.shape[1] == 1:
+                waveform = waveform.squeeze(1)
+            else:
+                waveform = waveform.mean(dim=1)
+        spec = self.mel(waveform)
+        if spec.device != self.vae_device:
+            spec = spec.to(self.vae_device)
+        latent = self.vae(spec)
+        tokens = rearrange(latent, 'b c h w -> b (h w) c')
+        out = self.project(tokens)
+        return self.norm(out)
+class MultimodalSequencer(nn.Module):
+    def __init__(self, tscale_type=TScaleType.T32, enable_text=True, enable_image=True, enable_audio=True):
+        super().__init__()
+        self.text = TextSequencer(tscale_type=tscale_type) if enable_text else None
+        self.image = VAE2DSequencer(tscale_type=tscale_type) if enable_image else None
+        self.audio = VAEAudioSequencer(tscale_type=tscale_type) if enable_audio else None
+        self.enabled_modalities = []
+        if enable_text:
+            self.enabled_modalities.append('text')
+        if enable_image:
+            self.enabled_modalities.append('image')
+        if enable_audio:
+            self.enabled_modalities.append('audio')
+    def forward(self, modality_inputs):
+        outputs = {}
+        for mod in self.enabled_modalities:
+            seq = getattr(self, mod)
+            if mod in modality_inputs and modality_inputs[mod] is not None and seq is not None:
+                outputs[mod] = seq(modality_inputs[mod])
+        return outputs

arbitor/vq.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""VQ modules — vector quantization adapters."""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .kernel.ternary_scale import TernaryScaleTensor, TScaleType, TernaryRMSNorm
+from .components import TernaryVQCodebook
+from .config import EMBEDDING_DIM, HIDDEN_DIM, CODEBOOK_DIM, SHARED_VQ_SIZE, TIMESTAMP_MAX_PERIOD
+class SharedVQ(nn.Module):
+    """Single shared VQ codebook for all modalities (10M entries).
+    Each modality projects to the shared CODEBOOK_DIM=64 space, then
+    quantizes independently through the shared codebook. Text uses
+    CODEBOOK_DIM directly.
+    IDs are globally unique: all modalities share the same range [0, 10M).
+    """
+    def __init__(self, codebook_size=SHARED_VQ_SIZE, codebook_dim=CODEBOOK_DIM,
+                 tscale_type=TScaleType.T32, enable_image=True, enable_audio=True):
+        super().__init__()
+        codebook_size = SHARED_VQ_SIZE if codebook_size is None else codebook_size
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        # Per-modality input projections (their_dim → CODEBOOK_DIM)
+        self.text_proj = TernaryScaleTensor(HIDDEN_DIM, codebook_dim, tscale_type=tscale_type)
+        if enable_image:
+            self.image_proj = TernaryScaleTensor(HIDDEN_DIM, codebook_dim, tscale_type=tscale_type)
+        if enable_audio:
+            self.audio_proj = TernaryScaleTensor(HIDDEN_DIM, codebook_dim, tscale_type=tscale_type)
+        # Shared VQ codebook
+        self.vq = TernaryVQCodebook(
+            codebook_size=codebook_size,
+            codebook_dim=codebook_dim,
+            commitment_weight=1.0,
+            tscale_type=tscale_type,
+        )
+        self.modalities = ['text']
+        if enable_image:
+            self.modalities.append('image')
+        if enable_audio:
+            self.modalities.append('audio')
+    @staticmethod
+    def _sinusoidal_timestamp(seq_len, dim, max_period=TIMESTAMP_MAX_PERIOD, device=None):
+        freqs = torch.exp(-torch.arange(0, dim, 2, device=device).float() * (math.log(max_period) / dim))
+        t = torch.arange(seq_len, device=device).float().unsqueeze(1)
+        pe = torch.zeros(seq_len, dim, device=device)
+        pe[:, 0::2] = torch.sin(t * freqs)
+        pe[:, 1::2] = torch.cos(t * freqs)
+        return pe
+    def forward(self, modality_inputs, timestep=0):
+        outputs = []
+        vq_losses = {}
+        indices_dict = {}
+        for mod in self.modalities:
+            if mod not in modality_inputs or modality_inputs[mod] is None:
+                continue
+            x = modality_inputs[mod]
+            proj = getattr(self, f'{mod}_proj')
+            x_proj = proj(x)
+            quantized, idx, loss = self.vq(x_proj)
+            outputs.append(quantized)
+            vq_losses[f'{mod}_vq'] = loss
+            indices_dict[mod] = idx
+        combined = torch.cat(outputs, dim=1) if outputs else modality_inputs.get('text', None)
+        if combined is not None and timestep > 0:
+            ts_enc = self._sinusoidal_timestamp(combined.shape[1], combined.shape[2], device=combined.device)
+            combined = combined + ts_enc.unsqueeze(0)
+        return combined, vq_losses, indices_dict
+    @property
+    def total_codebook_size(self):
+        return self.codebook_size
+    @torch.no_grad()
+    def get_codebook_utilization(self):
+        cluster_size = self.vq.cluster_size
+        return (cluster_size > 0).float().mean().item()
+    @torch.no_grad()
+    def get_dead_code_count(self):
+        cluster_size = self.vq.cluster_size
+        return (cluster_size < self.vq.threshold_ema_dead_code).sum().item()

docs/ARB-RENAME-NOTE.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+title: ARB System Rename
+date: 2026-05-18
+context: System renamed from MORPH to ARB (Any Relational Bit)
+---
+# ARB System — Rename Note
+## New Name
+The system has been renamed from **MORPH** to **ARB** (Any Relational Bit).
+- **ARB** = Any Relational Bit — the core ternary architecture
+- **ARBS** = ARB System — the full software system
+- **ARBitor** = The Python package name (`arbitor/`)
+## Package Structure
+All core system files now live under `arbitor/`:
+```
+models/Trigram/
+├── arbitor/                    # Core ARB system package
+│   ├── __init__.py            # Public API exports
+│   ├── trigram.py             # Core model (ARBModel replaces MORPHTernaryModel)
+│   ├── tscale.py              # Ternary scale tensors
+│   ├── convert_to_ternary.py  # 5-trit packing
+│   ├── convert_to_ternary*.py # Legacy converters
+│   ├── flash_vq.py            # FlashVQ codebook
+│   ├── ternary_audit.py       # Model state auditor
+│   ├── profiling.py            # Profiling utilities
+│   ├── train.py               # Training pipeline
+│   ├── optim/
+│   │   └── sign_sgd.py        # SignSGD optimizer
+│   └── encoders/              # Float sidecar encoders
+│       ├── __init__.py
+│       ├── audio_codec.py
+│       ├── audio_vq_encoder.py
+│       └── video_vae.py
+├── testing/                   # Tests (import from arbitor)
+├── .planning/                 # Planning docs (P0-P10 complete)
+├── TRUE-TERNARY-REFACTOR*.md  # Architecture refactor notes
+├── BENCHMARK.md               # Benchmark docs
+└── benchmark_true_ternary.py  # Benchmark scripts
+```
+## Import Changes
+| Before | After |
+|--------|-------|
+| `from trigram import ARBModel` | `from arbitor.trigram import ARBModel` |
+| `from tscale import TernaryScaleTensor` | `from arbitor.tscale import TernaryScaleTensor` |
+| `from optim.sign_sgd import SignSGD` | `from arbitor.optim.sign_sgd import SignSGD` |
+| `from encoders.video_vae import load_vae` | `from arbitor.encoders.video_vae import load_vae` |
+| `from arbitor import ARBModel` | Shorthand via `arbitor/__init__.py` |
+| `import trigram` | `from arbitor import trigram` |
+## Class Rename
+| Before | After |
+|--------|-------|
+| `MORPHTernaryModel` | `ARBModel` |

docs/arbs-tts/README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# ARBS Ternary Training System (TTS)
+## E1TM Format — Exponent-1 Ternary Mantissa
+E1TM encodes each weight group as **one int8 exponent shared across N ternary mantissas**.
+```
+W_eff[i] = S × T[i]    where T[i] ∈ {-1, 0, +1},  S = 2^{E + Δ}
+E  = int8 log₂ scale (persistent, per group)
+Δ  = 4 × corr_accum / (step × gs)  (from BigInt accumulator)
+S  = 2^{E+Δ} (float32, ephemeral — created per forward, discarded)
+```
+### Format variants
+| Name | TScaleType | T per E | gs | E bpw | T bpw | Total bpw (inf) | Precision |
+|---|---|---|---|---|---|---|---|
+| E1TM4 | T4 | 4 | 4 | 2.000 | 1.58 | 3.58 | Highest |
+| E1TM6 | T6 | 6 | 6 | 1.333 | 1.58 | 2.91 | |
+| E1TM8 | T8 | 8 | 8 | 1.000 | 1.58 | 2.58 | |
+| E1TM16 | T16 | 16 | 16 | 0.500 | 1.58 | 2.08 | |
+| **E1TM32** | **T32** | **32** | **32** | **0.250** | **1.58** | **1.85** | **Default** |
+| E1TM64 | T64 | 64 | 64 | 0.125 | 1.58 | 1.71 | |
+| E1TM96 | T96 | 96 | 96 | 0.083 | 1.58 | 1.67 | Most packed |
+Higher T number = more T per E = less storage = coarser per-weight magnitude.
+### Group sizes
+The TScaleType name is the group size:
+```python
+TScaleType.T4  → gs = 4   → E shared across 4  ternary mantissas
+TScaleType.T32 → gs = 32  → E shared across 32 ternary mantissas
+TScaleType.T96 → gs = 96  → E shared across 96 ternary mantissas
+```
+### Persistent training state (all integer)
+| Buffer | Type | Size/weight | Role |
+|---|---|---|---|
+| T_packed | uint8 | 1.58 bpw | Base-3 packed ternary {-1,0,+1}, 5 trits/byte |
+| E | int8 | 8/N bpw | Log₂ scale, one per N-weight group |
+| corr_accum | int64 | 64/N bpw | BigInt accumulator for gradient sign votes |
+| step_counter | int64 | 0 bpw | Total steps processed |
+**No float32/16 anywhere in persistent state.** Float32 ephemeral `W_eff` is created per-forward and discarded after backward.
+### Why ternary over binary or int4
+| Format | Values/weight | Packing efficiency | Null state |
+|---|---|---|---|
+| Binary | 2 | 1 bit/bw (100%) | No |
+| Ternary | 3 | 1.58 bpw (log₂3 ≈ 95%) | **Yes** (T=0 = null) |
+| Int4 | 16 | 4 bpw (100%) | No |
+Ternary's null state (T=0) provides structural sparsity — ≈38% of weights are zero, skipping matmul tiles. No other low-bit format has this property at equivalent bpw.
+### The BigInt difference
+Unlike conventional quantization where E is static after conversion, ARBS TTS trains **through** E via a BigInt correlation accumulator:
+```
+corr_accum[g] -= Σ (grad_sign × T)   # int64, never clips or resets
+Δ = 4 × corr_accum / (step × gs)      # continuous adjustment from integer division
+S = 2^{E + Δ}                          # effective scale (ephemeral float32)
+```
+The division `corr_accum / (step × gs)` is the **Big Number Calculator** operation — it converts the accumulated integer evidence into a continuous ratio with arbitrary precision. No threshold flips, no discrete steps, no information loss.
+### Training vs inference
+| Phase | T_packed | E | corr_accum | step | S |
+|---|---|---|---|---|---|
+| Training | Read-only | Read-only | **Accumulates** | **Increments** | Computed from corr/step |
+| Inference (Option A) | Frozen | Frozen | Frozen | Frozen | Burned into checkpoint |
+| Inference (Option B) | Frozen | **Fused** | Discarded | Discarded | Static 2^{E_fused} |
+**Option A** (export): keep corr_accum + step for continuous S.
+**Option B** (fuse): `E_fused = round(E + 4 × corr_accum / (step × gs))` — discards corr_accum, drops to 2.6 bpw.
+### Relationship to IEEE float
+```
+IEEE FP32:  1 sign + 8 exponent + 23 mantissa  → per value
+E1TM32:    1 exponent (int8) + 32 ternary signs → per group of 32
+```
+In IEEE, the exponent and mantissa belong to the same value. In E1TM, the exponent is **shared** — the mantissa is split into N independent ternary signs. The corr_accum provides sub-exponent precision beyond the int8 E, making the effective scale continuous rather than constrained to the 256 discrete `2^E` values.

docs/benchmarks/BENCHMARK.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# TrueTernary Benchmark
+Results from `benchmark_true_ternary.py` — comparing pure ternary training against standard methods on MORPHTernaryModel.
+## Quick Start
+```bash
+cd models/Trigram
+# TrueTernary (strict, 0 float params, 14M ternary weights)
+python benchmark_true_ternary.py --configs TrueTernary --steps 200 --batch 4 --ctx 33
+# Adam baseline (full model, 102M float params)
+python benchmark_true_ternary.py --configs Adam_FP32 --steps 200 --batch 4 --ctx 33
+# Compare both
+python benchmark_true_ternary.py --configs Adam_FP32,TrueTernary --steps 200 --batch 4 --ctx 33 --reuse-base
+# Training script (strict ternary)
+python train.py --max_steps 1000 --batch_size 4 --ctx 33 --strict_ternary
+```
+## Head-to-Head: TrueTernary vs Adam (200 steps, B=4, C=33)
+| Metric | Adam_FP32 | TrueTernary |
+|--------|-----------|-------------|
+| **Trainable params** | 102,629,376 (float32) | **0** (pure ternary) |
+| **Model weights** | 473.6 MB | **0.0 MB** |
+| **Optimizer state** | 391.5 MB | **0.0 MB** |
+| **Training state** | 473.6 + 391.5 = **865 MB** | **18.3 MB** (buffers only) |
+| **Peak VRAM** | ~2,548 MB | **~232 MB** (includes CUDA context) |
+| **Step time** | ~200 ms | **~131 ms** |
+| **Final loss** | ~12.3 | **5.75** ↓ |
+| **Min loss** | — | **4.49** |
+| **Converges?** | Yes (to high loss) | **Yes (near optimal: ln(288)≈5.66)** |
+### Key Takeaways
+- **VRAM**: TrueTernary uses **~40× less** persistent state (18 MB vs 865 MB)
+- **Speed**: 1.5× faster per step (131 ms vs 200 ms) — pure add/sub/skip, no float GEMM
+- **Convergence**: TrueTernary reaches **5.75** (near theoretical minimum ln(288) ≈ 5.66) — Adam stalls at **12.3**
+- **No float params**: TrueTernary has 0 trainable float params, 0 float buffers
+## TrueTernary Training Dynamics (200 steps)
+The loss curve follows a characteristic 3-phase pattern:
+```
+Phase 1 (steps 0-15):  Mass T flips from random init, loss spikes to ~90
+Phase 2 (steps 15-80):  Recovery and convergence, loss drops from ~15 to ~6
+Phase 3 (steps 80-200): Stable convergence, loss hovers at 5.7-6.0
+```
+**Convergence evidence:**
+| Segment | Mean Loss | Min Loss | Trend |
+|---------|-----------|----------|-------|
+| Steps 0-50 | 13.4 | 4.49 | High variance (T flips) |
+| Steps 50-100 | 8.7 | 6.03 | Monotonic decline |
+| Steps 100-150 | 6.4 | 5.69 | Approaching optimum |
+| Steps 150-200 | **5.82** | **5.64** | **Converged** |
+The minimum loss of **4.49** is well below the uniform-distribution baseline (ln(288) ≈ 5.66), indicating the model captures meaningful byte-level patterns.
+## Training State Breakdown (14M ternary weights)
+| Component | Storage | Size | Role |
+|-----------|---------|------|------|
+| T_packed | 5-trit/byte uint8 | 2.67 MB | Packed {-1, 0, +1} weights |
+| E | int8 per group | 1.12 MB | Log₂ scale exponent |
+| E_accum | int8 per group | 1.12 MB | Residual E accumulator |
+| T_accum | int8 per weight | 13.36 MB | Gradient sign accumulator |
+| **Total** | | **18.27 MB** | |
+All int8 or packed ternary — no IEEE float anywhere in weight state.
+## Scale Projection to 3B Parameters
+| Component | 14M | 3B (projected) |
+|-----------|-----|----------------|
+| T_packed | 2.67 MB | **~572 MB** |
+| E | 1.12 MB | **~240 MB** |
+| E_accum | 1.12 MB | **~240 MB** |
+| T_accum | 13.36 MB | **~2.86 GB** |
+| **Total training** | **18.27 MB** | **~3.9 GB** |
+| Inference (T+E only) | ~3.8 MB | **~812 MB** |
+At 3B: **~3.9 GB** training VRAM fits on a single RTX 4060 (8 GB). Compare to BF16 Adam: **~18 GB** (requires server GPU).
+## Architecture Components
+All internal trainable components are now ternary or integer buffers (REFACTOR6+):
+- `TernaryScaleTensor` — packed ternary linear layers
+- `TernaryEmbeddingTable` — packed ternary embedding lookup
+- `TernaryLSTMCell` — LSTM with ternary projections
+- `TernaryVQCodebook` — VQ with ternary embedding table
+- Graph: Triton-backed edge aggregation + gather-add kernels (REFACTOR8)
+- MoE: Triton-backed dense combine kernel (REFACTOR8)
+The only remaining float parameters are imported frozen encoders (ViT, Whisper).
+## Running the Benchmark
+```bash
+# Default: 200 steps, batch=4, ctx=33
+python benchmark_true_ternary.py
+# Custom config
+python benchmark_true_ternary.py \
+  --configs TrueTernary \
+  --steps 500 \
+  --batch 8 \
+  --ctx 66 \
+  --update-backend gpu \
+  --scale-update-interval 1
+# Compare with Adam
+python benchmark_true_ternary.py \
+  --configs Adam_FP32,TrueTernary \
+  --steps 200 \
+  --batch 4 \
+  --ctx 33 \
+  --reuse-base
+# Change T_accum threshold (higher = less frequent flips)
+python benchmark_true_ternary.py \
+  --accum-threshold 5
+# Full training pipeline
+python train.py \
+  --max_steps 5000 \
+  --batch_size 8 \
+  --ctx 66 \
+  --strict_ternary \
+  --scale_update_interval 1 \
+  --run_name my_ternary_run
+```
+## Benchmark CLI Arguments
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--configs` | `TrueTernary` | Comma-separated: `Adam_FP32`, `SignSGD_Old`, `TrueTernary` |
+| `--steps` | 200 | Training steps |
+| `--batch` | 4 | Batch size |
+| `--ctx` | 33 | Context length |
+| `--update-backend` | `gpu` | `gpu`, `gpu-signcache`, `dense-fallback`, `none` |
+| `--scale-update-interval` | 1 | E update frequency (0 = disable) |
+| `--accum-threshold` | 3 | T_accum flip threshold |
+| `--print-every` | 50 | Logging frequency |