omar-ah commited on
Commit
59fd921
Β·
verified Β·
1 Parent(s): 7e7f067

Update README with full documentation

Browse files
Files changed (1) hide show
  1. README.md +107 -41
README.md CHANGED
@@ -16,17 +16,20 @@ A lightweight single-object tracker (SOT) using Vision-LSTM (ViL) as backbone, d
16
 
17
  ### Core Design
18
  - **Backbone**: Vision-LSTM (ViL-S) with 24 mLSTM blocks, bidirectional scanning
19
- - **Temporal Modulation**: FiLM (Feature-wise Linear Modulation) for temporal context
20
  - **Prediction Heads**: Center-based heatmap + size regression + offset refinement
21
  - **Uncertainty**: Aleatoric uncertainty estimation for adaptive tracking
22
  - **TMoE**: Temporal Mixture-of-Experts MLP in last 2 blocks
 
23
 
24
  ### Key Innovations
25
  1. **LinearHeadwiseExpand Q/K/V projections**: Block-diagonal projections (192Γ—4Γ—4 = 3K params each vs 589K for full linear), matching the official NX-AI ViL-S architecture
26
  2. **No separate MLP/FFN**: Following ViL-S, the gated output inside the mLSTM cell serves as the MLP (SwiGLU-style gating via proj_up β†’ split β†’ z-gate β†’ proj_down)
27
  3. **Bidirectional scanning**: Even blocks Lβ†’R, odd blocks Rβ†’L via `torch.flip`
28
- 4. **FiLM temporal modulation**: Replaces DTPTrack temporal tokens (broken in R→L scan) with channel-wise affine modulation
29
  5. **TMoE in last 2 blocks**: Dense routing with frozen shared expert + 4 specialized experts for temporal dynamics
 
 
30
 
31
  ### Constraint Compliance
32
 
@@ -73,44 +76,61 @@ Input x (B, S, D=384)
73
  ```
74
 
75
  ### Training Pipeline
76
- - **Phase 1** (300 epochs): Full supervised training with focal + GIoU + size losses, ACL curriculum
77
- - **Phase 2** (100 epochs): Fine-tuning with frozen shared TMoE experts, contrastive loss
78
-
79
- ## File Structure
80
-
81
- ```
82
- vil_tracker/
83
- β”œβ”€β”€ models/
84
- β”‚ β”œβ”€β”€ mlstm.py # LinearHeadwiseExpand, mLSTMCell, mLSTMBlock, SwiGLUMLP
85
- β”‚ β”œβ”€β”€ backbone.py # ViLBackbone, PatchEmbed, TMoEMLP, mLSTMBlockWithTMoE
86
- β”‚ β”œβ”€β”€ film_temporal.py # FiLM modulation, TemporalReliabilityCalibrator
87
- β”‚ β”œβ”€β”€ heads.py # CenterHead, UncertaintyHead, decode_predictions
88
- β”‚ └── tracker.py # ViLTracker, build_tracker, get_default_config
89
- β”œβ”€β”€ training/
90
- β”‚ β”œβ”€β”€ losses.py # FocalLoss, GIoULoss, UncertaintyNLLLoss, CombinedTrackingLoss
91
- β”‚ └── train.py # Phase 1/2 training, ACL curriculum, AMP
92
- β”œβ”€β”€ data/
93
- β”‚ └── dataset.py # TrackingDataset with synthetic fallback, ACL difficulty
94
- β”œβ”€β”€ inference/
95
- β”‚ β”œβ”€β”€ kalman.py # 8-state Kalman filter with adaptive noise
96
- β”‚ └── online_tracker.py # OnlineTracker inference pipeline
97
- β”œβ”€β”€ evaluation/
98
- β”‚ └── evaluate.py # BenchmarkEvaluator for LaSOT/UAV123/DTB70/VisDrone
99
- β”œβ”€β”€ utils/
100
- β”‚ └── helpers.py # count_parameters, estimate_flops, print_model_summary
101
- └── configs/
102
- └── default.json # Full configuration
103
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  ## Quick Start
106
 
 
107
  ```python
108
  from vil_tracker.models.tracker import build_tracker
 
109
 
110
- # Build model with default config (36.33M params)
111
  tracker = build_tracker()
 
 
112
 
113
- # Forward pass
 
114
  import torch
115
  template = torch.randn(1, 3, 128, 128)
116
  search = torch.randn(1, 3, 256, 256)
@@ -120,19 +140,65 @@ print(output['boxes']) # (1, 4) predicted [cx, cy, w, h]
120
  print(output['scores']) # (1,) confidence scores
121
  ```
122
 
123
- ## References
 
 
124
 
125
- ### Seed Papers
126
- - **UETrack**: arXiv:2603.01412 β€” Uncertainty-aware tracker
127
- - **SGLATrack**: arXiv:2503.06625 β€” Structure-guided attention tracking
128
- - **SUTrack**: arXiv:2412.19138 β€” Unified tracking framework
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
- ### Architecture References
131
  - **Vision-LSTM (ViL)**: Alkin et al., arXiv:2406.04303
132
  - **xLSTM**: Beck et al., arXiv:2405.04517
133
- - **FiLM**: Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer"
134
- - **MCITrack**: Distillation teacher (B256 variant)
 
 
 
135
 
136
  ## License
137
 
138
- MIT
 
16
 
17
  ### Core Design
18
  - **Backbone**: Vision-LSTM (ViL-S) with 24 mLSTM blocks, bidirectional scanning
19
+ - **Temporal Modulation**: FiLM (Feature-wise Linear Modulation) integrated BETWEEN backbone blocks
20
  - **Prediction Heads**: Center-based heatmap + size regression + offset refinement
21
  - **Uncertainty**: Aleatoric uncertainty estimation for adaptive tracking
22
  - **TMoE**: Temporal Mixture-of-Experts MLP in last 2 blocks
23
+ - **Online Tracking**: Kalman filter with uncertainty-adaptive noise + confidence-based template update
24
 
25
  ### Key Innovations
26
  1. **LinearHeadwiseExpand Q/K/V projections**: Block-diagonal projections (192Γ—4Γ—4 = 3K params each vs 589K for full linear), matching the official NX-AI ViL-S architecture
27
  2. **No separate MLP/FFN**: Following ViL-S, the gated output inside the mLSTM cell serves as the MLP (SwiGLU-style gating via proj_up β†’ split β†’ z-gate β†’ proj_down)
28
  3. **Bidirectional scanning**: Even blocks Lβ†’R, odd blocks Rβ†’L via `torch.flip`
29
+ 4. **FiLM temporal modulation**: Replaces DTPTrack temporal tokens (broken in R→L scan) with channel-wise affine modulation, integrated between backbone blocks (not post-hoc)
30
  5. **TMoE in last 2 blocks**: Dense routing with frozen shared expert + 4 specialized experts for temporal dynamics
31
+ 6. **ACL curriculum**: Progressive difficulty ramp-up (sample jitter + temporal gap + loss weighting)
32
+ 7. **8-state Kalman filter**: Chi-squared gating for outlier rejection, uncertainty-adaptive measurement noise
33
 
34
  ### Constraint Compliance
35
 
 
76
  ```
77
 
78
  ### Training Pipeline
79
+ - **Phase 1** (300 epochs): Full supervised training with focal + GIoU + size losses
80
+ - ACL curriculum: difficulty ramp 0β†’1 over 50 epochs (controls temporal gap, spatial jitter, loss weighting)
81
+ - FiLM temporal modulation activated after epoch 30
82
+ - Datasets: GOT-10k + LaSOT + TrackingNet + COCO (with synthetic fallback)
83
+ - **Phase 2** (100 epochs): Fine-tuning with frozen shared TMoE experts
84
+ - Contrastive loss on template/search temporal features
85
+ - Optional AFKD distillation from MCITrack-B256 teacher
86
+ - FiLM temporal modulation always active
87
+
88
+ ### Loss Functions
89
+ - **FocalLoss**: Center heatmap prediction (CornerNet-style, handles 1/256 positive ratio)
90
+ - **GIoULoss**: Bounding box regression
91
+ - **L1Loss**: Size regression
92
+ - **UncertaintyNLLLoss**: Uncertainty-aware regression
93
+ - **MemoryContrastiveLoss**: Temporal feature consistency (Phase 2)
94
+ - **AFKDDistillationLoss**: Attention-free knowledge distillation (optional teacher)
95
+ - **ADWLoss**: Adaptive dynamic weighting (homoscedastic uncertainty)
96
+
97
+ ### Inference Pipeline (OnlineTracker)
98
+ 1. Kalman filter predict β†’ estimated position
99
+ 2. Crop search region (4x context) around prediction
100
+ 3. Model forward: template + search β†’ heatmap + size + offset
101
+ 4. Decode predictions β†’ candidate bounding box
102
+ 5. Map predictions back to frame coordinates
103
+ 6. Confidence check β†’ update Kalman filter (with uncertainty-adaptive noise)
104
+ 7. Conditional template update (high confidence, every 10th frame)
105
+
106
+ ## Dataset Support
107
+
108
+ ### Training Datasets
109
+ - **GOT-10k**: `root/train/GOT-10k_Train_NNNNNN/` (10K sequences)
110
+ - **LaSOT**: `root/{category}/{seq_name}/img/` + `groundtruth.txt` (1120 sequences)
111
+ - **TrackingNet**: `root/TRAIN_N/frames/{video}/` + `anno/{video}.txt` (30K sequences)
112
+ - **COCO**: Pseudo-sequences from detection annotations (static pair pretraining)
113
+ - **Synthetic**: Colored rectangles on noise backgrounds (no external data needed)
114
+
115
+ ### Evaluation Datasets
116
+ - **LaSOT** (test): 280 sequences, AUC metric
117
+ - **UAV123**: 123 sequences at 123fps
118
+ - **DTB70**: 70 drone tracking sequences
119
+ - **VisDrone-SOT**: Drone-perspective tracking
120
 
121
  ## Quick Start
122
 
123
+ ### Build and Inspect Model
124
  ```python
125
  from vil_tracker.models.tracker import build_tracker
126
+ from vil_tracker.utils.helpers import print_model_summary
127
 
 
128
  tracker = build_tracker()
129
+ print_model_summary(tracker)
130
+ ```
131
 
132
+ ### Forward Pass
133
+ ```python
134
  import torch
135
  template = torch.randn(1, 3, 128, 128)
136
  search = torch.randn(1, 3, 256, 256)
 
140
  print(output['scores']) # (1,) confidence scores
141
  ```
142
 
143
+ ### Online Tracking
144
+ ```python
145
+ from vil_tracker.inference.online_tracker import OnlineTracker
146
 
147
+ online = OnlineTracker(tracker, device='cuda')
148
+ online.initialize(first_frame, init_bbox)
149
+ for frame in video_frames[1:]:
150
+ bbox = online.track(frame)
151
+ ```
152
+
153
+ ### Training
154
+ ```python
155
+ from vil_tracker.models.tracker import build_tracker, get_default_config
156
+ from vil_tracker.data.dataset import build_tracking_dataset
157
+ from vil_tracker.training.train import train_phase1, train_phase2
158
+
159
+ config = get_default_config()
160
+ model = build_tracker(config)
161
+
162
+ dataset = build_tracking_dataset({
163
+ 'got10k_root': '/data/GOT-10k',
164
+ 'lasot_root': '/data/LaSOT',
165
+ 'trackingnet_root': '/data/TrackingNet',
166
+ })
167
+
168
+ model = train_phase1(model, dataset, config, device='cuda',
169
+ push_to_hub=True, hub_model_id='user/vil-tracker')
170
+ model = train_phase2(model, dataset, config, device='cuda',
171
+ push_to_hub=True, hub_model_id='user/vil-tracker')
172
+ ```
173
+
174
+ ### Evaluation
175
+ ```python
176
+ from vil_tracker.inference.online_tracker import OnlineTracker
177
+ from vil_tracker.evaluation.evaluate import BenchmarkEvaluator
178
+
179
+ online = OnlineTracker(model, device='cuda')
180
+ evaluator = BenchmarkEvaluator(online)
181
+ results = evaluator.evaluate_dataset('/data/LaSOT', 'lasot')
182
+ print(f"LaSOT AUC: {results['mean_seq_auc']:.3f}")
183
+ ```
184
+
185
+ ## Tests
186
+
187
+ Run the full test suite (16 tests):
188
+ ```bash
189
+ python test_all.py
190
+ ```
191
+
192
+ ## References
193
 
 
194
  - **Vision-LSTM (ViL)**: Alkin et al., arXiv:2406.04303
195
  - **xLSTM**: Beck et al., arXiv:2405.04517
196
+ - **UETrack**: arXiv:2603.01412
197
+ - **SGLATrack**: arXiv:2503.06625
198
+ - **SUTrack**: arXiv:2412.19138
199
+ - **FiLM**: Perez et al.
200
+ - **MCITrack**: Distillation teacher
201
 
202
  ## License
203
 
204
+ MIT