Parallax_VIDEO-AmongUs

Parallax_VIDEO-AmongUs is an experimental temporal autoencoder that extends geometric image reconstruction into the dimension of time. Operating on top of the Parallax Vision 2.0 backbone, this model focuses on the sequential flow and structural evolution of a scene over multiple frames. It proves that complex, 120-frame UI animations can be compressed into a highly constrained temporal bottleneck without losing their logical motion.

1. Technical Specs

  • Meta-AE Parameters: ~10,757,888 (10.7M)
  • Input Format: 120-Frame Sequence (via 512-dim Vision Latents)
  • Temporal Bottleneck: 256 per frame
  • Architecture: Scaled Sequence Meta-Autoencoder (Linear Stack)

2. The Facts: The 6075:1 Challenge

This pipeline operates at an extreme compression ratio by compounding the spatial compression of Vision 2.0 with temporal sequence bottlenecking.

In practical terms, an entire 120-frame sequence of 720p video is reduced to a microscopic "temporal DNA" of just 30,720 values. It effectively discards 99.98% of the raw video data. Instead of suffering from severe temporal drift or "ghosting," the model retains the precise trajectory of moving elements and UI state changes across time.

Metric Value
Input Values (120 Frames, 720p RGB) 186,624,000
Temporal Latent Values (120 x 256) 30,720
Total Compression Ratio 6075.00 : 1
Data Retained ~0.016%

3. The Story: Choreography Over Frames

Most video autoencoders fail at high ratios because they try to memorize the pixels of every individual frame. Parallax_VIDEO-AmongUs succeeds by prioritizing the differences and trajectories between frames.

Think of it as a motion capture system for digital environments. Rather than drawing the character running 120 times, the model remembers the starting position and the mathematical rules of the movement. By shifting the objective from "What does this frame look like?" to "How do these structural coordinates evolve over 120 steps?", the model can reconstruct smooth, logical video sequences from an incredibly dense seed.

It’s high-fidelity choreography hidden in a 256-dimension temporal thread.

4. Usage (Inference Script)

This script demonstrates the full sequential reconstruction pipeline, utilizing both the pre-trained Vision 2.0 backbone for spatial extraction and the 10.7M Meta-AE for temporal reconstruction.

import os
import cv2
import torch
import torch.nn as nn
from torchvision import transforms
import matplotlib.pyplot as plt

# --- CONFIGURATION ---
VISION_REPO = "Parallax-labs-1/parallax-VISION_amongus-2.0"
VIDEO_REPO = "Parallax-labs-1/parallax_VIDEO-amongus"
INPUT_VIDEO = "test_sequence.mp4" 
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
LATENT_DIM = 512
SEQ_LENGTH = 120
IMG_SIZE = 720

# --- 1. MODEL ARCHITECTURES ---
class ParallaxAutoencoder(nn.Module):
    # Spatial Backbone
    def __init__(self):
        super(ParallaxAutoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64 * 45 * 45, LATENT_DIM)
        )
        self.decoder = nn.Sequential(
            nn.Linear(LATENT_DIM, 64 * 45 * 45), nn.ReLU(),
            nn.Unflatten(1, (64, 45, 45)),
            nn.ConvTranspose2d(64, 64, 2, stride=2), nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 2, stride=2), nn.ReLU(),
            nn.ConvTranspose2d(32, 16, 2, stride=2), nn.ReLU(),
            nn.ConvTranspose2d(16, 3, 2, stride=2), nn.Sigmoid()
        )
    def forward(self, x): return self.decoder(self.encoder(x))

class SequenceAE(nn.Module):
    # Temporal Meta-AE
    def __init__(self):
        super(SequenceAE, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(LATENT_DIM, 1024), nn.ReLU(),
            nn.Linear(1024, 2048), nn.ReLU(),
            nn.Linear(2048, 1024), nn.ReLU(),
            nn.Linear(1024, 512), nn.ReLU(),
            nn.Linear(512, 256)
        )
        self.decoder = nn.Sequential(
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, 1024), nn.ReLU(),
            nn.Linear(1024, 2048), nn.ReLU(),
            nn.Linear(2048, 1024), nn.ReLU(),
            nn.Linear(1024, LATENT_DIM)
        )
    def forward(self, x): return self.decoder(self.encoder(x))

# --- 2. LOAD & INITIALIZE ---
vision_model = ParallaxAutoencoder().to(DEVICE)
meta_ae = SequenceAE().to(DEVICE)

def download_weights(repo, filename):
    if not os.path.exists(filename):
        print(f"Downloading {filename}...")
        os.system(f"wget https://huggingface.co/{repo}/resolve/main/{filename} -O {filename}")

download_weights(VISION_REPO, "model.pth")
download_weights(VIDEO_REPO, "meta_ae_model.pth")

vision_model.load_state_dict(torch.load('model.pth', map_location=DEVICE))
meta_ae.load_state_dict(torch.load('meta_ae_model.pth', map_location=DEVICE))

vision_model.eval()
meta_ae.eval()

# --- 3. EXECUTE SEQUENCE RECONSTRUCTION ---
def run_video_reconstruction(video_path):
    if not os.path.exists(video_path):
        print(f"Error: {video_path} not found.")
        return

    cap = cv2.VideoCapture(video_path)
    preprocess = transforms.Compose([
        transforms.ToPILImage(),
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor()
    ])
    
    frames = []
    print("Extracting frames...")
    while cap.isOpened() and len(frames) < SEQ_LENGTH:
        ret, frame = cap.read()
        if not ret: break
        frames.append(preprocess(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
    cap.release()

    if len(frames) < SEQ_LENGTH:
        print(f"Need at least {SEQ_LENGTH} frames.")
        return

    frames_t = torch.stack(frames).to(DEVICE) # [120, 3, 720, 720]
    
    print("Running through pipeline...")
    with torch.no_grad():
        # 1. Spatial Encoding (120 frames -> 120 x 512)
        spatial_latents = vision_model.encoder(frames_t)
        
        # 2. Temporal Encoding/Decoding (120 x 512 -> 120 x 256 -> 120 x 512)
        reconstructed_latents = meta_ae.decoder(meta_ae.encoder(spatial_latents))
        
        # 3. Spatial Decoding (Pick a frame to visualize, e.g., middle frame 60)
        frame_idx = 60
        original_frame = frames_t[frame_idx]
        recon_frame = vision_model.decoder(reconstructed_latents[frame_idx].unsqueeze(0))[0]

    # Plotting
    fig, ax = plt.subplots(1, 2, figsize=(15, 7))
    ax[0].imshow(original_frame.cpu().permute(1, 2, 0))
    ax[0].set_title(f"Original Frame {frame_idx}")
    ax[0].axis('off')
    
    ax[1].imshow(recon_frame.cpu().permute(1, 2, 0))
    ax[1].set_title(f"Reconstructed from Temporal Latents (Frame {frame_idx})")
    ax[1].axis('off')
    
    plt.show()

if __name__ == "__main__":
    run_video_reconstruction(INPUT_VIDEO)

5. License

Licensed under Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Parallax-labs-1/parallax_VIDEO-amongus