Instructions to use phanerozoic/moving-plantain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use phanerozoic/moving-plantain with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.2-klein-base-4B", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("phanerozoic/moving-plantain") prompt = "Turn this cat into a dog" input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = pipe(image=input_image, prompt=prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
File size: 1,937 Bytes
fb28cee | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | ---
language: en
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-base-4B
library_name: diffusers
tags:
- video-prediction
- dynamics
- lora
- flux2
- vision-banana
- arxiv:2604.20329
pipeline_tag: image-to-image
---
# moving-plantain
A LoRA adapter on FLUX.2 Klein (4B) for single-step future-frame prediction. Tests whether the latent physics priors of an image generator can be surfaced through the instruction-tuning recipe of *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)).
## Thesis
Vision Banana argues that image generation pretraining produces a generalist vision learner. moving-plantain extends that argument to dynamics. A model that can render a physically coherent t+1 frame conditioned on a t=0 frame and a free-form intervention prompt — "the ball rolls left", "the cup tips over", "the cloth falls" — implicitly carries a forward physics simulator in its weights. Recovering that simulator under parameter-efficient adaptation is the empirical test of whether generative vision pretraining encodes object permanence, gravity, contact dynamics, and other physical structure beyond static appearance.
## Method
Input: a single RGB frame at t=0 and an intervention prompt describing the change. Output: the predicted RGB frame at t=1. Training pairs are drawn from natural video datasets, with intervention prompts derived from the optical flow / motion description between consecutive frames. The loss is the diffusion objective on the t=1 target.
## Status
Placeholder. Weights and training data forthcoming.
## License
Apache 2.0 — matches base FLUX.2 Klein 4B.
## References
- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
- Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).
|