πŸ›°οΈ Satellite Omni LISA

A multi-modal satellite model that understands AND generates images.

Component Model Role
VLM Qwen3.5-0.8B Text + vision understanding
Mask Decoder SlimSAM-77 Image (mask) generation β€” 10x lighter than SAM-ViT-B
Connector MLP Projection (1024β†’256) Bridges VLM β†’ SAM

Architecture

Based on LISA: Reasoning Segmentation via Large Language Model, adapted for satellite/remote sensing with multi-task training. Uses SlimSAM (9.7M params, 77% pruned) instead of SAM-ViT-B (93.7M) for efficiency.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INPUT                                 β”‚
β”‚  Satellite Image + Text Prompt                          β”‚
β”‚  e.g., "Segment the buildings in this aerial image"     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             Qwen3.5-0.8B VLM (LoRA fine-tuned)          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ ViT Vision β”‚  β”‚ Language Model (24 layers, 1024d)  β”‚  β”‚
β”‚  β”‚  Encoder   │──│ Linear + Full Attention layers     β”‚  β”‚
β”‚  β”‚  (768d)    β”‚  β”‚ MRoPE positional encoding          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                              β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚   Text Output     β”‚          β”‚  <SEG> Token        β”‚
     β”‚                   β”‚          β”‚  Hidden State       β”‚
     β”‚  "I identified    β”‚          β”‚  (1024-dim)         β”‚
     β”‚   5 buildings"    β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚  MLP Projection     β”‚
                                    β”‚  1024 β†’ 256         β”‚
                                    β”‚  (GELU activation)  β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚            SlimSAM-77                              β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
     β”‚  β”‚ ViT Encoder  β”‚  β”‚ Mask Decoder (fine-tuned)  β”‚  β”‚
     β”‚  β”‚ (frozen)     │──│ Cross-attention + MLP      β”‚  β”‚
     β”‚  β”‚ ~5.7M params β”‚  β”‚ ~4M params                 β”‚  β”‚
     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚  Binary Mask Image  β”‚
                           β”‚  (256Γ—256 β†’ resize) β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Innovation

The model handles two output modalities from a single VLM:

  1. Text output: Standard autoregressive text generation for captioning, VQA, classification
  2. Image output: When the model generates a special <SEG> token, its hidden representation is projected through an MLP and used as a prompt for SlimSAM's mask decoder to generate a segmentation mask image

This means the model learns when to generate an image (by outputting <SEG>) and what image to generate (the hidden state encodes the full visual context).

Tasks

Text Output Tasks

Task Example
Image Captioning "Describe this satellite image" β†’ "Green trees surround industrial buildings near a river..."
Visual QA "What type of land cover is shown?" β†’ "Agricultural farmland"
Scene Classification "Classify this satellite image" β†’ "Industrial Buildings"
Object Detection "What objects are visible?" β†’ "3 buildings and 2 roads"
Flood Assessment "Is there flooding visible?" β†’ "Yes, 2 flooded buildings visible"

Image Output Tasks

Task Example
Building Segmentation "Segment the buildings" β†’ Text + Binary mask image
Instance Segmentation "Detect and segment all structures" β†’ Text + Per-instance masks

Training Details

Loss Function

L = L_text + L_mask
L_text = CrossEntropy(predicted_tokens, target_tokens)
L_mask = 2.0 * BCE(predicted_mask, gt_mask) + 0.5 * DICE(predicted_mask, gt_mask)

Hyperparameters

Parameter Value
Epochs 2
Effective batch size 16 (1Γ—16 or 2Γ—8 grad accum, auto-detected)
Learning rate 2e-4
Scheduler Cosine with 5% warmup
LoRA rank 16
LoRA alpha 32
Max sequence length 512
Precision FP16 (T4/Colab) or BF16 (A10G+) β€” auto-detected

Parameter Budget

Component Params Trainable?
Qwen3.5-0.8B VLM ~873M LoRA only (~5M)
SlimSAM-77 Encoder ~5.7M ❄️ Frozen
SlimSAM Mask Decoder ~4M βœ… Yes
Seg Projection MLP ~1M βœ… Yes
Total trainable ~10M ~1% of total

Dataset

Trained on rahuldshetty/satellite-multitask-omni:

  • ~20K training samples across 9 task types
  • Sources: EuroSAT, RSICD, PatternNet, FloodNet, building segmentation datasets
  • Segmentation masks from normalized polygon coordinates (0-1000 range)

Usage

Installation

# Qwen3.5 requires transformers from git (>= 4.57.0)
pip install "transformers @ git+https://github.com/huggingface/transformers.git"
pip install torch torchvision peft datasets Pillow numpy huggingface_hub

# Or use the setup script:
bash setup.sh

Single-Image Inference

from inference import load_model, generate_text, generate_segmentation
from PIL import Image

# Load model
model = load_model("./model_dir", device="cuda")

# Text task
image = Image.open("satellite.jpg")
response = generate_text(model, image, "Describe this satellite image")
print(response)

# Segmentation task
text, mask = generate_segmentation(model, image, "Segment the buildings")
print(text)
mask.save("building_mask.png")

CLI Inference

# Text generation
python inference.py --model_dir ./model --image sat.jpg --prompt "What land cover type?"

# Segmentation (generates mask image)
python inference.py --model_dir ./model --image sat.jpg --prompt "Segment buildings" --output_mask mask.png

Dataset Evaluation

Run inference on sampled dataset examples across all task types and generate a structured report:

# Default: 3 samples per task from validation split
python infer_dataset.py --model_repo rahuldshetty/satellite-omni-lisa

# More samples, custom output
python infer_dataset.py --model_repo rahuldshetty/satellite-omni-lisa \
    --samples_per_task 5 \
    --split test \
    --output_dir ./my_eval

# Force CPU (slower but works without GPU)
python infer_dataset.py --device cpu --samples_per_task 2

This produces:

eval_output/
β”œβ”€β”€ REPORT.md              ← Markdown report with all results
β”œβ”€β”€ results.json           ← Machine-readable results
└── images/
    β”œβ”€β”€ sample_000_input.png
    β”œβ”€β”€ sample_003_gt_mask.png
    β”œβ”€β”€ sample_003_pred_mask.png
    └── ...

The REPORT.md includes:

  • Summary table β€” per-task sample counts, error counts, mean IoU for segmentation
  • Per-task sections β€” each sample shows:
    • Input satellite image
    • Text tasks: expected vs model output in a comparison table
    • Segmentation tasks: GT mask vs predicted mask side-by-side with IoU score

Training

# Auto-detects GPU, sets batch size, bf16/fp16 automatically
python train.py

# Override settings via environment variables
BATCH_SIZE=1 GRAD_ACCUM=16 MAX_SAMPLES=100 python train.py
Environment Variable Default Description
BATCH_SIZE Auto (1 for T4, 2 for A10G+) Per-device batch size
GRAD_ACCUM 16 // BATCH_SIZE Gradient accumulation steps
MAX_SAMPLES 0 (all) Limit dataset size for debugging
NUM_WORKERS 0 Dataloader workers (0 = main process, saves RAM)

Files

File Description
train.py Full training script with auto hardware detection
inference.py Single-image inference (text + segmentation)
infer_dataset.py Dataset evaluation β€” samples all tasks, generates REPORT.md
setup.sh Auto-installs transformers from git if needed
requirements.txt Pip dependencies with version pins
vlm_lora/ LoRA adapter weights for Qwen3.5-0.8B
seg_projector.pt MLP projection weights (1024β†’256)
sam_mask_decoder.pt Fine-tuned SlimSAM mask decoder weights
tokenizer/ Tokenizer with added <SEG> token
model_config.json Model configuration (model IDs, dimensions, tokens)

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rahuldshetty/satellite-omni-lisa

Finetuned
(194)
this model

Dataset used to train rahuldshetty/satellite-omni-lisa

Papers for rahuldshetty/satellite-omni-lisa