boudiafA commited on 3 days ago

Commit

0a8ae4e

verified ·

1 Parent(s): 5223b5a

Add CropVLM model card and code

Browse files

Files changed (19) hide show

.gitattributes +3 -0
.gitignore +10 -0
README.md +176 -3
cropvlm/__init__.py +3 -0
cropvlm/model.py +139 -0
docs/figures/agri_semantics_coverage.png +3 -0
docs/figures/cropvlm_framework.png +3 -0
docs/figures/semantic_annotation_examples.png +3 -0
examples/cacao.png +0 -0
examples/cauliflower.png +0 -0
examples/olive.png +0 -0
examples/selection_metadata.json +51 -0
examples/sugarcane.png +0 -0
examples/sunflower.png +0 -0
models/.gitkeep +0 -0
outputs/.gitkeep +0 -0
requirements.txt +13 -0
scripts/evaluate_zero_shot.py +391 -0
scripts/gradio_demo.py +120 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+docs/figures/agri_semantics_coverage.png filter=lfs diff=lfs merge=lfs -text
+docs/figures/cropvlm_framework.png filter=lfs diff=lfs merge=lfs -text
+docs/figures/semantic_annotation_examples.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+__pycache__/
+*.py[cod]
+.ipynb_checkpoints/
+.DS_Store
+models/*.pth
+models/*.pt
+models/*.ckpt
+outputs/*
+!outputs/.gitkeep

README.md CHANGED Viewed

@@ -1,3 +1,176 @@
----
-license: apache-2.0
----

+# CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis
+CropVLM is a CLIP-based zero-shot image classifier adapted for crop and fruit recognition. It compares one image embedding against text embeddings for candidate class names, then returns the class with the highest cosine similarity.
+![CropVLM framework overview](docs/figures/cropvlm_framework.png)
+This repository contains:
+- a simple CropVLM Python loader,
+- a Gradio demo for classifying one image,
+- a zero-shot evaluation script for ImageFolder-style datasets,
+- five strategically selected high-margin example images in `examples/`.
+## Agri-Semantics Data
+CropVLM is adapted with dense agricultural image-text supervision. The Agri-Semantics dataset spans 37 crop classes across fruits, vegetables, grains, and industrial crops, with examples covering visual diversity such as ripeness levels, varieties, and growth stages.
+![Agri-Semantics crop coverage](docs/figures/agri_semantics_coverage.png)
+The generated captions encode crop identity together with phenotypic cues such as ripeness, count, color, and spatial position.
+![Agri-Semantics annotation examples](docs/figures/semantic_annotation_examples.png)
+## Zero-Shot Classification Comparison
+We evaluate CropVLM against CLIP-based baselines by encoding each crop class name once, encoding each test image, and assigning the class with the highest cosine similarity in the shared image-text embedding space. The table reports results on the held-out 37-class crop test split.
+| Model | Overall Accuracy (%) | Per-Class Mean +/- Std (%) |
+|---|---:|---:|
+| SigLIP 2 | 3.43 | 3.43 +/- 16.91 |
+| AgriCLIP | 4.04 | 4.04 +/- 14.61 |
+| RemoteCLIP | 42.52 | 42.52 +/- 27.57 |
+| BioCLIP | 48.33 | 48.34 +/- 34.95 |
+| BioTrove-CLIP | 51.07 | 51.07 +/- 36.20 |
+| BioCLIP 2 | 67.74 | 67.74 +/- 31.17 |
+| OpenAI CLIP ViT-B/32 | 70.24 | 70.24 +/- 28.83 |
+| **CropVLM** | **72.51** | **72.51 +/- 29.71** |
+## Installation
+Create an environment and install the dependencies:
+```bash
+conda create -n cropvlm python=3.10 -y
+conda activate cropvlm
+pip install -r requirements.txt
+```
+For GPU inference, install the CUDA build of PyTorch that matches your system before installing the remaining dependencies. For example:
+```bash
+pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
+pip install -r requirements.txt
+```
+## Checkpoint
+This Hugging Face repository includes the CropVLM checkpoint:
+```text
+models/CropCLIP_FullDataset_Acc_0.75.pth
+```
+You can download it with `huggingface_hub`:
+```python
+from huggingface_hub import hf_hub_download
+checkpoint = hf_hub_download(
+    repo_id="boudiafA/CropVLM",
+    filename="models/CropCLIP_FullDataset_Acc_0.75.pth",
+)
+```
+You can also clone the repository and use the local checkpoint path with `--checkpoint` or `--cropvlm-checkpoint`.
+## Gradio Demo
+Run:
+```bash
+python scripts/gradio_demo.py \
+  --checkpoint models/CropCLIP_FullDataset_Acc_0.75.pth
+```
+Then open:
+```text
+http://127.0.0.1:7860
+```
+The demo lets you upload any image and edit the candidate class names. The default class list is:
+```text
+apple, avocado, banana, barley, bell pepper, broccoli, cacao, canola,
+cauliflower, cherry, chilli, coconut, coffee, corn, cotton, cucumber,
+eggplant, kiwi, lemon, mango, olive, orange, pear, peas, pineapple,
+pomegranate, potato, pumpkin, rice, soyabean, strawberry, sugarcane,
+sunflower, tea, tomato, watermelon, wheat
+```
+The included examples are `cacao`, `olive`, `cauliflower`, `sugarcane`, and `sunflower`. They were selected from correct CropVLM predictions with a large cosine-similarity gap between the correct class and the second-best class. The selection details are in `examples/selection_metadata.json`.
+## Use CropVLM In Python
+```python
+from PIL import Image
+from cropvlm import load_cropvlm
+classifier = load_cropvlm("models/CropCLIP_FullDataset_Acc_0.75.pth")
+image = Image.open("examples/cacao.png")
+for label, score in classifier.predict(image, top_k=5):
+    print(label, score)
+```
+## Evaluate Zero-Shot Accuracy
+The dataset should be arranged like `torchvision.datasets.ImageFolder`:
+```text
+Crop_Dataset_testing/
+  apple/
+    image_001.png
+  banana/
+    image_001.png
+  ...
+```
+Run CropVLM and the supported comparison CLIP models:
+```bash
+python scripts/evaluate_zero_shot.py \
+  --dataset /mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing \
+  --cropvlm-checkpoint models/CropCLIP_FullDataset_Acc_0.75.pth \
+  --output outputs/zero_shot_results.json \
+  --batch-size 64
+```
+By default, the script evaluates:
+```text
+cropvlm
+openai_clip_vit_b32
+bioclip
+bioclip2
+biotrove_clip
+remoteclip
+siglip2
+```
+You can choose a subset:
+```bash
+python scripts/evaluate_zero_shot.py \
+  --dataset /path/to/test_dataset \
+  --models cropvlm openai_clip_vit_b32 bioclip2 \
+  --output outputs/subset_results.json
+```
+The output JSON includes:
+- `models`: compact per-model scores,
+- `model_results`: full per-model details keyed by model name,
+- `results`: full per-model details as a list,
+- per-class accuracy,
+- per-class accuracy mean and standard deviation,
+- confusion matrix,
+- optional per-image predictions when `--save-predictions` is used.
+The mean and standard deviation are computed across per-class accuracies.
+## Notes
+- BioCLIP, BioCLIP2, BioTrove-CLIP, RemoteCLIP, and SigLIP2 weights are downloaded automatically by their libraries when first used.
+- The score used for classification is cosine similarity between normalized image and text embeddings.

cropvlm/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .model import CROP_CLASSES, CropVLMClassifier, load_cropvlm
2	+
3	+ __all__ = ["CROP_CLASSES", "CropVLMClassifier", "load_cropvlm"]

cropvlm/model.py ADDED Viewed

	@@ -0,0 +1,139 @@

+from pathlib import Path
+from typing import Dict, Iterable, List, Sequence, Tuple
+import torch
+import torch.nn.functional as F
+from PIL import Image
+CROP_CLASSES = [
+    "apple",
+    "avocado",
+    "banana",
+    "barley",
+    "bell pepper",
+    "broccoli",
+    "cacao",
+    "canola",
+    "cauliflower",
+    "cherry",
+    "chilli",
+    "coconut",
+    "coffee",
+    "corn",
+    "cotton",
+    "cucumber",
+    "eggplant",
+    "kiwi",
+    "lemon",
+    "mango",
+    "olive",
+    "orange",
+    "pear",
+    "peas",
+    "pineapple",
+    "pomegranate",
+    "potato",
+    "pumpkin",
+    "rice",
+    "soyabean",
+    "strawberry",
+    "sugarcane",
+    "sunflower",
+    "tea",
+    "tomato",
+    "watermelon",
+    "wheat",
+]
+def _normalize(features: torch.Tensor) -> torch.Tensor:
+    return F.normalize(features.float(), dim=-1)
+class CropVLMClassifier:
+    """Small zero-shot wrapper around the CropVLM/OpenAI CLIP ViT-B/32 model."""
+    def __init__(
+        self,
+        checkpoint: str,
+        class_names: Sequence[str] = CROP_CLASSES,
+        device: str | None = None,
+        prompt_template: str = "{}",
+    ):
+        import clip
+        self.clip = clip
+        self.device = torch.device(device or ("cuda" if torch.cuda.is_available() else "cpu"))
+        self.prompt_template = prompt_template
+        self.class_names = list(class_names)
+        checkpoint_path = Path(checkpoint)
+        if not checkpoint_path.exists():
+            raise FileNotFoundError(f"CropVLM checkpoint not found: {checkpoint_path}")
+        self.model, self.preprocess = clip.load(
+            "ViT-B/32",
+            device=str(self.device),
+            download_root=str(Path.home() / ".cache" / "clip"),
+        )
+        ckpt = torch.load(checkpoint_path, map_location=self.device)
+        state = ckpt.get("model_state_dict", ckpt.get("state_dict", ckpt))
+        self.model.load_state_dict(state)
+        self.model.eval()
+        self.set_classes(self.class_names)
+    def set_classes(self, class_names: Sequence[str]) -> None:
+        self.class_names = [c.strip() for c in class_names if c.strip()]
+        prompts = [self.prompt_template.format(c) for c in self.class_names]
+        tokens = self.clip.tokenize(prompts, truncate=True).to(self.device)
+        with torch.no_grad():
+            self.text_features = _normalize(self.model.encode_text(tokens))
+    def encode_image(self, image: Image.Image) -> torch.Tensor:
+        image = image.convert("RGB")
+        batch = self.preprocess(image).unsqueeze(0).to(self.device)
+        with torch.no_grad():
+            return _normalize(self.model.encode_image(batch))
+    def predict(self, image: Image.Image, top_k: int = 5) -> List[Tuple[str, float]]:
+        return [(label, probability) for label, probability, _ in self.predict_with_scores(image, top_k=top_k)]
+    def predict_scores(self, image: Image.Image) -> Dict[str, float]:
+        image_features = self.encode_image(image)
+        logits = (image_features @ self.text_features.T).squeeze(0)
+        return {name: float(score) for name, score in zip(self.class_names, logits.tolist())}
+    def predict_with_scores(self, image: Image.Image, top_k: int = 5) -> List[Tuple[str, float, float]]:
+        image_features = self.encode_image(image)
+        cosine_scores = (image_features @ self.text_features.T).squeeze(0)
+        logit_scale = self.model.logit_scale.exp().clamp(max=100)
+        probabilities = (logit_scale * cosine_scores).softmax(dim=-1)
+        k = min(top_k, len(self.class_names))
+        probs, indices = probabilities.topk(k)
+        return [
+            (self.class_names[idx], float(prob), float(cosine_scores[idx]))
+            for prob, idx in zip(probs.tolist(), indices.tolist())
+        ]
+def load_cropvlm(
+    checkpoint: str,
+    class_names: Sequence[str] = CROP_CLASSES,
+    device: str | None = None,
+    prompt_template: str = "{}",
+) -> CropVLMClassifier:
+    return CropVLMClassifier(
+        checkpoint=checkpoint,
+        class_names=class_names,
+        device=device,
+        prompt_template=prompt_template,
+    )
+def parse_class_names(text: str | Iterable[str]) -> List[str]:
+    if isinstance(text, str):
+        raw = text.replace(",", "\n").splitlines()
+    else:
+        raw = list(text)
+    return [name.strip() for name in raw if name.strip()]

docs/figures/agri_semantics_coverage.png ADDED Viewed

Git LFS Details

SHA256: 41e76d65baf966b3b2af823a242fd4025ef554e4ded07b61322b6df41afa74e5
Pointer size: 132 Bytes
Size of remote file: 2.35 MB

docs/figures/cropvlm_framework.png ADDED Viewed

Git LFS Details

SHA256: dd9b27cd8dcccfb4eb311e50144dd6f4b905091a5a1ef0789f656e135192b5a6
Pointer size: 132 Bytes
Size of remote file: 1.62 MB

docs/figures/semantic_annotation_examples.png ADDED Viewed

Git LFS Details

SHA256: 6062ef8bbff1bd65d5492a3e9d73db5a5850e9cd43c3510affb449b8fbbddb39
Pointer size: 132 Bytes
Size of remote file: 1.34 MB

examples/cacao.png ADDED Viewed

examples/cauliflower.png ADDED Viewed

examples/olive.png ADDED Viewed

examples/selection_metadata.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "selection_method": "CropVLM was run on the full testing dataset. Examples were selected from correct predictions by descending cosine margin between the correct class and the second-highest class, preferring less-common crops and keeping at most one image per class.",
+  "score_type": "cosine similarity after L2-normalizing image and text embeddings",
+  "examples": [
+    {
+      "file": "cacao.png",
+      "source": "/mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing/cacao/cacao_01171.png",
+      "class": "cacao",
+      "top1_score": 0.336548,
+      "second_class": "mango",
+      "second_score": 0.218446,
+      "margin": 0.118102
+    },
+    {
+      "file": "olive.png",
+      "source": "/mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing/olive/olive_01140.png",
+      "class": "olive",
+      "top1_score": 0.329435,
+      "second_class": "peas",
+      "second_score": 0.215531,
+      "margin": 0.113904
+    },
+    {
+      "file": "cauliflower.png",
+      "source": "/mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing/cauliflower/cauliflower_01107.png",
+      "class": "cauliflower",
+      "top1_score": 0.355567,
+      "second_class": "cucumber",
+      "second_score": 0.246063,
+      "margin": 0.109503
+    },
+    {
+      "file": "sugarcane.png",
+      "source": "/mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing/sugarcane/sugarcane_01134.png",
+      "class": "sugarcane",
+      "top1_score": 0.334631,
+      "second_class": "rice",
+      "second_score": 0.226767,
+      "margin": 0.107864
+    },
+    {
+      "file": "sunflower.png",
+      "source": "/mnt/e/Desktop/Datasets/FruitDataset/Crop_Dataset_testing/sunflower/sunflower_01425.png",
+      "class": "sunflower",
+      "top1_score": 0.309900,
+      "second_class": "pineapple",
+      "second_score": 0.208610,
+      "margin": 0.101291
+    }
+  ]
+}

examples/sugarcane.png ADDED Viewed

examples/sunflower.png ADDED Viewed

models/.gitkeep ADDED Viewed

File without changes

outputs/.gitkeep ADDED Viewed

File without changes

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+torch
+torchvision
+ftfy
+regex
+tqdm
+Pillow
+numpy
+pandas
+gradio
+open_clip_torch
+transformers
+huggingface_hub
+git+https://github.com/openai/CLIP.git

scripts/evaluate_zero_shot.py ADDED Viewed

	@@ -0,0 +1,391 @@

+import argparse
+import json
+import math
+import time
+import traceback
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence, Tuple
+import torch
+import torch.nn.functional as F
+from PIL import Image
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".webp"}
+DEFAULT_MODELS = [
+    "cropvlm",
+    "openai_clip_vit_b32",
+    "bioclip",
+    "bioclip2",
+    "biotrove_clip",
+    "remoteclip",
+    "siglip2",
+]
+class ImageFolderPaths(Dataset):
+    def __init__(self, root: str):
+        self.root = Path(root)
+        self.classes = sorted([p.name for p in self.root.iterdir() if p.is_dir()])
+        self.class_to_idx = {name: idx for idx, name in enumerate(self.classes)}
+        self.samples: List[Tuple[Path, int]] = []
+        for class_name in self.classes:
+            for path in sorted((self.root / class_name).iterdir()):
+                if path.is_file() and path.suffix.lower() in IMAGE_EXTS:
+                    self.samples.append((path, self.class_to_idx[class_name]))
+    def __len__(self) -> int:
+        return len(self.samples)
+    def __getitem__(self, idx: int):
+        path, label = self.samples[idx]
+        return Image.open(path).convert("RGB"), label, str(path)
+def pil_collate(batch):
+    images, labels, paths = zip(*batch)
+    return list(images), torch.tensor(labels, dtype=torch.long), list(paths)
+def display_name(class_name: str) -> str:
+    return class_name.replace("_", " ")
+def normalize(features: torch.Tensor) -> torch.Tensor:
+    if isinstance(features, (tuple, list)):
+        features = features[0]
+    return F.normalize(features.float(), dim=-1)
+class Adapter:
+    name = ""
+    family = ""
+    checkpoint: Optional[str] = None
+    load_message: Optional[str] = None
+    def encode_text(self, prompts: Sequence[str]) -> torch.Tensor:
+        raise NotImplementedError
+    def encode_images(self, images: Sequence[Image.Image]) -> torch.Tensor:
+        raise NotImplementedError
+class OpenAIClipAdapter(Adapter):
+    def __init__(self, device: torch.device, checkpoint: Optional[str] = None):
+        import clip
+        self.name = "CropVLM" if checkpoint else "OpenAI CLIP ViT-B/32"
+        self.family = "openai_clip"
+        self.device = device
+        self.clip = clip
+        self.model, self.preprocess = clip.load("ViT-B/32", device=str(device))
+        if checkpoint:
+            checkpoint_path = Path(checkpoint)
+            if not checkpoint_path.exists():
+                raise FileNotFoundError(f"CropVLM checkpoint not found: {checkpoint_path}")
+            ckpt = torch.load(checkpoint_path, map_location=device)
+            state = ckpt.get("model_state_dict", ckpt.get("state_dict", ckpt))
+            self.model.load_state_dict(state)
+            self.checkpoint = str(checkpoint_path)
+        self.model.eval()
+    def encode_text(self, prompts: Sequence[str]) -> torch.Tensor:
+        tokens = self.clip.tokenize(list(prompts), truncate=True).to(self.device)
+        with torch.no_grad():
+            return normalize(self.model.encode_text(tokens))
+    def encode_images(self, images: Sequence[Image.Image]) -> torch.Tensor:
+        batch = torch.stack([self.preprocess(image) for image in images]).to(self.device)
+        with torch.no_grad():
+            return normalize(self.model.encode_image(batch))
+class OpenClipAdapter(Adapter):
+    def __init__(
+        self,
+        model_name: str,
+        pretrained: Optional[str],
+        device: torch.device,
+        hf_checkpoint: Optional[Tuple[str, str]] = None,
+    ):
+        import open_clip
+        self.name = model_name
+        self.family = "open_clip"
+        self.device = device
+        self.model_name = model_name
+        self.pretrained = pretrained
+        self.open_clip = open_clip
+        if hf_checkpoint:
+            from huggingface_hub import hf_hub_download
+            repo, filename = hf_checkpoint
+            checkpoint = hf_hub_download(repo, filename)
+            self.model, _, self.preprocess = open_clip.create_model_and_transforms(model_name, pretrained=None)
+            ckpt = torch.load(checkpoint, map_location="cpu")
+            state = ckpt.get("state_dict", ckpt.get("model_state_dict", ckpt)) if isinstance(ckpt, dict) else ckpt
+            if any(key.startswith("module.") for key in state):
+                state = {key.removeprefix("module."): value for key, value in state.items()}
+            self.load_message = str(self.model.load_state_dict(state, strict=False))
+            self.checkpoint = checkpoint
+        else:
+            self.model, _, self.preprocess = open_clip.create_model_and_transforms(
+                model_name,
+                pretrained=pretrained,
+            )
+        self.tokenizer = open_clip.get_tokenizer(model_name)
+        self.model.to(device).eval()
+    def encode_text(self, prompts: Sequence[str]) -> torch.Tensor:
+        tokens = self.tokenizer(list(prompts)).to(self.device)
+        with torch.no_grad():
+            return normalize(self.model.encode_text(tokens))
+    def encode_images(self, images: Sequence[Image.Image]) -> torch.Tensor:
+        batch = torch.stack([self.preprocess(image) for image in images]).to(self.device)
+        with torch.no_grad():
+            return normalize(self.model.encode_image(batch))
+class Siglip2Adapter(Adapter):
+    def __init__(self, device: torch.device):
+        from transformers import AutoModel, AutoProcessor
+        self.name = "google/siglip2-base-patch16-224"
+        self.family = "transformers_siglip2"
+        self.device = device
+        self.processor = AutoProcessor.from_pretrained(self.name)
+        self.model = AutoModel.from_pretrained(self.name).to(device).eval()
+    def encode_text(self, prompts: Sequence[str]) -> torch.Tensor:
+        inputs = self.processor(text=list(prompts), padding=True, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            if hasattr(self.model, "get_text_features"):
+                features = self.model.get_text_features(**inputs)
+            else:
+                features = self.model(**inputs).text_embeds
+            return normalize(features)
+    def encode_images(self, images: Sequence[Image.Image]) -> torch.Tensor:
+        inputs = self.processor(images=list(images), return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            if hasattr(self.model, "get_image_features"):
+                features = self.model.get_image_features(**inputs)
+            else:
+                features = self.model(**inputs).image_embeds
+            return normalize(features)
+def build_adapter(model_key: str, device: torch.device, cropvlm_checkpoint: str) -> Adapter:
+    if model_key == "cropvlm":
+        return OpenAIClipAdapter(device, checkpoint=cropvlm_checkpoint)
+    if model_key == "openai_clip_vit_b32":
+        return OpenAIClipAdapter(device)
+    if model_key == "bioclip":
+        return OpenClipAdapter("hf-hub:imageomics/bioclip", None, device)
+    if model_key == "bioclip2":
+        return OpenClipAdapter("hf-hub:imageomics/bioclip-2", None, device)
+    if model_key == "biotrove_clip":
+        return OpenClipAdapter(
+            "ViT-B-16",
+            None,
+            device,
+            hf_checkpoint=("BGLab/BioTrove-CLIP", "biotroveclip-vit-b-16-from-bioclip-epoch-8.pt"),
+        )
+    if model_key == "remoteclip":
+        return OpenClipAdapter(
+            "ViT-B-32",
+            None,
+            device,
+            hf_checkpoint=("chendelong/RemoteCLIP", "RemoteCLIP-ViT-B-32.pt"),
+        )
+    if model_key == "siglip2":
+        return Siglip2Adapter(device)
+    raise KeyError(
+        f"Unknown model '{model_key}'. Supported models: {', '.join(DEFAULT_MODELS)}. "
+        "TULIP, EVA-CLIP, and LongCLIP are intentionally excluded."
+    )
+def per_class_stats(per_class: Dict[str, Dict[str, Any]]) -> Dict[str, Any]:
+    values = [item["accuracy"] for item in per_class.values() if item.get("accuracy") is not None]
+    if not values:
+        return {
+            "per_class_accuracy_mean": None,
+            "per_class_accuracy_std": None,
+            "per_class_accuracy_std_population": None,
+            "num_classes_with_accuracy": 0,
+        }
+    mean = sum(values) / len(values)
+    sample_std = math.sqrt(sum((x - mean) ** 2 for x in values) / (len(values) - 1)) if len(values) > 1 else 0.0
+    population_std = math.sqrt(sum((x - mean) ** 2 for x in values) / len(values))
+    return {
+        "per_class_accuracy_mean": mean,
+        "per_class_accuracy_std": sample_std,
+        "per_class_accuracy_std_population": population_std,
+        "num_classes_with_accuracy": len(values),
+    }
+def evaluate_model(args: argparse.Namespace, dataset: ImageFolderPaths, model_key: str) -> Dict[str, Any]:
+    started_at = time.time()
+    device = torch.device(args.device or ("cuda" if torch.cuda.is_available() else "cpu"))
+    prompts = [args.prompt_template.format(display_name(class_name)) for class_name in dataset.classes]
+    result: Dict[str, Any] = {
+        "model_key": model_key,
+        "dataset": str(dataset.root),
+        "num_images": len(dataset),
+        "num_classes": len(dataset.classes),
+        "classes": dataset.classes,
+        "class_prompts": dict(zip(dataset.classes, prompts)),
+        "prompt_template": args.prompt_template,
+        "device": str(device),
+        "status": "started",
+        "started_at_unix": started_at,
+    }
+    try:
+        adapter = build_adapter(model_key, device, args.cropvlm_checkpoint)
+        result["model_name"] = adapter.name
+        result["family"] = adapter.family
+        result["checkpoint"] = adapter.checkpoint
+        result["load_message"] = adapter.load_message
+        text_features = adapter.encode_text(prompts).to(device)
+        loader = DataLoader(
+            dataset,
+            batch_size=args.batch_size,
+            shuffle=False,
+            num_workers=args.num_workers,
+            collate_fn=pil_collate,
+        )
+        class_total = [0 for _ in dataset.classes]
+        class_correct = [0 for _ in dataset.classes]
+        confusion = [[0 for _ in dataset.classes] for _ in dataset.classes]
+        predictions: List[Dict[str, Any]] = []
+        correct = 0
+        for images, labels, paths in tqdm(loader, desc=model_key):
+            image_features = adapter.encode_images(images)
+            logits = image_features @ text_features.T
+            pred = logits.argmax(dim=-1).detach().cpu()
+            scores = logits.max(dim=-1).values.detach().cpu()
+            for true_idx, pred_idx, score, path in zip(labels.tolist(), pred.tolist(), scores.tolist(), paths):
+                class_total[true_idx] += 1
+                class_correct[true_idx] += int(true_idx == pred_idx)
+                confusion[true_idx][pred_idx] += 1
+                correct += int(true_idx == pred_idx)
+                if args.save_predictions:
+                    predictions.append(
+                        {
+                            "path": path,
+                            "true_class": dataset.classes[true_idx],
+                            "pred_class": dataset.classes[pred_idx],
+                            "correct": true_idx == pred_idx,
+                            "score": float(score),
+                        }
+                    )
+        per_class = {}
+        for idx, class_name in enumerate(dataset.classes):
+            total = class_total[idx]
+            per_class[class_name] = {
+                "correct": class_correct[idx],
+                "total": total,
+                "accuracy": class_correct[idx] / total if total else None,
+            }
+        result.update(
+            {
+                "status": "ok",
+                "accuracy": correct / len(dataset) if len(dataset) else None,
+                "correct": correct,
+                "per_class": per_class,
+                "confusion_matrix": confusion,
+                "predictions": predictions if args.save_predictions else None,
+            }
+        )
+        result.update(per_class_stats(per_class))
+    except Exception as exc:
+        result.update(
+            {
+                "status": "failed",
+                "error_type": type(exc).__name__,
+                "error": str(exc),
+                "traceback": traceback.format_exc(),
+            }
+        )
+    result["elapsed_seconds"] = time.time() - started_at
+    return result
+def write_json(path: Path, data: Dict[str, Any]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w") as f:
+        json.dump(data, f, indent=2)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", required=True, help="ImageFolder-style dataset root.")
+    parser.add_argument("--output", default="outputs/zero_shot_results.json")
+    parser.add_argument("--cropvlm-checkpoint", default="models/CropCLIP_FullDataset_Acc_0.75.pth")
+    parser.add_argument("--models", nargs="+", default=DEFAULT_MODELS)
+    parser.add_argument("--device", default=None)
+    parser.add_argument("--batch-size", type=int, default=64)
+    parser.add_argument("--num-workers", type=int, default=2)
+    parser.add_argument("--prompt-template", default="{}")
+    parser.add_argument("--save-predictions", action="store_true")
+    args = parser.parse_args()
+    excluded = {"tulip", "eva_clip", "eva_clip_official", "longclip"}
+    requested = [model for model in args.models if model not in excluded]
+    skipped = [model for model in args.models if model in excluded]
+    dataset = ImageFolderPaths(args.dataset)
+    results = [evaluate_model(args, dataset, model_key) for model_key in requested]
+    ok = [result for result in results if result.get("status") == "ok"]
+    failed = [result for result in results if result.get("status") != "ok"]
+    summary = {
+        "created_at": datetime.now(timezone.utc).isoformat(),
+        "dataset": str(dataset.root),
+        "num_images": len(dataset),
+        "num_classes": len(dataset.classes),
+        "classes": dataset.classes,
+        "requested_models": args.models,
+        "evaluated_models": requested,
+        "skipped_models": skipped,
+        "num_models": len(results),
+        "num_successful": len(ok),
+        "num_failed": len(failed),
+        "models": {
+            result["model_key"]: {
+                "status": result.get("status"),
+                "accuracy": result.get("accuracy"),
+                "correct": result.get("correct"),
+                "num_images": result.get("num_images"),
+                "per_class_accuracy_mean": result.get("per_class_accuracy_mean"),
+                "per_class_accuracy_std": result.get("per_class_accuracy_std"),
+                "per_class_accuracy_std_population": result.get("per_class_accuracy_std_population"),
+                "num_classes_with_accuracy": result.get("num_classes_with_accuracy"),
+                "elapsed_seconds": result.get("elapsed_seconds"),
+                "error": result.get("error"),
+            }
+            for result in results
+        },
+        "model_results": {result["model_key"]: result for result in results},
+        "results": results,
+    }
+    write_json(Path(args.output), summary)
+    print(Path(args.output))
+if __name__ == "__main__":
+    main()

scripts/gradio_demo.py ADDED Viewed

	@@ -0,0 +1,120 @@

+from __future__ import annotations
+import argparse
+import sys
+from threading import Lock
+from pathlib import Path
+from PIL import Image
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+from cropvlm import CROP_CLASSES, load_cropvlm
+from cropvlm.model import parse_class_names
+DEFAULT_CLASSES_TEXT = "\n".join(CROP_CLASSES)
+def build_demo(checkpoint: str, device: str | None, prompt_template: str, top_k: int) -> gr.Blocks:
+    import gradio as gr
+    classifier = load_cropvlm(
+        checkpoint=checkpoint,
+        class_names=CROP_CLASSES,
+        device=device,
+        prompt_template=prompt_template,
+    )
+    classifier_lock = Lock()
+    current_classes = tuple(CROP_CLASSES)
+    def classify(image: Image.Image, classes_text: str, top_k_value: int):
+        if image is None:
+            return {}, []
+        nonlocal current_classes
+        requested_classes = tuple(parse_class_names(classes_text))
+        if not requested_classes:
+            return {}, []
+        with classifier_lock:
+            if requested_classes != current_classes:
+                classifier.set_classes(requested_classes)
+                current_classes = requested_classes
+            predictions = classifier.predict_with_scores(image, top_k=int(top_k_value))
+        label_scores = {label: probability for label, probability, _ in predictions}
+        score_text = "\n".join(
+            f"{rank}. {label}: probability={probability:.6f}, cosine={cosine:.6f}"
+            for rank, (label, probability, cosine) in enumerate(predictions, start=1)
+        )
+        return label_scores, score_text
+    examples_dir = Path(__file__).resolve().parents[1] / "examples"
+    example_paths = [
+        str(examples_dir / name)
+        for name in ["cacao.png", "olive.png", "cauliflower.png", "sugarcane.png", "sunflower.png"]
+        if (examples_dir / name).exists()
+    ]
+    with gr.Blocks(title="CropVLM Zero-Shot Demo") as demo:
+        gr.Markdown("# CropVLM Zero-Shot Image Classification")
+        with gr.Row():
+            with gr.Column():
+                image = gr.Image(type="pil", label="Image")
+                classes = gr.Textbox(
+                    value=DEFAULT_CLASSES_TEXT,
+                    lines=12,
+                    label="Class names",
+                )
+                top_k_slider = gr.Slider(
+                    minimum=1,
+                    maximum=10,
+                    value=top_k,
+                    step=1,
+                    label="Top-k",
+                )
+                button = gr.Button("Classify", variant="primary")
+            with gr.Column():
+                label = gr.Label(num_top_classes=top_k, label="Predictions")
+                score_text = gr.Textbox(
+                    label="Scores",
+                    lines=8,
+                    interactive=False,
+                )
+        outputs = [label, score_text]
+        button.click(classify, inputs=[image, classes, top_k_slider], outputs=outputs)
+        classes.change(lambda: ({}, ""), outputs=outputs)
+        if example_paths:
+            gr.Examples(
+                examples=[[path, DEFAULT_CLASSES_TEXT, top_k] for path in example_paths],
+                inputs=[image, classes, top_k_slider],
+                outputs=outputs,
+                fn=classify,
+                cache_examples=False,
+            )
+    return demo
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint", default="models/CropCLIP_FullDataset_Acc_0.75.pth")
+    parser.add_argument("--device", default=None)
+    parser.add_argument("--prompt-template", default="{}")
+    parser.add_argument("--top-k", type=int, default=5)
+    parser.add_argument("--server-name", default="127.0.0.1")
+    parser.add_argument("--server-port", type=int, default=7860)
+    args = parser.parse_args()
+    demo = build_demo(
+        checkpoint=args.checkpoint,
+        device=args.device,
+        prompt_template=args.prompt_template,
+        top_k=args.top_k,
+    )
+    demo.launch(server_name=args.server_name, server_port=args.server_port)
+if __name__ == "__main__":
+    main()