Stomata Keypoint Detection: Finetuned Model Checkpoints

This repository contains the finetuned model checkpoints used in our CVPR 2026 AgriVision Workshop paper:

Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift

Paper: Coming soon
Dataset: stomata-keypoint-benchmark-cvpr-agrivision-2026
Code: Coming soon on GitHub

All models were finetuned on KP-Train (344 field-collected maize images, 12,503 stomata) and evaluated across nine test splits covering location, environment, taxonomic, species, and sensor shift.

Models

YOLO26X-Pose

Single-stage keypoint detector from the Ultralytics YOLO26 family. Finetuned end-to-end to detect stomata and predict four keypoints per instance.

File
`best.pt`

Trained on 2× NVIDIA A100 (80 GB) for up to 400 epochs with AdamW.

Grounding DINO — Swin-B (GDINO-SB)

Open-vocabulary bounding-box detector. Finetuned with the text prompt "a stomata ." using bounding-box supervision only. Initialized from the checkpoint pretrained on O365, GoldG, and Cap4M.

File
`model.safetensors`
`config.json`, `preprocessor_config.json`, `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `training_args.bin`

Trained on a single NVIDIA A100 (80 GB) for 120 epochs with gradient accumulation of 4 and cosine scheduling.

Keypoint R-CNN — ResNeXt-101 (KP-RCNN-X101)

Two-stage keypoint baseline from Detectron2. Uses a ResNeXt-101-32×8d backbone with FPN, initialized from COCO person-keypoint pretraining. The keypoint head produces four per-instance heatmaps.

File
`model_final.pth`

Trained on 2× NVIDIA A100 (80 GB) for 80k iterations with step LR decay.

ViTPose++ Huge — 4 Keypoints

Top-down keypoint localizer with a ViT-H backbone evaluated under ground-truth box conditioning to isolate landmark regression from detector error. Predicts all four keypoints (two polar tips and two lateral endpoints).

File
`model.safetensors`
`config.json`

Trained on 2× NVIDIA H100 (80 GB) with tiered learning rates, cosine annealing, and 5-epoch warmup.

ViTPose++ Huge — 2 Keypoints

Same architecture as above, but predicting only the two polar tips (length axis). This variant tests whether width endpoint prediction is a dominant failure mode compared to length-only localization.

File
`model.safetensors`
`config.json`

SAM 3

Segment Anything Model 3 from Meta, finetuned in detection-only mode (no mask supervision) to predict stomata bounding boxes conditioned on the text prompt "stomata".

File
`checkpoint.pt`

Trained on 2× NVIDIA H100 (80 GB) for up to 35 epochs.

Annotation Format

All models were trained on stomata annotated with four COCO-format keypoints:

p0, p1 — polar tips along the stomatal length axis
p2, p3 — lateral endpoints along the stomatal width axis

Stomatal length and width are computed as Euclidean distances between the respective keypoint pairs.

Usage Notes

These checkpoints are provided for reproducibility and downstream research. Each model directory contains everything needed to load the finetuned weights in its native framework:

YOLO26X-Pose — Ultralytics - https://docs.ultralytics.com/models/yolo26/
GDINO-SB — HuggingFace Transformers - https://huggingface.co/IDEA-Research/grounding-dino-base
KP-RCNN-X101 — Detectron2 - https://github.com/facebookresearch/detectron2/blob/main/configs/COCO-Keypoints/keypoint_rcnn_X_101_32x8d_FPN_3x.yaml
ViTPose++ Huge — HuggingFace Transformers - https://huggingface.co/usyd-community/vitpose-plus-huge
SAM 3 — Meta SAM 3 weights sourced from HuggingFace and training initialized using SAM 3 GitHub repository. - https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/odinw13/odinw_text_only_train.yaml

Inference code and evaluation scripts will be released alongside the paper on GitHub.

License

Component	License
Finetuned model weights	CC BY-NC 4.0

These weights are derived from publicly available pretrained models. Please also follow the original licensing terms of each base model when using these checkpoints.

Citation

@inproceedings{gummi2026stomata,
  author    = {Gummi, S. R. and Pack, C. and Zhang, H. K. and Solanki, S. and Chang, Y.},
  title     = {Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2026},
  note      = {Accepted}
}

Contact

Sainath Reddy Gummi South Dakota State University Email: gummisainath@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track