Stomata Keypoint Detection: Finetuned Model Checkpoints
This repository contains the finetuned model checkpoints used in our CVPR 2026 AgriVision Workshop paper:
Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift
- Paper: Coming soon
- Dataset: stomata-keypoint-benchmark-cvpr-agrivision-2026
- Code: Coming soon on GitHub
All models were finetuned on KP-Train (344 field-collected maize images, 12,503 stomata) and evaluated across nine test splits covering location, environment, taxonomic, species, and sensor shift.
Models
YOLO26X-Pose
Single-stage keypoint detector from the Ultralytics YOLO26 family. Finetuned end-to-end to detect stomata and predict four keypoints per instance.
| File |
|---|
best.pt |
Trained on 2Γ NVIDIA A100 (80 GB) for up to 400 epochs with AdamW.
Grounding DINO β Swin-B (GDINO-SB)
Open-vocabulary bounding-box detector. Finetuned with the text prompt "a stomata ." using bounding-box supervision only. Initialized from the checkpoint pretrained on O365, GoldG, and Cap4M.
| File |
|---|
model.safetensors |
config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, special_tokens_map.json, training_args.bin |
Trained on a single NVIDIA A100 (80 GB) for 120 epochs with gradient accumulation of 4 and cosine scheduling.
Keypoint R-CNN β ResNeXt-101 (KP-RCNN-X101)
Two-stage keypoint baseline from Detectron2. Uses a ResNeXt-101-32Γ8d backbone with FPN, initialized from COCO person-keypoint pretraining. The keypoint head produces four per-instance heatmaps.
| File |
|---|
model_final.pth |
Trained on 2Γ NVIDIA A100 (80 GB) for 80k iterations with step LR decay.
ViTPose++ Huge β 4 Keypoints
Top-down keypoint localizer with a ViT-H backbone evaluated under ground-truth box conditioning to isolate landmark regression from detector error. Predicts all four keypoints (two polar tips and two lateral endpoints).
| File |
|---|
model.safetensors |
config.json |
Trained on 2Γ NVIDIA H100 (80 GB) with tiered learning rates, cosine annealing, and 5-epoch warmup.
ViTPose++ Huge β 2 Keypoints
Same architecture as above, but predicting only the two polar tips (length axis). This variant tests whether width endpoint prediction is a dominant failure mode compared to length-only localization.
| File |
|---|
model.safetensors |
config.json |
SAM 3
Segment Anything Model 3 from Meta, finetuned in detection-only mode (no mask supervision) to predict stomata bounding boxes conditioned on the text prompt "stomata".
| File |
|---|
checkpoint.pt |
Trained on 2Γ NVIDIA H100 (80 GB) for up to 35 epochs.
Annotation Format
All models were trained on stomata annotated with four COCO-format keypoints:
- p0, p1 β polar tips along the stomatal length axis
- p2, p3 β lateral endpoints along the stomatal width axis
Stomatal length and width are computed as Euclidean distances between the respective keypoint pairs.
Usage Notes
These checkpoints are provided for reproducibility and downstream research. Each model directory contains everything needed to load the finetuned weights in its native framework:
- YOLO26X-Pose β Ultralytics - https://docs.ultralytics.com/models/yolo26/
- GDINO-SB β HuggingFace Transformers - https://huggingface.co/IDEA-Research/grounding-dino-base
- KP-RCNN-X101 β Detectron2 - https://github.com/facebookresearch/detectron2/blob/main/configs/COCO-Keypoints/keypoint_rcnn_X_101_32x8d_FPN_3x.yaml
- ViTPose++ Huge β HuggingFace Transformers - https://huggingface.co/usyd-community/vitpose-plus-huge
- SAM 3 β Meta SAM 3 weights sourced from HuggingFace and training initialized using SAM 3 GitHub repository. - https://github.com/facebookresearch/sam3/blob/main/sam3/train/configs/odinw13/odinw_text_only_train.yaml
Inference code and evaluation scripts will be released alongside the paper on GitHub.
License
| Component | License |
|---|---|
| Finetuned model weights | CC BY-NC 4.0 |
These weights are derived from publicly available pretrained models. Please also follow the original licensing terms of each base model when using these checkpoints.
Citation
@inproceedings{gummi2026stomata,
author = {Gummi, S. R. and Pack, C. and Zhang, H. K. and Solanki, S. and Chang, Y.},
title = {Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2026},
note = {Accepted}
}
Contact
Sainath Reddy Gummi South Dakota State University Email: gummisainath@gmail.com