Stomata Keypoint Detection: Finetuned Model Checkpoints

This repository contains the finetuned model checkpoints used in our CVPR 2026 AgriVision Workshop paper:

Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift

All models were finetuned on KP-Train (344 field-collected maize images, 12,503 stomata) and evaluated across nine test splits covering location, environment, taxonomic, species, and sensor shift.


Models

YOLO26X-Pose

Single-stage keypoint detector from the Ultralytics YOLO26 family. Finetuned end-to-end to detect stomata and predict four keypoints per instance.

File
best.pt

Trained on 2Γ— NVIDIA A100 (80 GB) for up to 400 epochs with AdamW.

Grounding DINO β€” Swin-B (GDINO-SB)

Open-vocabulary bounding-box detector. Finetuned with the text prompt "a stomata ." using bounding-box supervision only. Initialized from the checkpoint pretrained on O365, GoldG, and Cap4M.

File
model.safetensors
config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, special_tokens_map.json, training_args.bin

Trained on a single NVIDIA A100 (80 GB) for 120 epochs with gradient accumulation of 4 and cosine scheduling.

Keypoint R-CNN β€” ResNeXt-101 (KP-RCNN-X101)

Two-stage keypoint baseline from Detectron2. Uses a ResNeXt-101-32Γ—8d backbone with FPN, initialized from COCO person-keypoint pretraining. The keypoint head produces four per-instance heatmaps.

File
model_final.pth

Trained on 2Γ— NVIDIA A100 (80 GB) for 80k iterations with step LR decay.

ViTPose++ Huge β€” 4 Keypoints

Top-down keypoint localizer with a ViT-H backbone evaluated under ground-truth box conditioning to isolate landmark regression from detector error. Predicts all four keypoints (two polar tips and two lateral endpoints).

File
model.safetensors
config.json

Trained on 2Γ— NVIDIA H100 (80 GB) with tiered learning rates, cosine annealing, and 5-epoch warmup.

ViTPose++ Huge β€” 2 Keypoints

Same architecture as above, but predicting only the two polar tips (length axis). This variant tests whether width endpoint prediction is a dominant failure mode compared to length-only localization.

File
model.safetensors
config.json

SAM 3

Segment Anything Model 3 from Meta, finetuned in detection-only mode (no mask supervision) to predict stomata bounding boxes conditioned on the text prompt "stomata".

File
checkpoint.pt

Trained on 2Γ— NVIDIA H100 (80 GB) for up to 35 epochs.


Annotation Format

All models were trained on stomata annotated with four COCO-format keypoints:

  • p0, p1 β€” polar tips along the stomatal length axis
  • p2, p3 β€” lateral endpoints along the stomatal width axis

Stomatal length and width are computed as Euclidean distances between the respective keypoint pairs.


Usage Notes

These checkpoints are provided for reproducibility and downstream research. Each model directory contains everything needed to load the finetuned weights in its native framework:

Inference code and evaluation scripts will be released alongside the paper on GitHub.


License

Component License
Finetuned model weights CC BY-NC 4.0

These weights are derived from publicly available pretrained models. Please also follow the original licensing terms of each base model when using these checkpoints.


Citation

@inproceedings{gummi2026stomata,
  author    = {Gummi, S. R. and Pack, C. and Zhang, H. K. and Solanki, S. and Chang, Y.},
  title     = {Towards Morphology Aware Stomata Keypoint Detection: Benchmarking Foundation Models Under Distribution Shift},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2026},
  note      = {Accepted}
}

Contact

Sainath Reddy Gummi South Dakota State University Email: gummisainath@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support