SekoKuva Mobile 10M

SekoKuva Mobile 10M logo

A mobile-sized self-supervised visual backbone for image feature extraction, retrieval, and transfer learning.

Designed in Eurajoki, Finland and trained on the LUMI supercomputer by BC Bertenex Oy

This is the first model built under BC Bertenex Oy's 50,000 GPU-hour EuroHPC AI Factory Fast Lane allocation on LUMI.

Parameters 9,902,504 (9.9M)
Feature Dimension 2,176
Input 320 x 320 x 3 (RGB)
Primary Output Dense image embeddings via forward_features(x)
Evaluation 40.7% k-NN accuracy on a 65-class OpenImages-derived embedding benchmark
Model Size 40.0 MB (PyTorch backbone checkpoint)
Training Data ~7.2M unlabeled Common Catalog CC-BY images
Training Method SimCLR contrastive learning, 100 epochs
License Apache 2.0

Why This Model?

Most small mobile vision models are released as fixed-label classifiers. SekoKuva Mobile 10M is different: it is trained first and foremost as a feature extractor. The goal is to produce reusable visual embeddings for retrieval, clustering, few-label transfer learning, and downstream fine-tuning - not just a single closed-set classifier.

This release is also much closer in spirit to the original SekoKuva Mobile 423K card: practical, open, and focused on what a user can actually do with the weights. The code and weights are released under Apache 2.0, the pretraining corpus comes from Common Catalog CC-BY, and the repository includes both the training pipeline and the post-training evaluation outputs.

Pre-trained Weights

The recommended default checkpoint is checkpoints/backbone_latest.pt.

File Purpose
checkpoints/backbone_latest.pt Primary backbone release for feature extraction and downstream fine-tuning
checkpoints/backbone_ema.pt Alternate EMA-smoothed backbone with near-identical quality
checkpoints/projector_latest.pt SimCLR projection head used during pretraining, not needed for normal downstream use
checkpoints/checkpoint_latest.pt Full resumable training state for continued pretraining

Quick Start

Extract features

import torch
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms

from sekokuva_mobile.model import SekoKuvaMobile

model = SekoKuvaMobile(num_classes=1000)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False)  # backbone checkpoint omits classifier weights
model.eval()

transform = transforms.Compose([
    transforms.Resize(366),
    transforms.CenterCrop(320),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

image = Image.open("photo.jpg").convert("RGB")
x = transform(image).unsqueeze(0)

with torch.no_grad():
    features = model.forward_features(x)
    features = F.normalize(features, dim=1)

print(features.shape)  # [1, 2176]

Use the normalized feature vector for cosine similarity search, retrieval, clustering, or as the starting point for a small downstream classifier.

Fine-tune on labeled data

import torch
from sekokuva_mobile.model import SekoKuvaMobile

model = SekoKuvaMobile(num_classes=65)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False)

# Train only the classifier for a linear probe, or unfreeze the backbone
# for full fine-tuning. See train_finetune.py for the full pipeline.

Evaluation Snapshot

The released checkpoints were evaluated on a 65-class OpenImages-derived dataset with 52,898 gallery images and 9,478 validation images at 320 x 320 resolution.

Metric backbone_latest.pt backbone_ema.pt
k-NN accuracy (k=20) 40.70% 40.34%
Top-1 nearest-neighbor accuracy 45.34% 45.29%
Centroid accuracy 23.64% 23.82%
Mean top-k similarity 0.9252 0.9257
Same-vs-different similarity gap 0.05935 0.05915

backbone_latest.pt is the recommended public default because it is slightly better on the nearest-neighbor retrieval metrics that matter most for practical downstream use, while the EMA checkpoint remains a valid alternate.

Training also converged cleanly. Across the 100-epoch run, contrastive loss fell from 6.62 to 0.118 and effective rank rose from 59.5 to 93.3, which is consistent with a stable non-collapsed representation space.

Architecture

SekoKuva Mobile 10M uses a MobileNetV2-style backbone scaled with width_mult=1.7.

Input: 320 x 320 x 3 (RGB)
  -> Stem: Conv2d 3 -> 56, stride 2
  -> Stages 1-7: MobileNetV2-style inverted residual blocks
  -> Head: 1 x 1 Conv 544 -> 2176
  -> Global Average Pooling
  -> 2176-dim feature vector
  -> Optional: Linear classifier for downstream supervised tasks

The backbone uses depthwise separable convolutions, BatchNorm, and ReLU6 throughout. The primary output is the 2,176-dimensional feature vector from forward_features(x).

Training At A Glance

Pretraining method SimCLR contrastive learning
Projection head Linear(2176 -> 2048) -> ReLU -> Linear(2048 -> 128)
Pretraining data ~7.2M unlabeled Common Catalog CC-BY images
Resolution schedule 160 -> 240 -> 320
Optimizer AdamC with cosine decay
Hardware 128 x AMD Instinct MI250X GPUs on LUMI
EuroHPC access AI Factory Fast Lane allocation, 50,000 GPU hours
Evaluation data OpenImages-derived 65-class local benchmark

Intended Uses

  • Image feature extraction
  • Similarity search and visual retrieval
  • Clustering, deduplication, and dataset exploration
  • Low-label or few-label transfer learning
  • Initialization for downstream classification or other vision heads

Limitations

  • This is not a zero-shot image classifier and should not be presented as one.
  • The current public evaluation is a repository-local embedding benchmark, not a standardized public transfer-learning leaderboard.
  • Performance varies by class; the model is stronger on some object categories than on people-related and fine-grained categories.
  • The pretraining corpus is web-derived and may carry cultural, geographic, and temporal bias.
  • Any deployment involving people or high-stakes decisions needs task-specific evaluation and bias review.

Repository Highlights

Repository: https://github.com/BCBertenex/sekokuva-mobile-10m

File Role
sekokuva_mobile/model.py Backbone definition
sekokuva_mobile/blocks.py Inverted residual building blocks
sekokuva_mobile/simclr.py Projection head and contrastive loss
train_simclr.py Large-scale self-supervised pretraining
train_finetune.py Supervised downstream fine-tuning
embedding_eval/run_embedding_eval.py Post-training embedding evaluator
embedding_eval/output/latest_full/embedding_eval.json Final evaluation for backbone_latest.pt
embedding_eval/output/ema_full/embedding_eval.json Final evaluation for backbone_ema.pt

Acknowledgement

SekoKuva Mobile 10M was trained on the LUMI supercomputer using a 50,000 GPU-hour allocation awarded to BC Bertenex Oy through the EuroHPC JU AI Factories Fast Lane call. This is the first model produced under that allocation.

We acknowledge EuroHPC JU for awarding Proposal ID EHPC-AIF-2026FL01-123 access to resources on the standard-g partition of LUMI hosted by CSC, Finland.

License

The released code and weights are provided under Apache 2.0.

Pretraining used Common Catalog CC-BY images. The post-training embedding evaluation in this repository uses a local OpenImages-derived dataset split.

Citation

@misc{sekokuva2026mobile10m,
  title     = {SekoKuva Mobile 10M: A Self-Supervised Mobile Visual Backbone for Feature Extraction and Transfer Learning},
  author    = {{BC Bertenex Oy}},
  year      = {2026},
  note      = {Apache 2.0 licensed release. Trained with SimCLR on approximately 7.2M unlabeled Common Catalog CC-BY images.}
}

Built by BC Bertenex Oy, Finland

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train bcbertenex/sekokuva-mobile-10m