SekoKuva Mobile 10M

A mobile-sized self-supervised visual backbone for image feature extraction, retrieval, and transfer learning.

Designed in Eurajoki, Finland and trained on the LUMI supercomputer by BC Bertenex Oy

This is the first model built under BC Bertenex Oy's 50,000 GPU-hour EuroHPC AI Factory Fast Lane allocation on LUMI.


Parameters	9,902,504 (9.9M)
Feature Dimension	2,176
Input	320 x 320 x 3 (RGB)
Primary Output	Dense image embeddings via `forward_features(x)`
Evaluation	40.7% k-NN accuracy on a 65-class OpenImages-derived embedding benchmark
Model Size	40.0 MB (PyTorch backbone checkpoint)
Training Data	~7.2M unlabeled Common Catalog CC-BY images
Training Method	SimCLR contrastive learning, 100 epochs
License	Apache 2.0

Why This Model?

Most small mobile vision models are released as fixed-label classifiers. SekoKuva Mobile 10M is different: it is trained first and foremost as a feature extractor. The goal is to produce reusable visual embeddings for retrieval, clustering, few-label transfer learning, and downstream fine-tuning - not just a single closed-set classifier.

This release is also much closer in spirit to the original SekoKuva Mobile 423K card: practical, open, and focused on what a user can actually do with the weights. The code and weights are released under Apache 2.0, the pretraining corpus comes from Common Catalog CC-BY, and the repository includes both the training pipeline and the post-training evaluation outputs.

Pre-trained Weights

The recommended default checkpoint is checkpoints/backbone_latest.pt.

File	Purpose
`checkpoints/backbone_latest.pt`	Primary backbone release for feature extraction and downstream fine-tuning
`checkpoints/backbone_ema.pt`	Alternate EMA-smoothed backbone with near-identical quality
`checkpoints/projector_latest.pt`	SimCLR projection head used during pretraining, not needed for normal downstream use
`checkpoints/checkpoint_latest.pt`	Full resumable training state for continued pretraining

Quick Start

Extract features

import torch
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms

from sekokuva_mobile.model import SekoKuvaMobile

model = SekoKuvaMobile(num_classes=1000)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False)  # backbone checkpoint omits classifier weights
model.eval()

transform = transforms.Compose([
    transforms.Resize(366),
    transforms.CenterCrop(320),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

image = Image.open("photo.jpg").convert("RGB")
x = transform(image).unsqueeze(0)

with torch.no_grad():
    features = model.forward_features(x)
    features = F.normalize(features, dim=1)

print(features.shape)  # [1, 2176]

Use the normalized feature vector for cosine similarity search, retrieval, clustering, or as the starting point for a small downstream classifier.

Fine-tune on labeled data

import torch
from sekokuva_mobile.model import SekoKuvaMobile

model = SekoKuvaMobile(num_classes=65)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False)

# Train only the classifier for a linear probe, or unfreeze the backbone
# for full fine-tuning. See train_finetune.py for the full pipeline.

Evaluation Snapshot

The released checkpoints were evaluated on a 65-class OpenImages-derived dataset with 52,898 gallery images and 9,478 validation images at 320 x 320 resolution.

Metric	`backbone_latest.pt`	`backbone_ema.pt`
k-NN accuracy (`k=20`)	40.70%	40.34%
Top-1 nearest-neighbor accuracy	45.34%	45.29%
Centroid accuracy	23.64%	23.82%
Mean top-k similarity	0.9252	0.9257
Same-vs-different similarity gap	0.05935	0.05915

backbone_latest.pt is the recommended public default because it is slightly better on the nearest-neighbor retrieval metrics that matter most for practical downstream use, while the EMA checkpoint remains a valid alternate.

Training also converged cleanly. Across the 100-epoch run, contrastive loss fell from 6.62 to 0.118 and effective rank rose from 59.5 to 93.3, which is consistent with a stable non-collapsed representation space.

Architecture

SekoKuva Mobile 10M uses a MobileNetV2-style backbone scaled with width_mult=1.7.

Input: 320 x 320 x 3 (RGB)
  -> Stem: Conv2d 3 -> 56, stride 2
  -> Stages 1-7: MobileNetV2-style inverted residual blocks
  -> Head: 1 x 1 Conv 544 -> 2176
  -> Global Average Pooling
  -> 2176-dim feature vector
  -> Optional: Linear classifier for downstream supervised tasks

The backbone uses depthwise separable convolutions, BatchNorm, and ReLU6 throughout. The primary output is the 2,176-dimensional feature vector from forward_features(x).

Training At A Glance


Pretraining method	SimCLR contrastive learning
Projection head	Linear(2176 -> 2048) -> ReLU -> Linear(2048 -> 128)
Pretraining data	~7.2M unlabeled Common Catalog CC-BY images
Resolution schedule	160 -> 240 -> 320
Optimizer	AdamC with cosine decay
Hardware	128 x AMD Instinct MI250X GPUs on LUMI
EuroHPC access	AI Factory Fast Lane allocation, 50,000 GPU hours
Evaluation data	OpenImages-derived 65-class local benchmark

Intended Uses

Image feature extraction
Similarity search and visual retrieval
Clustering, deduplication, and dataset exploration
Low-label or few-label transfer learning
Initialization for downstream classification or other vision heads

Limitations

This is not a zero-shot image classifier and should not be presented as one.
The current public evaluation is a repository-local embedding benchmark, not a standardized public transfer-learning leaderboard.
Performance varies by class; the model is stronger on some object categories than on people-related and fine-grained categories.
The pretraining corpus is web-derived and may carry cultural, geographic, and temporal bias.
Any deployment involving people or high-stakes decisions needs task-specific evaluation and bias review.

Repository Highlights

Repository: https://github.com/BCBertenex/sekokuva-mobile-10m

File	Role
`sekokuva_mobile/model.py`	Backbone definition
`sekokuva_mobile/blocks.py`	Inverted residual building blocks
`sekokuva_mobile/simclr.py`	Projection head and contrastive loss
`train_simclr.py`	Large-scale self-supervised pretraining
`train_finetune.py`	Supervised downstream fine-tuning
`embedding_eval/run_embedding_eval.py`	Post-training embedding evaluator
`embedding_eval/output/latest_full/embedding_eval.json`	Final evaluation for `backbone_latest.pt`
`embedding_eval/output/ema_full/embedding_eval.json`	Final evaluation for `backbone_ema.pt`

Acknowledgement

SekoKuva Mobile 10M was trained on the LUMI supercomputer using a 50,000 GPU-hour allocation awarded to BC Bertenex Oy through the EuroHPC JU AI Factories Fast Lane call. This is the first model produced under that allocation.

We acknowledge EuroHPC JU for awarding Proposal ID EHPC-AIF-2026FL01-123 access to resources on the standard-g partition of LUMI hosted by CSC, Finland.

License

The released code and weights are provided under Apache 2.0.

Pretraining used Common Catalog CC-BY images. The post-training embedding evaluation in this repository uses a local OpenImages-derived dataset split.

Citation

@misc{sekokuva2026mobile10m,
  title     = {SekoKuva Mobile 10M: A Self-Supervised Mobile Visual Backbone for Feature Extraction and Transfer Learning},
  author    = {{BC Bertenex Oy}},
  year      = {2026},
  note      = {Apache 2.0 licensed release. Trained with SimCLR on approximately 7.2M unlabeled Common Catalog CC-BY images.}
}

Built by BC Bertenex Oy, Finland

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

bcbertenex
/

sekokuva-mobile-10m