SekoKuva Mobile 10M
A mobile-sized self-supervised visual backbone for image feature extraction, retrieval, and transfer learning.
Designed in Eurajoki, Finland and trained on the LUMI supercomputer by BC Bertenex Oy
This is the first model built under BC Bertenex Oy's 50,000 GPU-hour EuroHPC AI Factory Fast Lane allocation on LUMI.
| Parameters | 9,902,504 (9.9M) |
| Feature Dimension | 2,176 |
| Input | 320 x 320 x 3 (RGB) |
| Primary Output | Dense image embeddings via forward_features(x) |
| Evaluation | 40.7% k-NN accuracy on a 65-class OpenImages-derived embedding benchmark |
| Model Size | 40.0 MB (PyTorch backbone checkpoint) |
| Training Data | ~7.2M unlabeled Common Catalog CC-BY images |
| Training Method | SimCLR contrastive learning, 100 epochs |
| License | Apache 2.0 |
Why This Model?
Most small mobile vision models are released as fixed-label classifiers. SekoKuva Mobile 10M is different: it is trained first and foremost as a feature extractor. The goal is to produce reusable visual embeddings for retrieval, clustering, few-label transfer learning, and downstream fine-tuning - not just a single closed-set classifier.
This release is also much closer in spirit to the original SekoKuva Mobile 423K card: practical, open, and focused on what a user can actually do with the weights. The code and weights are released under Apache 2.0, the pretraining corpus comes from Common Catalog CC-BY, and the repository includes both the training pipeline and the post-training evaluation outputs.
Pre-trained Weights
The recommended default checkpoint is checkpoints/backbone_latest.pt.
| File | Purpose |
|---|---|
checkpoints/backbone_latest.pt |
Primary backbone release for feature extraction and downstream fine-tuning |
checkpoints/backbone_ema.pt |
Alternate EMA-smoothed backbone with near-identical quality |
checkpoints/projector_latest.pt |
SimCLR projection head used during pretraining, not needed for normal downstream use |
checkpoints/checkpoint_latest.pt |
Full resumable training state for continued pretraining |
Quick Start
Extract features
import torch
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms
from sekokuva_mobile.model import SekoKuvaMobile
model = SekoKuvaMobile(num_classes=1000)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False) # backbone checkpoint omits classifier weights
model.eval()
transform = transforms.Compose([
transforms.Resize(366),
transforms.CenterCrop(320),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
image = Image.open("photo.jpg").convert("RGB")
x = transform(image).unsqueeze(0)
with torch.no_grad():
features = model.forward_features(x)
features = F.normalize(features, dim=1)
print(features.shape) # [1, 2176]
Use the normalized feature vector for cosine similarity search, retrieval, clustering, or as the starting point for a small downstream classifier.
Fine-tune on labeled data
import torch
from sekokuva_mobile.model import SekoKuvaMobile
model = SekoKuvaMobile(num_classes=65)
state = torch.load("checkpoints/backbone_latest.pt", map_location="cpu")
model.load_state_dict(state, strict=False)
# Train only the classifier for a linear probe, or unfreeze the backbone
# for full fine-tuning. See train_finetune.py for the full pipeline.
Evaluation Snapshot
The released checkpoints were evaluated on a 65-class OpenImages-derived dataset with 52,898 gallery images and 9,478 validation images at 320 x 320 resolution.
| Metric | backbone_latest.pt |
backbone_ema.pt |
|---|---|---|
k-NN accuracy (k=20) |
40.70% | 40.34% |
| Top-1 nearest-neighbor accuracy | 45.34% | 45.29% |
| Centroid accuracy | 23.64% | 23.82% |
| Mean top-k similarity | 0.9252 | 0.9257 |
| Same-vs-different similarity gap | 0.05935 | 0.05915 |
backbone_latest.pt is the recommended public default because it is slightly better on the nearest-neighbor retrieval metrics that matter most for practical downstream use, while the EMA checkpoint remains a valid alternate.
Training also converged cleanly. Across the 100-epoch run, contrastive loss fell from 6.62 to 0.118 and effective rank rose from 59.5 to 93.3, which is consistent with a stable non-collapsed representation space.
Architecture
SekoKuva Mobile 10M uses a MobileNetV2-style backbone scaled with width_mult=1.7.
Input: 320 x 320 x 3 (RGB)
-> Stem: Conv2d 3 -> 56, stride 2
-> Stages 1-7: MobileNetV2-style inverted residual blocks
-> Head: 1 x 1 Conv 544 -> 2176
-> Global Average Pooling
-> 2176-dim feature vector
-> Optional: Linear classifier for downstream supervised tasks
The backbone uses depthwise separable convolutions, BatchNorm, and ReLU6 throughout. The primary output is the 2,176-dimensional feature vector from forward_features(x).
Training At A Glance
| Pretraining method | SimCLR contrastive learning |
| Projection head | Linear(2176 -> 2048) -> ReLU -> Linear(2048 -> 128) |
| Pretraining data | ~7.2M unlabeled Common Catalog CC-BY images |
| Resolution schedule | 160 -> 240 -> 320 |
| Optimizer | AdamC with cosine decay |
| Hardware | 128 x AMD Instinct MI250X GPUs on LUMI |
| EuroHPC access | AI Factory Fast Lane allocation, 50,000 GPU hours |
| Evaluation data | OpenImages-derived 65-class local benchmark |
Intended Uses
- Image feature extraction
- Similarity search and visual retrieval
- Clustering, deduplication, and dataset exploration
- Low-label or few-label transfer learning
- Initialization for downstream classification or other vision heads
Limitations
- This is not a zero-shot image classifier and should not be presented as one.
- The current public evaluation is a repository-local embedding benchmark, not a standardized public transfer-learning leaderboard.
- Performance varies by class; the model is stronger on some object categories than on people-related and fine-grained categories.
- The pretraining corpus is web-derived and may carry cultural, geographic, and temporal bias.
- Any deployment involving people or high-stakes decisions needs task-specific evaluation and bias review.
Repository Highlights
Repository: https://github.com/BCBertenex/sekokuva-mobile-10m
| File | Role |
|---|---|
sekokuva_mobile/model.py |
Backbone definition |
sekokuva_mobile/blocks.py |
Inverted residual building blocks |
sekokuva_mobile/simclr.py |
Projection head and contrastive loss |
train_simclr.py |
Large-scale self-supervised pretraining |
train_finetune.py |
Supervised downstream fine-tuning |
embedding_eval/run_embedding_eval.py |
Post-training embedding evaluator |
embedding_eval/output/latest_full/embedding_eval.json |
Final evaluation for backbone_latest.pt |
embedding_eval/output/ema_full/embedding_eval.json |
Final evaluation for backbone_ema.pt |
Acknowledgement
SekoKuva Mobile 10M was trained on the LUMI supercomputer using a 50,000 GPU-hour allocation awarded to BC Bertenex Oy through the EuroHPC JU AI Factories Fast Lane call. This is the first model produced under that allocation.
We acknowledge EuroHPC JU for awarding Proposal ID EHPC-AIF-2026FL01-123 access to resources on the standard-g partition of LUMI hosted by CSC, Finland.
License
The released code and weights are provided under Apache 2.0.
Pretraining used Common Catalog CC-BY images. The post-training embedding evaluation in this repository uses a local OpenImages-derived dataset split.
Citation
@misc{sekokuva2026mobile10m,
title = {SekoKuva Mobile 10M: A Self-Supervised Mobile Visual Backbone for Feature Extraction and Transfer Learning},
author = {{BC Bertenex Oy}},
year = {2026},
note = {Apache 2.0 licensed release. Trained with SimCLR on approximately 7.2M unlabeled Common Catalog CC-BY images.}
}
Built by BC Bertenex Oy, Finland