File size: 5,341 Bytes
6fb6c07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d4826f
6fb6c07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: mit
library_name: pytorch
language:
- en
tags:
- soil
- soil-science
- earth-science
- environmental-science
- multimodal
- tabular
- transformer
- representation-learning
- masked-feature-modeling
- remote-sensing
- europe
datasets:
- earthroverprogram/lucas-mega
---

# SoilFormer

A multimodal tabular transformer trained on [LUCAS-MEGA](https://huggingface.co/datasets/earthroverprogram/lucas-mega).

[Manuscript](https://arxiv.org/abs/2605.04323)

## Introduction

SoilFormer is a multimodal transformer for representation learning in soil–environment systems. It is trained on
LUCAS-MEGA, a large-scale dataset built from European soil and environmental observations, with the LUCAS soil survey as
its backbone. LUCAS-MEGA integrates heterogeneous sources into a machine-learning-ready sample–feature table, covering
numerical, categorical, textual, and visual modalities across soil physical, chemical, hydrological, environmental, and
site-related properties.

SoilFormer learns from partially observed multimodal samples using masked feature modeling. During training, a subset of
observed categorical and numerical features is masked, and the model reconstructs them from the remaining tabular and
visual context. The architecture combines grouped categorical embedding, grouped numerical encoding/decoding, vision
feature extraction and compression, transformer layers, and heteroscedastic prediction heads for uncertainty-aware
reconstruction.

<img src="resources/arch.png" alt="SoilFormer architecture" width="70%">

## Training

Train SoilFormer with:

```bash
python modelling/train.py
```

Main configuration files:

* `config/config_model.json`: model architecture parameters, including embedding sizes, transformer layer settings,
  decoder settings, dtype, and vision model configuration.
* `config/config_data.json`: data parameters, including CSV path, vocab paths, numeric statistics, photo mapping, image
  root, train/eval split, batch size, and masking ratios.
* `config/config_train.json`: training hyperparameters, including runtime device, seed, optimizer settings, scheduler
  settings, checkpoint behavior, loss options, logging, and output paths.

## Inference

Inference uses readable JSON input cards. The workflow is:

1. Create input cards from one dataset row.
2. Edit the masked card manually if desired.
3. Run model prediction from the edited card.
4. Optionally compare predictions against the unmasked answer card.

### 1. Create input cards

```bash
python create_input_card_from_dataset.py \
  --row_index 10 \
  --output example/input_card.json
```

This writes two files:

```text
example/input_card__unmasked.json
example/input_card__masked.json
```

The unmasked card contains the raw readable values from the CSV row. The masked card randomly replaces a fraction of
categorical and numeric values with `null`. Natural missing values remain as empty strings `""`, while active masks are
represented as `null`.

Default masking ratios are 0.15 for both categorical and numeric features:

```bash
python create_input_card_from_dataset.py \
  --row_index 10 \
  --output example/input_card.json \
  --cat_mask_ratio 0.15 \
  --num_mask_ratio 0.15 \
  --seed 42
```

The card format is intentionally simple and user-editable. Users can copy this card as a template, replace the values
with their own soil sample information, and set variables to `null` to indicate which fields should be predicted during
inference:

```json
{
  "categorical": {
    "land_site:land_cover_primary": "B16: Cropland => Cereals => Maize",
    "land_site:land_use_primary": null,
    "soil_type:WRB_soil_group": "Cambisol",
    "texture:ISSS_class": "silty clay",
    "...": "..."
  },
  "numeric": {
    "carbon:CaCO3_content (g/kg)": 7.0,
    "carbon:SOC_saturation_ratio": 0.3647958934307098,
    "geographic:latitude (deg)": 38.8513900000485,
    "geographic:longitude (deg)": -9.29050000007487,
    "mass_density:bulk_density (g/cm³)": null,
    "...": "..."
  },
  "vision": {
    "image_path_suffix": "relative/path/to/photo.jpg"
  }
}
```

### 2. Run prediction

```bash
python inference_predict_output_card.py \
  --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \
  --input_card example/input_card__masked.json \
  --output example/output_card.json
```

This writes:

```text
example/output_card.json
```

`output_card.json` contains readable predictions:

* categorical outputs are decoded back to raw category labels;
* numeric outputs are converted from z-score space back to the original physical units;
* vision input is read from `vision.image_path_suffix` together with `photo_root` in `config/config_data.json`.

### 3. Evaluation with an answer card

```bash
python inference_predict_output_card.py \
  --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \
  --input_card example/input_card__masked.json \
  --answer_card example/input_card__unmasked.json \
  --output example/output_card.json
```

This additionally writes:

```text
example/output_card__acc.json
```

When `--answer_card` is provided, `output_card__acc.json` reports reconstruction metrics over fields that are `null` in
the masked input card:

* categorical accuracy for masked categorical fields;
* numeric MAE for masked numeric fields, measured in the original feature units.