| --- |
| license: mit |
| library_name: pytorch |
| language: |
| - en |
| tags: |
| - soil |
| - soil-science |
| - earth-science |
| - environmental-science |
| - multimodal |
| - tabular |
| - transformer |
| - representation-learning |
| - masked-feature-modeling |
| - remote-sensing |
| - europe |
| datasets: |
| - earthroverprogram/lucas-mega |
| --- |
| |
| # SoilFormer |
|
|
| A multimodal tabular transformer trained on [LUCAS-MEGA](https://huggingface.co/datasets/earthroverprogram/lucas-mega). |
|
|
| [Manuscript](https://arxiv.org/abs/2605.04323) |
|
|
| ## Introduction |
|
|
| SoilFormer is a multimodal transformer for representation learning in soil–environment systems. It is trained on |
| LUCAS-MEGA, a large-scale dataset built from European soil and environmental observations, with the LUCAS soil survey as |
| its backbone. LUCAS-MEGA integrates heterogeneous sources into a machine-learning-ready sample–feature table, covering |
| numerical, categorical, textual, and visual modalities across soil physical, chemical, hydrological, environmental, and |
| site-related properties. |
|
|
| SoilFormer learns from partially observed multimodal samples using masked feature modeling. During training, a subset of |
| observed categorical and numerical features is masked, and the model reconstructs them from the remaining tabular and |
| visual context. The architecture combines grouped categorical embedding, grouped numerical encoding/decoding, vision |
| feature extraction and compression, transformer layers, and heteroscedastic prediction heads for uncertainty-aware |
| reconstruction. |
|
|
| <img src="resources/arch.png" alt="SoilFormer architecture" width="70%"> |
|
|
| ## Training |
|
|
| Train SoilFormer with: |
|
|
| ```bash |
| python modelling/train.py |
| ``` |
|
|
| Main configuration files: |
|
|
| * `config/config_model.json`: model architecture parameters, including embedding sizes, transformer layer settings, |
| decoder settings, dtype, and vision model configuration. |
| * `config/config_data.json`: data parameters, including CSV path, vocab paths, numeric statistics, photo mapping, image |
| root, train/eval split, batch size, and masking ratios. |
| * `config/config_train.json`: training hyperparameters, including runtime device, seed, optimizer settings, scheduler |
| settings, checkpoint behavior, loss options, logging, and output paths. |
|
|
| ## Inference |
|
|
| Inference uses readable JSON input cards. The workflow is: |
|
|
| 1. Create input cards from one dataset row. |
| 2. Edit the masked card manually if desired. |
| 3. Run model prediction from the edited card. |
| 4. Optionally compare predictions against the unmasked answer card. |
|
|
| ### 1. Create input cards |
|
|
| ```bash |
| python create_input_card_from_dataset.py \ |
| --row_index 10 \ |
| --output example/input_card.json |
| ``` |
|
|
| This writes two files: |
|
|
| ```text |
| example/input_card__unmasked.json |
| example/input_card__masked.json |
| ``` |
|
|
| The unmasked card contains the raw readable values from the CSV row. The masked card randomly replaces a fraction of |
| categorical and numeric values with `null`. Natural missing values remain as empty strings `""`, while active masks are |
| represented as `null`. |
|
|
| Default masking ratios are 0.15 for both categorical and numeric features: |
|
|
| ```bash |
| python create_input_card_from_dataset.py \ |
| --row_index 10 \ |
| --output example/input_card.json \ |
| --cat_mask_ratio 0.15 \ |
| --num_mask_ratio 0.15 \ |
| --seed 42 |
| ``` |
|
|
| The card format is intentionally simple and user-editable. Users can copy this card as a template, replace the values |
| with their own soil sample information, and set variables to `null` to indicate which fields should be predicted during |
| inference: |
|
|
| ```json |
| { |
| "categorical": { |
| "land_site:land_cover_primary": "B16: Cropland => Cereals => Maize", |
| "land_site:land_use_primary": null, |
| "soil_type:WRB_soil_group": "Cambisol", |
| "texture:ISSS_class": "silty clay", |
| "...": "..." |
| }, |
| "numeric": { |
| "carbon:CaCO3_content (g/kg)": 7.0, |
| "carbon:SOC_saturation_ratio": 0.3647958934307098, |
| "geographic:latitude (deg)": 38.8513900000485, |
| "geographic:longitude (deg)": -9.29050000007487, |
| "mass_density:bulk_density (g/cm³)": null, |
| "...": "..." |
| }, |
| "vision": { |
| "image_path_suffix": "relative/path/to/photo.jpg" |
| } |
| } |
| ``` |
|
|
| ### 2. Run prediction |
|
|
| ```bash |
| python inference_predict_output_card.py \ |
| --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \ |
| --input_card example/input_card__masked.json \ |
| --output example/output_card.json |
| ``` |
|
|
| This writes: |
|
|
| ```text |
| example/output_card.json |
| ``` |
|
|
| `output_card.json` contains readable predictions: |
|
|
| * categorical outputs are decoded back to raw category labels; |
| * numeric outputs are converted from z-score space back to the original physical units; |
| * vision input is read from `vision.image_path_suffix` together with `photo_root` in `config/config_data.json`. |
|
|
| ### 3. Evaluation with an answer card |
|
|
| ```bash |
| python inference_predict_output_card.py \ |
| --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \ |
| --input_card example/input_card__masked.json \ |
| --answer_card example/input_card__unmasked.json \ |
| --output example/output_card.json |
| ``` |
|
|
| This additionally writes: |
|
|
| ```text |
| example/output_card__acc.json |
| ``` |
|
|
| When `--answer_card` is provided, `output_card__acc.json` reports reconstruction metrics over fields that are `null` in |
| the masked input card: |
|
|
| * categorical accuracy for masked categorical fields; |
| * numeric MAE for masked numeric fields, measured in the original feature units. |
|
|