SoilFM Language Tower โ€” Qwen2.5-14B Literature CPT

A domain-adapted large language model for soil science and soil microbiology, created by continued pretraining of Qwen2.5-14B-Instruct on 200,000 curated soil science text passages.

This model is the Language Tower component of SoilFM2, a multi-modal foundation model for soil microbiome analysis developed at Lawrence Berkeley National Laboratory.

Model Details

Base model Qwen/Qwen2.5-14B-Instruct (14.2B parameters)
Method Continued pretraining via QLoRA (4-bit NF4)
Format Full merged model (LoRA weights merged into base)
Precision BF16
Context length 32,768 tokens
Size on disk ~28 GB
LoRA adapter Also available at northenlab/soilfm-qwen2.5-14b-qlora (263 MB)

Intended Uses

  • Generating explanations of soil microbial processes, rhizosphere ecology, and plant-microbe interactions
  • Providing domain-grounded context within the SoilFM2 multi-modal pipeline (prebiotic recommendation, substrate preference prediction)
  • Serving as a soil-science-aware backbone for downstream fine-tuning or RAG systems
  • Research and educational applications in soil microbiology

Training Data

The training corpus was assembled from four sources of soil science domain knowledge, stratified-sampled to 200,000 training examples and 10,000 validation examples (seed = 42):

Source Description Proportion Train Val
PubMed Central Full-text soil microbiology papers (39,853 articles) 55% 110,000 5,500
Wikipedia Soil science articles 20% 40,000 2,000
USDA Soil Survey Manual Official USDA technical reference 10% 20,000 1,000
Wikipedia General Biology Broad biology context to prevent catastrophic forgetting 15% 30,000 1,500
Total 100% 200,000 10,000

Text was chunked to 1,024 tokens with 100-token overlap. The full corpus contained 329M tokens across 388,563 chunks; the 200K stratified subsample was used for this training run.

Preprocessing

  • PubMed Central articles retrieved via BioC JSON API, cleaned of XML artifacts
  • Soil Survey Manual cleaned of page headers, footers, and index content (57 of 435 chunks removed)
  • All sources standardized to JSONL format with text and source fields

Training Procedure

Configuration

Parameter Value
Quantization 4-bit NF4 (BitsAndBytes, double quantization)
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Rank-stabilized LoRA (rsLoRA) Yes
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters ~263M (1.85% of total)
Optimizer PagedAdamW8bit
Learning rate 2e-5 (cosine schedule, 10% warmup)
Effective batch size 128 (micro-batch 2, gradient accumulation 64)
Max gradient norm 1.0
Weight decay 0.01
Precision BF16 mixed precision
Flash Attention 2 Enabled
Epochs 1
Total steps 1,500

Infrastructure

  • GPU: NVIDIA A100 PCIe 80GB (RunPod)
  • Training time: ~42 hours
  • Peak VRAM: 54 GB (67% utilization)

Training Script

A custom manual PyTorch training loop was used (rather than HuggingFace Trainer) for compatibility and control. The script is available in the project repository as train_soilfm_cpt_MANUAL.py.

Results

Validation Loss

Step Validation Loss
500 1.7369
1,000 1.6281
1,500 1.6130

Total improvement: 7.2% over the course of training. Loss was still decreasing at the end of training with no signs of overfitting. Gradient norms remained stable in the 0.2โ€“0.8 range throughout.

Qualitative Evaluation

Prompt: "The role of root exudates in shaping rhizosphere microbial communities involves"

Output: The model produces coherent, technically accurate continuations using appropriate domain terminology (root exudates, rhizosphere, primary/secondary metabolites, phytohormones), demonstrating successful domain adaptation.

Usage

Direct Loading (Recommended)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt"
)

prompt = "The role of mycorrhizal fungi in soil nutrient cycling"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (Production)

python3 -m vllm.entrypoints.openai.api_server \
  --model northenlab/soilfm-qwen2.5-14b-literature-cpt \
  --host 0.0.0.0 --port 8001

LoRA Adapter Only

If you prefer to load the adapter separately (e.g., for 4-bit inference):

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "northenlab/soilfm-qwen2.5-14b-qlora")

Part of SoilFM2

This Language Tower works alongside other SoilFM2 components:

Component Description HuggingFace
Language Tower (this model) Domain-adapted LLM northenlab/soilfm-qwen2.5-14b-literature-cpt
Graph Tower Heterogeneous GNN on 2.39M-node knowledge graph northenlab/soilfm2-graph-tower-joint-v0.1
BSPR Bayesian substrate preference model (AUC 0.94) โ€”

Together these components power a prebiotic recommendation pipeline that takes 16S microbiome profiles as input and suggests soil amendments to steer community function.

Use Restrictions

This model is intended for research and non-commercial use only. The training corpus includes PubMed Central Open Access articles under various Creative Commons licenses, some of which may carry non-commercial (CC BY-NC) terms. Users should ensure their use complies with the underlying data licenses.

The base model (Qwen2.5-14B-Instruct) is released under the Apache 2.0 license.

Limitations

  • Trained for 1 epoch on a 200K subsample of the full 667K corpus; additional training may further improve performance
  • Domain adaptation was evaluated primarily via validation loss and qualitative generation; systematic benchmarking on soil science Q&A tasks is ongoing
  • The model inherits the base Qwen2.5-14B-Instruct capabilities and limitations
  • Not intended for medical, agricultural, or regulatory decision-making without expert review

Citation

@misc{soilfm-language-tower-2025,
  title={SoilFM Language Tower: Domain Adaptation of Qwen2.5-14B for Soil Science},
  author={Northen Lab, Lawrence Berkeley National Laboratory},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/northenlab/soilfm-qwen2.5-14b-literature-cpt},
  note={Continued pretraining on 200K soil science literature examples via QLoRA}
}

License

Apache 2.0 (inherited from the base Qwen2.5-14B-Instruct model). Training data includes PubMed Central Open Access articles under various CC licenses โ€” see Use Restrictions above.

Downloads last month
2
Safetensors
Model size
15B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for northenlab/soilfm-qwen2.5-14b-literature-cpt

Base model

Qwen/Qwen2.5-14B
Finetuned
(385)
this model