SoilFM Language Tower — Qwen2.5-14B Literature CPT

A domain-adapted large language model for soil science and soil microbiology, created by continued pretraining of Qwen2.5-14B-Instruct on 200,000 curated soil science text passages.

This model is the Language Tower component of SoilFM2, a multi-modal foundation model for soil microbiome analysis developed at Lawrence Berkeley National Laboratory.

Model Details


Base model	Qwen/Qwen2.5-14B-Instruct (14.2B parameters)
Method	Continued pretraining via QLoRA (4-bit NF4)
Format	Full merged model (LoRA weights merged into base)
Precision	BF16
Context length	32,768 tokens
Size on disk	~28 GB
LoRA adapter	Also available at northenlab/soilfm-qwen2.5-14b-qlora (263 MB)

Intended Uses

Generating explanations of soil microbial processes, rhizosphere ecology, and plant-microbe interactions
Providing domain-grounded context within the SoilFM2 multi-modal pipeline (prebiotic recommendation, substrate preference prediction)
Serving as a soil-science-aware backbone for downstream fine-tuning or RAG systems
Research and educational applications in soil microbiology

Training Data

The training corpus was assembled from four sources of soil science domain knowledge, stratified-sampled to 200,000 training examples and 10,000 validation examples (seed = 42):

Source	Description	Proportion	Train	Val
PubMed Central	Full-text soil microbiology papers (39,853 articles)	55%	110,000	5,500
Wikipedia	Soil science articles	20%	40,000	2,000
USDA Soil Survey Manual	Official USDA technical reference	10%	20,000	1,000
Wikipedia General Biology	Broad biology context to prevent catastrophic forgetting	15%	30,000	1,500
Total		100%	200,000	10,000

Text was chunked to 1,024 tokens with 100-token overlap. The full corpus contained 329M tokens across 388,563 chunks; the 200K stratified subsample was used for this training run.

Preprocessing

PubMed Central articles retrieved via BioC JSON API, cleaned of XML artifacts
Soil Survey Manual cleaned of page headers, footers, and index content (57 of 435 chunks removed)
All sources standardized to JSONL format with text and source fields

Training Procedure

Configuration

Parameter	Value
Quantization	4-bit NF4 (BitsAndBytes, double quantization)
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Rank-stabilized LoRA (rsLoRA)	Yes
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	~263M (1.85% of total)
Optimizer	PagedAdamW8bit
Learning rate	2e-5 (cosine schedule, 10% warmup)
Effective batch size	128 (micro-batch 2, gradient accumulation 64)
Max gradient norm	1.0
Weight decay	0.01
Precision	BF16 mixed precision
Flash Attention 2	Enabled
Epochs	1
Total steps	1,500

Infrastructure

GPU: NVIDIA A100 PCIe 80GB (RunPod)
Training time: ~42 hours
Peak VRAM: 54 GB (67% utilization)

Training Script

A custom manual PyTorch training loop was used (rather than HuggingFace Trainer) for compatibility and control. The script is available in the project repository as train_soilfm_cpt_MANUAL.py.

Results

Validation Loss

Step	Validation Loss
500	1.7369
1,000	1.6281
1,500	1.6130

Total improvement: 7.2% over the course of training. Loss was still decreasing at the end of training with no signs of overfitting. Gradient norms remained stable in the 0.2–0.8 range throughout.

Qualitative Evaluation

Prompt: "The role of root exudates in shaping rhizosphere microbial communities involves"

Output: The model produces coherent, technically accurate continuations using appropriate domain terminology (root exudates, rhizosphere, primary/secondary metabolites, phytohormones), demonstrating successful domain adaptation.

Usage

Direct Loading (Recommended)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "northenlab/soilfm-qwen2.5-14b-literature-cpt"
)

prompt = "The role of mycorrhizal fungi in soil nutrient cycling"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM (Production)

python3 -m vllm.entrypoints.openai.api_server \
  --model northenlab/soilfm-qwen2.5-14b-literature-cpt \
  --host 0.0.0.0 --port 8001

LoRA Adapter Only

If you prefer to load the adapter separately (e.g., for 4-bit inference):

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-14B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "northenlab/soilfm-qwen2.5-14b-qlora")

Part of SoilFM2

This Language Tower works alongside other SoilFM2 components:

Component	Description	HuggingFace
Language Tower (this model)	Domain-adapted LLM	northenlab/soilfm-qwen2.5-14b-literature-cpt
Graph Tower	Heterogeneous GNN on 2.39M-node knowledge graph	northenlab/soilfm2-graph-tower-joint-v0.1
BSPR	Bayesian substrate preference model (AUC 0.94)	—

Together these components power a prebiotic recommendation pipeline that takes 16S microbiome profiles as input and suggests soil amendments to steer community function.

Use Restrictions

This model is intended for research and non-commercial use only. The training corpus includes PubMed Central Open Access articles under various Creative Commons licenses, some of which may carry non-commercial (CC BY-NC) terms. Users should ensure their use complies with the underlying data licenses.

The base model (Qwen2.5-14B-Instruct) is released under the Apache 2.0 license.

Limitations

Trained for 1 epoch on a 200K subsample of the full 667K corpus; additional training may further improve performance
Domain adaptation was evaluated primarily via validation loss and qualitative generation; systematic benchmarking on soil science Q&A tasks is ongoing
The model inherits the base Qwen2.5-14B-Instruct capabilities and limitations
Not intended for medical, agricultural, or regulatory decision-making without expert review

Citation

@misc{soilfm-language-tower-2025,
  title={SoilFM Language Tower: Domain Adaptation of Qwen2.5-14B for Soil Science},
  author={Northen Lab, Lawrence Berkeley National Laboratory},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/northenlab/soilfm-qwen2.5-14b-literature-cpt},
  note={Continued pretraining on 200K soil science literature examples via QLoRA}
}

License

Apache 2.0 (inherited from the base Qwen2.5-14B-Instruct model). Training data includes PubMed Central Open Access articles under various CC licenses — see Use Restrictions above.

Downloads last month: 2

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for northenlab/soilfm-qwen2.5-14b-literature-cpt

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Finetuned

(385)

this model