SoilFM Language Tower โ Qwen2.5-14B Literature CPT
A domain-adapted large language model for soil science and soil microbiology, created by continued pretraining of Qwen2.5-14B-Instruct on 200,000 curated soil science text passages.
This model is the Language Tower component of SoilFM2, a multi-modal foundation model for soil microbiome analysis developed at Lawrence Berkeley National Laboratory.
Model Details
| Base model | Qwen/Qwen2.5-14B-Instruct (14.2B parameters) |
| Method | Continued pretraining via QLoRA (4-bit NF4) |
| Format | Full merged model (LoRA weights merged into base) |
| Precision | BF16 |
| Context length | 32,768 tokens |
| Size on disk | ~28 GB |
| LoRA adapter | Also available at northenlab/soilfm-qwen2.5-14b-qlora (263 MB) |
Intended Uses
- Generating explanations of soil microbial processes, rhizosphere ecology, and plant-microbe interactions
- Providing domain-grounded context within the SoilFM2 multi-modal pipeline (prebiotic recommendation, substrate preference prediction)
- Serving as a soil-science-aware backbone for downstream fine-tuning or RAG systems
- Research and educational applications in soil microbiology
Training Data
The training corpus was assembled from four sources of soil science domain knowledge, stratified-sampled to 200,000 training examples and 10,000 validation examples (seed = 42):
| Source | Description | Proportion | Train | Val |
|---|---|---|---|---|
| PubMed Central | Full-text soil microbiology papers (39,853 articles) | 55% | 110,000 | 5,500 |
| Wikipedia | Soil science articles | 20% | 40,000 | 2,000 |
| USDA Soil Survey Manual | Official USDA technical reference | 10% | 20,000 | 1,000 |
| Wikipedia General Biology | Broad biology context to prevent catastrophic forgetting | 15% | 30,000 | 1,500 |
| Total | 100% | 200,000 | 10,000 |
Text was chunked to 1,024 tokens with 100-token overlap. The full corpus contained 329M tokens across 388,563 chunks; the 200K stratified subsample was used for this training run.
Preprocessing
- PubMed Central articles retrieved via BioC JSON API, cleaned of XML artifacts
- Soil Survey Manual cleaned of page headers, footers, and index content (57 of 435 chunks removed)
- All sources standardized to JSONL format with
textandsourcefields
Training Procedure
Configuration
| Parameter | Value |
|---|---|
| Quantization | 4-bit NF4 (BitsAndBytes, double quantization) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Rank-stabilized LoRA (rsLoRA) | Yes |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | ~263M (1.85% of total) |
| Optimizer | PagedAdamW8bit |
| Learning rate | 2e-5 (cosine schedule, 10% warmup) |
| Effective batch size | 128 (micro-batch 2, gradient accumulation 64) |
| Max gradient norm | 1.0 |
| Weight decay | 0.01 |
| Precision | BF16 mixed precision |
| Flash Attention 2 | Enabled |
| Epochs | 1 |
| Total steps | 1,500 |
Infrastructure
- GPU: NVIDIA A100 PCIe 80GB (RunPod)
- Training time: ~42 hours
- Peak VRAM: 54 GB (67% utilization)
Training Script
A custom manual PyTorch training loop was used (rather than HuggingFace Trainer) for compatibility and control. The script is available in the project repository as train_soilfm_cpt_MANUAL.py.
Results
Validation Loss
| Step | Validation Loss |
|---|---|
| 500 | 1.7369 |
| 1,000 | 1.6281 |
| 1,500 | 1.6130 |
Total improvement: 7.2% over the course of training. Loss was still decreasing at the end of training with no signs of overfitting. Gradient norms remained stable in the 0.2โ0.8 range throughout.
Qualitative Evaluation
Prompt: "The role of root exudates in shaping rhizosphere microbial communities involves"
Output: The model produces coherent, technically accurate continuations using appropriate domain terminology (root exudates, rhizosphere, primary/secondary metabolites, phytohormones), demonstrating successful domain adaptation.
Usage
Direct Loading (Recommended)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"northenlab/soilfm-qwen2.5-14b-literature-cpt",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"northenlab/soilfm-qwen2.5-14b-literature-cpt"
)
prompt = "The role of mycorrhizal fungi in soil nutrient cycling"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With vLLM (Production)
python3 -m vllm.entrypoints.openai.api_server \
--model northenlab/soilfm-qwen2.5-14b-literature-cpt \
--host 0.0.0.0 --port 8001
LoRA Adapter Only
If you prefer to load the adapter separately (e.g., for 4-bit inference):
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-14B-Instruct",
load_in_4bit=True,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "northenlab/soilfm-qwen2.5-14b-qlora")
Part of SoilFM2
This Language Tower works alongside other SoilFM2 components:
| Component | Description | HuggingFace |
|---|---|---|
| Language Tower (this model) | Domain-adapted LLM | northenlab/soilfm-qwen2.5-14b-literature-cpt |
| Graph Tower | Heterogeneous GNN on 2.39M-node knowledge graph | northenlab/soilfm2-graph-tower-joint-v0.1 |
| BSPR | Bayesian substrate preference model (AUC 0.94) | โ |
Together these components power a prebiotic recommendation pipeline that takes 16S microbiome profiles as input and suggests soil amendments to steer community function.
Use Restrictions
This model is intended for research and non-commercial use only. The training corpus includes PubMed Central Open Access articles under various Creative Commons licenses, some of which may carry non-commercial (CC BY-NC) terms. Users should ensure their use complies with the underlying data licenses.
The base model (Qwen2.5-14B-Instruct) is released under the Apache 2.0 license.
Limitations
- Trained for 1 epoch on a 200K subsample of the full 667K corpus; additional training may further improve performance
- Domain adaptation was evaluated primarily via validation loss and qualitative generation; systematic benchmarking on soil science Q&A tasks is ongoing
- The model inherits the base Qwen2.5-14B-Instruct capabilities and limitations
- Not intended for medical, agricultural, or regulatory decision-making without expert review
Citation
@misc{soilfm-language-tower-2025,
title={SoilFM Language Tower: Domain Adaptation of Qwen2.5-14B for Soil Science},
author={Northen Lab, Lawrence Berkeley National Laboratory},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/northenlab/soilfm-qwen2.5-14b-literature-cpt},
note={Continued pretraining on 200K soil science literature examples via QLoRA}
}
License
Apache 2.0 (inherited from the base Qwen2.5-14B-Instruct model). Training data includes PubMed Central Open Access articles under various CC licenses โ see Use Restrictions above.
- Downloads last month
- 2