Biomarker Extraction Model v2.1 β Best Model (Qwen3.5-0.8B LoRA)
The best performing model in the biomarker extraction series. Fine-tuned LoRA adapter on Qwen3.5-0.8B trained across 3 rounds on 3,000 total samples with the lowest loss (0.951). Returns structured JSON with abbreviations, full names, and all biomarkers separated.
Output Format
{
"Biomarkers_abbreviations": ["HbA1c", "CRP"],
"Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
"All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}
- Biomarkers_abbreviations: Biomarkers in abbreviated form (e.g., HbA1c, CRP, TSH)
- Biomarkers_full_name: Biomarkers in full-name form (e.g., C-Reactive Protein, troponin I)
- All_biomarkers: Every biomarker found in both forms
- Returns empty arrays
[] when no biomarkers are found
Model Lineage
Qwen/Qwen3.5-0.8B
βββ v1 (1K biomarker NER samples) β loss 1.645
βββ v2 (1K GPT-120B-labeled samples, bf16) β loss 1.473
βββ v2.1 (1K Nemotron-labeled, JSON output) β loss 0.951 β THIS MODEL (BEST)
Training Details
| Round |
Dataset |
Samples |
Labeler |
Loss |
| 1 (v1) |
Biomarker NER (Kaggle) |
1,000 |
Ground truth (BIO tags) |
1.645 |
| 2 (v2) |
Clinical trial outcomes |
1,000 |
GPT-OSS-120B (NVIDIA API) |
1.473 |
| 3 (v2.1) |
Kaggle random sample |
1,000 |
Nemotron-3-Super-120B (OpenRouter) |
0.951 |
Data Sources
- Round 1: shubhanshu789/biomarkers-ner-training-data (Kaggle, human-annotated NER)
- Round 2: Clinical trial outcomes labeled by GPT-OSS-120B via NVIDIA API
- Round 3: Random Kaggle samples labeled by Nemotron-3-Super-120B via OpenRouter with reasoning β structured JSON labels (abbreviations, full names, all biomarkers)
Hyperparameters
Method: LoRA (bf16, NOT 4-bit β per Unsloth Qwen3.5 guidelines)
LoRA rank: 16, alpha: 16
Learning rate: 1e-4 (cosine scheduler)
Batch size: 8 (gradient accumulation 2, effective 16)
Epochs: 3
Optimizer: adamw_8bit
Sequence length: 2048
Framework: Unsloth + TRL SFTTrainer
Hardware: NVIDIA RTX A6000 (48GB), ~11 minutes per round
Minimum inference: any GPU with 3GB+ VRAM
Usage
With Unsloth (recommended)
import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1",
max_seq_length=2048,
load_in_4bit=False,
load_in_16bit=True,
dtype=torch.bfloat16,
)
text_tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
FastLanguageModel.for_inference(model)
model.generation_config.pad_token_id = text_tokenizer.pad_token_id
clinical_text = "The patient's HbA1c was 7.2%, C-Reactive Protein (CRP) levels elevated at 15mg/L, and troponin I was within normal range."
messages = [
{"role": "user", "content": f"Extract all biomarker names from the following clinical text. Return ONLY a JSON.\nText: {clinical_text}"}
]
inputs = text_tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1, do_sample=True)
result = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
biomarkers = json.loads(result)
print(json.dumps(biomarkers, indent=2))
Output:
{
"Biomarkers_abbreviations": ["HbA1c", "CRP"],
"Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
"All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}
With PEFT/Transformers (no Unsloth)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base_model, "Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
Examples
| Clinical Text |
Biomarkers_abbreviations |
Biomarkers_full_name |
All_biomarkers |
| "HbA1c 7.2%, C-Reactive Protein (CRP) elevated, troponin I normal" |
HbA1c, CRP |
Hemoglobin A1c, C-Reactive Protein, troponin I |
HbA1c, Hemoglobin A1c, CRP, C-Reactive Protein, troponin I |
| "VEGF-D Serum Levels" |
VEGF-D |
β |
VEGF-D |
| "FEV1. Forced expiratory volume in 1 second" |
FEV1 |
Forced expiratory volume in 1 second |
FEV1, Forced expiratory volume in 1 second |
| "TP53 mutation status in tumor tissue" |
TP53 |
β |
TP53 |
| "Placental Growth Factor, PIGF" |
PIGF |
Placental Growth Factor |
PIGF, Placental Growth Factor |
| "Patient-Reported Outcome Measures (PROMs)" |
β |
β |
(empty β no biomarkers) |
All Models in This Series
| Model |
Rounds |
Samples |
Loss |
Output |
Link |
| v1 |
1 |
1,000 |
1.645 |
comma list |
v1 |
| v1.1 |
2 |
2,000 |
1.051 |
JSON |
v1.1 |
| v2 |
2 |
2,000 |
1.473 |
comma list |
v2 |
| v2.1 |
3 |
3,000 |
0.951 |
JSON |
This model (Best) |
Limitations
- Trained on English clinical trial text only
- May miss biomarkers in non-standard naming not seen during training
- Best suited for clinical trial outcome measures, lab values, and diagnostic markers
Citation
@misc{biomarker-qwen3.5-0.8b-lora-v2.1,
author = {Shubh-0789},
title = {Biomarker Extraction Model v2.1 β Best (Qwen3.5-0.8B LoRA, JSON output)},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1}
}