Biomarker Extraction Model v2.1 — Best Model (Qwen3.5-0.8B LoRA)

The best performing model in the biomarker extraction series. Fine-tuned LoRA adapter on Qwen3.5-0.8B trained across 3 rounds on 3,000 total samples with the lowest loss (0.951). Returns structured JSON with abbreviations, full names, and all biomarkers separated.

Output Format

{
  "Biomarkers_abbreviations": ["HbA1c", "CRP"],
  "Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
  "All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}

Biomarkers_abbreviations: Biomarkers in abbreviated form (e.g., HbA1c, CRP, TSH)
Biomarkers_full_name: Biomarkers in full-name form (e.g., C-Reactive Protein, troponin I)
All_biomarkers: Every biomarker found in both forms
Returns empty arrays [] when no biomarkers are found

Model Lineage

Qwen/Qwen3.5-0.8B
└── v1 (1K biomarker NER samples) → loss 1.645
    └── v2 (1K GPT-120B-labeled samples, bf16) → loss 1.473
        └── v2.1 (1K Nemotron-labeled, JSON output) → loss 0.951  ← THIS MODEL (BEST)

Training Details

Round	Dataset	Samples	Labeler	Loss
1 (v1)	Biomarker NER (Kaggle)	1,000	Ground truth (BIO tags)	1.645
2 (v2)	Clinical trial outcomes	1,000	GPT-OSS-120B (NVIDIA API)	1.473
3 (v2.1)	Kaggle random sample	1,000	Nemotron-3-Super-120B (OpenRouter)	0.951

Data Sources

Round 1: shubhanshu789/biomarkers-ner-training-data (Kaggle, human-annotated NER)
Round 2: Clinical trial outcomes labeled by GPT-OSS-120B via NVIDIA API
Round 3: Random Kaggle samples labeled by Nemotron-3-Super-120B via OpenRouter with reasoning — structured JSON labels (abbreviations, full names, all biomarkers)

Hyperparameters

Method: LoRA (bf16, NOT 4-bit — per Unsloth Qwen3.5 guidelines)
LoRA rank: 16, alpha: 16
Learning rate: 1e-4 (cosine scheduler)
Batch size: 8 (gradient accumulation 2, effective 16)
Epochs: 3
Optimizer: adamw_8bit
Sequence length: 2048
Framework: Unsloth + TRL SFTTrainer
Hardware: NVIDIA RTX A6000 (48GB), ~11 minutes per round
Minimum inference: any GPU with 3GB+ VRAM

Usage

With Unsloth (recommended)

import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1",
    max_seq_length=2048,
    load_in_4bit=False,
    load_in_16bit=True,
    dtype=torch.bfloat16,
)
text_tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
FastLanguageModel.for_inference(model)
model.generation_config.pad_token_id = text_tokenizer.pad_token_id

# Extract biomarkers
clinical_text = "The patient's HbA1c was 7.2%, C-Reactive Protein (CRP) levels elevated at 15mg/L, and troponin I was within normal range."

messages = [
    {"role": "user", "content": f"Extract all biomarker names from the following clinical text. Return ONLY a JSON.\nText: {clinical_text}"}
]

inputs = text_tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1, do_sample=True)

result = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
biomarkers = json.loads(result)
print(json.dumps(biomarkers, indent=2))

Output:

{
  "Biomarkers_abbreviations": ["HbA1c", "CRP"],
  "Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
  "All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}

With PEFT/Transformers (no Unsloth)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base_model, "Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")

Examples

Clinical Text	Biomarkers_abbreviations	Biomarkers_full_name	All_biomarkers
"HbA1c 7.2%, C-Reactive Protein (CRP) elevated, troponin I normal"	HbA1c, CRP	Hemoglobin A1c, C-Reactive Protein, troponin I	HbA1c, Hemoglobin A1c, CRP, C-Reactive Protein, troponin I
"VEGF-D Serum Levels"	VEGF-D	—	VEGF-D
"FEV1. Forced expiratory volume in 1 second"	FEV1	Forced expiratory volume in 1 second	FEV1, Forced expiratory volume in 1 second
"TP53 mutation status in tumor tissue"	TP53	—	TP53
"Placental Growth Factor, PIGF"	PIGF	Placental Growth Factor	PIGF, Placental Growth Factor
"Patient-Reported Outcome Measures (PROMs)"	—	—	(empty — no biomarkers)

All Models in This Series

Model	Rounds	Samples	Loss	Output	Link
v1	1	1,000	1.645	comma list	v1
v1.1	2	2,000	1.051	JSON	v1.1
v2	2	2,000	1.473	comma list	v2
v2.1	3	3,000	0.951	JSON	This model (Best)

Limitations

Trained on English clinical trial text only
May miss biomarkers in non-standard naming not seen during training
Best suited for clinical trial outcome measures, lab values, and diagnostic markers

Citation

@misc{biomarker-qwen3.5-0.8b-lora-v2.1,
  author = {Shubh-0789},
  title = {Biomarker Extraction Model v2.1 — Best (Qwen3.5-0.8B LoRA, JSON output)},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Adapter

(74)

this model