Biomarker Extraction Model v2.1 β€” Best Model (Qwen3.5-0.8B LoRA)

The best performing model in the biomarker extraction series. Fine-tuned LoRA adapter on Qwen3.5-0.8B trained across 3 rounds on 3,000 total samples with the lowest loss (0.951). Returns structured JSON with abbreviations, full names, and all biomarkers separated.

Output Format

{
  "Biomarkers_abbreviations": ["HbA1c", "CRP"],
  "Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
  "All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}
  • Biomarkers_abbreviations: Biomarkers in abbreviated form (e.g., HbA1c, CRP, TSH)
  • Biomarkers_full_name: Biomarkers in full-name form (e.g., C-Reactive Protein, troponin I)
  • All_biomarkers: Every biomarker found in both forms
  • Returns empty arrays [] when no biomarkers are found

Model Lineage

Qwen/Qwen3.5-0.8B
└── v1 (1K biomarker NER samples) β†’ loss 1.645
    └── v2 (1K GPT-120B-labeled samples, bf16) β†’ loss 1.473
        └── v2.1 (1K Nemotron-labeled, JSON output) β†’ loss 0.951  ← THIS MODEL (BEST)

Training Details

Round Dataset Samples Labeler Loss
1 (v1) Biomarker NER (Kaggle) 1,000 Ground truth (BIO tags) 1.645
2 (v2) Clinical trial outcomes 1,000 GPT-OSS-120B (NVIDIA API) 1.473
3 (v2.1) Kaggle random sample 1,000 Nemotron-3-Super-120B (OpenRouter) 0.951

Data Sources

  • Round 1: shubhanshu789/biomarkers-ner-training-data (Kaggle, human-annotated NER)
  • Round 2: Clinical trial outcomes labeled by GPT-OSS-120B via NVIDIA API
  • Round 3: Random Kaggle samples labeled by Nemotron-3-Super-120B via OpenRouter with reasoning β€” structured JSON labels (abbreviations, full names, all biomarkers)

Hyperparameters

Method: LoRA (bf16, NOT 4-bit β€” per Unsloth Qwen3.5 guidelines)
LoRA rank: 16, alpha: 16
Learning rate: 1e-4 (cosine scheduler)
Batch size: 8 (gradient accumulation 2, effective 16)
Epochs: 3
Optimizer: adamw_8bit
Sequence length: 2048
Framework: Unsloth + TRL SFTTrainer
Hardware: NVIDIA RTX A6000 (48GB), ~11 minutes per round
Minimum inference: any GPU with 3GB+ VRAM

Usage

With Unsloth (recommended)

import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1",
    max_seq_length=2048,
    load_in_4bit=False,
    load_in_16bit=True,
    dtype=torch.bfloat16,
)
text_tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
FastLanguageModel.for_inference(model)
model.generation_config.pad_token_id = text_tokenizer.pad_token_id

# Extract biomarkers
clinical_text = "The patient's HbA1c was 7.2%, C-Reactive Protein (CRP) levels elevated at 15mg/L, and troponin I was within normal range."

messages = [
    {"role": "user", "content": f"Extract all biomarker names from the following clinical text. Return ONLY a JSON.\nText: {clinical_text}"}
]

inputs = text_tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1, do_sample=True)

result = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
biomarkers = json.loads(result)
print(json.dumps(biomarkers, indent=2))

Output:

{
  "Biomarkers_abbreviations": ["HbA1c", "CRP"],
  "Biomarkers_full_name": ["Hemoglobin A1c", "C-Reactive Protein", "troponin I"],
  "All_biomarkers": ["HbA1c", "Hemoglobin A1c", "CRP", "C-Reactive Protein", "troponin I"]
}

With PEFT/Transformers (no Unsloth)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base_model, "Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")
tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1")

Examples

Clinical Text Biomarkers_abbreviations Biomarkers_full_name All_biomarkers
"HbA1c 7.2%, C-Reactive Protein (CRP) elevated, troponin I normal" HbA1c, CRP Hemoglobin A1c, C-Reactive Protein, troponin I HbA1c, Hemoglobin A1c, CRP, C-Reactive Protein, troponin I
"VEGF-D Serum Levels" VEGF-D β€” VEGF-D
"FEV1. Forced expiratory volume in 1 second" FEV1 Forced expiratory volume in 1 second FEV1, Forced expiratory volume in 1 second
"TP53 mutation status in tumor tissue" TP53 β€” TP53
"Placental Growth Factor, PIGF" PIGF Placental Growth Factor PIGF, Placental Growth Factor
"Patient-Reported Outcome Measures (PROMs)" β€” β€” (empty β€” no biomarkers)

All Models in This Series

Model Rounds Samples Loss Output Link
v1 1 1,000 1.645 comma list v1
v1.1 2 2,000 1.051 JSON v1.1
v2 2 2,000 1.473 comma list v2
v2.1 3 3,000 0.951 JSON This model (Best)

Limitations

  • Trained on English clinical trial text only
  • May miss biomarkers in non-standard naming not seen during training
  • Best suited for clinical trial outcome measures, lab values, and diagnostic markers

Citation

@misc{biomarker-qwen3.5-0.8b-lora-v2.1,
  author = {Shubh-0789},
  title = {Biomarker Extraction Model v2.1 β€” Best (Qwen3.5-0.8B LoRA, JSON output)},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Shubh-0789/biomarker-qwen3.5-0.8b-lora-v2.1

Adapter
(74)
this model