BioReview SFT β€” Qwen3.5-9B (all_nonfig)

QLoRA fine-tuned model for identifying scientific concerns in biomedical research papers. Trained on the peer-review-benchmark dataset.

F1 = 0.621 Β· Recall = 0.498 Β· Precision = 0.827 (test set, dedup+cap20 postprocessing) Precision exceeds GPT-4o-mini (0.827 vs 0.753). For maximum recall, use the 8B+9B union ensemble (F1 = 0.704).


Model Description

This model was trained with supervised fine-tuning (SFT) and QLoRA on the peer-review-benchmark v3 dataset. Given the full text of a biomedical paper, it generates a structured list of scientific concerns a peer reviewer might raise, each annotated with category and severity.

Base model Qwen/Qwen3.5-9B
Model class Qwen3_5ForConditionalGeneration (Vision-Language; text-only fine-tune)
Training method QLoRA β€” 4-bit NF4 quantization, LoRA rank=16, alpha=32
Training data 4,734 articles Β· 5 journal sources (eLife, F1000Research, PLOS, PeerJ, Nature)
Training duration ~35 h on 1Γ— NVIDIA A100 80 GB Β· 3 epochs Β· 1,773 steps
Framework Unsloth + TRL + PEFT
Training code jang1563/BioReview_Training

Performance

Evaluated on the peer-review-benchmark v3 test split (981 articles) using SPECTER2 semantic embeddings + Hungarian algorithm (matching threshold = 0.65).

Test Set

Postprocessing F1 Recall Precision
dedup+cap20 (recommended) 0.621 0.498 0.827
raw (no postprocessing) 0.514 0.496 0.533

Validation Set (838 articles)

Postprocessing F1 Recall Precision
dedup+cap20 0.625 0.501 0.832

No overfitting: val/test F1 within 0.004.

Comparison

Model F1 Recall Precision Gate
GPT-4o-mini (val-only baseline) 0.696* 0.647* 0.753* βœ“
8B+9B Ensemble (union, dedup+cap20) 0.704 0.695 0.713 βœ“
This model (Qwen3.5-9B, dedup+cap20) 0.621 0.498 0.827 βœ“
Qwen3-8B SFT (dedup+cap20) 0.557 0.409 0.871 βœ—

* GPT-4o-mini baseline evaluated on validation split only; test results pending.

Precision note: Qwen3-8B achieves higher raw precision (0.871) but lower F1 and recall, and fails the evaluation threshold. The 9B model offers the best single-model balance of F1, recall, and precision. The 8B+9B union ensemble is recommended for highest F1.


Output Format

The model returns a JSON array of concern objects, each with three fields:

[
  {
    "text": "The sample size of n=12 per group is insufficient for the claimed statistical power.",
    "category": "statistical_methodology",
    "severity": "major"
  },
  {
    "text": "The authors do not compare their method to the current state-of-the-art baselines.",
    "category": "prior_art_novelty",
    "severity": "minor"
  }
]

Valid categories: design_flaw Β· statistical_methodology Β· missing_experiment Β· prior_art_novelty Β· writing_clarity Β· reagent_method_specificity Β· interpretation Β· other

Valid severities: major Β· minor Β· optional


How to Use

Note: Qwen3.5-9B is a Vision-Language model (Qwen3_5ForConditionalGeneration). Load via AutoProcessor and use its inner .tokenizer for text decoding. The adapter was saved via Unsloth β€” always pre-load the base model first (as shown below), rather than using AutoPeftModelForCausalLM.from_pretrained directly.

Installation

pip install transformers peft bitsandbytes accelerate torch sentencepiece protobuf

Inference

import json, re, torch
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel

ADAPTER_REPO = "jang1563/bioreview-qwen3.5-9b-sft"
BASE_MODEL   = "Qwen/Qwen3.5-9B"

# --- Load model ---
# Qwen3.5-9B is a VL model: AutoProcessor returns a multimodal processor.
# Extract the inner text tokenizer for encoding/decoding.
processor = AutoProcessor.from_pretrained(BASE_MODEL)
tokenizer = processor.tokenizer   # text-only tokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True,            # requires bitsandbytes
)
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()

# --- Prompts (must match training exactly) ---
# Note: the base sentence says "strings" but OUTPUT FORMAT overrides it with "objects".
# This apparent contradiction is intentional β€” it matches the training prompt exactly.
# Changing either part will degrade output quality.
SYSTEM_PROMPT = (
    "You are an expert peer reviewer for biomedical research papers published in "
    "high-impact journals. Identify specific scientific concerns, weaknesses, and "
    "issues. Return only a JSON array of concern strings.\n\n"
    "OUTPUT FORMAT: Return a JSON array of concern objects, nothing else.\n"
    'Each item must be: {"text": string, "category": one of '
    "[design_flaw, statistical_methodology, missing_experiment, "
    "prior_art_novelty, writing_clarity, reagent_method_specificity, "
    'interpretation, other], "severity": one of [major, minor, optional]}.\n'
    "Do NOT return a JSON array of strings."
)

USER_PREFIX = (
    "Review the following biomedical article.\n\n"
    "Return ONLY a JSON array of objects with keys: `text`, `category`, `severity`.\n"
    "Allowed category values: design_flaw, statistical_methodology, "
    "missing_experiment, prior_art_novelty, writing_clarity, "
    "reagent_method_specificity, interpretation, other.\n"
    "Allowed severity values: major, minor, optional.\n\n"
)


def review_paper(paper_text: str) -> list[dict]:
    """Generate peer-review concerns for a biomedical paper.

    Returns a list of concern dicts: {text, category, severity}.
    Apply postprocess_concerns() for best evaluation results.
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": USER_PREFIX + paper_text},
    ]
    # Use processor.apply_chat_template (handles VL model chat format)
    input_ids = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=4096,
            temperature=0.1,
            repetition_penalty=1.05,
            do_sample=True,
        )

    # Use inner text tokenizer to decode (not the multimodal processor)
    raw = tokenizer.decode(
        output_ids[0][input_ids.shape[1]:],
        skip_special_tokens=True,
    )

    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        # Fallback: extract the first JSON array found in the output
        m = re.search(r"\[.*\]", raw, re.DOTALL)
        if not m:
            return []
        try:
            return json.loads(m.group())
        except json.JSONDecodeError:
            return []


concerns = review_paper(paper_text)
# [{"text": "...", "category": "design_flaw", "severity": "major"}, ...]

Postprocessing β€” dedup+cap20 (Recommended)

Raw output typically contains 30–60 concerns, many overlapping. Apply dedup+cap20 to remove near-duplicates and cap at 20 concerns per article. This improves F1 by +0.11 (0.514 β†’ 0.625 on the validation set).

pip install sentence-transformers scikit-learn
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


def postprocess_concerns(
    concerns: list[dict],
    cap: int = 20,
    dedup_threshold: float = 0.95,
) -> list[dict]:
    """Remove near-duplicate concerns and cap at `cap` items.

    Uses SPECTER2 embeddings on the `text` field.
    Requires: sentence-transformers, scikit-learn
    """
    if len(concerns) <= 1:
        return concerns

    texts = [c["text"] for c in concerns]
    embedder = SentenceTransformer("allenai/specter2_base")
    embeddings = embedder.encode(texts, normalize_embeddings=True)
    sim = cosine_similarity(embeddings)

    keep: list[int] = []
    for i in range(len(concerns)):
        if all(sim[i][j] < dedup_threshold for j in keep):
            keep.append(i)

    return [concerns[i] for i in keep][:cap]


concerns = postprocess_concerns(review_paper(paper_text))

See the BioReview Training repository for the full inference + evaluation pipeline including SPECTER2-based scoring.


Training Details

Data

Property Value
Corpus All non-figure concerns β€” peer-review-benchmark v3 train split
Total articles 4,734
Source breakdown eLife 1,304 Β· F1000 1,933 Β· PLOS 1,255 Β· PeerJ 176 Β· Nature 66
Avg concerns / article 14.1
Concern schema {text, category, severity} objects
Format ShareGPT (system / human / assistant turns)
Input truncation 15,000-token budget Β· section priority: methods > results > intro > …

Hyperparameters

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit NF4 (bitsandbytes)
Precision bfloat16
Learning rate 2e-4
LR scheduler cosine
Warmup ratio 0.03
Weight decay 0.01
Epochs 3
Per-device batch size 1
Gradient accumulation 8 (effective batch size = 8)
Max sequence length 16,384
Optimizer AdamW 8-bit
Seed 42

Hardware

GPU 1Γ— NVIDIA A100 80 GB
Cluster Cornell Cayuga HPC
Wall time ~35 hours
Steps 1,773 (3 epochs)

Evaluation Methodology

Scientific concerns are matched semantically, not by exact string:

  1. Encode all concerns (model-generated + human reviewer annotations) with SPECTER2 (allenai/specter2_base)
  2. Compute pairwise cosine similarity matrix
  3. Apply Hungarian algorithm for optimal 1-to-1 matching
  4. Threshold at 0.65 β€” pairs above this are counted as true positives
  5. Compute micro-averaged F1, Recall, Precision across all articles

Postprocessing (dedup+cap20):

  • Remove near-duplicate model outputs (cosine similarity > 0.95 β†’ keep first)
  • Cap at 20 concerns per article

Important: SPECTER2 model weights must be downloaded for valid evaluation. Without them, the pipeline silently falls back to Jaccard similarity and produces scores near 0.03 instead of 0.62.


Limitations

  • Recall gap: Captures ~50% of human reviewer concerns (vs ~65% for GPT-4o-mini)
  • No figure analysis: Explicitly trained to skip figure/image-related concerns
  • Source bias: Performance is weakest on Nature articles (n=66 in training)
  • Context truncation: Papers exceeding 15K tokens are truncated; later sections may be missed
  • Parse failures: ~1–2% of articles produce malformed JSON; fallback regex parsing is recommended
  • Best as ensemble: Pair with the Qwen3-8B SFT model (union) to reach F1 = 0.704

Citation

@software{bioreview_training_2026,
  title  = {BioReview Training: QLoRA SFT Pipeline for Biomedical Peer-Review LLMs},
  author = {Kim, JangKeun},
  year   = {2026},
  url    = {https://github.com/jang1563/BioReview_Training}
}

Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine

Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jang1563/bioreview-qwen3.5-9b-sft

Finetuned
Qwen/Qwen3.5-9B
Adapter
(124)
this model

Evaluation results

  • F1 (dedup+cap20) on peer-review-benchmark v3 (test split)
    self-reported
    0.621
  • Recall (dedup+cap20) on peer-review-benchmark v3 (test split)
    self-reported
    0.498
  • Precision (dedup+cap20) on peer-review-benchmark v3 (test split)
    self-reported
    0.827