Onca logo

Onca 1.0 9B

Model Summary

Onca 1.0 is an open 9B language model for pancreatic cancer clinical tasks. It is designed for four PDAC-relevant task families:

  • clinical trial screening
  • case-specific clinical reasoning
  • structured pathology report extraction
  • molecular variant evidence reasoning

This release is the main FP16/BF16-compatible checkpoint intended as the reference Hugging Face release for the Onca 1.0 model family.

Base Model

Onca 1.0 is fine-tuned from Jackrong/Qwopus3.5-9B-v3, a Qwen3.5-derived 9B dense reasoning model. The released checkpoint reflects task-focused supervised fine-tuning for pancreatic cancer workflows while preserving the underlying Qwen3.5-class architecture and tokenizer setup.

Training Scope

The model was trained on 37,364 prepared rows from openly available sources. The multitask mixture covers:

  • trial eligibility screening
  • oncology clinical reasoning
  • CAP-aligned pathology abstraction
  • CIViC-style variant interpretation

The project was built around an open-data, open-weight, single-workstation pipeline so the workflow can be audited and reproduced without private institutional corpora.

Intended Use

Onca 1.0 is intended for:

  • research on oncology-focused language models
  • benchmarking PDAC-oriented clinical NLP workflows
  • prototyping structured extraction and screening pipelines
  • local experimentation in privacy-sensitive environments

Out-of-Scope Use

Onca 1.0 is not intended for:

  • direct clinical care
  • autonomous treatment recommendations
  • unsupervised patient-facing use
  • deployment as a validated medical device or diagnostic system

This is a research model and does not replace clinician judgment.

Evaluation Summary

In the companion manuscript, Onca 1.0 was evaluated across 11 panels against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unfine-tuned Qwopus base. Headline results reported in the draft include:

  • Trial Screening: 81.6 F1
  • Clinical Reasoning: 14.1 composite
  • Pathology Extraction: 30.5 field exact-match
  • PubMedQA Cancer: 68.3 macro-F1
  • PubMedQA: 66.5 macro-F1

The strongest gains appear in workflow-proximal tasks such as trial review and pathology structuring. Variant evidence reasoning remains more difficult than the other task groups.

Limitations

  • The model is specialized for pancreatic cancer and oncology-adjacent workflows rather than general medicine.
  • Training data come from openly available sources rather than private institutional notes, which improves reproducibility but does not fully capture real-world documentation style.
  • Benchmark sample sizes for several panels are deliberately limited and should be interpreted with care.
  • Performance is uneven across task families and does not imply broad medical competence.

Usage

This repository contains the main full-precision checkpoint files. A standard transformers loading pattern is:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Joesh1/onca-1.0-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

Inference formatting should follow the included tokenizer and chat template files in this repository.

Quick Chat Helper

def run_onca(prompt, system_prompt="You are Onca 1.0, a pancreatic-cancer clinical research assistant."):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.2,
            do_sample=False,
        )
    completion = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(completion, skip_special_tokens=True)

Example 1: Trial Screening

prompt = """
Task: Trial eligibility screening for pancreatic cancer.

Patient summary:
- 63-year-old with metastatic PDAC
- Liver metastases present
- ECOG 1
- Prior gemcitabine plus nab-paclitaxel
- Total bilirubin 0.9 mg/dL
- ANC 2.4
- Platelets 188
- No active infection
- No brain metastases

Trial criteria:
- Histologically confirmed metastatic pancreatic adenocarcinoma
- ECOG 0-1
- Progression after 1 prior systemic regimen
- Adequate marrow and hepatic function
- Exclude uncontrolled infection or CNS metastases

Return:
1. Eligibility label: eligible / ineligible / unclear
2. Criterion-by-criterion reasoning
3. Missing information, if any
"""

print(run_onca(prompt))

Example 2: Clinical Reasoning

prompt = """
Task: Pancreatic cancer clinical reasoning.

Case:
A 58-year-old patient has borderline resectable PDAC in the pancreatic head.
CA19-9 is elevated. ECOG is 0. Germline testing is pending. No distant metastases
are seen on imaging.

Please provide:
1. A concise assessment
2. A high-level management plan
3. Key factors that could change the plan
4. Important limitations or uncertainties

Do not present this as medical advice. Keep it research-oriented.
"""

print(run_onca(prompt))

Example 3: Pathology Extraction

prompt = """
Task: Structured pathology extraction.

Extract the report into JSON with the following fields:
specimen_type, primary_site, histology, tumor_grade, tumor_size_cm,
margin_status, lymphovascular_invasion, perineural_invasion,
lymph_nodes_examined, lymph_nodes_positive, pT, pN, pM,
ajcc_stage, treatment_effect, tumor_focality, additional_findings

Report:
Whipple resection specimen showing moderately differentiated pancreatic ductal
adenocarcinoma, 3.1 cm, centered in the pancreatic head. Tumor extends into
peripancreatic soft tissue. All margins are negative; closest margin is 0.4 cm
at the uncinate margin. Perineural invasion is present. Lymphovascular invasion
is present. Sixteen lymph nodes examined, 3 positive for metastatic carcinoma.
Pathologic stage: pT2 pN1. No distant metastasis identified in specimen.
"""

print(run_onca(prompt))

Example 4: Variant Evidence Interpretation

prompt = """
Task: Variant evidence reasoning for pancreatic cancer.

Variant:
- Gene: BRCA2
- Alteration: pathogenic loss-of-function variant
- Tumor type: pancreatic ductal adenocarcinoma

Return a JSON object with:
- gene
- alteration
- disease
- evidence_summary
- therapeutic_implication
- diagnostic_implication
- prognostic_implication
- evidence_direction
- confidence

Keep the answer concise and note uncertainty when evidence is incomplete.
"""

print(run_onca(prompt))

Prompting Tips

  • Ask for a specific output format such as bullet points or JSON.
  • For extraction tasks, list the exact fields you want returned.
  • For screening tasks, provide both the patient summary and the trial criteria.
  • For reasoning tasks, request uncertainties and missing data explicitly.
  • Treat outputs as research artifacts that require expert review.

Files in This Repository

  • model-00001-of-00004.safetensors through model-00004-of-00004.safetensors: sharded model weights
  • model.safetensors.index.json: shard index
  • config.json: model architecture configuration
  • generation_config.json: default generation settings
  • tokenizer.json and tokenizer_config.json: tokenizer files
  • chat_template.jinja: chat formatting template

Related Variants

Quantized releases are provided separately:

  • JosephKBS/onca-1.0-9B-Int8
  • JosephKBS/onca-1.0-9B-Int4

License

This release is provided under the Apache 2.0 license. Users should also review the license and usage terms of the upstream base model and any referenced datasets or benchmarks.

Citation

If you use Onca 1.0, please cite the accompanying manuscript when publicly available. A temporary reference is:

@misc{shim2026onca,
  title  = {Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks},
  author = {Shim, Kwan Bo},
  year   = {2026},
  note   = {Preprint in preparation}
}

Acknowledgments

This project builds on the work of the Qwen and Qwopus model developers, as well as the many institutions and open-data contributors who created and maintained the public datasets used in training and evaluation.

Downloads last month
19
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Joesh1/onca-1.0-9B

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(7)
this model
Quantizations
1 model

Collection including Joesh1/onca-1.0-9B