---
license: apache-2.0
base_model:
- Jackrong/Qwopus3.5-9B-v3
tags:
- oncology
- pancreatic-cancer
- pdac
- clinical-nlp
- medical-llm
- text-generation
- research
language:
- en
pipeline_tag: text-generation
library_name: transformers
---
# Onca 1.0 9B
## Model Summary
Onca 1.0 is an open 9B language model for pancreatic cancer clinical tasks. It is designed for four PDAC-relevant task families:
- clinical trial screening
- case-specific clinical reasoning
- structured pathology report extraction
- molecular variant evidence reasoning
This release is the main FP16/BF16-compatible checkpoint intended as the reference Hugging Face release for the Onca 1.0 model family.
## Base Model
Onca 1.0 is fine-tuned from `Jackrong/Qwopus3.5-9B-v3`, a Qwen3.5-derived 9B dense reasoning model. The released checkpoint reflects task-focused supervised fine-tuning for pancreatic cancer workflows while preserving the underlying Qwen3.5-class architecture and tokenizer setup.
## Training Scope
The model was trained on 37,364 prepared rows from openly available sources. The multitask mixture covers:
- trial eligibility screening
- oncology clinical reasoning
- CAP-aligned pathology abstraction
- CIViC-style variant interpretation
The project was built around an open-data, open-weight, single-workstation pipeline so the workflow can be audited and reproduced without private institutional corpora.
## Intended Use
Onca 1.0 is intended for:
- research on oncology-focused language models
- benchmarking PDAC-oriented clinical NLP workflows
- prototyping structured extraction and screening pipelines
- local experimentation in privacy-sensitive environments
## Out-of-Scope Use
Onca 1.0 is not intended for:
- direct clinical care
- autonomous treatment recommendations
- unsupervised patient-facing use
- deployment as a validated medical device or diagnostic system
This is a research model and does not replace clinician judgment.
## Evaluation Summary
In the companion manuscript, Onca 1.0 was evaluated across 11 panels against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unfine-tuned Qwopus base. Headline results reported in the draft include:
- Trial Screening: 81.6 F1
- Clinical Reasoning: 14.1 composite
- Pathology Extraction: 30.5 field exact-match
- PubMedQA Cancer: 68.3 macro-F1
- PubMedQA: 66.5 macro-F1
The strongest gains appear in workflow-proximal tasks such as trial review and pathology structuring. Variant evidence reasoning remains more difficult than the other task groups.
## Limitations
- The model is specialized for pancreatic cancer and oncology-adjacent workflows rather than general medicine.
- Training data come from openly available sources rather than private institutional notes, which improves reproducibility but does not fully capture real-world documentation style.
- Benchmark sample sizes for several panels are deliberately limited and should be interpreted with care.
- Performance is uneven across task families and does not imply broad medical competence.
## Usage
This repository contains the main full-precision checkpoint files. A standard `transformers` loading pattern is:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Joesh1/onca-1.0-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
```
Inference formatting should follow the included tokenizer and chat template files in this repository.
### Quick Chat Helper
```python
def run_onca(prompt, system_prompt="You are Onca 1.0, a pancreatic-cancer clinical research assistant."):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=False,
)
completion = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(completion, skip_special_tokens=True)
```
### Example 1: Trial Screening
```python
prompt = """
Task: Trial eligibility screening for pancreatic cancer.
Patient summary:
- 63-year-old with metastatic PDAC
- Liver metastases present
- ECOG 1
- Prior gemcitabine plus nab-paclitaxel
- Total bilirubin 0.9 mg/dL
- ANC 2.4
- Platelets 188
- No active infection
- No brain metastases
Trial criteria:
- Histologically confirmed metastatic pancreatic adenocarcinoma
- ECOG 0-1
- Progression after 1 prior systemic regimen
- Adequate marrow and hepatic function
- Exclude uncontrolled infection or CNS metastases
Return:
1. Eligibility label: eligible / ineligible / unclear
2. Criterion-by-criterion reasoning
3. Missing information, if any
"""
print(run_onca(prompt))
```
### Example 2: Clinical Reasoning
```python
prompt = """
Task: Pancreatic cancer clinical reasoning.
Case:
A 58-year-old patient has borderline resectable PDAC in the pancreatic head.
CA19-9 is elevated. ECOG is 0. Germline testing is pending. No distant metastases
are seen on imaging.
Please provide:
1. A concise assessment
2. A high-level management plan
3. Key factors that could change the plan
4. Important limitations or uncertainties
Do not present this as medical advice. Keep it research-oriented.
"""
print(run_onca(prompt))
```
### Example 3: Pathology Extraction
```python
prompt = """
Task: Structured pathology extraction.
Extract the report into JSON with the following fields:
specimen_type, primary_site, histology, tumor_grade, tumor_size_cm,
margin_status, lymphovascular_invasion, perineural_invasion,
lymph_nodes_examined, lymph_nodes_positive, pT, pN, pM,
ajcc_stage, treatment_effect, tumor_focality, additional_findings
Report:
Whipple resection specimen showing moderately differentiated pancreatic ductal
adenocarcinoma, 3.1 cm, centered in the pancreatic head. Tumor extends into
peripancreatic soft tissue. All margins are negative; closest margin is 0.4 cm
at the uncinate margin. Perineural invasion is present. Lymphovascular invasion
is present. Sixteen lymph nodes examined, 3 positive for metastatic carcinoma.
Pathologic stage: pT2 pN1. No distant metastasis identified in specimen.
"""
print(run_onca(prompt))
```
### Example 4: Variant Evidence Interpretation
```python
prompt = """
Task: Variant evidence reasoning for pancreatic cancer.
Variant:
- Gene: BRCA2
- Alteration: pathogenic loss-of-function variant
- Tumor type: pancreatic ductal adenocarcinoma
Return a JSON object with:
- gene
- alteration
- disease
- evidence_summary
- therapeutic_implication
- diagnostic_implication
- prognostic_implication
- evidence_direction
- confidence
Keep the answer concise and note uncertainty when evidence is incomplete.
"""
print(run_onca(prompt))
```
### Prompting Tips
- Ask for a specific output format such as bullet points or JSON.
- For extraction tasks, list the exact fields you want returned.
- For screening tasks, provide both the patient summary and the trial criteria.
- For reasoning tasks, request uncertainties and missing data explicitly.
- Treat outputs as research artifacts that require expert review.
## Files in This Repository
- `model-00001-of-00004.safetensors` through `model-00004-of-00004.safetensors`: sharded model weights
- `model.safetensors.index.json`: shard index
- `config.json`: model architecture configuration
- `generation_config.json`: default generation settings
- `tokenizer.json` and `tokenizer_config.json`: tokenizer files
- `chat_template.jinja`: chat formatting template
## Related Variants
Quantized releases are provided separately:
- `JosephKBS/onca-1.0-9B-Int8`
- `JosephKBS/onca-1.0-9B-Int4`
## License
This release is provided under the Apache 2.0 license. Users should also review the license and usage terms of the upstream base model and any referenced datasets or benchmarks.
## Citation
If you use Onca 1.0, please cite the accompanying manuscript when publicly available. A temporary reference is:
```bibtex
@misc{shim2026onca,
title = {Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks},
author = {Shim, Kwan Bo},
year = {2026},
note = {Preprint in preparation}
}
```
## Acknowledgments
This project builds on the work of the Qwen and Qwopus model developers, as well as the many institutions and open-data contributors who created and maintained the public datasets used in training and evaluation.