File size: 8,805 Bytes
c6f2e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81ced3c
 
 
 
c6f2e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eabf943
c6f2e66
 
a24db73
c6f2e66
 
 
 
 
 
 
 
 
 
 
eabf943
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c6f2e66
 
 
 
 
 
 
 
 
 
 
 
 
eabf943
 
c6f2e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
license: apache-2.0
base_model:
- Jackrong/Qwopus3.5-9B-v3
tags:
- oncology
- pancreatic-cancer
- pdac
- clinical-nlp
- medical-llm
- text-generation
- research
language:
- en
pipeline_tag: text-generation
library_name: transformers
---

<p align="center">
  <img src="./assets/onca-logo-horizontal.svg" alt="Onca logo" width="520">
</p>

# Onca 1.0 9B

## Model Summary

Onca 1.0 is an open 9B language model for pancreatic cancer clinical tasks. It is designed for four PDAC-relevant task families:

- clinical trial screening
- case-specific clinical reasoning
- structured pathology report extraction
- molecular variant evidence reasoning

This release is the main FP16/BF16-compatible checkpoint intended as the reference Hugging Face release for the Onca 1.0 model family.

## Base Model

Onca 1.0 is fine-tuned from `Jackrong/Qwopus3.5-9B-v3`, a Qwen3.5-derived 9B dense reasoning model. The released checkpoint reflects task-focused supervised fine-tuning for pancreatic cancer workflows while preserving the underlying Qwen3.5-class architecture and tokenizer setup.

## Training Scope

The model was trained on 37,364 prepared rows from openly available sources. The multitask mixture covers:

- trial eligibility screening
- oncology clinical reasoning
- CAP-aligned pathology abstraction
- CIViC-style variant interpretation

The project was built around an open-data, open-weight, single-workstation pipeline so the workflow can be audited and reproduced without private institutional corpora.

## Intended Use

Onca 1.0 is intended for:

- research on oncology-focused language models
- benchmarking PDAC-oriented clinical NLP workflows
- prototyping structured extraction and screening pipelines
- local experimentation in privacy-sensitive environments

## Out-of-Scope Use

Onca 1.0 is not intended for:

- direct clinical care
- autonomous treatment recommendations
- unsupervised patient-facing use
- deployment as a validated medical device or diagnostic system

This is a research model and does not replace clinician judgment.

## Evaluation Summary

In the companion manuscript, Onca 1.0 was evaluated across 11 panels against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unfine-tuned Qwopus base. Headline results reported in the draft include:

- Trial Screening: 81.6 F1
- Clinical Reasoning: 14.1 composite
- Pathology Extraction: 30.5 field exact-match
- PubMedQA Cancer: 68.3 macro-F1
- PubMedQA: 66.5 macro-F1

The strongest gains appear in workflow-proximal tasks such as trial review and pathology structuring. Variant evidence reasoning remains more difficult than the other task groups.

## Limitations

- The model is specialized for pancreatic cancer and oncology-adjacent workflows rather than general medicine.
- Training data come from openly available sources rather than private institutional notes, which improves reproducibility but does not fully capture real-world documentation style.
- Benchmark sample sizes for several panels are deliberately limited and should be interpreted with care.
- Performance is uneven across task families and does not imply broad medical competence.

## Usage

This repository contains the main full-precision checkpoint files. A standard `transformers` loading pattern is:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Joesh1/onca-1.0-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
```

Inference formatting should follow the included tokenizer and chat template files in this repository.

### Quick Chat Helper

```python
def run_onca(prompt, system_prompt="You are Onca 1.0, a pancreatic-cancer clinical research assistant."):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.2,
            do_sample=False,
        )
    completion = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(completion, skip_special_tokens=True)
```

### Example 1: Trial Screening

```python
prompt = """
Task: Trial eligibility screening for pancreatic cancer.

Patient summary:
- 63-year-old with metastatic PDAC
- Liver metastases present
- ECOG 1
- Prior gemcitabine plus nab-paclitaxel
- Total bilirubin 0.9 mg/dL
- ANC 2.4
- Platelets 188
- No active infection
- No brain metastases

Trial criteria:
- Histologically confirmed metastatic pancreatic adenocarcinoma
- ECOG 0-1
- Progression after 1 prior systemic regimen
- Adequate marrow and hepatic function
- Exclude uncontrolled infection or CNS metastases

Return:
1. Eligibility label: eligible / ineligible / unclear
2. Criterion-by-criterion reasoning
3. Missing information, if any
"""

print(run_onca(prompt))
```

### Example 2: Clinical Reasoning

```python
prompt = """
Task: Pancreatic cancer clinical reasoning.

Case:
A 58-year-old patient has borderline resectable PDAC in the pancreatic head.
CA19-9 is elevated. ECOG is 0. Germline testing is pending. No distant metastases
are seen on imaging.

Please provide:
1. A concise assessment
2. A high-level management plan
3. Key factors that could change the plan
4. Important limitations or uncertainties

Do not present this as medical advice. Keep it research-oriented.
"""

print(run_onca(prompt))
```

### Example 3: Pathology Extraction

```python
prompt = """
Task: Structured pathology extraction.

Extract the report into JSON with the following fields:
specimen_type, primary_site, histology, tumor_grade, tumor_size_cm,
margin_status, lymphovascular_invasion, perineural_invasion,
lymph_nodes_examined, lymph_nodes_positive, pT, pN, pM,
ajcc_stage, treatment_effect, tumor_focality, additional_findings

Report:
Whipple resection specimen showing moderately differentiated pancreatic ductal
adenocarcinoma, 3.1 cm, centered in the pancreatic head. Tumor extends into
peripancreatic soft tissue. All margins are negative; closest margin is 0.4 cm
at the uncinate margin. Perineural invasion is present. Lymphovascular invasion
is present. Sixteen lymph nodes examined, 3 positive for metastatic carcinoma.
Pathologic stage: pT2 pN1. No distant metastasis identified in specimen.
"""

print(run_onca(prompt))
```

### Example 4: Variant Evidence Interpretation

```python
prompt = """
Task: Variant evidence reasoning for pancreatic cancer.

Variant:
- Gene: BRCA2
- Alteration: pathogenic loss-of-function variant
- Tumor type: pancreatic ductal adenocarcinoma

Return a JSON object with:
- gene
- alteration
- disease
- evidence_summary
- therapeutic_implication
- diagnostic_implication
- prognostic_implication
- evidence_direction
- confidence

Keep the answer concise and note uncertainty when evidence is incomplete.
"""

print(run_onca(prompt))
```

### Prompting Tips

- Ask for a specific output format such as bullet points or JSON.
- For extraction tasks, list the exact fields you want returned.
- For screening tasks, provide both the patient summary and the trial criteria.
- For reasoning tasks, request uncertainties and missing data explicitly.
- Treat outputs as research artifacts that require expert review.

## Files in This Repository

- `model-00001-of-00004.safetensors` through `model-00004-of-00004.safetensors`: sharded model weights
- `model.safetensors.index.json`: shard index
- `config.json`: model architecture configuration
- `generation_config.json`: default generation settings
- `tokenizer.json` and `tokenizer_config.json`: tokenizer files
- `chat_template.jinja`: chat formatting template

## Related Variants

Quantized releases are provided separately:

- `JosephKBS/onca-1.0-9B-Int8`
- `JosephKBS/onca-1.0-9B-Int4`

## License

This release is provided under the Apache 2.0 license. Users should also review the license and usage terms of the upstream base model and any referenced datasets or benchmarks.

## Citation

If you use Onca 1.0, please cite the accompanying manuscript when publicly available. A temporary reference is:

```bibtex
@misc{shim2026onca,
  title  = {Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks},
  author = {Shim, Kwan Bo},
  year   = {2026},
  note   = {Preprint in preparation}
}
```

## Acknowledgments

This project builds on the work of the Qwen and Qwopus model developers, as well as the many institutions and open-data contributors who created and maintained the public datasets used in training and evaluation.