--- license: apache-2.0 base_model: - Jackrong/Qwopus3.5-9B-v3 tags: - oncology - pancreatic-cancer - pdac - clinical-nlp - medical-llm - text-generation - research language: - en pipeline_tag: text-generation library_name: transformers ---

Onca logo

# Onca 1.0 9B ## Model Summary Onca 1.0 is an open 9B language model for pancreatic cancer clinical tasks. It is designed for four PDAC-relevant task families: - clinical trial screening - case-specific clinical reasoning - structured pathology report extraction - molecular variant evidence reasoning This release is the main FP16/BF16-compatible checkpoint intended as the reference Hugging Face release for the Onca 1.0 model family. ## Base Model Onca 1.0 is fine-tuned from `Jackrong/Qwopus3.5-9B-v3`, a Qwen3.5-derived 9B dense reasoning model. The released checkpoint reflects task-focused supervised fine-tuning for pancreatic cancer workflows while preserving the underlying Qwen3.5-class architecture and tokenizer setup. ## Training Scope The model was trained on 37,364 prepared rows from openly available sources. The multitask mixture covers: - trial eligibility screening - oncology clinical reasoning - CAP-aligned pathology abstraction - CIViC-style variant interpretation The project was built around an open-data, open-weight, single-workstation pipeline so the workflow can be audited and reproduced without private institutional corpora. ## Intended Use Onca 1.0 is intended for: - research on oncology-focused language models - benchmarking PDAC-oriented clinical NLP workflows - prototyping structured extraction and screening pipelines - local experimentation in privacy-sensitive environments ## Out-of-Scope Use Onca 1.0 is not intended for: - direct clinical care - autonomous treatment recommendations - unsupervised patient-facing use - deployment as a validated medical device or diagnostic system This is a research model and does not replace clinician judgment. ## Evaluation Summary In the companion manuscript, Onca 1.0 was evaluated across 11 panels against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unfine-tuned Qwopus base. Headline results reported in the draft include: - Trial Screening: 81.6 F1 - Clinical Reasoning: 14.1 composite - Pathology Extraction: 30.5 field exact-match - PubMedQA Cancer: 68.3 macro-F1 - PubMedQA: 66.5 macro-F1 The strongest gains appear in workflow-proximal tasks such as trial review and pathology structuring. Variant evidence reasoning remains more difficult than the other task groups. ## Limitations - The model is specialized for pancreatic cancer and oncology-adjacent workflows rather than general medicine. - Training data come from openly available sources rather than private institutional notes, which improves reproducibility but does not fully capture real-world documentation style. - Benchmark sample sizes for several panels are deliberately limited and should be interpreted with care. - Performance is uneven across task families and does not imply broad medical competence. ## Usage This repository contains the main full-precision checkpoint files. A standard `transformers` loading pattern is: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "Joesh1/onca-1.0-9B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", ) ``` Inference formatting should follow the included tokenizer and chat template files in this repository. ### Quick Chat Helper ```python def run_onca(prompt, system_prompt="You are Onca 1.0, a pancreatic-cancer clinical research assistant."): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.2, do_sample=False, ) completion = outputs[0][inputs["input_ids"].shape[1]:] return tokenizer.decode(completion, skip_special_tokens=True) ``` ### Example 1: Trial Screening ```python prompt = """ Task: Trial eligibility screening for pancreatic cancer. Patient summary: - 63-year-old with metastatic PDAC - Liver metastases present - ECOG 1 - Prior gemcitabine plus nab-paclitaxel - Total bilirubin 0.9 mg/dL - ANC 2.4 - Platelets 188 - No active infection - No brain metastases Trial criteria: - Histologically confirmed metastatic pancreatic adenocarcinoma - ECOG 0-1 - Progression after 1 prior systemic regimen - Adequate marrow and hepatic function - Exclude uncontrolled infection or CNS metastases Return: 1. Eligibility label: eligible / ineligible / unclear 2. Criterion-by-criterion reasoning 3. Missing information, if any """ print(run_onca(prompt)) ``` ### Example 2: Clinical Reasoning ```python prompt = """ Task: Pancreatic cancer clinical reasoning. Case: A 58-year-old patient has borderline resectable PDAC in the pancreatic head. CA19-9 is elevated. ECOG is 0. Germline testing is pending. No distant metastases are seen on imaging. Please provide: 1. A concise assessment 2. A high-level management plan 3. Key factors that could change the plan 4. Important limitations or uncertainties Do not present this as medical advice. Keep it research-oriented. """ print(run_onca(prompt)) ``` ### Example 3: Pathology Extraction ```python prompt = """ Task: Structured pathology extraction. Extract the report into JSON with the following fields: specimen_type, primary_site, histology, tumor_grade, tumor_size_cm, margin_status, lymphovascular_invasion, perineural_invasion, lymph_nodes_examined, lymph_nodes_positive, pT, pN, pM, ajcc_stage, treatment_effect, tumor_focality, additional_findings Report: Whipple resection specimen showing moderately differentiated pancreatic ductal adenocarcinoma, 3.1 cm, centered in the pancreatic head. Tumor extends into peripancreatic soft tissue. All margins are negative; closest margin is 0.4 cm at the uncinate margin. Perineural invasion is present. Lymphovascular invasion is present. Sixteen lymph nodes examined, 3 positive for metastatic carcinoma. Pathologic stage: pT2 pN1. No distant metastasis identified in specimen. """ print(run_onca(prompt)) ``` ### Example 4: Variant Evidence Interpretation ```python prompt = """ Task: Variant evidence reasoning for pancreatic cancer. Variant: - Gene: BRCA2 - Alteration: pathogenic loss-of-function variant - Tumor type: pancreatic ductal adenocarcinoma Return a JSON object with: - gene - alteration - disease - evidence_summary - therapeutic_implication - diagnostic_implication - prognostic_implication - evidence_direction - confidence Keep the answer concise and note uncertainty when evidence is incomplete. """ print(run_onca(prompt)) ``` ### Prompting Tips - Ask for a specific output format such as bullet points or JSON. - For extraction tasks, list the exact fields you want returned. - For screening tasks, provide both the patient summary and the trial criteria. - For reasoning tasks, request uncertainties and missing data explicitly. - Treat outputs as research artifacts that require expert review. ## Files in This Repository - `model-00001-of-00004.safetensors` through `model-00004-of-00004.safetensors`: sharded model weights - `model.safetensors.index.json`: shard index - `config.json`: model architecture configuration - `generation_config.json`: default generation settings - `tokenizer.json` and `tokenizer_config.json`: tokenizer files - `chat_template.jinja`: chat formatting template ## Related Variants Quantized releases are provided separately: - `JosephKBS/onca-1.0-9B-Int8` - `JosephKBS/onca-1.0-9B-Int4` ## License This release is provided under the Apache 2.0 license. Users should also review the license and usage terms of the upstream base model and any referenced datasets or benchmarks. ## Citation If you use Onca 1.0, please cite the accompanying manuscript when publicly available. A temporary reference is: ```bibtex @misc{shim2026onca, title = {Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks}, author = {Shim, Kwan Bo}, year = {2026}, note = {Preprint in preparation} } ``` ## Acknowledgments This project builds on the work of the Qwen and Qwopus model developers, as well as the many institutions and open-data contributors who created and maintained the public datasets used in training and evaluation.