Davis426 commited on
Commit
15f6e0c
·
verified ·
1 Parent(s): e7c32ff

Add model card

Browse files
Files changed (1) hide show
  1. README.md +172 -1
README.md CHANGED
@@ -1,3 +1,174 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ library_name: peft
6
+ pipeline_tag: text-generation
7
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
8
+ tags:
9
+ - medical
10
+ - healthcare
11
+ - clinical
12
+ - qlora
13
+ - peft
14
+ - lora
15
+ - qwen
16
+ - qwen2.5
17
+ - ollama
18
+ - gguf
19
  ---
20
+
21
+ # Qwen2.5-1.5B Medical QA (QLoRA)
22
+
23
+ QLoRA fine-tune of [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on a 9,000-pair mix of six public biomedical Q&A sources. Built as part of the COMP8420 (Macquarie University) main project on a healthcare NLP assistant. The fine-tuned model is served locally via Ollama and benchmarked head-to-head against GPT-5.5 in the parent GitHub repo.
24
+
25
+ **Companion code**: https://github.com/NhatNguyen3001/COMP8420-Healthcare-LLM-Assistant
26
+ (see the GitHub README for the full system: voice input, PII railguard, multi-agent RAG, evaluation notebooks.)
27
+
28
+ ## What is in this repo
29
+
30
+ | Path | Size | What |
31
+ |---|---|---|
32
+ | `qwen-medqa-adapter/` | ~82 MB | PEFT LoRA adapter (re-apply to base Qwen2.5-1.5B-Instruct with `peft`) |
33
+ | `qwen-medqa-gguf/model.Q4_K_M.gguf` | ~941 MB | Merged + Q4_K_M quantized model, ready for Ollama or llama.cpp |
34
+ | `qwen-medqa-gguf/Modelfile` | <1 KB | Ollama registration recipe |
35
+
36
+ The merged-but-unquantized `safetensors` is intentionally not uploaded; it is redundant for end users (use the GGUF for Ollama OR the adapter for transformers+peft).
37
+
38
+ ## Training data
39
+
40
+ 9,000 question-answer pairs (train 8,100 / val 450 / test 450) drawn from six public sources, capped at 1,500 pairs per source for balance:
41
+
42
+ | Source | Pairs | Notes |
43
+ |---|---|---|
44
+ | BioASQ (subset of training14b) | ~1,500 | factoid / list / summary biomedical Q&A |
45
+ | MedQuAD | ~1,500 | consumer-facing medical questions |
46
+ | DrugBank `description` | ~1,500 | "What is X?" templates |
47
+ | DrugBank `indication` | ~1,500 | indication / contraindication |
48
+ | DrugBank `side_effects` | ~1,500 | side-effect summaries |
49
+ | DrugBank `mechanism_of_action` | ~1,500 | MoA explanations |
50
+
51
+ 90 / 5 / 5 random split with `seed=42`. The OpenAI messages format was used at JSONL level; the Qwen2.5 chat template is applied at training time, not stored in the JSONL.
52
+
53
+ ## Training setup
54
+
55
+ | Hyperparameter | Value |
56
+ |---|---|
57
+ | Base | `Qwen/Qwen2.5-1.5B-Instruct` (4-bit NF4 via bitsandbytes) |
58
+ | LoRA rank `r` | 16 |
59
+ | LoRA alpha | 32 |
60
+ | LoRA target modules | all 7 projection layers (q, k, v, o, gate, up, down) |
61
+ | Max sequence length | 1024 |
62
+ | Per-device batch size | 2 |
63
+ | Gradient accumulation | 4 (effective batch = 8) |
64
+ | Epochs | 3 |
65
+ | Learning rate | 2e-4, cosine schedule |
66
+ | Optimizer | `adamw_8bit` |
67
+ | Seed | 42 |
68
+ | Hardware | RTX 4060 (8 GB, bf16) |
69
+ | Wall time | ~5,667 seconds (~95 minutes) |
70
+
71
+ Best validation loss: 1.5536 around epoch 1.98. The deployed checkpoint is end-of-epoch-3 (the "what a full QLoRA run gives you" baseline, not early-stopped).
72
+
73
+ ## Evaluation
74
+
75
+ Evaluated on the held-out 450-pair test set, with 100 stratified pairs (~17 per source) used as the common comparison sample across all evaluation notebooks.
76
+
77
+ Two evaluation passes:
78
+
79
+ 1. **Surface metrics**: ROUGE-1/2/L + BERTScore-F1 (with the PubMedBERT backbone)
80
+ 2. **LLM-as-judge**: GPT-5.4 scoring blind on Accuracy / Completeness / Clarity / Safety (0-10), reference-aware
81
+
82
+ **Headline findings (vs GPT-5.5):**
83
+
84
+ - This QLoRA model wins ROUGE-L by ~+0.022 (~+12% relative) and BERTScore-F1 by ~+0.0067 (~+0.8% relative)
85
+ - The win is driven by **template substitution**, not factual improvement. The training set includes 71+ DrugBank entries sharing the skeleton "`{X}` pollen is the pollen of the `{X}` plant. `{X}` pollen is mainly used in allergenic testing." The fine-tune learns the template and slot-fills the entity at inference; ROUGE and BERTScore both reward this even when the substituted entity is wrong.
86
+ - Verified 0 / 450 literal Q+A pair overlap between train and test, so this is template generalization, not memorization.
87
+ - Under the LLM-as-judge Accuracy dimension, GPT-5.5 leads (judge results in the parent repo's `results/llm_judge_evaluation.csv`).
88
+
89
+ Detailed numbers and charts live in the parent repo:
90
+
91
+ - `results/llm_generation_evaluation.csv` + `llm_generation_eval_chart.png` + `llm_generation_bertscore_chart.png`
92
+ - `results/llm_judge_evaluation.csv` + `llm_judge_eval_chart.png`
93
+ - `results/model_comparison.csv` + `model_comparison_chart.png`
94
+ - `results/qlora_loss_curve.png` + `results/qlora_source_mix.png`
95
+
96
+ ## How to use
97
+
98
+ ### Option 1 — Ollama (recommended for local serving)
99
+
100
+ ```bash
101
+ # Fetch the GGUF + Modelfile
102
+ huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
103
+ --include "qwen-medqa-gguf/*" \
104
+ --local-dir ./models
105
+
106
+ # Register with Ollama
107
+ cd ./models/qwen-medqa-gguf
108
+ ollama create medqa-qwen -f Modelfile
109
+
110
+ # Try it
111
+ ollama run medqa-qwen "What is amoxicillin used for?"
112
+ ```
113
+
114
+ ### Option 2 — transformers + peft (Python)
115
+
116
+ ```python
117
+ from peft import PeftModel
118
+ from transformers import AutoModelForCausalLM, AutoTokenizer
119
+
120
+ base_id = "Qwen/Qwen2.5-1.5B-Instruct"
121
+ adapter_id = "Davis426/COMP8420-Healthcare-LLM-Assistant"
122
+
123
+ tokenizer = AutoTokenizer.from_pretrained(base_id)
124
+ base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
125
+ model = PeftModel.from_pretrained(base, adapter_id, subfolder="qwen-medqa-adapter")
126
+
127
+ messages = [{"role": "user", "content": "What is amoxicillin used for?"}]
128
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
129
+ out = model.generate(inputs, max_new_tokens=256)
130
+ print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
131
+ ```
132
+
133
+ ### Option 3 — llama.cpp directly
134
+
135
+ ```bash
136
+ huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
137
+ --include "qwen-medqa-gguf/model.Q4_K_M.gguf" --local-dir .
138
+
139
+ ./llama-cli -m model.Q4_K_M.gguf -p "What is amoxicillin used for?" -n 256
140
+ ```
141
+
142
+ ## Limitations
143
+
144
+ This model is a teaching / research artifact. **Do not use for real clinical decisions.** Specifically:
145
+
146
+ - **Catastrophic forgetting on out-of-distribution prompts.** Fine-tuning on a narrow Q&A distribution at 1.5B parameter scale shifts the base model hard. Casual / non-medical questions get answered in MedQA-style; the base model's general conversational ability is degraded.
147
+ - **Weakened in-context grounding.** Every training pair has shape `user_question -> answer`, with no retrieved-context block. As a result the fine-tuned model partly loses the ability to read RAG passages in the prompt and tends to answer from parametric memory even when correct evidence is supplied. The parent repo's MASS-RAG pipeline retains GPT-5.5 for cases where grounded answers matter; this local model is sidebar-selectable for the comparison experience.
148
+ - **No factual safety net.** Both training data and evaluation rely on existing biomedical corpora; the model has no live knowledge cutoff or up-to-date drug-interaction database. The parent repo applies a regex-based PII railguard on user input, but the model output itself is not safety-filtered beyond what the base model already does.
149
+ - **English only.**
150
+
151
+ ## License
152
+
153
+ `cc-by-nc-4.0` — research and non-commercial use. The base model (Qwen2.5-1.5B-Instruct) is Apache-2.0. Downstream dataset licenses may impose additional restrictions; please consult each source (BioASQ, MedQuAD, DrugBank, MedRAG textbooks) before redistribution.
154
+
155
+ ## Citation
156
+
157
+ If you use or build on this work, please reference:
158
+
159
+ ```bibtex
160
+ @misc{comp8420-2026-medqa-qwen,
161
+ title = {Healthcare NLP Assistant: QLoRA-fine-tuned Qwen2.5-1.5B for medical Q&A},
162
+ author = {Davis426},
163
+ year = {2026},
164
+ howpublished = {\url{https://huggingface.co/Davis426/COMP8420-Healthcare-LLM-Assistant}}
165
+ }
166
+ ```
167
+
168
+ Built on top of:
169
+
170
+ - Qwen2.5 (Alibaba): https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
171
+ - QLoRA (Dettmers et al., 2023): https://arxiv.org/abs/2305.14314
172
+ - MASS-RAG (Xiao, Huang, Liu, Xie, 2026): https://arxiv.org/abs/2604.18509 (used by the parent repo's retrieval pipeline that this model plugs into)
173
+ - Unsloth: https://github.com/unslothai/unsloth
174
+ - llama.cpp + Ollama for GGUF serving