Davis426 commited on
Commit
63b7fcc
Β·
verified Β·
1 Parent(s): 355dfc6

Update combined model card

Browse files
Files changed (1) hide show
  1. README.md +75 -31
README.md CHANGED
@@ -4,7 +4,9 @@ language:
4
  - en
5
  library_name: peft
6
  pipeline_tag: text-generation
7
- base_model: Qwen/Qwen2.5-1.5B-Instruct
 
 
8
  tags:
9
  - medical
10
  - healthcare
@@ -14,26 +16,43 @@ tags:
14
  - lora
15
  - qwen
16
  - qwen2.5
 
 
17
  - ollama
18
  - gguf
19
  ---
20
 
21
- # Qwen2.5-1.5B Medical QA (QLoRA)
22
 
23
- QLoRA fine-tune of [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on a 9,000-pair mix of six public biomedical Q&A sources. Built as part of the COMP8420 (Macquarie University) main project on a healthcare NLP assistant. The fine-tuned model is served locally via Ollama and benchmarked head-to-head against GPT-5.5 in the parent GitHub repo.
24
 
25
- **Companion code**: https://github.com/NhatNguyen3001/COMP8420-Healthcare-LLM-Assistant
 
 
 
 
 
 
 
26
  (see the GitHub README for the full system: voice input, PII railguard, multi-agent RAG, evaluation notebooks.)
27
 
28
  ## What is in this repo
29
 
30
- | Path | Size | What |
31
- |---|---|---|
32
- | `qwen-medqa-adapter/` | ~82 MB | PEFT LoRA adapter (re-apply to base Qwen2.5-1.5B-Instruct with `peft`) |
33
- | `qwen-medqa-gguf/model.Q4_K_M.gguf` | ~941 MB | Merged + Q4_K_M quantized model, ready for Ollama or llama.cpp |
34
- | `qwen-medqa-gguf/Modelfile` | <1 KB | Ollama registration recipe |
 
 
 
 
 
 
 
 
35
 
36
- The merged-but-unquantized `safetensors` is intentionally not uploaded; it is redundant for end users (use the GGUF for Ollama OR the adapter for transformers+peft).
37
 
38
  ## Training data
39
 
@@ -48,13 +67,14 @@ The merged-but-unquantized `safetensors` is intentionally not uploaded; it is re
48
  | DrugBank `side_effects` | ~1,500 | side-effect summaries |
49
  | DrugBank `mechanism_of_action` | ~1,500 | MoA explanations |
50
 
51
- 90 / 5 / 5 random split with `seed=42`. The OpenAI messages format was used at JSONL level; the Qwen2.5 chat template is applied at training time, not stored in the JSONL.
52
 
53
  ## Training setup
54
 
 
 
55
  | Hyperparameter | Value |
56
  |---|---|
57
- | Base | `Qwen/Qwen2.5-1.5B-Instruct` (4-bit NF4 via bitsandbytes) |
58
  | LoRA rank `r` | 16 |
59
  | LoRA alpha | 32 |
60
  | LoRA target modules | all 7 projection layers (q, k, v, o, gate, up, down) |
@@ -66,9 +86,17 @@ The merged-but-unquantized `safetensors` is intentionally not uploaded; it is re
66
  | Optimizer | `adamw_8bit` |
67
  | Seed | 42 |
68
  | Hardware | RTX 4060 (8 GB, bf16) |
69
- | Wall time | ~5,667 seconds (~95 minutes) |
70
 
71
- Best validation loss: 1.5536 around epoch 1.98. The deployed checkpoint is end-of-epoch-3 (the "what a full QLoRA run gives you" baseline, not early-stopped).
 
 
 
 
 
 
 
 
 
72
 
73
  ## Evaluation
74
 
@@ -79,13 +107,15 @@ Two evaluation passes:
79
  1. **Surface metrics**: ROUGE-1/2/L + BERTScore-F1 (with the PubMedBERT backbone)
80
  2. **LLM-as-judge**: GPT-5.4 scoring blind on Accuracy / Completeness / Clarity / Safety (0-10), reference-aware
81
 
82
- **Headline findings (vs GPT-5.5):**
83
 
84
- - This QLoRA model wins ROUGE-L by ~+0.022 (~+12% relative) and BERTScore-F1 by ~+0.0067 (~+0.8% relative)
85
  - The win is driven by **template substitution**, not factual improvement. The training set includes 71+ DrugBank entries sharing the skeleton "`{X}` pollen is the pollen of the `{X}` plant. `{X}` pollen is mainly used in allergenic testing." The fine-tune learns the template and slot-fills the entity at inference; ROUGE and BERTScore both reward this even when the substituted entity is wrong.
86
  - Verified 0 / 450 literal Q+A pair overlap between train and test, so this is template generalization, not memorization.
87
  - Under the LLM-as-judge Accuracy dimension, GPT-5.5 leads (judge results in the parent repo's `results/llm_judge_evaluation.csv`).
88
 
 
 
89
  Detailed numbers and charts live in the parent repo:
90
 
91
  - `results/llm_generation_evaluation.csv` + `llm_generation_eval_chart.png` + `llm_generation_bertscore_chart.png`
@@ -95,34 +125,45 @@ Detailed numbers and charts live in the parent repo:
95
 
96
  ## How to use
97
 
 
 
98
  ### Option 1 β€” Ollama (recommended for local serving)
99
 
100
  ```bash
101
- # Fetch the GGUF + Modelfile
102
  huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
103
- --include "qwen-medqa-gguf/*" \
104
  --local-dir ./models
105
 
106
  # Register with Ollama
107
- cd ./models/qwen-medqa-gguf
108
  ollama create medqa-qwen -f Modelfile
109
 
110
  # Try it
111
  ollama run medqa-qwen "What is amoxicillin used for?"
112
  ```
113
 
 
 
 
 
114
  ### Option 2 β€” transformers + peft (Python)
115
 
116
  ```python
117
  from peft import PeftModel
118
  from transformers import AutoModelForCausalLM, AutoTokenizer
119
 
120
- base_id = "Qwen/Qwen2.5-1.5B-Instruct"
 
 
 
 
 
121
  adapter_id = "Davis426/COMP8420-Healthcare-LLM-Assistant"
122
 
123
  tokenizer = AutoTokenizer.from_pretrained(base_id)
124
  base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
125
- model = PeftModel.from_pretrained(base, adapter_id, subfolder="qwen-medqa-adapter")
126
 
127
  messages = [{"role": "user", "content": "What is amoxicillin used for?"}]
128
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
@@ -134,31 +175,33 @@ print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
134
 
135
  ```bash
136
  huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
137
- --include "qwen-medqa-gguf/model.Q4_K_M.gguf" --local-dir .
138
 
139
- ./llama-cli -m model.Q4_K_M.gguf -p "What is amoxicillin used for?" -n 256
 
140
  ```
141
 
142
  ## Limitations
143
 
144
- This model is a teaching / research artifact. **Do not use for real clinical decisions.** Specifically:
145
 
146
- - **Catastrophic forgetting on out-of-distribution prompts.** Fine-tuning on a narrow Q&A distribution at 1.5B parameter scale shifts the base model hard. Casual / non-medical questions get answered in MedQA-style; the base model's general conversational ability is degraded.
147
- - **Weakened in-context grounding.** Every training pair has shape `user_question -> answer`, with no retrieved-context block. As a result the fine-tuned model partly loses the ability to read RAG passages in the prompt and tends to answer from parametric memory even when correct evidence is supplied. The parent repo's MASS-RAG pipeline retains GPT-5.5 for cases where grounded answers matter; this local model is sidebar-selectable for the comparison experience.
148
- - **No factual safety net.** Both training data and evaluation rely on existing biomedical corpora; the model has no live knowledge cutoff or up-to-date drug-interaction database. The parent repo applies a regex-based PII railguard on user input, but the model output itself is not safety-filtered beyond what the base model already does.
149
  - **English only.**
 
150
 
151
  ## License
152
 
153
- `cc-by-nc-4.0` β€” research and non-commercial use. The base model (Qwen2.5-1.5B-Instruct) is Apache-2.0. Downstream dataset licenses may impose additional restrictions; please consult each source (BioASQ, MedQuAD, DrugBank, MedRAG textbooks) before redistribution.
154
 
155
  ## Citation
156
 
157
  If you use or build on this work, please reference:
158
 
159
  ```bibtex
160
- @misc{comp8420-2026-medqa-qwen,
161
- title = {Healthcare NLP Assistant: QLoRA-fine-tuned Qwen2.5-1.5B for medical Q&A},
162
  author = {Davis426},
163
  year = {2026},
164
  howpublished = {\url{https://huggingface.co/Davis426/COMP8420-Healthcare-LLM-Assistant}}
@@ -168,7 +211,8 @@ If you use or build on this work, please reference:
168
  Built on top of:
169
 
170
  - Qwen2.5 (Alibaba): https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
 
171
  - QLoRA (Dettmers et al., 2023): https://arxiv.org/abs/2305.14314
172
- - MASS-RAG (Xiao, Huang, Liu, Xie, 2026): https://arxiv.org/abs/2604.18509 (used by the parent repo's retrieval pipeline that this model plugs into)
173
  - Unsloth: https://github.com/unslothai/unsloth
174
  - llama.cpp + Ollama for GGUF serving
 
4
  - en
5
  library_name: peft
6
  pipeline_tag: text-generation
7
+ base_model:
8
+ - Qwen/Qwen2.5-1.5B-Instruct
9
+ - meta-llama/Llama-3.2-1B-Instruct
10
  tags:
11
  - medical
12
  - healthcare
 
16
  - lora
17
  - qwen
18
  - qwen2.5
19
+ - llama
20
+ - llama-3.2
21
  - ollama
22
  - gguf
23
  ---
24
 
25
+ # Healthcare LLM Assistant β€” QLoRA fine-tunes
26
 
27
+ Two parallel QLoRA fine-tunes of small instruct models on the same 9,000-pair mix of public biomedical Q&A, served side-by-side in the parent project's Streamlit UI for a 3-way bake-off against GPT-5.5.
28
 
29
+ | Variant | Subfolder | Base | Adapter | GGUF (Q4_K_M) |
30
+ |---|---|---|---|---|
31
+ | **Qwen** | `qwen/` | [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | `qwen/qwen-medqa-adapter/` (~82 MB) | `qwen/qwen-medqa-gguf/model.Q4_K_M.gguf` (~941 MB) |
32
+ | **Llama-3.2** | `llama32/` | [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | `llama32/llama32-medqa-adapter/` (~50 MB) | `llama32/llama32-medqa-gguf/model.Q4_K_M.gguf` (~770 MB) |
33
+
34
+ Both variants were trained with the same dataset, the same LoRA shape (r=16, Ξ±=32, all 7 projection layers) and the same SFT recipe, so any quality gap isolates the base-model effect.
35
+
36
+ Built as part of the COMP8420 (Macquarie University) main project on a healthcare NLP assistant. Companion code: **https://github.com/NhatNguyen3001/COMP8420-Healthcare-LLM-Assistant**
37
  (see the GitHub README for the full system: voice input, PII railguard, multi-agent RAG, evaluation notebooks.)
38
 
39
  ## What is in this repo
40
 
41
+ ```
42
+ .
43
+ β”œβ”€β”€ qwen/
44
+ β”‚ β”œβ”€β”€ qwen-medqa-adapter/ # PEFT LoRA adapter
45
+ β”‚ └── qwen-medqa-gguf/
46
+ β”‚ β”œβ”€β”€ model.Q4_K_M.gguf # Ollama-ready GGUF
47
+ β”‚ └── Modelfile # Ollama registration recipe
48
+ └── llama32/
49
+ β”œβ”€β”€ llama32-medqa-adapter/ # PEFT LoRA adapter
50
+ └── llama32-medqa-gguf/
51
+ β”œβ”€β”€ model.Q4_K_M.gguf
52
+ └── Modelfile
53
+ ```
54
 
55
+ The merged-but-unquantized `safetensors` is intentionally not uploaded for either variant; it is redundant for end users (use the GGUF for Ollama OR the adapter for transformers+peft).
56
 
57
  ## Training data
58
 
 
67
  | DrugBank `side_effects` | ~1,500 | side-effect summaries |
68
  | DrugBank `mechanism_of_action` | ~1,500 | MoA explanations |
69
 
70
+ 90 / 5 / 5 random split with `seed=42`. The OpenAI messages format is used at JSONL level; each variant's chat template (Qwen2.5 or Llama-3.1) is applied at training time, not stored in the JSONL.
71
 
72
  ## Training setup
73
 
74
+ Same hyperparameters across both variants:
75
+
76
  | Hyperparameter | Value |
77
  |---|---|
 
78
  | LoRA rank `r` | 16 |
79
  | LoRA alpha | 32 |
80
  | LoRA target modules | all 7 projection layers (q, k, v, o, gate, up, down) |
 
86
  | Optimizer | `adamw_8bit` |
87
  | Seed | 42 |
88
  | Hardware | RTX 4060 (8 GB, bf16) |
 
89
 
90
+ Per-variant differences:
91
+
92
+ | | Qwen | Llama-3.2 |
93
+ |---|---|---|
94
+ | Base id | `Qwen/Qwen2.5-1.5B-Instruct` (4-bit NF4) | `meta-llama/Llama-3.2-1B-Instruct` (4-bit NF4) |
95
+ | Chat template | `qwen-2.5` | `llama-3.1` |
96
+ | Wall time (3 epochs) | ~95 min | ~30-45 min (smaller base) |
97
+ | Best val loss | 1.5536 (~epoch 1.98) | see `results/training_qlora_llama32.md` in parent repo |
98
+
99
+ Deployed checkpoints are end-of-epoch-3 for both (the "what a full QLoRA run gives you" baseline, not early-stopped).
100
 
101
  ## Evaluation
102
 
 
107
  1. **Surface metrics**: ROUGE-1/2/L + BERTScore-F1 (with the PubMedBERT backbone)
108
  2. **LLM-as-judge**: GPT-5.4 scoring blind on Accuracy / Completeness / Clarity / Safety (0-10), reference-aware
109
 
110
+ **Headline findings (vs GPT-5.5), Qwen variant:**
111
 
112
+ - The Qwen QLoRA model wins ROUGE-L by ~+0.022 (~+12% relative) and BERTScore-F1 by ~+0.0067 (~+0.8% relative) against GPT-5.5
113
  - The win is driven by **template substitution**, not factual improvement. The training set includes 71+ DrugBank entries sharing the skeleton "`{X}` pollen is the pollen of the `{X}` plant. `{X}` pollen is mainly used in allergenic testing." The fine-tune learns the template and slot-fills the entity at inference; ROUGE and BERTScore both reward this even when the substituted entity is wrong.
114
  - Verified 0 / 450 literal Q+A pair overlap between train and test, so this is template generalization, not memorization.
115
  - Under the LLM-as-judge Accuracy dimension, GPT-5.5 leads (judge results in the parent repo's `results/llm_judge_evaluation.csv`).
116
 
117
+ **Llama-3.2 variant:** see the 3-way numbers in the parent repo's `results/model_comparison.csv` (refreshed after the Llama run). The same template-substitution dynamic is expected on shared DrugBank slots; the contrast with the Qwen variant isolates the base-model contribution.
118
+
119
  Detailed numbers and charts live in the parent repo:
120
 
121
  - `results/llm_generation_evaluation.csv` + `llm_generation_eval_chart.png` + `llm_generation_bertscore_chart.png`
 
125
 
126
  ## How to use
127
 
128
+ Replace `<variant>` with `qwen` or `llama32` in the examples below.
129
+
130
  ### Option 1 β€” Ollama (recommended for local serving)
131
 
132
  ```bash
133
+ # Fetch one variant's GGUF + Modelfile
134
  huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
135
+ --include "qwen/qwen-medqa-gguf/*" \
136
  --local-dir ./models
137
 
138
  # Register with Ollama
139
+ cd ./models/qwen/qwen-medqa-gguf
140
  ollama create medqa-qwen -f Modelfile
141
 
142
  # Try it
143
  ollama run medqa-qwen "What is amoxicillin used for?"
144
  ```
145
 
146
+ For the Llama variant, swap every `qwen` for `llama32` (paths) and the Ollama tag to `medqa-llama32`.
147
+
148
+ You can register both side-by-side; one `ollama serve` daemon handles both tags concurrently (`OLLAMA_MAX_LOADED_MODELS` defaults to 3).
149
+
150
  ### Option 2 β€” transformers + peft (Python)
151
 
152
  ```python
153
  from peft import PeftModel
154
  from transformers import AutoModelForCausalLM, AutoTokenizer
155
 
156
+ # pick a variant
157
+ base_id = "Qwen/Qwen2.5-1.5B-Instruct"
158
+ subfolder = "qwen/qwen-medqa-adapter"
159
+ # or:
160
+ # base_id = "meta-llama/Llama-3.2-1B-Instruct"
161
+ # subfolder = "llama32/llama32-medqa-adapter"
162
  adapter_id = "Davis426/COMP8420-Healthcare-LLM-Assistant"
163
 
164
  tokenizer = AutoTokenizer.from_pretrained(base_id)
165
  base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
166
+ model = PeftModel.from_pretrained(base, adapter_id, subfolder=subfolder)
167
 
168
  messages = [{"role": "user", "content": "What is amoxicillin used for?"}]
169
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
 
175
 
176
  ```bash
177
  huggingface-cli download Davis426/COMP8420-Healthcare-LLM-Assistant \
178
+ --include "qwen/qwen-medqa-gguf/model.Q4_K_M.gguf" --local-dir .
179
 
180
+ ./llama-cli -m qwen/qwen-medqa-gguf/model.Q4_K_M.gguf \
181
+ -p "What is amoxicillin used for?" -n 256
182
  ```
183
 
184
  ## Limitations
185
 
186
+ Both models are teaching / research artifacts. **Do not use for real clinical decisions.** Specifically:
187
 
188
+ - **Catastrophic forgetting on out-of-distribution prompts.** Fine-tuning on a narrow Q&A distribution at the 1-1.5B parameter scale shifts each base model hard. Casual / non-medical questions get answered in MedQA-style; the base model's general conversational ability is degraded.
189
+ - **Weakened in-context grounding.** Every training pair has shape `user_question -> answer`, with no retrieved-context block. As a result both fine-tuned models partly lose the ability to read RAG passages in the prompt and tend to answer from parametric memory even when correct evidence is supplied. The parent repo's MASS-RAG pipeline retains GPT-5.5 for cases where grounded answers matter; the local models are sidebar-selectable for the comparison experience.
190
+ - **No factual safety net.** Both training data and evaluation rely on existing biomedical corpora; the models have no live knowledge cutoff or up-to-date drug-interaction database. The parent repo applies a regex-based PII railguard on user input, but model output itself is not safety-filtered beyond what each base model already does.
191
  - **English only.**
192
+ - **Llama-3.2 base licence:** Llama-3.2 community licence applies to the Llama variant (acceptance via the gated HF repo); see the Meta licence for permitted uses.
193
 
194
  ## License
195
 
196
+ The fine-tuned adapters and GGUFs in this repo are released under `cc-by-nc-4.0` (research and non-commercial use). Base model licences override where stricter: Qwen2.5 is Apache-2.0; Llama-3.2 is under the Meta Llama 3.2 Community Licence. Downstream dataset licences may impose additional restrictions; please consult each source (BioASQ, MedQuAD, DrugBank, MedRAG textbooks) before redistribution.
197
 
198
  ## Citation
199
 
200
  If you use or build on this work, please reference:
201
 
202
  ```bibtex
203
+ @misc{comp8420-2026-medqa,
204
+ title = {Healthcare NLP Assistant: parallel QLoRA fine-tunes of Qwen2.5-1.5B and Llama-3.2-1B for medical Q&A},
205
  author = {Davis426},
206
  year = {2026},
207
  howpublished = {\url{https://huggingface.co/Davis426/COMP8420-Healthcare-LLM-Assistant}}
 
211
  Built on top of:
212
 
213
  - Qwen2.5 (Alibaba): https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
214
+ - Llama-3.2 (Meta): https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
215
  - QLoRA (Dettmers et al., 2023): https://arxiv.org/abs/2305.14314
216
+ - MASS-RAG (Xiao, Huang, Liu, Xie, 2026): https://arxiv.org/abs/2604.18509 (used by the parent repo's retrieval pipeline that these models plug into)
217
  - Unsloth: https://github.com/unslothai/unsloth
218
  - llama.cpp + Ollama for GGUF serving