MedGemma 1.5 4B (GGUF Q5_K_M) two-stage (Stage1 base + Stage2 LoRA adapter)
This repo is the contest submission artifact set for a two-stage clinical extraction pipeline:
- Stage1:
medgemma-base-q5_k_m.gguf(base model) produces structured Stage1 outputs (JSON -> Markdown). - Stage2: the same base GGUF plus a GGUF LoRA adapter produces KVT4 fact lines.
Files kept (submission-relevant)
medgemma-base-q5_k_m.gguf(Stage1 base GGUF, Q5_K_M)lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf(Stage2 LoRA adapter for llama.cpp)manifest_q5_k_m_pair.json(provenance + file inventory)
Recommended llama.cpp usage
Stage1 server (base only):
./llama-server \
-m medgemma-base-q5_k_m.gguf \
--host 127.0.0.1 --port 1245 \
--alias medgemma-base-q5_k_m \
-c 8192 -ngl 99 -t 4 --parallel 1
Stage2 server (base + LoRA adapter):
./llama-server \
-m medgemma-base-q5_k_m.gguf \
--lora lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf \
--host 127.0.0.1 --port 1246 \
--alias medgemma-ft-lora-adapters-q5_k_m \
-c 8192 -ngl 99 -t 4 --parallel 1 \
--cache-prompt --cache-reuse 256
Notes:
--cache-prompt --cache-reuse 256enables prompt KV cache reuse for Stage2 when the prompt prefix is byte-stable.- Use deterministic generation (
temperature=0.0) for strict artifact reproducibility.
MedGemma runner mapping (aliases, ports, flags)
The MedGemma repo runners typically call two OpenAI-compatible llama.cpp servers:
- Stage1 URL:
http://127.0.0.1:1245with model name (alias)medgemma-base-q5_k_m - Stage2 URL:
http://127.0.0.1:1246with model name (alias)medgemma-ft-lora-adapters-q5_k_m
The sequential runner (example invocation):
MEDGEMMA_STAGE2_PROFILE=curated10_tuning \
python3 scripts/run_two_stage_structured_sequential.py \
--cohort-root <EHR_test_data_root> \
--out-dir <out_dir> \
--stage1-url http://127.0.0.1:1245 --stage1-model medgemma-base-q5_k_m \
--stage1-profile sgr_v2 --schema-path schemas/readmission_domain_summary_sgr_v2.schema.json \
--stage1-max-tokens 1536 --stage1-temperature 0.0 \
--stage2-url http://127.0.0.1:1246 --stage2-model medgemma-ft-lora-adapters-q5_k_m \
--stage2-scope all --stage2-max-tokens 768 --stage2-temperature 0.0 \
--stage2-repetition-penalty 1.1
Reproducibility checklist
- Use
temperature=0.0for both stages. - Keep Stage2 prompt prefix byte-stable if using llama.cpp prompt cache.
- Use a fixed
--ctx-size(recommended8192) and--parallel 1for reproducible latency and retries. - Keep Stage2 profile consistent across runs (example:
MEDGEMMA_STAGE2_PROFILE=curated10_tuning). - Log and preserve
meta_stage1.jsonandmeta_stage2.jsonfrom your run output directory.
Provenance
- Base model:
google/medgemma-1.5-4b-it - Base revision: see
manifest_q5_k_m_pair.json
Licensing
This repository contains model artifacts (GGUF weights and a GGUF LoRA adapter) derived from google/medgemma-1.5-4b-it.
- Model artifacts are not released under Apache-2.0.
- Usage and redistribution must comply with the upstream MedGemma terms:
If you maintain a separate code repository (runners, evaluation scripts, docs), license the code and documentation separately (e.g. Apache-2.0 for code and CC BY 4.0 for docs) and keep clinical text out of git.
- Downloads last month
- 60
Hardware compatibility
Log In to add your hardware
5-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for DocUA/medgemma-1.5-4b-it-gguf-q5-k-m-two-stage
Base model
google/medgemma-1.5-4b-it