MedGemma 1.5 4B (GGUF Q5_K_M) two-stage (Stage1 base + Stage2 LoRA adapter)

This repo is the contest submission artifact set for a two-stage clinical extraction pipeline:

Stage1: medgemma-base-q5_k_m.gguf (base model) produces structured Stage1 outputs (JSON -> Markdown).
Stage2: the same base GGUF plus a GGUF LoRA adapter produces KVT4 fact lines.

Files kept (submission-relevant)

medgemma-base-q5_k_m.gguf (Stage1 base GGUF, Q5_K_M)
lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf (Stage2 LoRA adapter for llama.cpp)
manifest_q5_k_m_pair.json (provenance + file inventory)

Recommended llama.cpp usage

Stage1 server (base only):

./llama-server \
  -m medgemma-base-q5_k_m.gguf \
  --host 127.0.0.1 --port 1245 \
  --alias medgemma-base-q5_k_m \
  -c 8192 -ngl 99 -t 4 --parallel 1

Stage2 server (base + LoRA adapter):

./llama-server \
  -m medgemma-base-q5_k_m.gguf \
  --lora lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf \
  --host 127.0.0.1 --port 1246 \
  --alias medgemma-ft-lora-adapters-q5_k_m \
  -c 8192 -ngl 99 -t 4 --parallel 1 \
  --cache-prompt --cache-reuse 256

Notes:

--cache-prompt --cache-reuse 256 enables prompt KV cache reuse for Stage2 when the prompt prefix is byte-stable.
Use deterministic generation (temperature=0.0) for strict artifact reproducibility.

MedGemma runner mapping (aliases, ports, flags)

The MedGemma repo runners typically call two OpenAI-compatible llama.cpp servers:

Stage1 URL: http://127.0.0.1:1245 with model name (alias) medgemma-base-q5_k_m
Stage2 URL: http://127.0.0.1:1246 with model name (alias) medgemma-ft-lora-adapters-q5_k_m

The sequential runner (example invocation):

MEDGEMMA_STAGE2_PROFILE=curated10_tuning \
python3 scripts/run_two_stage_structured_sequential.py \
  --cohort-root <EHR_test_data_root> \
  --out-dir <out_dir> \
  --stage1-url http://127.0.0.1:1245 --stage1-model medgemma-base-q5_k_m \
  --stage1-profile sgr_v2 --schema-path schemas/readmission_domain_summary_sgr_v2.schema.json \
  --stage1-max-tokens 1536 --stage1-temperature 0.0 \
  --stage2-url http://127.0.0.1:1246 --stage2-model medgemma-ft-lora-adapters-q5_k_m \
  --stage2-scope all --stage2-max-tokens 768 --stage2-temperature 0.0 \
  --stage2-repetition-penalty 1.1

Reproducibility checklist

Use temperature=0.0 for both stages.
Keep Stage2 prompt prefix byte-stable if using llama.cpp prompt cache.
Use a fixed --ctx-size (recommended 8192) and --parallel 1 for reproducible latency and retries.
Keep Stage2 profile consistent across runs (example: MEDGEMMA_STAGE2_PROFILE=curated10_tuning).
Log and preserve meta_stage1.json and meta_stage2.json from your run output directory.

Provenance

Base model: google/medgemma-1.5-4b-it
Base revision: see manifest_q5_k_m_pair.json

Licensing

This repository contains model artifacts (GGUF weights and a GGUF LoRA adapter) derived from google/medgemma-1.5-4b-it.

Model artifacts are not released under Apache-2.0.
Usage and redistribution must comply with the upstream MedGemma terms:
- https://huggingface.co/google/medgemma-1.5-4b-it

If you maintain a separate code repository (runners, evaluation scripts, docs), license the code and documentation separately (e.g. Apache-2.0 for code and CC BY 4.0 for docs) and keep clinical text out of git.

Downloads last month: 60

GGUF

Model size

4B params

Architecture

gemma3

Hardware compatibility

5-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DocUA/medgemma-1.5-4b-it-gguf-q5-k-m-two-stage

Base model

google/medgemma-1.5-4b-it

Adapter

(37)

this model