MedGemma 1.5 4B (GGUF Q5_K_M) two-stage (Stage1 base + Stage2 LoRA adapter)

This repo is the contest submission artifact set for a two-stage clinical extraction pipeline:

  1. Stage1: medgemma-base-q5_k_m.gguf (base model) produces structured Stage1 outputs (JSON -> Markdown).
  2. Stage2: the same base GGUF plus a GGUF LoRA adapter produces KVT4 fact lines.

Files kept (submission-relevant)

  • medgemma-base-q5_k_m.gguf (Stage1 base GGUF, Q5_K_M)
  • lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf (Stage2 LoRA adapter for llama.cpp)
  • manifest_q5_k_m_pair.json (provenance + file inventory)

Recommended llama.cpp usage

Stage1 server (base only):

./llama-server \
  -m medgemma-base-q5_k_m.gguf \
  --host 127.0.0.1 --port 1245 \
  --alias medgemma-base-q5_k_m \
  -c 8192 -ngl 99 -t 4 --parallel 1

Stage2 server (base + LoRA adapter):

./llama-server \
  -m medgemma-base-q5_k_m.gguf \
  --lora lora_stage2_all_hard200_20260207/lora_stage2_all_hard200_20260207-f16.gguf \
  --host 127.0.0.1 --port 1246 \
  --alias medgemma-ft-lora-adapters-q5_k_m \
  -c 8192 -ngl 99 -t 4 --parallel 1 \
  --cache-prompt --cache-reuse 256

Notes:

  • --cache-prompt --cache-reuse 256 enables prompt KV cache reuse for Stage2 when the prompt prefix is byte-stable.
  • Use deterministic generation (temperature=0.0) for strict artifact reproducibility.

MedGemma runner mapping (aliases, ports, flags)

The MedGemma repo runners typically call two OpenAI-compatible llama.cpp servers:

  • Stage1 URL: http://127.0.0.1:1245 with model name (alias) medgemma-base-q5_k_m
  • Stage2 URL: http://127.0.0.1:1246 with model name (alias) medgemma-ft-lora-adapters-q5_k_m

The sequential runner (example invocation):

MEDGEMMA_STAGE2_PROFILE=curated10_tuning \
python3 scripts/run_two_stage_structured_sequential.py \
  --cohort-root <EHR_test_data_root> \
  --out-dir <out_dir> \
  --stage1-url http://127.0.0.1:1245 --stage1-model medgemma-base-q5_k_m \
  --stage1-profile sgr_v2 --schema-path schemas/readmission_domain_summary_sgr_v2.schema.json \
  --stage1-max-tokens 1536 --stage1-temperature 0.0 \
  --stage2-url http://127.0.0.1:1246 --stage2-model medgemma-ft-lora-adapters-q5_k_m \
  --stage2-scope all --stage2-max-tokens 768 --stage2-temperature 0.0 \
  --stage2-repetition-penalty 1.1

Reproducibility checklist

  • Use temperature=0.0 for both stages.
  • Keep Stage2 prompt prefix byte-stable if using llama.cpp prompt cache.
  • Use a fixed --ctx-size (recommended 8192) and --parallel 1 for reproducible latency and retries.
  • Keep Stage2 profile consistent across runs (example: MEDGEMMA_STAGE2_PROFILE=curated10_tuning).
  • Log and preserve meta_stage1.json and meta_stage2.json from your run output directory.

Provenance

  • Base model: google/medgemma-1.5-4b-it
  • Base revision: see manifest_q5_k_m_pair.json

Licensing

This repository contains model artifacts (GGUF weights and a GGUF LoRA adapter) derived from google/medgemma-1.5-4b-it.

If you maintain a separate code repository (runners, evaluation scripts, docs), license the code and documentation separately (e.g. Apache-2.0 for code and CC BY 4.0 for docs) and keep clinical text out of git.

Downloads last month
60
GGUF
Model size
4B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

5-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DocUA/medgemma-1.5-4b-it-gguf-q5-k-m-two-stage

Adapter
(37)
this model