Medical Document Understanding β GGUF
GGUF-quantised versions of Sebukpor/medical-document-understanding-v2
for CPU inference via llama.cpp.
Optimised for HuggingFace free-tier CPU spaces (2 vCPU, 16 GB RAM).
Available Quants
| File | Size | Use case |
|---|---|---|
model-Q4_K_M.gguf |
~2.5 GB | β Recommended β best quality/size for deployment |
model-Q8_0.gguf |
~4.5 GB | Near-lossless, slower |
model-F16.gguf |
~8.5 GB | Reference, too large for free tier |
Quick Start (llama-cpp-python)
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Qwen2VLChatHandler
chat_handler = Qwen2VLChatHandler(
clip_model_path="mmproj-model-f16.gguf" # vision encoder (see below)
)
llm = Llama(
model_path="model-Q4_K_M.gguf",
chat_handler=chat_handler,
n_ctx=2048,
n_threads=2, # HF free tier has 2 vCPU
verbose=False,
)
import base64
with open("opd_form.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are an expert Medical Transcription AI. Extract all information into structured JSON."
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "Extract all information from this medical OPD form into structured JSON."}
]
}
],
max_tokens=1024,
temperature=0.0,
)
print(response["choices"][0]["message"]["content"])
HF Space Deployment
See the app.py included in this repo for a ready-to-deploy Gradio app.
Base Model
Fine-tuned from Qwen/Qwen3.5-4B on handwritten Indian medical OPD forms.
- Downloads last month
- 188
Hardware compatibility
Log In to add your hardware
4-bit
8-bit
16-bit