FetchMerck-AI-Demo / README.md
jeremygracey-ai's picture
Bump Gradio SDK to 6.14.0
f1a917c verified
metadata
title: FetchMerck AI Demo
emoji: 🩺
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
  - inference-api
license: apache-2.0
short_description: Lightweight RAG demo for clinical decision support

🩺 FetchMerck AI β€” Demo

A lightweight, public demonstration of a Retrieval-Augmented Generation (RAG) pipeline for clinical decision support.

Gradio License Status


⚠️ Medical Disclaimer

This Space is an educational prototype only. It is not a medical device and must not be used for diagnosis, treatment, triage, or any clinical decision-making. Outputs may be inaccurate or incomplete. Always consult a licensed clinician for medical questions.


✨ What it does

FetchMerck AI demonstrates a minimal, end-to-end RAG pipeline you can read in a single afternoon:

  • Embeds your question with sentence-transformers/all-MiniLM-L6-v2.
  • Retrieves the top-k most similar passages from a small corpus via cosine similarity over a NumPy matrix (no vector DB needed).
  • Asks a hosted instruction-tuned LLM via the Hugging Face Inference API to answer only from the retrieved context.
  • Surfaces the source topic names alongside every answer, plus the standing disclaimer.

It ships with two corpus modes:

Mode When it activates Source
MedlinePlus corpus When data/corpus.jsonl and data/embeddings.npy are present NIH MedlinePlus Health Topics (public domain)
Tiny sample corpus Fallback, in-memory Built-in, ensures the demo always boots

πŸ”¬ How it works

  1. The user enters a clinical question.
  2. The query is embedded with MiniLM and L2-normalized.
  3. Cosine similarity is computed against the corpus matrix; the top-k passages are selected.
  4. Those passages are concatenated into a grounded context window.
  5. A hosted LLM is prompted to answer only from that context.
  6. The answer is rendered with source topic names and the medical disclaimer.

πŸ› οΈ Build the MedlinePlus corpus locally

A local-only ingest script lives at scripts/ingest_medline.py. It downloads the latest MedlinePlus Health Topics XML (public domain), chunks each topic summary, and embeds the chunks with MiniLM.

python -m venv .venv && source .venv/bin/activate
pip install -U sentence-transformers numpy lxml
python scripts/ingest_medline.py

This produces:

  • data/corpus.jsonl β€” one chunk per line: {id, topic, section, url, text}
  • data/embeddings.npy β€” float32 matrix, L2-normalized, shape (N, 384)

Optional environment variables for the script:

Variable Purpose
MEDLINE_XML_URL Pin a specific snapshot (e.g. https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip)
EMBED_MODEL Override the embedding model
CHUNK_TOKENS Chunk size in tokens (default 300)
CHUNK_OVERLAP Chunk overlap in tokens (default 50)

Then drag data/corpus.jsonl and data/embeddings.npy into the Files tab of this Space (under a top-level data/ folder). The Space will pick them up on next restart.


βš™οΈ Configuration

Optional environment variables / Space secrets:

Variable Default Purpose
HF_TOKEN β€” Hugging Face token (needed for gated or private generation models)
GEN_MODEL meta-llama/Llama-3.1-8B-Instruct Override the hosted generation model
EMBED_MODEL sentence-transformers/all-MiniLM-L6-v2 Override the embedding model

πŸ—ΊοΈ Roadmap

  • Lightweight publishable v0 with sample corpus
  • MedlinePlus ingest script + auto-load when uploaded
  • Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax)
  • Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows
  • Optional local GGUF inference on GPU hardware

🚫 What this Space deliberately does not do

  • It does not include or redistribute the Merck Manuals or any other restricted, paywalled, or copyrighted clinical reference content.
  • It does not provide medical advice.

πŸ“š Attribution

Health-topic content used by the prebuilt corpus is adapted from MedlinePlus, a service of the U.S. National Library of Medicine, National Institutes of Health. MedlinePlus content is in the public domain and free to reuse.

This project is not affiliated with, endorsed by, or sponsored by NLM, NIH, or HHS.


πŸ“„ License

Released under the Apache License 2.0.