FetchMerck-AI-Demo / README.md
jeremygracey-ai's picture
Bump Gradio SDK to 6.14.0
f1a917c verified
---
title: FetchMerck AI Demo
emoji: ๐Ÿฉบ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- inference-api
license: apache-2.0
short_description: Lightweight RAG demo for clinical decision support
---
# ๐Ÿฉบ FetchMerck AI โ€” Demo
> A lightweight, public demonstration of a **Retrieval-Augmented Generation (RAG)** pipeline for clinical decision support.
[![Gradio](https://img.shields.io/badge/Gradio-6.14.0-orange)](https://www.gradio.app/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Status](https://img.shields.io/badge/status-demo-yellow)]()
---
## โš ๏ธ Medical Disclaimer
> **This Space is an educational prototype only.**
> It is **not** a medical device and **must not** be used for diagnosis, treatment, triage, or any clinical decision-making. Outputs may be inaccurate or incomplete. **Always consult a licensed clinician** for medical questions.
---
## โœจ What it does
FetchMerck AI demonstrates a minimal, end-to-end RAG pipeline you can read in a single afternoon:
- Embeds your question with `sentence-transformers/all-MiniLM-L6-v2`.
- Retrieves the top-k most similar passages from a small corpus via **cosine similarity over a NumPy matrix** (no vector DB needed).
- Asks a hosted instruction-tuned LLM via the **Hugging Face Inference API** to answer **only** from the retrieved context.
- Surfaces the source topic names alongside every answer, plus the standing disclaimer.
It ships with two corpus modes:
| Mode | When it activates | Source |
| --- | --- | --- |
| **MedlinePlus corpus** | When `data/corpus.jsonl` and `data/embeddings.npy` are present | NIH MedlinePlus Health Topics (public domain) |
| **Tiny sample corpus** | Fallback, in-memory | Built-in, ensures the demo always boots |
---
## ๐Ÿ”ฌ How it works
1. The user enters a clinical question.
2. The query is embedded with MiniLM and L2-normalized.
3. Cosine similarity is computed against the corpus matrix; the top-k passages are selected.
4. Those passages are concatenated into a grounded context window.
5. A hosted LLM is prompted to answer **only** from that context.
6. The answer is rendered with source topic names and the medical disclaimer.
---
## ๐Ÿ› ๏ธ Build the MedlinePlus corpus locally
A local-only ingest script lives at `scripts/ingest_medline.py`. It downloads the latest **MedlinePlus Health Topics XML** (public domain), chunks each topic summary, and embeds the chunks with MiniLM.
```bash
python -m venv .venv && source .venv/bin/activate
pip install -U sentence-transformers numpy lxml
python scripts/ingest_medline.py
```
This produces:
- `data/corpus.jsonl` โ€” one chunk per line: `{id, topic, section, url, text}`
- `data/embeddings.npy` โ€” float32 matrix, L2-normalized, shape `(N, 384)`
Optional environment variables for the script:
| Variable | Purpose |
| --- | --- |
| `MEDLINE_XML_URL` | Pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`) |
| `EMBED_MODEL` | Override the embedding model |
| `CHUNK_TOKENS` | Chunk size in tokens (default `300`) |
| `CHUNK_OVERLAP` | Chunk overlap in tokens (default `50`) |
Then drag `data/corpus.jsonl` and `data/embeddings.npy` into the **Files** tab of this Space (under a top-level `data/` folder). The Space will pick them up on next restart.
---
## โš™๏ธ Configuration
Optional environment variables / Space secrets:
| Variable | Default | Purpose |
| --- | --- | --- |
| `HF_TOKEN` | โ€” | Hugging Face token (needed for gated or private generation models) |
| `GEN_MODEL` | `meta-llama/Llama-3.1-8B-Instruct` | Override the hosted generation model |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Override the embedding model |
---
## ๐Ÿ—บ๏ธ Roadmap
- [x] Lightweight publishable v0 with sample corpus
- [x] MedlinePlus ingest script + auto-load when uploaded
- [ ] Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax)
- [ ] Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows
- [ ] Optional local GGUF inference on GPU hardware
---
## ๐Ÿšซ What this Space deliberately does **not** do
- It does **not** include or redistribute the *Merck Manuals* or any other restricted, paywalled, or copyrighted clinical reference content.
- It does **not** provide medical advice.
---
## ๐Ÿ“š Attribution
Health-topic content used by the prebuilt corpus is adapted from **MedlinePlus**, a service of the U.S. National Library of Medicine, National Institutes of Health. MedlinePlus content is in the public domain and free to reuse.
This project is **not affiliated with, endorsed by, or sponsored by** NLM, NIH, or HHS.
---
## ๐Ÿ“„ License
Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).