Spaces:

jeremygracey-ai
/

FetchMerck-AI-Demo

Sleeping

File size: 4,897 Bytes

---
title: FetchMerck AI Demo
emoji: 🩺
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- inference-api
license: apache-2.0
short_description: Lightweight RAG demo for clinical decision support
---

# 🩺 FetchMerck AI — Demo

> A lightweight, public demonstration of a **Retrieval-Augmented Generation (RAG)** pipeline for clinical decision support.

[![Gradio](https://img.shields.io/badge/Gradio-6.14.0-orange)](https://www.gradio.app/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Status](https://img.shields.io/badge/status-demo-yellow)]()

---

## ⚠️ Medical Disclaimer

> **This Space is an educational prototype only.**
> It is **not** a medical device and **must not** be used for diagnosis, treatment, triage, or any clinical decision-making. Outputs may be inaccurate or incomplete. **Always consult a licensed clinician** for medical questions.

---

## ✨ What it does

FetchMerck AI demonstrates a minimal, end-to-end RAG pipeline you can read in a single afternoon:

- Embeds your question with `sentence-transformers/all-MiniLM-L6-v2`.
- Retrieves the top-k most similar passages from a small corpus via **cosine similarity over a NumPy matrix** (no vector DB needed).
- Asks a hosted instruction-tuned LLM via the **Hugging Face Inference API** to answer **only** from the retrieved context.
- Surfaces the source topic names alongside every answer, plus the standing disclaimer.

It ships with two corpus modes:

| Mode | When it activates | Source |
| --- | --- | --- |
| **MedlinePlus corpus** | When `data/corpus.jsonl` and `data/embeddings.npy` are present | NIH MedlinePlus Health Topics (public domain) |
| **Tiny sample corpus** | Fallback, in-memory | Built-in, ensures the demo always boots |

---

## 🔬 How it works

1. The user enters a clinical question.
2. The query is embedded with MiniLM and L2-normalized.
3. Cosine similarity is computed against the corpus matrix; the top-k passages are selected.
4. Those passages are concatenated into a grounded context window.
5. A hosted LLM is prompted to answer **only** from that context.
6. The answer is rendered with source topic names and the medical disclaimer.

---

## 🛠️ Build the MedlinePlus corpus locally

A local-only ingest script lives at `scripts/ingest_medline.py`. It downloads the latest **MedlinePlus Health Topics XML** (public domain), chunks each topic summary, and embeds the chunks with MiniLM.

```bash
python -m venv .venv && source .venv/bin/activate
pip install -U sentence-transformers numpy lxml
python scripts/ingest_medline.py
```

This produces:

- `data/corpus.jsonl` — one chunk per line: `{id, topic, section, url, text}`
- `data/embeddings.npy` — float32 matrix, L2-normalized, shape `(N, 384)`

Optional environment variables for the script:

| Variable | Purpose |
| --- | --- |
| `MEDLINE_XML_URL` | Pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`) |
| `EMBED_MODEL` | Override the embedding model |
| `CHUNK_TOKENS` | Chunk size in tokens (default `300`) |
| `CHUNK_OVERLAP` | Chunk overlap in tokens (default `50`) |

Then drag `data/corpus.jsonl` and `data/embeddings.npy` into the **Files** tab of this Space (under a top-level `data/` folder). The Space will pick them up on next restart.

---

## ⚙️ Configuration

Optional environment variables / Space secrets:

| Variable | Default | Purpose |
| --- | --- | --- |
| `HF_TOKEN` | — | Hugging Face token (needed for gated or private generation models) |
| `GEN_MODEL` | `meta-llama/Llama-3.1-8B-Instruct` | Override the hosted generation model |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Override the embedding model |

---

## 🗺️ Roadmap

- [x] Lightweight publishable v0 with sample corpus
- [x] MedlinePlus ingest script + auto-load when uploaded
- [ ] Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax)
- [ ] Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows
- [ ] Optional local GGUF inference on GPU hardware

---

## 🚫 What this Space deliberately does **not** do

- It does **not** include or redistribute the *Merck Manuals* or any other restricted, paywalled, or copyrighted clinical reference content.
- It does **not** provide medical advice.

---

## 📚 Attribution

Health-topic content used by the prebuilt corpus is adapted from **MedlinePlus**, a service of the U.S. National Library of Medicine, National Institutes of Health. MedlinePlus content is in the public domain and free to reuse.

This project is **not affiliated with, endorsed by, or sponsored by** NLM, NIH, or HHS.

---

## 📄 License

Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).