File size: 4,897 Bytes
88d7726
6362a5b
e3642e5
88d7726
 
 
f1a917c
88d7726
 
 
 
 
6362a5b
e3642e5
88d7726
 
307f851
e3642e5
307f851
e3642e5
f1a917c
307f851
 
e3642e5
307f851
e3642e5
 
 
307f851
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3642e5
307f851
 
 
 
 
 
 
 
e3642e5
 
307f851
 
 
 
 
e2a6f29
307f851
e2a6f29
307f851
e2a6f29
307f851
e2a6f29
 
 
 
 
 
 
 
 
 
 
 
 
 
307f851
 
 
 
 
 
e2a6f29
307f851
e2a6f29
307f851
e2a6f29
307f851
e2a6f29
307f851
e3642e5
307f851
 
 
 
 
e3642e5
307f851
e3642e5
307f851
e3642e5
307f851
 
 
 
 
e3642e5
307f851
e3642e5
307f851
e3642e5
307f851
e3642e5
307f851
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: FetchMerck AI Demo
emoji: 🩺
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- inference-api
license: apache-2.0
short_description: Lightweight RAG demo for clinical decision support
---

# 🩺 FetchMerck AI — Demo

> A lightweight, public demonstration of a **Retrieval-Augmented Generation (RAG)** pipeline for clinical decision support.

[![Gradio](https://img.shields.io/badge/Gradio-6.14.0-orange)](https://www.gradio.app/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Status](https://img.shields.io/badge/status-demo-yellow)]()

---

## ⚠️ Medical Disclaimer

> **This Space is an educational prototype only.**
> It is **not** a medical device and **must not** be used for diagnosis, treatment, triage, or any clinical decision-making. Outputs may be inaccurate or incomplete. **Always consult a licensed clinician** for medical questions.

---

## ✨ What it does

FetchMerck AI demonstrates a minimal, end-to-end RAG pipeline you can read in a single afternoon:

- Embeds your question with `sentence-transformers/all-MiniLM-L6-v2`.
- Retrieves the top-k most similar passages from a small corpus via **cosine similarity over a NumPy matrix** (no vector DB needed).
- Asks a hosted instruction-tuned LLM via the **Hugging Face Inference API** to answer **only** from the retrieved context.
- Surfaces the source topic names alongside every answer, plus the standing disclaimer.

It ships with two corpus modes:

| Mode | When it activates | Source |
| --- | --- | --- |
| **MedlinePlus corpus** | When `data/corpus.jsonl` and `data/embeddings.npy` are present | NIH MedlinePlus Health Topics (public domain) |
| **Tiny sample corpus** | Fallback, in-memory | Built-in, ensures the demo always boots |

---

## 🔬 How it works

1. The user enters a clinical question.
2. The query is embedded with MiniLM and L2-normalized.
3. Cosine similarity is computed against the corpus matrix; the top-k passages are selected.
4. Those passages are concatenated into a grounded context window.
5. A hosted LLM is prompted to answer **only** from that context.
6. The answer is rendered with source topic names and the medical disclaimer.

---

## 🛠️ Build the MedlinePlus corpus locally

A local-only ingest script lives at `scripts/ingest_medline.py`. It downloads the latest **MedlinePlus Health Topics XML** (public domain), chunks each topic summary, and embeds the chunks with MiniLM.

```bash
python -m venv .venv && source .venv/bin/activate
pip install -U sentence-transformers numpy lxml
python scripts/ingest_medline.py
```

This produces:

- `data/corpus.jsonl` — one chunk per line: `{id, topic, section, url, text}`
- `data/embeddings.npy` — float32 matrix, L2-normalized, shape `(N, 384)`

Optional environment variables for the script:

| Variable | Purpose |
| --- | --- |
| `MEDLINE_XML_URL` | Pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`) |
| `EMBED_MODEL` | Override the embedding model |
| `CHUNK_TOKENS` | Chunk size in tokens (default `300`) |
| `CHUNK_OVERLAP` | Chunk overlap in tokens (default `50`) |

Then drag `data/corpus.jsonl` and `data/embeddings.npy` into the **Files** tab of this Space (under a top-level `data/` folder). The Space will pick them up on next restart.

---

## ⚙️ Configuration

Optional environment variables / Space secrets:

| Variable | Default | Purpose |
| --- | --- | --- |
| `HF_TOKEN` | — | Hugging Face token (needed for gated or private generation models) |
| `GEN_MODEL` | `meta-llama/Llama-3.1-8B-Instruct` | Override the hosted generation model |
| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Override the embedding model |

---

## 🗺️ Roadmap

- [x] Lightweight publishable v0 with sample corpus
- [x] MedlinePlus ingest script + auto-load when uploaded
- [ ] Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax)
- [ ] Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows
- [ ] Optional local GGUF inference on GPU hardware

---

## 🚫 What this Space deliberately does **not** do

- It does **not** include or redistribute the *Merck Manuals* or any other restricted, paywalled, or copyrighted clinical reference content.
- It does **not** provide medical advice.

---

## 📚 Attribution

Health-topic content used by the prebuilt corpus is adapted from **MedlinePlus**, a service of the U.S. National Library of Medicine, National Institutes of Health. MedlinePlus content is in the public domain and free to reuse.

This project is **not affiliated with, endorsed by, or sponsored by** NLM, NIH, or HHS.

---

## 📄 License

Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).