Spaces:
Sleeping
Sleeping
Document MedlinePlus ingest workflow + attribution; update roadmap
Browse files
README.md
CHANGED
|
@@ -21,8 +21,9 @@ pipeline for clinical decision support.
|
|
| 21 |
|
| 22 |
This Space uses:
|
| 23 |
|
| 24 |
-
- A
|
| 25 |
-
|
|
|
|
| 26 |
- `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
|
| 27 |
- Cosine-similarity retrieval over a NumPy matrix (no vector DB).
|
| 28 |
- A hosted generation model via the Hugging Face Inference API.
|
|
@@ -37,32 +38,68 @@ licensed clinician for medical questions.
|
|
| 37 |
## How it works
|
| 38 |
|
| 39 |
1. The user enters a clinical question.
|
| 40 |
-
2. The query is embedded and compared against the
|
| 41 |
3. The top-k passages are concatenated as grounded context.
|
| 42 |
4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
|
| 43 |
-
5. The response is shown along with the source
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Configuration
|
| 46 |
|
| 47 |
Optional environment variables / Space secrets:
|
| 48 |
|
| 49 |
-
- `HF_TOKEN` — Hugging Face token (needed
|
| 50 |
- `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
|
|
|
|
| 51 |
|
| 52 |
## Roadmap
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
2. Move retrieval to a persistent vector store (e.g., Chroma) once the corpus grows.
|
| 60 |
-
3. Pre-build and ship a vector index alongside the Space.
|
| 61 |
-
4. Optionally add local GGUF inference on GPU hardware.
|
| 62 |
|
| 63 |
## What this Space deliberately does **not** do
|
| 64 |
|
| 65 |
- It does **not** include or redistribute the Merck Manuals or any other
|
| 66 |
restricted, paywalled, or copyrighted clinical reference content.
|
| 67 |
-
- It does **not** persist user data; the in-memory index is rebuilt each cold start.
|
| 68 |
- It does **not** provide medical advice.
|
|
|
|
| 21 |
|
| 22 |
This Space uses:
|
| 23 |
|
| 24 |
+
- A prebuilt **MedlinePlus**-derived corpus when the files
|
| 25 |
+
`data/corpus.jsonl` and `data/embeddings.npy` are present in the repo.
|
| 26 |
+
- Otherwise, a tiny in-memory **sample corpus** so the demo always works.
|
| 27 |
- `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
|
| 28 |
- Cosine-similarity retrieval over a NumPy matrix (no vector DB).
|
| 29 |
- A hosted generation model via the Hugging Face Inference API.
|
|
|
|
| 38 |
## How it works
|
| 39 |
|
| 40 |
1. The user enters a clinical question.
|
| 41 |
+
2. The query is embedded and compared against the corpus by cosine similarity.
|
| 42 |
3. The top-k passages are concatenated as grounded context.
|
| 43 |
4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
|
| 44 |
+
5. The response is shown along with the source topic names and a disclaimer.
|
| 45 |
+
|
| 46 |
+
## Building the MedlinePlus corpus locally
|
| 47 |
+
|
| 48 |
+
A local-only ingest script is included at `scripts/ingest_medline.py`. It
|
| 49 |
+
downloads the latest MedlinePlus Health Topics XML (public domain), chunks
|
| 50 |
+
each topic summary, and embeds each chunk with MiniLM.
|
| 51 |
+
|
| 52 |
+
### Run locally
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
python -m venv .venv && source .venv/bin/activate
|
| 56 |
+
pip install -U sentence-transformers numpy lxml
|
| 57 |
+
python scripts/ingest_medline.py
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
This produces:
|
| 61 |
+
|
| 62 |
+
- `data/corpus.jsonl` — one chunk per line: `{id, topic, section, url, text}`
|
| 63 |
+
- `data/embeddings.npy` — float32 matrix, L2-normalized, shape `(N, 384)`
|
| 64 |
+
|
| 65 |
+
Optional environment variables for the script:
|
| 66 |
+
|
| 67 |
+
- `MEDLINE_XML_URL` — pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`).
|
| 68 |
+
- `EMBED_MODEL` — override the embedding model.
|
| 69 |
+
- `CHUNK_TOKENS`, `CHUNK_OVERLAP` — tune chunk size (defaults: 300 / 50).
|
| 70 |
+
|
| 71 |
+
### Upload to the Space
|
| 72 |
+
|
| 73 |
+
Drag `data/corpus.jsonl` and `data/embeddings.npy` into the Files tab of this
|
| 74 |
+
Space (under a top-level `data/` folder). The Space will pick them up on next
|
| 75 |
+
restart.
|
| 76 |
+
|
| 77 |
+
## MedlinePlus attribution
|
| 78 |
+
|
| 79 |
+
Health-topic content used by the prebuilt corpus is adapted from
|
| 80 |
+
**MedlinePlus**, a service of the U.S. National Library of Medicine, National
|
| 81 |
+
Institutes of Health. MedlinePlus content is in the public domain and free to
|
| 82 |
+
reuse. This project is not affiliated with, endorsed by, or sponsored by NLM,
|
| 83 |
+
NIH, or HHS.
|
| 84 |
|
| 85 |
## Configuration
|
| 86 |
|
| 87 |
Optional environment variables / Space secrets:
|
| 88 |
|
| 89 |
+
- `HF_TOKEN` — Hugging Face token (needed for gated or private generation models).
|
| 90 |
- `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
|
| 91 |
+
- `EMBED_MODEL` — override the embedding model (default: `sentence-transformers/all-MiniLM-L6-v2`).
|
| 92 |
|
| 93 |
## Roadmap
|
| 94 |
|
| 95 |
+
1. ✅ Lightweight publishable v0 with sample corpus.
|
| 96 |
+
2. ✅ MedlinePlus ingest script + auto-load when uploaded.
|
| 97 |
+
3. ⏳ Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax).
|
| 98 |
+
4. ⏳ Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows.
|
| 99 |
+
5. ⏳ Optionally add local GGUF inference on GPU hardware.
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
## What this Space deliberately does **not** do
|
| 102 |
|
| 103 |
- It does **not** include or redistribute the Merck Manuals or any other
|
| 104 |
restricted, paywalled, or copyrighted clinical reference content.
|
|
|
|
| 105 |
- It does **not** provide medical advice.
|