Spaces:

jeremygracey-ai
/

FetchMerck-AI-Demo

Sleeping

App Files Files Community

jeremygracey-ai commited on 28 days ago

Commit

e2a6f29

verified ·

1 Parent(s): 5e5f986

Document MedlinePlus ingest workflow + attribution; update roadmap

Browse files

Files changed (1) hide show

README.md +51 -14

README.md CHANGED Viewed

@@ -21,8 +21,9 @@ pipeline for clinical decision support.
 This Space uses:
-- A small in-memory **sample corpus** of original, paraphrased clinical
-  reference snippets (no copyrighted source material).
 - `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
 - Cosine-similarity retrieval over a NumPy matrix (no vector DB).
 - A hosted generation model via the Hugging Face Inference API.
@@ -37,32 +38,68 @@ licensed clinician for medical questions.
 ## How it works
 1. The user enters a clinical question.
-2. The query is embedded and compared against the sample corpus by cosine similarity.
 3. The top-k passages are concatenated as grounded context.
 4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
-5. The response is shown along with the source section names and a disclaimer.
 ## Configuration
 Optional environment variables / Space secrets:
-- `HF_TOKEN` — Hugging Face token (needed only for gated or private generation models).
 - `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
 ## Roadmap
-This is the v0 publishable baseline. Planned upgrades, in order:
-1. Replace the sample corpus with a **legally publishable** medical reference
-   corpus (e.g., openly licensed clinical guidelines, public-domain references,
-   or content the project is licensed to redistribute).
-2. Move retrieval to a persistent vector store (e.g., Chroma) once the corpus grows.
-3. Pre-build and ship a vector index alongside the Space.
-4. Optionally add local GGUF inference on GPU hardware.
 ## What this Space deliberately does **not** do
 - It does **not** include or redistribute the Merck Manuals or any other
   restricted, paywalled, or copyrighted clinical reference content.
-- It does **not** persist user data; the in-memory index is rebuilt each cold start.
 - It does **not** provide medical advice.

 This Space uses:
+- A prebuilt **MedlinePlus**-derived corpus when the files
+  `data/corpus.jsonl` and `data/embeddings.npy` are present in the repo.
+- Otherwise, a tiny in-memory **sample corpus** so the demo always works.
 - `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
 - Cosine-similarity retrieval over a NumPy matrix (no vector DB).
 - A hosted generation model via the Hugging Face Inference API.
 ## How it works
 1. The user enters a clinical question.
+2. The query is embedded and compared against the corpus by cosine similarity.
 3. The top-k passages are concatenated as grounded context.
 4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
+5. The response is shown along with the source topic names and a disclaimer.
+## Building the MedlinePlus corpus locally
+A local-only ingest script is included at `scripts/ingest_medline.py`. It
+downloads the latest MedlinePlus Health Topics XML (public domain), chunks
+each topic summary, and embeds each chunk with MiniLM.
+### Run locally
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install -U sentence-transformers numpy lxml
+python scripts/ingest_medline.py
+```
+This produces:
+- `data/corpus.jsonl` — one chunk per line: `{id, topic, section, url, text}`
+- `data/embeddings.npy` — float32 matrix, L2-normalized, shape `(N, 384)`
+Optional environment variables for the script:
+- `MEDLINE_XML_URL` — pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`).
+- `EMBED_MODEL` — override the embedding model.
+- `CHUNK_TOKENS`, `CHUNK_OVERLAP` — tune chunk size (defaults: 300 / 50).
+### Upload to the Space
+Drag `data/corpus.jsonl` and `data/embeddings.npy` into the Files tab of this
+Space (under a top-level `data/` folder). The Space will pick them up on next
+restart.
+## MedlinePlus attribution
+Health-topic content used by the prebuilt corpus is adapted from
+**MedlinePlus**, a service of the U.S. National Library of Medicine, National
+Institutes of Health. MedlinePlus content is in the public domain and free to
+reuse. This project is not affiliated with, endorsed by, or sponsored by NLM,
+NIH, or HHS.
 ## Configuration
 Optional environment variables / Space secrets:
+- `HF_TOKEN` — Hugging Face token (needed for gated or private generation models).
 - `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
+- `EMBED_MODEL` — override the embedding model (default: `sentence-transformers/all-MiniLM-L6-v2`).
 ## Roadmap
+1. ✅ Lightweight publishable v0 with sample corpus.
+2. ✅ MedlinePlus ingest script + auto-load when uploaded.
+3. ⏳ Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax).
+4. ⏳ Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows.
+5. ⏳ Optionally add local GGUF inference on GPU hardware.
 ## What this Space deliberately does **not** do
 - It does **not** include or redistribute the Merck Manuals or any other
   restricted, paywalled, or copyrighted clinical reference content.
 - It does **not** provide medical advice.