jeremygracey-ai commited on
Commit
e2a6f29
·
verified ·
1 Parent(s): 5e5f986

Document MedlinePlus ingest workflow + attribution; update roadmap

Browse files
Files changed (1) hide show
  1. README.md +51 -14
README.md CHANGED
@@ -21,8 +21,9 @@ pipeline for clinical decision support.
21
 
22
  This Space uses:
23
 
24
- - A small in-memory **sample corpus** of original, paraphrased clinical
25
- reference snippets (no copyrighted source material).
 
26
  - `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
27
  - Cosine-similarity retrieval over a NumPy matrix (no vector DB).
28
  - A hosted generation model via the Hugging Face Inference API.
@@ -37,32 +38,68 @@ licensed clinician for medical questions.
37
  ## How it works
38
 
39
  1. The user enters a clinical question.
40
- 2. The query is embedded and compared against the sample corpus by cosine similarity.
41
  3. The top-k passages are concatenated as grounded context.
42
  4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
43
- 5. The response is shown along with the source section names and a disclaimer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Configuration
46
 
47
  Optional environment variables / Space secrets:
48
 
49
- - `HF_TOKEN` — Hugging Face token (needed only for gated or private generation models).
50
  - `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
 
51
 
52
  ## Roadmap
53
 
54
- This is the v0 publishable baseline. Planned upgrades, in order:
55
-
56
- 1. Replace the sample corpus with a **legally publishable** medical reference
57
- corpus (e.g., openly licensed clinical guidelines, public-domain references,
58
- or content the project is licensed to redistribute).
59
- 2. Move retrieval to a persistent vector store (e.g., Chroma) once the corpus grows.
60
- 3. Pre-build and ship a vector index alongside the Space.
61
- 4. Optionally add local GGUF inference on GPU hardware.
62
 
63
  ## What this Space deliberately does **not** do
64
 
65
  - It does **not** include or redistribute the Merck Manuals or any other
66
  restricted, paywalled, or copyrighted clinical reference content.
67
- - It does **not** persist user data; the in-memory index is rebuilt each cold start.
68
  - It does **not** provide medical advice.
 
21
 
22
  This Space uses:
23
 
24
+ - A prebuilt **MedlinePlus**-derived corpus when the files
25
+ `data/corpus.jsonl` and `data/embeddings.npy` are present in the repo.
26
+ - Otherwise, a tiny in-memory **sample corpus** so the demo always works.
27
  - `sentence-transformers/all-MiniLM-L6-v2` for embeddings.
28
  - Cosine-similarity retrieval over a NumPy matrix (no vector DB).
29
  - A hosted generation model via the Hugging Face Inference API.
 
38
  ## How it works
39
 
40
  1. The user enters a clinical question.
41
+ 2. The query is embedded and compared against the corpus by cosine similarity.
42
  3. The top-k passages are concatenated as grounded context.
43
  4. A hosted instruction-tuned LLM is asked to answer **only** from that context.
44
+ 5. The response is shown along with the source topic names and a disclaimer.
45
+
46
+ ## Building the MedlinePlus corpus locally
47
+
48
+ A local-only ingest script is included at `scripts/ingest_medline.py`. It
49
+ downloads the latest MedlinePlus Health Topics XML (public domain), chunks
50
+ each topic summary, and embeds each chunk with MiniLM.
51
+
52
+ ### Run locally
53
+
54
+ ```bash
55
+ python -m venv .venv && source .venv/bin/activate
56
+ pip install -U sentence-transformers numpy lxml
57
+ python scripts/ingest_medline.py
58
+ ```
59
+
60
+ This produces:
61
+
62
+ - `data/corpus.jsonl` — one chunk per line: `{id, topic, section, url, text}`
63
+ - `data/embeddings.npy` — float32 matrix, L2-normalized, shape `(N, 384)`
64
+
65
+ Optional environment variables for the script:
66
+
67
+ - `MEDLINE_XML_URL` — pin a specific snapshot (e.g. `https://medlineplus.gov/xml/mplus_topics_YYYY-MM-DD.xml.zip`).
68
+ - `EMBED_MODEL` — override the embedding model.
69
+ - `CHUNK_TOKENS`, `CHUNK_OVERLAP` — tune chunk size (defaults: 300 / 50).
70
+
71
+ ### Upload to the Space
72
+
73
+ Drag `data/corpus.jsonl` and `data/embeddings.npy` into the Files tab of this
74
+ Space (under a top-level `data/` folder). The Space will pick them up on next
75
+ restart.
76
+
77
+ ## MedlinePlus attribution
78
+
79
+ Health-topic content used by the prebuilt corpus is adapted from
80
+ **MedlinePlus**, a service of the U.S. National Library of Medicine, National
81
+ Institutes of Health. MedlinePlus content is in the public domain and free to
82
+ reuse. This project is not affiliated with, endorsed by, or sponsored by NLM,
83
+ NIH, or HHS.
84
 
85
  ## Configuration
86
 
87
  Optional environment variables / Space secrets:
88
 
89
+ - `HF_TOKEN` — Hugging Face token (needed for gated or private generation models).
90
  - `GEN_MODEL` — override the generation model (default: `meta-llama/Llama-3.1-8B-Instruct`).
91
+ - `EMBED_MODEL` — override the embedding model (default: `sentence-transformers/all-MiniLM-L6-v2`).
92
 
93
  ## Roadmap
94
 
95
+ 1. Lightweight publishable v0 with sample corpus.
96
+ 2. ✅ MedlinePlus ingest script + auto-load when uploaded.
97
+ 3. Add additional public-domain / openly licensed corpora (CDC, NICE OGL, OpenStax).
98
+ 4. ⏳ Move retrieval to a persistent vector store (e.g. Chroma) once the corpus grows.
99
+ 5. Optionally add local GGUF inference on GPU hardware.
 
 
 
100
 
101
  ## What this Space deliberately does **not** do
102
 
103
  - It does **not** include or redistribute the Merck Manuals or any other
104
  restricted, paywalled, or copyrighted clinical reference content.
 
105
  - It does **not** provide medical advice.