| # RAG Knowledge Base |
|
|
| Drop reference documents here (`.md`, `.txt`, or `.pdf`). They are ingested by |
| `python -m src.rag.ingest` at Docker build time and surfaced to the orchestrator |
| agent via the `retrieve_context` tool. The container entrypoint also rebuilds |
| the index at startup when a mounted `data/` volume does not already contain |
| `data/processed/faiss_index/`. |
|
|
| ## Recommended seed set |
|
|
| For a clinical-ML / NeuroBridge demo: |
|
|
| - **BBB / molecules**: Lipinski's Rule of Five (1997, 2001), Pajouhesh & Lenz |
| CNS multiparameter optimization (2005) |
| - **MRI / harmonization**: Fortin et al. ComBat for cortical thickness (2017), |
| Fortin et al. ComBat for diffusion (2018), Johnson et al. original ComBat |
| (2007, gene expression) |
| - **EEG / artifacts**: Hyvärinen ICA primer (1999), MNE-Python overview |
| (Gramfort 2013) |
|
|
| ## Format notes |
|
|
| - PDFs work via `pypdf`. OCR-only PDFs (scanned images) won't extract text; |
| pre-OCR them first. |
| - Markdown is preferred — full text + headers chunk cleanly. |
| - Files are gitignored by default. Mount them via Docker volume in |
| production, or COPY them in via a sub-path before the `RUN` ingest line. |
|
|
| ## Re-indexing |
|
|
| After adding/removing files, re-run: |
|
|
| python -m src.rag.ingest |
| |
| This rewrites `data/processed/faiss_index/` from scratch (no incremental |
| update — the index is small enough to rebuild in seconds). |
|
|