# RAG Knowledge Base Drop reference documents here (`.md`, `.txt`, or `.pdf`). They are ingested by `python -m src.rag.ingest` at Docker build time and surfaced to the orchestrator agent via the `retrieve_context` tool. The container entrypoint also rebuilds the index at startup when a mounted `data/` volume does not already contain `data/processed/faiss_index/`. ## Recommended seed set For a clinical-ML / NeuroBridge demo: - **BBB / molecules**: Lipinski's Rule of Five (1997, 2001), Pajouhesh & Lenz CNS multiparameter optimization (2005) - **MRI / harmonization**: Fortin et al. ComBat for cortical thickness (2017), Fortin et al. ComBat for diffusion (2018), Johnson et al. original ComBat (2007, gene expression) - **EEG / artifacts**: Hyvärinen ICA primer (1999), MNE-Python overview (Gramfort 2013) ## Format notes - PDFs work via `pypdf`. OCR-only PDFs (scanned images) won't extract text; pre-OCR them first. - Markdown is preferred — full text + headers chunk cleanly. - Files are gitignored by default. Mount them via Docker volume in production, or COPY them in via a sub-path before the `RUN` ingest line. ## Re-indexing After adding/removing files, re-run: python -m src.rag.ingest This rewrites `data/processed/faiss_index/` from scratch (no incremental update — the index is small enough to rebuild in seconds).