File size: 1,363 Bytes
0db04e6
 
c0a7163
 
 
 
 
0db04e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# RAG Knowledge Base

Drop reference documents here (`.md`, `.txt`, or `.pdf`). They are ingested by
`python -m src.rag.ingest` at Docker build time and surfaced to the orchestrator
agent via the `retrieve_context` tool. The container entrypoint also rebuilds
the index at startup when a mounted `data/` volume does not already contain
`data/processed/faiss_index/`.

## Recommended seed set

For a clinical-ML / NeuroBridge demo:

- **BBB / molecules**: Lipinski's Rule of Five (1997, 2001), Pajouhesh & Lenz
  CNS multiparameter optimization (2005)
- **MRI / harmonization**: Fortin et al. ComBat for cortical thickness (2017),
  Fortin et al. ComBat for diffusion (2018), Johnson et al. original ComBat
  (2007, gene expression)
- **EEG / artifacts**: Hyvärinen ICA primer (1999), MNE-Python overview
  (Gramfort 2013)

## Format notes

- PDFs work via `pypdf`. OCR-only PDFs (scanned images) won't extract text;
  pre-OCR them first.
- Markdown is preferred — full text + headers chunk cleanly.
- Files are gitignored by default. Mount them via Docker volume in
  production, or COPY them in via a sub-path before the `RUN` ingest line.

## Re-indexing

After adding/removing files, re-run:

    python -m src.rag.ingest

This rewrites `data/processed/faiss_index/` from scratch (no incremental
update — the index is small enough to rebuild in seconds).