mekosotto commited on
Commit
0db04e6
·
1 Parent(s): 4fc125d

feat(deploy): build RAG index at Docker build time + KB seed dir

Browse files
Dockerfile CHANGED
@@ -43,6 +43,14 @@ RUN mkdir -p data/raw data/processed && \
43
  python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
44
  python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
45
 
 
 
 
 
 
 
 
 
46
  # --- HF Spaces convention ---
47
  EXPOSE 7860
48
 
 
43
  python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
44
  python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
45
 
46
+ # --- RAG knowledge base ingest ---
47
+ # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
48
+ # (always present) plus data/knowledge_base/ (optional, user-supplied via
49
+ # additional COPY layer or volume mount). Empty KB → empty index, agent
50
+ # still functions, retrieve_context just returns no chunks.
51
+ COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
52
+ RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
53
+
54
  # --- HF Spaces convention ---
55
  EXPOSE 7860
56
 
Dockerfile.hf CHANGED
@@ -43,6 +43,14 @@ RUN mkdir -p data/raw data/processed && \
43
  python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
44
  python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
45
 
 
 
 
 
 
 
 
 
46
  # --- HF Spaces convention ---
47
  EXPOSE 7860
48
 
 
43
  python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
44
  python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
45
 
46
+ # --- RAG knowledge base ingest ---
47
+ # Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
48
+ # (always present) plus data/knowledge_base/ (optional, user-supplied via
49
+ # additional COPY layer or volume mount). Empty KB → empty index, agent
50
+ # still functions, retrieve_context just returns no chunks.
51
+ COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
52
+ RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
53
+
54
  # --- HF Spaces convention ---
55
  EXPOSE 7860
56
 
data/knowledge_base/.gitkeep ADDED
File without changes
data/knowledge_base/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG Knowledge Base
2
+
3
+ Drop reference documents here (`.md`, `.txt`, or `.pdf`). They will be
4
+ ingested by `python -m src.rag.ingest` at Docker build time and surfaced
5
+ to the orchestrator agent via the `retrieve_context` tool.
6
+
7
+ ## Recommended seed set
8
+
9
+ For a clinical-ML / NeuroBridge demo:
10
+
11
+ - **BBB / molecules**: Lipinski's Rule of Five (1997, 2001), Pajouhesh & Lenz
12
+ CNS multiparameter optimization (2005)
13
+ - **MRI / harmonization**: Fortin et al. ComBat for cortical thickness (2017),
14
+ Fortin et al. ComBat for diffusion (2018), Johnson et al. original ComBat
15
+ (2007, gene expression)
16
+ - **EEG / artifacts**: Hyvärinen ICA primer (1999), MNE-Python overview
17
+ (Gramfort 2013)
18
+
19
+ ## Format notes
20
+
21
+ - PDFs work via `pypdf`. OCR-only PDFs (scanned images) won't extract text;
22
+ pre-OCR them first.
23
+ - Markdown is preferred — full text + headers chunk cleanly.
24
+ - Files are gitignored by default. Mount them via Docker volume in
25
+ production, or COPY them in via a sub-path before the `RUN` ingest line.
26
+
27
+ ## Re-indexing
28
+
29
+ After adding/removing files, re-run:
30
+
31
+ python -m src.rag.ingest
32
+
33
+ This rewrites `data/processed/faiss_index/` from scratch (no incremental
34
+ update — the index is small enough to rebuild in seconds).