Spaces:

mekosotto
/

hackathon

Running

mekosotto commited on 5 days ago

Commit

0db04e6

1 Parent(s): 4fc125d

feat(deploy): build RAG index at Docker build time + KB seed dir

Files changed (4) hide show

Dockerfile CHANGED Viewed

@@ -43,6 +43,14 @@ RUN mkdir -p data/raw data/processed && \
     python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
     python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- HF Spaces convention ---
 EXPOSE 7860

     python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
     python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
+# --- RAG knowledge base ingest ---
+# Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
+# (always present) plus data/knowledge_base/ (optional, user-supplied via
+# additional COPY layer or volume mount). Empty KB → empty index, agent
+# still functions, retrieve_context just returns no chunks.
+COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
+RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
 # --- HF Spaces convention ---
 EXPOSE 7860

Dockerfile.hf CHANGED Viewed

@@ -43,6 +43,14 @@ RUN mkdir -p data/raw data/processed && \
     python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
     python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
 # --- HF Spaces convention ---
 EXPOSE 7860

     python -c "from pathlib import Path; from src.pipelines.eeg_pipeline import run_pipeline; run_pipeline(input_path=Path('tests/fixtures/eeg_sample.fif'), output_path=Path('data/processed/eeg_features.parquet'))" && \
     python -c "from pathlib import Path; from src.pipelines.mri_pipeline import run_pipeline; run_pipeline(input_dir=Path('tests/fixtures/mri_sample'), sites_csv=Path('tests/fixtures/mri_sample/sites.csv'), output_path=Path('data/processed/mri_features.parquet'))"
+# --- RAG knowledge base ingest ---
+# Build the FAISS index from any seed docs in tests/fixtures/kb_sample/
+# (always present) plus data/knowledge_base/ (optional, user-supplied via
+# additional COPY layer or volume mount). Empty KB → empty index, agent
+# still functions, retrieve_context just returns no chunks.
+COPY tests/fixtures/kb_sample/ ./data/knowledge_base/seed/
+RUN python -m src.rag.ingest data/knowledge_base data/processed/faiss_index
 # --- HF Spaces convention ---
 EXPOSE 7860

data/knowledge_base/.gitkeep ADDED Viewed

File without changes

data/knowledge_base/README.md ADDED Viewed

+# RAG Knowledge Base
+Drop reference documents here (`.md`, `.txt`, or `.pdf`). They will be
+ingested by `python -m src.rag.ingest` at Docker build time and surfaced
+to the orchestrator agent via the `retrieve_context` tool.
+## Recommended seed set
+For a clinical-ML / NeuroBridge demo:
+- **BBB / molecules**: Lipinski's Rule of Five (1997, 2001), Pajouhesh & Lenz
+  CNS multiparameter optimization (2005)
+- **MRI / harmonization**: Fortin et al. ComBat for cortical thickness (2017),
+  Fortin et al. ComBat for diffusion (2018), Johnson et al. original ComBat
+  (2007, gene expression)
+- **EEG / artifacts**: Hyvärinen ICA primer (1999), MNE-Python overview
+  (Gramfort 2013)
+## Format notes
+- PDFs work via `pypdf`. OCR-only PDFs (scanned images) won't extract text;
+  pre-OCR them first.
+- Markdown is preferred — full text + headers chunk cleanly.
+- Files are gitignored by default. Mount them via Docker volume in
+  production, or COPY them in via a sub-path before the `RUN` ingest line.
+## Re-indexing
+After adding/removing files, re-run:
+    python -m src.rag.ingest
+This rewrites `data/processed/faiss_index/` from scratch (no incremental
+update — the index is small enough to rebuild in seconds).