File size: 1,564 Bytes
fd0c71a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Datasets

The data pipeline creates a compact but complete medication-safety training substrate.

## Sources

- Local structured drug knowledge.
- Synthetic patients generated from simulator priors.
- Easy/medium/hard scenario files.
- Retrieval corpus and local evidence index.
- Optional Hugging Face instruction data (`tatsu-lab/alpaca`) for format warm start.
- Optional DDI API augmentation.
- Optional web fallback scraping through allowlisted public health domains.

## Generated Artifacts

- `data/processed/normalized_drugs.parquet`
- `data/processed/drug_classes.parquet`
- `data/processed/interactions.parquet`
- `data/processed/graph_edges.parquet`
- `data/processed/patients_synthetic.parquet`
- `data/processed/retrieval_corpus.jsonl`
- `data/scenarios/scenarios_easy.jsonl`
- `data/scenarios/scenarios_medium.jsonl`
- `data/scenarios/scenarios_hard.jsonl`
- `data/processed/training_corpus_sft.json(.jsonl)`
- `data/processed/training_corpus_grpo_prompts.jsonl`
- `data/processed/training_corpus_summary.json`

## Rebuild

```bash
.venv/bin/python scripts/build_synthetic_patients.py
.venv/bin/python scripts/ingest_open_drug_sources.py
.venv/bin/python scripts/build_drug_knowledge.py
.venv/bin/python scripts/build_retrieval_index.py
.venv/bin/python scripts/build_scenarios.py
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
```

Use `--enable-ddi-api` and `--enable-web-fallback` only when network access and provenance review are available.