Spaces:
Running
Running
Datasets
The data pipeline creates a compact but complete medication-safety training substrate.
Sources
- Local structured drug knowledge.
- Synthetic patients generated from simulator priors.
- Easy/medium/hard scenario files.
- Retrieval corpus and local evidence index.
- Optional Hugging Face instruction data (
tatsu-lab/alpaca) for format warm start. - Optional DDI API augmentation.
- Optional web fallback scraping through allowlisted public health domains.
Generated Artifacts
data/processed/normalized_drugs.parquetdata/processed/drug_classes.parquetdata/processed/interactions.parquetdata/processed/graph_edges.parquetdata/processed/patients_synthetic.parquetdata/processed/retrieval_corpus.jsonldata/scenarios/scenarios_easy.jsonldata/scenarios/scenarios_medium.jsonldata/scenarios/scenarios_hard.jsonldata/processed/training_corpus_sft.json(.jsonl)data/processed/training_corpus_grpo_prompts.jsonldata/processed/training_corpus_summary.json
Rebuild
.venv/bin/python scripts/build_synthetic_patients.py
.venv/bin/python scripts/ingest_open_drug_sources.py
.venv/bin/python scripts/build_drug_knowledge.py
.venv/bin/python scripts/build_retrieval_index.py
.venv/bin/python scripts/build_scenarios.py
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
Use --enable-ddi-api and --enable-web-fallback only when network access and provenance review are available.