Spaces:
Running
Running
| # Datasets | |
| The data pipeline creates a compact but complete medication-safety training substrate. | |
| ## Sources | |
| - Local structured drug knowledge. | |
| - Synthetic patients generated from simulator priors. | |
| - Easy/medium/hard scenario files. | |
| - Retrieval corpus and local evidence index. | |
| - Optional Hugging Face instruction data (`tatsu-lab/alpaca`) for format warm start. | |
| - Optional DDI API augmentation. | |
| - Optional web fallback scraping through allowlisted public health domains. | |
| ## Generated Artifacts | |
| - `data/processed/normalized_drugs.parquet` | |
| - `data/processed/drug_classes.parquet` | |
| - `data/processed/interactions.parquet` | |
| - `data/processed/graph_edges.parquet` | |
| - `data/processed/patients_synthetic.parquet` | |
| - `data/processed/retrieval_corpus.jsonl` | |
| - `data/scenarios/scenarios_easy.jsonl` | |
| - `data/scenarios/scenarios_medium.jsonl` | |
| - `data/scenarios/scenarios_hard.jsonl` | |
| - `data/processed/training_corpus_sft.json(.jsonl)` | |
| - `data/processed/training_corpus_grpo_prompts.jsonl` | |
| - `data/processed/training_corpus_summary.json` | |
| ## Rebuild | |
| ```bash | |
| .venv/bin/python scripts/build_synthetic_patients.py | |
| .venv/bin/python scripts/ingest_open_drug_sources.py | |
| .venv/bin/python scripts/build_drug_knowledge.py | |
| .venv/bin/python scripts/build_retrieval_index.py | |
| .venv/bin/python scripts/build_scenarios.py | |
| .venv/bin/python scripts/bootstrap_data.py | |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf | |
| ``` | |
| Use `--enable-ddi-api` and `--enable-web-fallback` only when network access and provenance review are available. | |