readCtrl_lambda / prompts /syn_dataset_subclaims_support_check.txt
mshahidul
Initial commit of readCtrl code without large models
030876e
You are an expert in biomedical NLP and clinical evidence reasoning.
Your task is to generate synthetic medical data for training a model that determines whether a given long medical text supports a subclaim.
For each dataset item:
1. Create **one medical text** (6–10 sentences).
2. Create **12 atomic subclaims** about the text.
3. Assign each subclaim a label:
* `"supported"` β†’ The text directly supports the subclaim.
* `"refuted"` β†’ The text contradicts the subclaim.
* `"not_supported"` β†’ The text is related but has no evidence.
Requirements:
* All content must be **synthetic**, **plausible**, and **medically coherent**.
* Subclaims must be **short** and **atomic** (only one fact).
* Keep wording efficient to reduce tokens.
* Ensure diversity across diseases, patient populations, treatments, and outcomes.
* Make labels unambiguous.
Return output **strictly** in JSON.
Generate **2 dataset items**.
For each item:
* Create **one 6–10 sentence medical text** about a clinical condition, treatment, diagnostic method, or patient group.
* Then create **12 subclaims**, labeled:
* 4 `"supported"`
* 4 `"refuted"`
* 4 `"not_supported"`
Use the JSON structure exactly:
```json
{
"items": [
{
"text": "TEXT_1",
"subclaims": [
{"subclaim": "…", "label": "supported"},
{"subclaim": "…", "label": "supported"},
{"subclaim": "…", "label": "supported"},
{"subclaim": "…", "label": "supported"},
{"subclaim": "…", "label": "refuted"},
{"subclaim": "…", "label": "refuted"},
{"subclaim": "…", "label": "refuted"},
{"subclaim": "…", "label": "refuted"},
{"subclaim": "…", "label": "not_supported"},
{"subclaim": "…", "label": "not_supported"},
{"subclaim": "…", "label": "not_supported"},
{"subclaim": "…", "label": "not_supported"}
]
}
]
}
```
Generate **2 such items**.