cesarali commited on 25 days ago

Commit

5686f5b

verified ·

1 Parent(s): 269fdb7

manual runtime bundle push from load_and_push.ipynb

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +21 -0
README.md +131 -0
config.json +445 -0
configuration_sim_priors_pk.py +42 -0
modeling_sim_priors_pk.py +123 -0
pytorch_model.bin +3 -0
requirements.txt +4 -0
sim_priors_pk/.DS_Store +0 -0
sim_priors_pk/__init__.py +43 -0
sim_priors_pk/config_classes/__init__.py +0 -0
sim_priors_pk/config_classes/data_config.py +375 -0
sim_priors_pk/config_classes/diffusion_pk_config.py +327 -0
sim_priors_pk/config_classes/flow_pk_config.py +534 -0
sim_priors_pk/config_classes/node_pk_config.py +518 -0
sim_priors_pk/config_classes/source_process_config.py +52 -0
sim_priors_pk/config_classes/training_config.py +96 -0
sim_priors_pk/config_classes/utils.py +14 -0
sim_priors_pk/config_classes/yaml_fallback.py +143 -0
sim_priors_pk/data/README.md +86 -0
sim_priors_pk/data/__init__.py +12 -0
sim_priors_pk/data/data_empirical/__init__.py +35 -0
sim_priors_pk/data/data_empirical/builder.py +1139 -0
sim_priors_pk/data/data_empirical/json_schema.py +372 -0
sim_priors_pk/data/data_empirical/json_stats.py +201 -0
sim_priors_pk/data/data_empirical/simulx_to_json.py +71 -0
sim_priors_pk/data/data_generation/__init__.py +0 -0
sim_priors_pk/data/data_generation/compartment_models.py +721 -0
sim_priors_pk/data/data_generation/compartment_models_management.py +1338 -0
sim_priors_pk/data/data_generation/dosing_models.py +0 -0
sim_priors_pk/data/data_generation/observations_classes.py +1776 -0
sim_priors_pk/data/data_generation/observations_functions.py +69 -0
sim_priors_pk/data/data_generation/study_population_stats.py +185 -0
sim_priors_pk/data/data_preprocessing/__init__.py +0 -0
sim_priors_pk/data/data_preprocessing/data_preprocessing_utils.py +321 -0
sim_priors_pk/data/data_preprocessing/raw_to_tensors_bundles.py +360 -0
sim_priors_pk/data/data_preprocessing/tensors_to_databatch.py +72 -0
sim_priors_pk/data/datasets/aicme_batch.py +167 -0
sim_priors_pk/data/datasets/aicme_datasets.py +1874 -0
sim_priors_pk/data/extra/compartment_models_vectorized.py +182 -0
sim_priors_pk/data/extra/kernels.py +28 -0
sim_priors_pk/hub_runtime/README.md +187 -0
sim_priors_pk/hub_runtime/__init__.py +19 -0
sim_priors_pk/hub_runtime/configuration_sim_priors_pk.py +42 -0
sim_priors_pk/hub_runtime/modeling_sim_priors_pk.py +123 -0
sim_priors_pk/hub_runtime/runtime_bundle.py +269 -0
sim_priors_pk/hub_runtime/runtime_contract.py +662 -0
sim_priors_pk/metrics/__init__.py +0 -0
sim_priors_pk/metrics/pk_metrics.py +490 -0
sim_priors_pk/metrics/quantiles_coverage.py +310 -0
sim_priors_pk/metrics/sampling_quality.py +409 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 César A. Ojeda
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+---
+language:
+  - en
+license: apache-2.0
+library_name: generative-pk
+datasets:
+  - simulated
+metrics:
+  - rmse
+  - npde
+tags:
+  - generative
+  - predictive
+---
+# Hierarchical Neural Process for Pharmacokinetic Data
+## Overview
+An Amortized Context Neural Process Generative model for Pharmacokinetic Modelling
+**Model details:**
+- **Authors:** César Ojeda (@cesarali)
+- **License:** Apache 2.0
+## Intended use
+Sample Drug Concentration Behavior and Sample and Prediction of New Points or new Individual
+## Runtime Bundle
+This repository is the consumer-facing runtime bundle for this PK model.
+- Runtime repo: `cesarali/AICME-runtime`
+- Native training/artifact repo: `cesarali/AICMEPK_cluster`
+- Supported tasks: `generate`, `predict`
+- Default task: `generate`
+- Load path: `AutoModel.from_pretrained(..., trust_remote_code=True)`
+### Installation
+You do **not** need to install `sim_priors_pk` to use this runtime bundle.
+`transformers` is the public loading entrypoint, but `transformers` alone is
+not sufficient because this is a PyTorch model with custom runtime code. A
+reliable consumer environment is:
+```bash
+pip install torch transformers huggingface_hub lightning datasets pandas torchtyping gpytorch pot torchdiffeq torchsde ruamel.yaml pyyaml
+```
+### Python Usage
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("cesarali/AICME-runtime", trust_remote_code=True)
+studies = [
+    {
+        "context": [
+            {
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "target": [],
+        "meta_data": {"study_name": "demo", "substance_name": "drug_x"},
+    }
+]
+outputs = model.run_task(
+    task="generate",
+    studies=studies,
+    num_samples=4,
+)
+print(outputs["results"][0]["samples"])
+```
+### Predictive Sampling
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("cesarali/AICME-runtime", trust_remote_code=True)
+predict_studies = [
+    {
+        "context": [
+            {
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "target": [
+            {
+                "name_id": "tgt_0",
+                "observations": [0.25, 0.31],
+                "observation_times": [0.5, 1.0],
+                "remaining": [0.0, 0.0, 0.0],
+                "remaining_times": [2.0, 4.0, 8.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "meta_data": {"study_name": "demo", "substance_name": "drug_x"},
+    }
+]
+outputs = model.run_task(
+    task="predict",
+    studies=predict_studies,
+    num_samples=4,
+)
+print(outputs["results"][0]["samples"][0]["target"][0]["prediction_samples"])
+```
+### Notes
+- `trust_remote_code=True` is required because this model uses custom Hugging Face Hub runtime code.
+- The consumer API is `transformers` + `run_task(...)`; the consumer does not need a local clone of this repository.
+- This runtime bundle is intentionally separate from the native training export so you can evaluate both distribution paths in parallel.

config.json ADDED Viewed

	@@ -0,0 +1,445 @@

+{
+  "architecture_name": "AICMEPK",
+  "architectures": [
+    "PKHubModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_sim_priors_pk.PKHubConfig",
+    "AutoModel": "modeling_sim_priors_pk.PKHubModel"
+  },
+  "builder_config": {
+    "max_context_individuals": 10,
+    "max_context_observations": 15,
+    "max_context_remaining": 15,
+    "max_target_individuals": 1,
+    "max_target_observations": 5,
+    "max_target_remaining": 12
+  },
+  "default_task": "generate",
+  "experiment_config": {
+    "comet_ai_key": null,
+    "context_observations": {
+      "add_rem": true,
+      "drop_time_zero_observations": false,
+      "empirical_number_of_obs": false,
+      "generative_bias": false,
+      "max_num_obs": 15,
+      "max_past": 5,
+      "min_past": 3,
+      "past_time_ratio": 0.1,
+      "split_past_future": false,
+      "type": "pk_peak_half_life"
+    },
+    "debug_test": false,
+    "dosing": {
+      "logdose_mean_range": [
+        -2.0,
+        2.0
+      ],
+      "logdose_std_range": [
+        0.1,
+        0.5
+      ],
+      "num_individuals": 10,
+      "route_options": [
+        "oral",
+        "iv"
+      ],
+      "route_weights": [
+        0.8,
+        0.2
+      ],
+      "same_route": true,
+      "time": 0.0
+    },
+    "experiment_dir": "/work/ojedamarin/Projects/Pharma/Results/comet/uai/7195d8f55b5d4684a766a69d5a736d28",
+    "experiment_indentifier": null,
+    "experiment_name": "uai",
+    "experiment_type": "nodepk",
+    "hf_model_card_path": [
+      "hf_model_cards",
+      "AICME-PK_Readme.md"
+    ],
+    "hf_model_name": "AICMEPK_cluster",
+    "hugging_face_token": null,
+    "meta_study": {
+      "V_tmag_range": [
+        0.001,
+        0.001
+      ],
+      "V_tscl_range": [
+        1,
+        5
+      ],
+      "drug_id_options": [
+        "Drug_A",
+        "Drug_B",
+        "Drug_C"
+      ],
+      "k_1p_tmag_range": [
+        0.01,
+        0.02
+      ],
+      "k_1p_tscl_range": [
+        1,
+        5
+      ],
+      "k_a_tmag_range": [
+        0.01,
+        0.02
+      ],
+      "k_a_tscl_range": [
+        1,
+        5
+      ],
+      "k_e_tmag_range": [
+        0.01,
+        0.02
+      ],
+      "k_e_tscl_range": [
+        1,
+        5
+      ],
+      "k_p1_tmag_range": [
+        0.01,
+        0.02
+      ],
+      "k_p1_tscl_range": [
+        1,
+        5
+      ],
+      "log_V_mean_range": [
+        2,
+        8
+      ],
+      "log_V_std_range": [
+        0.2,
+        0.6
+      ],
+      "log_k_1p_mean_range": [
+        -4,
+        0
+      ],
+      "log_k_1p_std_range": [
+        0.2,
+        0.6
+      ],
+      "log_k_a_mean_range": [
+        -1,
+        2
+      ],
+      "log_k_a_std_range": [
+        0.2,
+        0.6
+      ],
+      "log_k_e_mean_range": [
+        -5,
+        0
+      ],
+      "log_k_e_std_range": [
+        0.2,
+        0.6
+      ],
+      "log_k_p1_mean_range": [
+        -4,
+        -1
+      ],
+      "log_k_p1_std_range": [
+        0.2,
+        0.6
+      ],
+      "num_individuals_range": [
+        5,
+        10
+      ],
+      "num_peripherals_range": [
+        1,
+        3
+      ],
+      "rel_ruv_range": [
+        0.001,
+        0.01
+      ],
+      "solver_method": "rk4",
+      "time_num_steps": 100,
+      "time_start": 0.0,
+      "time_stop": 16.0
+    },
+    "mix_data": {
+      "evaluate_prediction_steps_past": 5,
+      "keep_tempfile": false,
+      "log_and_max": false,
+      "log_and_z": false,
+      "log_transform": false,
+      "n_of_databatches": null,
+      "n_of_permutations": 3,
+      "n_of_target_individuals": 1,
+      "normalize_by_max": true,
+      "normalize_time": true,
+      "recreate_tempfile": false,
+      "sample_size_for_generative_evaluation": null,
+      "sample_size_for_generative_evaluation_end_of_training": 500,
+      "sample_size_for_generative_evaluation_val": 10,
+      "store_in_tempfile": false,
+      "tempfile_path": [
+        "preprocessed",
+        "simulated_ou_as_rates"
+      ],
+      "test_empirical_datasets": [
+        "cesarali/lenuzza-2016",
+        "cesarali/Indometacin",
+        "cesarali/Theophylline"
+      ],
+      "test_size": 64,
+      "tqdm_progress": false,
+      "train_size": 12800,
+      "val_size": 256,
+      "z_score_normalization": false
+    },
+    "my_results_path": "/work/ojedamarin/Projects/Pharma/Results/",
+    "name_str": "AICMEPK",
+    "network": {
+      "activation": "ReLU",
+      "aggregator_num_heads": 8,
+      "aggregator_type": "mean",
+      "combine_latent_mode": "mlp",
+      "cov_proj_dim": 16,
+      "decoder_attention_layers": 2,
+      "decoder_hidden_dim": 512,
+      "decoder_name": "TransformerDecoder",
+      "decoder_num_layers": 4,
+      "decoder_rnn_hidden_dim": 256,
+      "drift_activation": "Tanh",
+      "drift_num_layers": 2,
+      "dropout": 0.1,
+      "encoder_rnn_hidden_dim": 256,
+      "exclusive_node_step": true,
+      "ignore_logvar": true,
+      "individual_encoder_name": "RNNContextEncoder",
+      "individual_encoder_number_of_heads": 4,
+      "init_hidden_num_layers": 2,
+      "input_encoding_hidden_dim": 128,
+      "kl_weight": 1.0,
+      "loss_name": "log_nll",
+      "node_step": true,
+      "norm": "layer",
+      "output_head_num_layers": 3,
+      "prediction_latent_deterministic": false,
+      "prediction_only": false,
+      "reconstruction_only": false,
+      "rnn_decoder_number_of_layers": 4,
+      "rnn_individual_encoder_number_of_layers": 4,
+      "scale_dosing_amounts": true,
+      "study_latent_deterministic": false,
+      "time_obs_encoder_hidden_dim": 256,
+      "time_obs_encoder_output_dim": 256,
+      "use_attention": true,
+      "use_invariance_loss": false,
+      "use_kl_i": true,
+      "use_kl_i_np": true,
+      "use_kl_init": true,
+      "use_kl_s": true,
+      "use_self_attention": true,
+      "use_time_deltas": true,
+      "zi_latent_dim": 128
+    },
+    "run_index": 0,
+    "tags": [
+      "AICME",
+      "AISTATS-2026",
+      "camera-ready"
+    ],
+    "target_observations": {
+      "add_rem": true,
+      "drop_time_zero_observations": false,
+      "empirical_number_of_obs": 2,
+      "generative_bias": false,
+      "max_num_obs": 15,
+      "max_past": 5,
+      "min_past": 3,
+      "past_time_ratio": 0.1,
+      "split_past_future": true,
+      "type": "pk_peak_half_life"
+    },
+    "train": {
+      "amsgrad": false,
+      "batch_size": 64,
+      "betas": [
+        0.9,
+        0.999
+      ],
+      "callbacks_scheduler": {
+        "checkpoint_used_in_end": [
+          "end",
+          "best",
+          "log_rmse"
+        ],
+        "include_end": true,
+        "keep_temp_files": false,
+        "max_samples_per_group": 500,
+        "percent_step": 0.1,
+        "skip_sanity_check": true,
+        "store_samples": true,
+        "task_during": [
+          {
+            "fn_key": "pk.predictive.images",
+            "log_prefix": "Synthetic",
+            "n_samples": 1,
+            "name": "synthetic/predictive_images",
+            "sample_source": "val_batch",
+            "save_to_disk": true,
+            "split": "val",
+            "task_cfg": {
+              "label": "Synthetic",
+              "milestone_stride": 1
+            }
+          },
+          {
+            "fn_key": "pk.generative.images",
+            "log_prefix": "Synthetic",
+            "n_samples": 10,
+            "name": "synthetic/new_individuals_images",
+            "sample_source": "val_batch",
+            "save_to_disk": true,
+            "split": "val",
+            "task_cfg": {
+              "label": "Synthetic",
+              "milestone_stride": 1
+            }
+          },
+          {
+            "fn_key": "pk.predictive.metrics",
+            "log_prefix": "Empirical",
+            "n_samples": 1,
+            "name": "empirical/predictive_metrics",
+            "sample_source": "empirical_set",
+            "save_to_disk": false,
+            "split": "empirical_heldout",
+            "task_cfg": {
+              "label": "Empirical",
+              "milestone_stride": 5
+            }
+          },
+          {
+            "checkpoint_metric": true,
+            "checkpoint_metric_name": "log_rmse",
+            "checkpoint_mode": "min",
+            "fn_key": "pk.empirical.summary",
+            "log_prefix": "Empirical",
+            "n_samples": 0,
+            "name": "empirical/summary",
+            "sample_source": "val_batch",
+            "save_to_disk": false,
+            "split": "val",
+            "task_cfg": {
+              "label": "Empirical",
+              "milestone_stride": 5,
+              "selected_summary_drugs": [
+                "paracetamol glucuronide",
+                "midazolam"
+              ],
+              "summary_metric": "log_rmse"
+            }
+          }
+        ],
+        "tasks_end": [
+          {
+            "fn_key": "pk.predictive.metrics",
+            "log_prefix": "Empirical",
+            "n_samples": 1,
+            "name": "empirical/predictive_metrics",
+            "sample_source": "empirical_set",
+            "save_to_disk": false,
+            "split": "empirical_heldout",
+            "task_cfg": {
+              "label": "Empirical"
+            }
+          },
+          {
+            "fn_key": "pk.predictive.images",
+            "log_prefix": "Empirical",
+            "n_samples": 1,
+            "name": "empirical/predictive_images",
+            "sample_source": "empirical_set",
+            "save_to_disk": true,
+            "split": "empirical_heldout",
+            "task_cfg": {
+              "label": "Empirical"
+            }
+          },
+          {
+            "fn_key": "pk.vpc.npde_pvalues",
+            "log_prefix": "Empirical",
+            "n_samples": 500,
+            "name": "empirical/vpc_npde_pvalues",
+            "sample_source": "empirical_set",
+            "save_to_disk": false,
+            "split": "empirical_no_heldout",
+            "task_cfg": {
+              "label": "Empirical"
+            }
+          },
+          {
+            "fn_key": "pk.vpc.images",
+            "log_prefix": "Empirical",
+            "n_samples": 500,
+            "name": "empirical/vpc_images",
+            "sample_source": "empirical_set",
+            "save_to_disk": true,
+            "split": "empirical_no_heldout",
+            "task_cfg": {
+              "label": "Empirical"
+            }
+          },
+          {
+            "fn_key": "pk.empirical.summary",
+            "log_prefix": "Empirical",
+            "n_samples": 0,
+            "name": "empirical/summary",
+            "sample_source": "val_batch",
+            "save_to_disk": false,
+            "split": "val",
+            "task_cfg": {
+              "label": "Empirical",
+              "selected_summary_drugs": [
+                "paracetamol glucuronide",
+                "midazolam"
+              ],
+              "summary_metric": "log_rmse"
+            }
+          }
+        ],
+        "tasks_validation": []
+      },
+      "epochs": 100,
+      "eps": 1e-08,
+      "gradient_clip_val": 0.5,
+      "learning_rate": 0.0001,
+      "log_interval": 1,
+      "num_batch_plot": 1,
+      "num_workers": 8,
+      "optimizer_name": "AdamW",
+      "persistent_workers": true,
+      "scheduler_name": "CosineAnnealingLR",
+      "scheduler_params": {
+        "T_max": 1000,
+        "eta_min": 5e-05,
+        "last_epoch": -1
+      },
+      "shuffle_val": true,
+      "weight_decay": 0.0001
+    },
+    "upload_to_hf_hub": true,
+    "verbose": false
+  },
+  "experiment_type": "nodepk",
+  "io_schema_version": "studyjson-v1",
+  "model_type": "sim_priors_pk",
+  "original_repo_id": "cesarali/AICMEPK_cluster",
+  "runtime_repo_id": "cesarali/AICME-runtime",
+  "supported_tasks": [
+    "generate",
+    "predict"
+  ],
+  "transformers_version": "4.52.4"
+}

configuration_sim_priors_pk.py ADDED Viewed

	@@ -0,0 +1,42 @@

+"""Hugging Face configuration for self-contained PK runtime bundles."""
+from __future__ import annotations
+from typing import Any, Dict, List, Optional
+from transformers import PretrainedConfig
+from sim_priors_pk.hub_runtime.runtime_contract import STUDY_JSON_IO_VERSION
+class PKHubConfig(PretrainedConfig):
+    """Public Hub config describing a consumer-facing PK runtime bundle."""
+    model_type = "sim_priors_pk"
+    def __init__(
+        self,
+        architecture_name: Optional[str] = None,
+        experiment_type: str = "nodepk",
+        experiment_config: Optional[Dict[str, Any]] = None,
+        builder_config: Optional[Dict[str, Any]] = None,
+        supported_tasks: Optional[List[str]] = None,
+        default_task: Optional[str] = None,
+        io_schema_version: str = STUDY_JSON_IO_VERSION,
+        original_repo_id: Optional[str] = None,
+        runtime_repo_id: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.architecture_name = architecture_name
+        self.experiment_type = experiment_type
+        self.experiment_config = dict(experiment_config or {})
+        self.builder_config = dict(builder_config or {})
+        self.supported_tasks = list(supported_tasks or [])
+        self.default_task = default_task or (self.supported_tasks[0] if self.supported_tasks else None)
+        self.io_schema_version = io_schema_version
+        self.original_repo_id = original_repo_id
+        self.runtime_repo_id = runtime_repo_id
+__all__ = ["PKHubConfig"]

modeling_sim_priors_pk.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Hugging Face AutoModel wrapper for consumer-facing PK runtime bundles."""
+from __future__ import annotations
+from typing import Any, Dict, Optional, Sequence, Union
+import torch
+from transformers import PreTrainedModel
+from sim_priors_pk.data.data_empirical.json_schema import StudyJSON
+from sim_priors_pk.hub_runtime.configuration_sim_priors_pk import PKHubConfig
+from sim_priors_pk.hub_runtime.runtime_contract import (
+    RuntimeBuilderConfig,
+    build_batch_from_studies,
+    infer_supported_tasks,
+    instantiate_backbone_from_hub_config,
+    normalize_studies_input,
+    split_runtime_samples,
+    validate_studies_for_task,
+)
+from sim_priors_pk.models.amortized_inference.generative_pk import (
+    NewGenerativeMixin,
+    NewPredictiveMixin,
+)
+class PKHubModel(PreTrainedModel):
+    """Thin wrapper exposing a stable StudyJSON runtime API on top of PK models."""
+    config_class = PKHubConfig
+    base_model_prefix = "backbone"
+    def __init__(self, config: PKHubConfig, backbone: Optional[torch.nn.Module] = None) -> None:
+        super().__init__(config)
+        self.backbone = backbone if backbone is not None else instantiate_backbone_from_hub_config(config)
+        self.backbone.eval()
+    def forward(self, *args, **kwargs):
+        """Delegate raw forward calls to the wrapped PK backbone."""
+        return self.backbone(*args, **kwargs)
+    @property
+    def supported_tasks(self) -> Sequence[str]:
+        """Tasks supported by this runtime model."""
+        return tuple(getattr(self.config, "supported_tasks", []) or infer_supported_tasks(self.backbone))
+    @torch.inference_mode()
+    def run_task(
+        self,
+        *,
+        task: str,
+        studies: Union[StudyJSON, Sequence[StudyJSON]],
+        num_samples: int = 1,
+        **kwargs: Any,
+    ) -> Dict[str, Any]:
+        """Run the public StudyJSON inference contract for the requested task."""
+        supported_tasks = list(self.supported_tasks)
+        if task not in supported_tasks:
+            raise ValueError(
+                f"Unsupported task {task!r}. Supported tasks: {supported_tasks or 'none'}."
+            )
+        if int(num_samples) < 1:
+            raise ValueError("num_samples must be >= 1.")
+        canonical_studies = normalize_studies_input(studies)
+        builder_config = RuntimeBuilderConfig.from_dict(self.config.builder_config)
+        validate_studies_for_task(canonical_studies, task=task, builder_config=builder_config)
+        experiment_config_payload = getattr(self.config, "experiment_config", {})
+        meta_dosing_payload = experiment_config_payload.get("dosing", {})
+        batch = build_batch_from_studies(
+            canonical_studies,
+            builder_config=builder_config,
+            meta_dosing=self.backbone.meta_dosing.__class__(**meta_dosing_payload)
+            if meta_dosing_payload
+            else self.backbone.meta_dosing,
+        )
+        batch = batch.to(self.device)
+        if task == "generate":
+            if not isinstance(self.backbone, NewGenerativeMixin):
+                raise ValueError(f"Backbone {type(self.backbone).__name__} does not support generate.")
+            output_studies = self.backbone.sample_new_individuals_to_studyjson(
+                batch,
+                sample_size=int(num_samples),
+                num_steps=kwargs.get("num_steps"),
+            )
+        elif task == "predict":
+            if not isinstance(self.backbone, NewPredictiveMixin):
+                raise ValueError(f"Backbone {type(self.backbone).__name__} does not support predict.")
+            output_studies = self.backbone.sample_individual_prediction_from_batch_list_to_studyjson(
+                [batch],
+                sample_size=int(num_samples),
+            )[0]
+        else:
+            raise ValueError(f"Unsupported task {task!r}.")
+        results = [
+            {
+                "input_index": index,
+                "samples": split_runtime_samples(task, study),
+            }
+            for index, study in enumerate(output_studies)
+        ]
+        return {
+            "task": task,
+            "io_schema_version": self.config.io_schema_version,
+            "model_info": {
+                "architecture_name": self.config.architecture_name,
+                "experiment_type": self.config.experiment_type,
+                "supported_tasks": supported_tasks,
+                "runtime_repo_id": self.config.runtime_repo_id,
+                "original_repo_id": self.config.original_repo_id,
+            },
+            "results": results,
+        }
+__all__ = ["PKHubModel"]

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec19d3a6970fcda03332a75ea0b12bb53e17e2d945088ef46e28c74f73195c84
+size 37495779

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+pytest==8.3.5
+ipython==9.2.0
+comet_ml==3.49.6
+matplotlib==3.10.1  # If not needed for inference

sim_priors_pk/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

sim_priors_pk/__init__.py ADDED Viewed

	@@ -0,0 +1,43 @@

+from pathlib import Path
+def _load_key_file(path: Path) -> str | None:
+    """Return the contents of a key file if it exists, otherwise ``None``."""
+    try:
+        return path.read_text(encoding="utf-8").strip()
+    except FileNotFoundError:
+        return None
+    except OSError:
+        # If the file is unreadable we surface the issue by returning ``None``
+        # so callers can decide how to handle missing credentials.
+        return None
+base_dir = Path(__file__).resolve().parent
+project_dir = (base_dir / "..").resolve()
+data_dir = project_dir / "data"
+test_resources_dir = project_dir / "tests" / "resources"
+results_dir = project_dir / "results"
+reports_dir = project_dir / "reports"
+config_dir = project_dir / "config_files"
+comet_keys_file = project_dir / "COMET_KEYS.txt"
+hf_keys_file = project_dir / "KEYS.txt"
+COMET_KEY = _load_key_file(comet_keys_file)
+HUGGINGFACE_KEY = _load_key_file(hf_keys_file)
+__all__ = [
+    "COMET_KEY",
+    "HUGGINGFACE_KEY",
+    "base_dir",
+    "comet_keys_file",
+    "config_dir",
+    "data_dir",
+    "hf_keys_file",
+    "project_dir",
+    "reports_dir",
+    "results_dir",
+    "test_resources_dir",
+]

sim_priors_pk/config_classes/__init__.py ADDED Viewed

File without changes

sim_priors_pk/config_classes/data_config.py ADDED Viewed

	@@ -0,0 +1,375 @@

+import os
+from dataclasses import dataclass, field
+from typing import List, Dict, Tuple, Optional, Union
+import warnings
+try:  # pragma: no cover - exercised indirectly via configuration loading
+    import yaml  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - fallback for minimal environments
+    from sim_priors_pk.config_classes import yaml_fallback as yaml
+try:  # pragma: no cover - optional dependency for downstream modules
+    import torch  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - torch is not required for configuration loading
+    torch = None  # type: ignore
+@dataclass
+class SimpleMetaStudyConfig:
+    """
+    Minimal configuration for the synthetic (non-mechanistic) PK simulator.
+    Used when `simple_mode=True` is detected in the YAML file.
+    """
+    simple_mode: bool = True
+    # --- keep same naming as MetaStudyConfig for compatibility ---
+    num_individuals: int = 16
+    num_individuals_range: Tuple[int, int] = (16, 16)  # <== added to avoid downstream errors
+    time_start: float = 0.0
+    time_stop: float = 24.0
+    time_num_steps: int = 40
+    band_scale_range: Tuple[float, float] = (0.1, 0.3)
+    baseline_range: Tuple[float, float] = (0.0, 0.1)
+    decay_rate_range: Tuple[float, float] = (0.3, 0.6)
+    p1: float = 0.5  # of runs use the exponential, 65% use the pulse
+    num_peripherals_range: Tuple[int, int] = (1, 3)
+    solver_method: str = "dummy"
+    drug_id_options: List[str] = field(default_factory=lambda: ["DummyDrug"])
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "SimpleMetaStudyConfig":
+        with open(file_path, "r", encoding="utf-8") as handle:
+            cfg = yaml.safe_load(handle) or {}
+        cfg = cfg.get("meta_study", cfg)
+        # Ensure backward compatibility if YAML only defines num_individuals
+        if "num_individuals_range" not in cfg and "num_individuals" in cfg:
+            n = cfg["num_individuals"]
+            cfg["num_individuals_range"] = (n, n)
+        return cls(**cfg)
+@dataclass
+class MetaStudyConfig:
+    """
+    This class contains the configuration for the compartment study.
+    i.e. it specifies the parameters to sample the population which
+    in turns will sample the individuals.
+    """
+    drug_id_options: List[str] = field(default_factory=lambda: ["Drug_A", "Drug_B", "Drug_C"])
+    num_individuals_range: Tuple[int, int] = (20, 20)
+    num_peripherals_range: Tuple[int, int] = (1, 3)
+    log_k_a_mean_range: Tuple[float, float] = (-1.5, 1.5)
+    log_k_a_std_range: Tuple[float, float] = (0.1, 0.5)
+    k_a_tmag_range: Tuple[float, float] = (0.01, 0.1)
+    k_a_tscl_range: Tuple[float, float] = (1.0, 5.0)
+    log_k_e_mean_range: Tuple[float, float] = (-1.5, 1.5)
+    log_k_e_std_range: Tuple[float, float] = (0.1, 0.5)
+    k_e_tmag_range: Tuple[float, float] = (0.01, 0.1)
+    k_e_tscl_range: Tuple[float, float] = (1.0, 5.0)
+    log_V_mean_range: Tuple[float, float] = (-1.5, 1.5)
+    log_V_std_range: Tuple[float, float] = (0.1, 0.5)
+    V_tmag_range: Tuple[float, float] = (0.01, 0.1)
+    V_tscl_range: Tuple[float, float] = (1.0, 5.0)
+    log_k_1p_mean_range: Tuple[float, float] = (-1.5, 1.5)
+    log_k_1p_std_range: Tuple[float, float] = (0.1, 0.5)
+    k_1p_tmag_range: Tuple[float, float] = (0.01, 0.1)
+    k_1p_tscl_range: Tuple[float, float] = (1.0, 5.0)
+    log_k_p1_mean_range: Tuple[float, float] = (-1.5, 1.5)
+    log_k_p1_std_range: Tuple[float, float] = (0.1, 0.5)
+    k_p1_tmag_range: Tuple[float, float] = (0.01, 0.1)
+    k_p1_tscl_range: Tuple[float, float] = (1.0, 5.0)
+    # Parameters for observation noise
+    rel_ruv_range: Tuple[float, float] = (0.05, 0.3)
+    # Parameters for generating time_points
+    time_start: float = 0.0
+    time_stop: float = 10.0
+    time_num_steps: int = 100
+    # parameters for solver
+    solver_method: str = "rk4"
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "MetaStudyConfig":
+        """Instantiate the meta-study configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict) and "meta_study" in config_dict:
+            config_dict = config_dict.get("meta_study") or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected 'meta_study' section in YAML to be a mapping.")
+        return cls(**config_dict)
+@dataclass
+class ObservationsConfig:
+    """High-level knobs describing an observation strategy."""
+    # ``None`` (e.g. YAML ``type: null``) is treated as the legacy
+    # ``pk_peak_half_life`` strategy by the observation factory.
+    type: Optional[str] = "pk_peak_half_life"
+    add_rem: bool = True
+    split_past_future: bool = False
+    min_past: Optional[int] = None
+    max_past: Optional[int] = None
+    max_num_obs: int = 10
+    empirical_number_of_obs: int = 2
+    # When True, entries at non-positive times are excluded from sampled
+    # observations (e.g. concentration at dosing time t=0).
+    drop_time_zero_observations: bool = False
+    # Strategy specific semantic controls (do not affect tensor shapes directly)
+    past_time_ratio: float = 0.1  # Used by random strategies with fixed boundary
+    # Sampling policy for split-past/future strategies:
+    # - False: sample uniformly in [min_past, max_past]
+    # - True: sample 0 with prob. 0.5, otherwise sample uniformly
+    #   in [max(1, min_past), max_past]
+    generative_bias: bool = False
+    def __post_init__(self):
+        if not isinstance(self.generative_bias, bool):
+            raise ValueError("generative_bias must be a boolean (true/false)")
+        if self.split_past_future:
+            if self.min_past is None or self.max_past is None:
+                raise ValueError(
+                    "min_past and max_past must be provided when split_past_future=True"
+                )
+            if self.min_past < 0:
+                raise ValueError("min_past must be non-negative")
+            if self.max_past < self.min_past:
+                raise ValueError("max_past must be >= min_past")
+            self.add_rem = True
+    @classmethod
+    def from_yaml(
+        cls,
+        file_path: Union[str, os.PathLike],
+        section: Optional[str] = None,
+    ) -> "ObservationsConfig":
+        """Instantiate an observation configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected YAML content to be a mapping.")
+        if section is not None:
+            if section not in config_dict:
+                raise KeyError(f"Section '{section}' not found in YAML file '{file_path}'.")
+            config_dict = config_dict.get(section) or {}
+        else:
+            potential_sections = [
+                key for key in ("context_observations", "target_observations") if key in config_dict
+            ]
+            if len(potential_sections) > 1:
+                raise ValueError(
+                    "Multiple observation sections found; specify which one to load using the 'section' argument."
+                )
+            if potential_sections:
+                config_dict = config_dict.get(potential_sections[0]) or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected observation configuration to be provided as a mapping.")
+        return cls(**config_dict)
+@dataclass
+class MixDataConfig:
+    """
+    Here we specify how do we construct the  mix databatch,
+    i.e. if we treat as a the decoder variable one full path
+         or if we treat as the decoder variable the future steps of a paht
+    """
+    test_empirical_datasets: List[str] = field(default_factory=lambda: ["cesarali/lenuzza-2016"])
+    # Deprecated fields removed (unused in current training flow):
+    # pretraining_*, val_protocol, test_protocol, split_strategy, split_seed.
+    evaluate_prediction_steps_past: int = 4  # lenght of past is kept fix for evaluation
+    sample_size_for_generative_evaluation_val: Optional[int] = None
+    # Number of generative samples (S) used for validation-time callback
+    # evaluation (new individuals and VPC/NPDE consumers). Defaults to 10.
+    sample_size_for_generative_evaluation_end_of_training: Optional[int] = None
+    # Number of generative samples (S) used for end-of-training callback
+    # evaluation (empirical end hooks). Defaults to 500.
+    sample_size_for_generative_evaluation: Optional[int] = None
+    # Deprecated legacy alias for both values above. When set and the new
+    # fields are not provided, the same value is applied to both stages.
+    # Value/time normalization flags consumed by PKScaler.
+    # Precedence for value scaling:
+    # 1) log_and_z=True -> "log_and_z"
+    # 2) log_and_max=True -> "log_and_max"
+    # 3) log_transform=True -> "log"
+    # 4) z_score_normalization=True -> "zscore"
+    # 5) normalize_by_max=True -> "max"
+    # 6) otherwise -> "none"
+    z_score_normalization: bool = False
+    # Explicit single switch for log + z-score scaling in PKScaler.
+    log_and_z: bool = False
+    # Explicit single switch for log + max scaling in PKScaler.
+    log_and_max: bool = False
+    normalize_by_max: bool = True
+    normalize_time: bool = True
+    n_of_permutations: int = 1
+    n_of_databatches: Optional[int] = None  # deprecated alias
+    n_of_target_individuals: int = 1  # ignored for LOO/NO_TARGET
+    # Log-only transform flag consumed by PKScaler (value_method="log").
+    # This is no longer handled in the dataset/datamodule path.
+    log_transform: bool = False  # Matches node-pk-1804.yaml
+    store_in_tempfile: bool = False  # When True dataset is generated and saved to a temporary file
+    keep_tempfile: bool = False  # Don't delete the temporary file on cleanup
+    recreate_tempfile: bool = False  # Regenerate file even if it already exists
+    tempfile_path: Tuple[str, str] = (
+        "preprocessed",
+        "simulated_ou_as_rates.tr",
+    )
+    tqdm_progress: bool = False  # Show progress bar when generating temp files
+    # DATA SIZES
+    train_size: int = 1000
+    val_size: int = 100
+    test_size: int = 100
+    def __post_init__(self) -> None:
+        if self.n_of_databatches is not None and self.n_of_permutations == 1:
+            self.n_of_permutations = self.n_of_databatches
+            warnings.warn(
+                "n_of_databatches is deprecated; use n_of_permutations",
+                DeprecationWarning,
+            )
+        legacy_sample_size = self.sample_size_for_generative_evaluation
+        if (
+            self.sample_size_for_generative_evaluation_val is None
+            and legacy_sample_size is not None
+        ):
+            self.sample_size_for_generative_evaluation_val = int(legacy_sample_size)
+        if (
+            self.sample_size_for_generative_evaluation_end_of_training is None
+            and legacy_sample_size is not None
+        ):
+            self.sample_size_for_generative_evaluation_end_of_training = int(legacy_sample_size)
+        if self.sample_size_for_generative_evaluation_val is None:
+            self.sample_size_for_generative_evaluation_val = 10
+        if self.sample_size_for_generative_evaluation_end_of_training is None:
+            self.sample_size_for_generative_evaluation_end_of_training = 500
+        if int(self.sample_size_for_generative_evaluation_val) < 1:
+            raise ValueError("sample_size_for_generative_evaluation_val must be >= 1")
+        if int(self.sample_size_for_generative_evaluation_end_of_training) < 1:
+            raise ValueError("sample_size_for_generative_evaluation_end_of_training must be >= 1")
+        self.sample_size_for_generative_evaluation_val = int(
+            self.sample_size_for_generative_evaluation_val
+        )
+        self.sample_size_for_generative_evaluation_end_of_training = int(
+            self.sample_size_for_generative_evaluation_end_of_training
+        )
+        if legacy_sample_size is not None:
+            warnings.warn(
+                "sample_size_for_generative_evaluation is deprecated; use "
+                "sample_size_for_generative_evaluation_val and "
+                "sample_size_for_generative_evaluation_end_of_training",
+                DeprecationWarning,
+            )
+        if self.n_of_permutations < 1:
+            raise ValueError("n_of_permutations must be >= 1")
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "MixDataConfig":
+        """Instantiate the mix-data configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict):
+            for key in ("mix_data", "mix_data_config"):
+                if key in config_dict and isinstance(config_dict[key], dict):
+                    config_dict = config_dict[key]
+                    break
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected mix data configuration to be provided as a mapping.")
+        return cls(**config_dict)
+@dataclass
+class MetaDosingConfig:
+    """
+    Config for specifying meta dosing information.
+    """
+    num_individuals: int = 10
+    same_route: bool = True
+    logdose_mean_range: Tuple[float, float] = (-2, 2)
+    logdose_std_range: Tuple[float, float] = (0.1, 0.5)
+    route_options: List[str] = field(default_factory=lambda: ["oral", "iv"])
+    route_weights: List[float] = field(default_factory=lambda: [0.8, 0.2])
+    time: float = 0.0
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "MetaDosingConfig":
+        """Instantiate the meta-dosing configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict) and "dosing" in config_dict:
+            config_dict = config_dict.get("dosing") or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected 'dosing' section in YAML to be a mapping.")
+        return cls(**config_dict)
+@dataclass
+class MetaDosingWithDurationConfig(MetaDosingConfig):
+    """
+    Config for specifying meta dosing information including iv infusions.
+    """
+    route_duration_weights:  Dict[str,float] = field(default_factory=lambda: {"oral": 0.0, "iv": 0.5}) # no duration for oral, 50% chance of infusion for iv
+    duration_range: Tuple[float, float] = (0.5, 2.0)  # Duration of infusion; 0.0 means bolus
+@dataclass
+class DosingConfig:
+    """
+    Config for specifying dosing information. For now, it just holds the amount D of a single oral dose
+    given at time t = 0.
+    """
+    dose: float = 1.0
+    route: str = "oral"
+    time: float = 0.0
+@dataclass
+class DosingWithDurationConfig:
+    """
+    Config for specifying dosing information. It holds the amount D of a dose
+    given at time t = 0, optionally with an infusion duration.
+    """
+    dose: float = 1.0
+    route: str = "oral"
+    time: float = 0.0
+    duration: float = 0.0  # Duration of infusion; 0.0 means bolus

sim_priors_pk/config_classes/diffusion_pk_config.py ADDED Viewed

	@@ -0,0 +1,327 @@

+import os
+from copy import deepcopy
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple, Union
+import yaml  # type: ignore
+from sim_priors_pk.config_classes.data_config import (
+    MetaDosingConfig,
+    MetaStudyConfig,
+    MixDataConfig,
+    ObservationsConfig,
+    SimpleMetaStudyConfig,
+)
+from sim_priors_pk.config_classes.node_pk_config import EncoderDecoderNetworkConfig
+from sim_priors_pk.config_classes.source_process_config import SourceProcessConfig
+from sim_priors_pk.config_classes.training_config import TrainingConfig
+from sim_priors_pk.config_classes.utils import TupleSafeLoader
+@dataclass
+class DiffusionPKExperimentConfig:
+    """Experiment configuration dedicated to diffusion PK models."""
+    experiment_type: str = "diffusionpk"
+    name_str: str = "ContinuousDiffusionPK"
+    diffusion_type: str = "continuous"  # "continuous" or "discrete"
+    comet_ai_key: str = None
+    experiment_name: str = "diffusion_pk_compartments"
+    hugging_face_token: str = None
+    upload_to_hf_hub: bool = True
+    hf_model_name: str = "DiffusionPK_test"
+    hf_model_card_path: Tuple[str, str, str] = ("hf_model_card", "DIFFUSION-PK_Readme.md")
+    tags: List[str] = field(default_factory=lambda: ["diffusion-pk", "B-0"])
+    experiment_indentifier: str = None
+    my_results_path: str = None
+    experiment_dir: str = None
+    verbose: bool = False
+    run_index: int = 0
+    debug_test: bool = False
+    # Diffusion training knob: predict unit Gaussian noise or correlated noise.
+    predict_gaussian_noise: bool = True
+    network: EncoderDecoderNetworkConfig = field(default_factory=EncoderDecoderNetworkConfig)
+    source_process: SourceProcessConfig = field(default_factory=SourceProcessConfig)
+    mix_data: MixDataConfig = field(default_factory=MixDataConfig)
+    context_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    target_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    meta_study: MetaStudyConfig = field(default_factory=MetaStudyConfig)
+    dosing: MetaDosingConfig = field(default_factory=MetaDosingConfig)
+    train: TrainingConfig = field(default_factory=TrainingConfig)
+    @staticmethod
+    def from_yaml(file_path: str) -> "DiffusionPKExperimentConfig":
+        """Initializes the class from a YAML file."""
+        with open(file_path, "r") as file:
+            config_dict = yaml.load(file, Loader=TupleSafeLoader) or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected experiment YAML to be a mapping.")
+        exp_type = config_dict.get("experiment_type")
+        if exp_type is not None and str(exp_type).lower() != "diffusionpk":
+            raise ValueError(
+                "Expected experiment_type 'diffusionpk' for DiffusionPKExperimentConfig, "
+                f"got {exp_type!r}."
+            )
+        base_dir = os.path.dirname(os.path.abspath(file_path))
+        data_cfg_dict = (
+            DiffusionPKExperimentConfig._load_ref_yaml(config_dict.get("data_config"), base_dir)
+            or {}
+        )
+        training_cfg_dict = (
+            DiffusionPKExperimentConfig._load_ref_yaml(config_dict.get("training_config"), base_dir)
+            or {}
+        )
+        model_cfg_dict = (
+            DiffusionPKExperimentConfig._load_ref_yaml(config_dict.get("model_config"), base_dir)
+            or {}
+        )
+        observations_section = DiffusionPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "observations_config"
+        )
+        if observations_section is None:
+            observations_section = DiffusionPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "observations_config"
+            )
+        if observations_section is not None:
+            context_observations_base = observations_section.get("context_observations")
+            target_observations_base = observations_section.get("target_observations")
+        else:
+            context_observations_base = data_cfg_dict.get("context_observations")
+            target_observations_base = data_cfg_dict.get("target_observations")
+        mix_data_section = DiffusionPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "mix_data_config"
+        )
+        if mix_data_section is None:
+            mix_data_section = DiffusionPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "mix_data_config"
+            )
+        if mix_data_section is None:
+            mix_data_section = data_cfg_dict.get("mix_data")
+        meta_study_section = DiffusionPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_study_config"
+        )
+        if meta_study_section is None:
+            meta_study_section = DiffusionPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_study_config"
+            )
+        meta_dosing_section = DiffusionPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_dosing_config"
+        )
+        if meta_dosing_section is None:
+            meta_dosing_section = DiffusionPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_dosing_config"
+            )
+        meta_study_base = DiffusionPKExperimentConfig._extract_config_mapping(
+            meta_study_section, "meta_study"
+        )
+        if meta_study_base is None and meta_dosing_section is not None:
+            meta_study_base = DiffusionPKExperimentConfig._extract_config_mapping(
+                meta_dosing_section, "meta_study"
+            )
+        if meta_study_base is None:
+            meta_study_base = data_cfg_dict.get("meta_study")
+        dosing_base = DiffusionPKExperimentConfig._extract_config_mapping(
+            meta_dosing_section, "dosing"
+        )
+        if dosing_base is None:
+            dosing_base = data_cfg_dict.get("dosing")
+        mix_data_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            mix_data_section, config_dict.get("mix_data")
+        )
+        context_obs_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            context_observations_base, config_dict.get("context_observations")
+        )
+        target_obs_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            target_observations_base, config_dict.get("target_observations")
+        )
+        meta_study_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            meta_study_base, config_dict.get("meta_study")
+        )
+        dosing_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            dosing_base, config_dict.get("dosing")
+        )
+        train_section = training_cfg_dict.get("train", training_cfg_dict)
+        train_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            train_section, config_dict.get("train")
+        )
+        network_section = model_cfg_dict.get("network", model_cfg_dict)
+        network_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            network_section, config_dict.get("network")
+        )
+        source_section = DiffusionPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "source_config"
+        )
+        if source_section is None:
+            source_section = DiffusionPKExperimentConfig._resolve_config_section(
+                model_cfg_dict, base_dir, "source_config"
+            )
+        if source_section is None:
+            source_section = model_cfg_dict.get("source_process") or model_cfg_dict.get("noise_model")
+        source_section = DiffusionPKExperimentConfig._extract_config_mapping(
+            source_section, "source_process"
+        )
+        if isinstance(source_section, dict) and "noise_model" in source_section:
+            source_section = source_section.get("noise_model")
+        source_cfg = DiffusionPKExperimentConfig._merge_dicts(
+            source_section, config_dict.get("source_process")
+        )
+        if meta_study_cfg.get("simple_mode", False):
+            meta_study_instance = SimpleMetaStudyConfig(**meta_study_cfg)
+        else:
+            meta_study_instance = MetaStudyConfig(**meta_study_cfg)
+        train_cfg = TrainingConfig._filter_kwargs(train_cfg)
+        return DiffusionPKExperimentConfig(
+            experiment_type=str(config_dict.get("experiment_type", "diffusionpk")).lower(),
+            name_str=config_dict.get("name_str", "ContinuousDiffusionPK"),
+            diffusion_type=config_dict.get("diffusion_type", "continuous"),
+            tags=config_dict.get("tags", ["diffusion-pk", "B-0"]),
+            experiment_name=config_dict.get("experiment_name", "diffusion_pk_compartments"),
+            experiment_indentifier=config_dict.get("experiment_indentifier", None),
+            my_results_path=config_dict.get("my_results_path", None),
+            experiment_dir=config_dict.get("experiment_dir", None),
+            comet_ai_key=config_dict.get("comet_ai_key", None),
+            hugging_face_token=config_dict.get("hugging_face_token", None),
+            upload_to_hf_hub=config_dict.get("upload_to_hf_hub", True),
+            hf_model_name=config_dict.get("hf_model_name", "DiffusionPK_test"),
+            hf_model_card_path=tuple(
+                config_dict.get(
+                    "hf_model_card_path", ("hf_model_card", "DIFFUSION-PK_Readme.md")
+                )
+            ),
+            debug_test=config_dict.get("debug_test", False),
+            predict_gaussian_noise=bool(config_dict.get("predict_gaussian_noise", True)),
+            network=EncoderDecoderNetworkConfig(**network_cfg),
+            source_process=SourceProcessConfig(**source_cfg),
+            mix_data=MixDataConfig(**mix_data_cfg),
+            context_observations=ObservationsConfig(**context_obs_cfg),
+            target_observations=ObservationsConfig(**target_obs_cfg),
+            meta_study=meta_study_instance,
+            dosing=MetaDosingConfig(**dosing_cfg),
+            train=TrainingConfig(**train_cfg),
+        )
+    @staticmethod
+    def _merge_dicts(
+        base_dict: Optional[Dict[str, Any]], override_dict: Optional[Dict[str, Any]]
+    ) -> Dict[str, Any]:
+        """Merge two optional dictionaries returning a new dictionary."""
+        merged: Dict[str, Any] = {}
+        if base_dict:
+            if not isinstance(base_dict, dict):
+                raise TypeError(
+                    "Expected base_dict to be a mapping when merging configuration sections."
+                )
+            merged = deepcopy(base_dict)
+        if override_dict:
+            if not isinstance(override_dict, dict):
+                raise TypeError(
+                    "Expected override_dict to be a mapping when merging configuration sections."
+                )
+            merged.update(override_dict)
+        return merged
+    @staticmethod
+    def _extract_config_mapping(
+        section: Optional[Dict[str, Any]], nested_key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Return a nested configuration mapping or the section itself."""
+        if section is None:
+            return None
+        if not isinstance(section, dict):
+            raise TypeError(
+                "Expected configuration section to be a mapping when extracting nested"
+                f" '{nested_key}' values."
+            )
+        if nested_key in section:
+            nested_value = section[nested_key]
+            if nested_value is None:
+                return None
+            if not isinstance(nested_value, dict):
+                raise TypeError(
+                    f"Expected '{nested_key}' section to be a mapping when extracting configuration values."
+                )
+            return nested_value
+        return section
+    @staticmethod
+    def _load_ref_yaml(
+        ref: Optional[Union[str, Dict[str, Any]]], base_dir: str
+    ) -> Optional[Dict[str, Any]]:
+        """Load a referenced YAML block or return inline dictionaries as-is."""
+        if ref is None:
+            return None
+        if isinstance(ref, dict):
+            return ref
+        if isinstance(ref, str):
+            ref_path = ref
+            if not os.path.isabs(ref_path):
+                ref_path = os.path.join(base_dir, ref_path)
+            with open(ref_path, "r") as handle:
+                return yaml.load(handle, Loader=TupleSafeLoader) or {}
+        raise TypeError("Expected configuration reference to be a mapping or string path.")
+    @staticmethod
+    def _resolve_config_section(
+        cfg_dict: Dict[str, Any], base_dir: str, key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Resolve nested configuration references within a configuration block."""
+        if key not in cfg_dict:
+            return None
+        section = cfg_dict[key]
+        if section is None:
+            return None
+        if isinstance(section, dict):
+            ref_value = section.get("_ref") if "_ref" in section else None
+            if ref_value is not None:
+                loaded = DiffusionPKExperimentConfig._load_ref_yaml(ref_value, base_dir)
+                return loaded or {}
+            return section
+        if isinstance(section, str):
+            loaded = DiffusionPKExperimentConfig._load_ref_yaml(section, base_dir)
+            return loaded or {}
+        raise TypeError(
+            f"Expected configuration section '{key}' to be a mapping or string reference."
+        )

sim_priors_pk/config_classes/flow_pk_config.py ADDED Viewed

	@@ -0,0 +1,534 @@

+import math
+import os
+from copy import deepcopy
+from dataclasses import asdict, dataclass, field, fields
+from typing import Any, Dict, List, Optional, Tuple, Union
+try:  # pragma: no cover - exercised indirectly via configuration loading
+    import yaml  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - fallback for minimal environments
+    from sim_priors_pk.config_classes import yaml_fallback as yaml
+try:  # pragma: no cover - optional dependency for HF integration
+    from transformers import PretrainedConfig  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - allow configuration utilities without transformers
+    class PretrainedConfig:  # type: ignore
+        def __init__(self, **kwargs):
+            super().__init__()
+from sim_priors_pk.config_classes.data_config import (
+    MetaDosingConfig,
+    MetaStudyConfig,
+    MixDataConfig,
+    ObservationsConfig,
+    SimpleMetaStudyConfig,
+)
+from sim_priors_pk.config_classes.source_process_config import SourceProcessConfig
+from sim_priors_pk.config_classes.training_config import TrainingConfig
+from sim_priors_pk.config_classes.utils import TupleSafeLoader
+def _to_float(x: Any) -> float:
+    try:
+        v = float(x)
+    except Exception:
+        return math.inf
+    # guard against NaN
+    if math.isnan(v):
+        return math.inf
+    return v
+def _raise_flowpk_network_migration() -> None:
+    raise ValueError(
+        "FlowPK configs no longer accept a 'network' section. "
+        "Please rename 'network' to 'vector_field' and set 'experiment_type: flowpk' in your YAML."
+    )
+@dataclass
+class VectorFieldPKConfig:
+    """Configuration for the transformer vector field used by FlowPK."""
+    # Transformer vector field configuration
+    hidden_dim: int = 64
+    fourier_modes: int = 16
+    use_spectral_qkv: bool = False
+    time_fourier_max_freq: int = 64
+    encoder_num_heads: int = 4
+    decoder_num_heads: int = 4
+    encoder_attention_layers: int = 2
+    decoder_attention_layers: int = 2
+    dropout: float = 0.0
+    # # Latent/conditioning settings required by the vector field implementation
+    cov_proj_dim: int = 16  # p in the paper
+    combine_latent_mode: str = "mlp"  # Options: "mlp", "sum"
+    zi_latent_dim: int = 200
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "VectorFieldPKConfig":
+        """Instantiate the vector field configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict) and "network" in config_dict:
+            _raise_flowpk_network_migration()
+        if isinstance(config_dict, dict) and "vector_field" in config_dict:
+            config_dict = config_dict.get("vector_field") or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected 'vector_field' section in YAML to be a mapping.")
+        return cls(**config_dict)
+@dataclass
+class FlowPKExperimentConfig:
+    """Experiment configuration for FlowPK (vector field only)."""
+    experiment_type: str = "flowpk"
+    name_str: str = "FlowPK"
+    comet_ai_key: str = None
+    experiment_name: str = "flow_pk_compartments"
+    hugging_face_token: str = None
+    upload_to_hf_hub: bool = True
+    hf_model_name: str = "FlowPK_test"
+    hf_model_card_path: Tuple[str, str, str] = ("hf_model_card", "CVAE_Readme.md")
+    tags: List[str] = field(default_factory=lambda: ["flow-pk", "B-0"])
+    experiment_indentifier: str = None
+    my_results_path: str = None
+    experiment_dir: str = None
+    verbose: bool = False
+    run_index: int = 0
+    debug_test: bool = False
+    # Default Euler integration steps used by FlowPK sampling when callers
+    # do not provide ``num_steps`` explicitly (for example VPC callbacks).
+    flow_num_steps: int = 50
+    vector_field: VectorFieldPKConfig = field(default_factory=VectorFieldPKConfig)
+    source_process: SourceProcessConfig = field(default_factory=SourceProcessConfig)
+    mix_data: MixDataConfig = field(default_factory=MixDataConfig)
+    context_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    target_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    meta_study: MetaStudyConfig = field(default_factory=MetaStudyConfig)
+    dosing: MetaDosingConfig = field(default_factory=MetaDosingConfig)
+    train: TrainingConfig = field(default_factory=TrainingConfig)
+    @staticmethod
+    def from_yaml(file_path: str) -> "FlowPKExperimentConfig":
+        """Initializes the class from a YAML file.
+        Supports both monolithic experiment YAML files as well as files that
+        reference dedicated data, training, and model configuration YAMLs.
+        """
+        with open(file_path, "r") as file:
+            config_dict = yaml.load(file, Loader=TupleSafeLoader) or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected experiment YAML to be a mapping.")
+        exp_type = config_dict.get("experiment_type")
+        if exp_type is not None and str(exp_type).lower() != "flowpk":
+            raise ValueError(
+                f"Expected experiment_type 'flowpk' for FlowPKExperimentConfig, got {exp_type!r}."
+            )
+        if "network" in config_dict:
+            _raise_flowpk_network_migration()
+        base_dir = os.path.dirname(os.path.abspath(file_path))
+        data_cfg_dict = (
+            FlowPKExperimentConfig._load_ref_yaml(config_dict.get("data_config"), base_dir) or {}
+        )
+        training_cfg_dict = (
+            FlowPKExperimentConfig._load_ref_yaml(config_dict.get("training_config"), base_dir)
+            or {}
+        )
+        model_cfg_dict = (
+            FlowPKExperimentConfig._load_ref_yaml(config_dict.get("model_config"), base_dir) or {}
+        )
+        if isinstance(model_cfg_dict, dict) and "network" in model_cfg_dict:
+            _raise_flowpk_network_migration()
+        observations_section = FlowPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "observations_config"
+        )
+        if observations_section is None:
+            observations_section = FlowPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "observations_config"
+            )
+        if observations_section is not None:
+            context_observations_base = observations_section.get("context_observations")
+            target_observations_base = observations_section.get("target_observations")
+        else:
+            context_observations_base = data_cfg_dict.get("context_observations")
+            target_observations_base = data_cfg_dict.get("target_observations")
+        mix_data_section = FlowPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "mix_data_config"
+        )
+        if mix_data_section is None:
+            mix_data_section = FlowPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "mix_data_config"
+            )
+        if mix_data_section is None:
+            mix_data_section = data_cfg_dict.get("mix_data")
+        meta_study_section = FlowPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_study_config"
+        )
+        if meta_study_section is None:
+            meta_study_section = FlowPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_study_config"
+            )
+        meta_dosing_section = FlowPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_dosing_config"
+        )
+        if meta_dosing_section is None:
+            meta_dosing_section = FlowPKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_dosing_config"
+            )
+        meta_study_base = FlowPKExperimentConfig._extract_config_mapping(
+            meta_study_section, "meta_study"
+        )
+        if meta_study_base is None and meta_dosing_section is not None:
+            meta_study_base = FlowPKExperimentConfig._extract_config_mapping(
+                meta_dosing_section, "meta_study"
+            )
+        if meta_study_base is None:
+            meta_study_base = data_cfg_dict.get("meta_study")
+        dosing_base = FlowPKExperimentConfig._extract_config_mapping(meta_dosing_section, "dosing")
+        if dosing_base is None:
+            dosing_base = data_cfg_dict.get("dosing")
+        mix_data_inline = config_dict.get("mix_data")
+        if mix_data_inline is not None and not isinstance(mix_data_inline, dict):
+            raise TypeError("Expected 'mix_data' section in experiment YAML to be a mapping.")
+        # Backward compatibility: allow mix-data keys at experiment top-level.
+        # Nested `mix_data:` values take precedence over these legacy top-level keys.
+        legacy_mix_data_inline = {
+            field_meta.name: config_dict[field_meta.name]
+            for field_meta in fields(MixDataConfig)
+            if field_meta.name in config_dict
+        }
+        merged_mix_data_inline = dict(legacy_mix_data_inline)
+        if isinstance(mix_data_inline, dict):
+            merged_mix_data_inline.update(mix_data_inline)
+        mix_data_cfg = FlowPKExperimentConfig._merge_dicts(
+            mix_data_section, merged_mix_data_inline
+        )
+        context_obs_cfg = FlowPKExperimentConfig._merge_dicts(
+            context_observations_base, config_dict.get("context_observations")
+        )
+        target_obs_cfg = FlowPKExperimentConfig._merge_dicts(
+            target_observations_base, config_dict.get("target_observations")
+        )
+        meta_study_cfg = FlowPKExperimentConfig._merge_dicts(
+            meta_study_base, config_dict.get("meta_study")
+        )
+        dosing_cfg = FlowPKExperimentConfig._merge_dicts(dosing_base, config_dict.get("dosing"))
+        train_section = training_cfg_dict.get("train", training_cfg_dict)
+        train_cfg = FlowPKExperimentConfig._merge_dicts(train_section, config_dict.get("train"))
+        vector_field_section = model_cfg_dict.get("vector_field", model_cfg_dict)
+        if isinstance(vector_field_section, dict) and "network" in vector_field_section:
+            _raise_flowpk_network_migration()
+        vector_field_cfg = FlowPKExperimentConfig._merge_dicts(
+            vector_field_section, config_dict.get("vector_field")
+        )
+        source_section = FlowPKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "source_config"
+        )
+        if source_section is None:
+            source_section = FlowPKExperimentConfig._resolve_config_section(
+                model_cfg_dict, base_dir, "source_config"
+            )
+        if source_section is None:
+            source_section = model_cfg_dict.get("source_process") or model_cfg_dict.get("noise_model")
+        source_section = FlowPKExperimentConfig._extract_config_mapping(
+            source_section, "source_process"
+        )
+        if isinstance(source_section, dict) and "noise_model" in source_section:
+            source_section = source_section.get("noise_model")
+        source_cfg = FlowPKExperimentConfig._merge_dicts(
+            source_section, config_dict.get("source_process")
+        )
+        # -----------------------------------------------------------------
+        # Choose MetaStudy class dynamically (simple vs full)
+        # -----------------------------------------------------------------
+        if meta_study_cfg.get("simple_mode", False):
+            meta_study_instance = SimpleMetaStudyConfig(**meta_study_cfg)
+        else:
+            meta_study_instance = MetaStudyConfig(**meta_study_cfg)
+        train_cfg = TrainingConfig._filter_kwargs(train_cfg)
+        return FlowPKExperimentConfig(
+            experiment_type=str(config_dict.get("experiment_type", "flowpk")).lower(),
+            name_str=config_dict.get("name_str", "FlowPK"),
+            tags=config_dict.get("tags", ["flow-pk", "B-0"]),
+            experiment_name=config_dict.get("experiment_name", "flow_pk_compartments"),
+            experiment_indentifier=config_dict.get("experiment_indentifier", None),
+            my_results_path=config_dict.get("my_results_path", None),
+            experiment_dir=config_dict.get("experiment_dir", None),
+            comet_ai_key=config_dict.get("comet_ai_key", None),
+            hugging_face_token=config_dict.get("hugging_face_token", None),
+            upload_to_hf_hub=config_dict.get("upload_to_hf_hub", True),
+            hf_model_name=config_dict.get("hf_model_name", "FlowPK_test"),
+            hf_model_card_path=tuple(
+                config_dict.get("hf_model_card_path", ("hf_model_card", "CVAE_Readme.md"))
+            ),
+            debug_test=config_dict.get("debug_test", False),
+            flow_num_steps=int(config_dict.get("flow_num_steps", 50)),
+            vector_field=VectorFieldPKConfig(**vector_field_cfg),
+            source_process=SourceProcessConfig(**source_cfg),
+            mix_data=MixDataConfig(**mix_data_cfg),
+            context_observations=ObservationsConfig(**context_obs_cfg),
+            target_observations=ObservationsConfig(**target_obs_cfg),
+            meta_study=meta_study_instance,
+            dosing=MetaDosingConfig(**dosing_cfg),
+            train=TrainingConfig(**train_cfg),
+        )
+    @staticmethod
+    def _merge_dicts(
+        base_dict: Optional[Dict[str, Any]], override_dict: Optional[Dict[str, Any]]
+    ) -> Dict[str, Any]:
+        """Merge two optional dictionaries returning a new dictionary."""
+        merged: Dict[str, Any] = {}
+        if base_dict:
+            if not isinstance(base_dict, dict):
+                raise TypeError(
+                    "Expected base_dict to be a mapping when merging configuration sections."
+                )
+            merged = deepcopy(base_dict)
+        if override_dict:
+            if not isinstance(override_dict, dict):
+                raise TypeError(
+                    "Expected override_dict to be a mapping when merging configuration sections."
+                )
+            merged.update(override_dict)
+        return merged
+    @staticmethod
+    def _extract_config_mapping(
+        section: Optional[Dict[str, Any]], nested_key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Return a nested configuration mapping or the section itself."""
+        if section is None:
+            return None
+        if not isinstance(section, dict):
+            raise TypeError(
+                "Expected configuration section to be a mapping when extracting nested"
+                f" '{nested_key}' values."
+            )
+        if nested_key in section:
+            nested_value = section[nested_key]
+            if nested_value is None:
+                return None
+            if not isinstance(nested_value, dict):
+                raise TypeError(
+                    f"Expected '{nested_key}' section to be a mapping when extracting configuration values."
+                )
+            return nested_value
+        return section
+    @staticmethod
+    def _load_ref_yaml(
+        ref: Optional[Union[str, Dict[str, Any]]], base_dir: str
+    ) -> Optional[Dict[str, Any]]:
+        """Load a referenced YAML block or return inline dictionaries as-is."""
+        if ref is None:
+            return None
+        if isinstance(ref, dict):
+            return ref
+        if isinstance(ref, str):
+            ref_path = ref
+            if not os.path.isabs(ref_path):
+                ref_path = os.path.join(base_dir, ref_path)
+            with open(ref_path, "r") as handle:
+                return yaml.load(handle, Loader=TupleSafeLoader) or {}
+        raise TypeError("Expected configuration reference to be a mapping or string path.")
+    @staticmethod
+    def _resolve_config_section(
+        cfg_dict: Dict[str, Any], base_dir: str, key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Resolve nested configuration references within a configuration block."""
+        if key not in cfg_dict:
+            return None
+        section = cfg_dict[key]
+        if section is None:
+            return None
+        if isinstance(section, dict):
+            ref_value = section.get("_ref") if "_ref" in section else None
+            if ref_value is not None:
+                loaded = FlowPKExperimentConfig._load_ref_yaml(ref_value, base_dir)
+                return loaded or {}
+            return section
+        if isinstance(section, str):
+            loaded = FlowPKExperimentConfig._load_ref_yaml(section, base_dir)
+            return loaded or {}
+        raise TypeError(
+            f"Expected configuration section '{key}' to be a mapping or string reference."
+        )
+    def to_yaml(self, file_path: str):
+        """Saves the class to a YAML file."""
+        with open(file_path, "w") as file:
+            yaml.dump(asdict(self), file, default_flow_style=False)
+class HFFlowPKConfig(PretrainedConfig):
+    """
+    HF config wrapping FlowPKExperimentConfig plus tracked metrics.
+    Canonical storage:
+        self.tracking: dict with shape
+            {
+              "best": { "<metric_name>": {"value": float, "step": int|None, "epoch": int|None} },
+              "meta": { ...optional... }
+            }
+    Backward compat:
+        - Accepts legacy keys like best_val_loss / best_val_rmse.
+        - Mirrors best["val_rmse"] to `best_val_loss` if you still use that elsewhere.
+    """
+    model_type = "flow_pk"
+    def __init__(self, **kwargs):
+        # --- extract tracking / legacy keys before super().__init__ ---
+        tracking = kwargs.pop("tracking", None)
+        # legacy keys (accept either; normalize into tracking)
+        legacy_best_val_loss = kwargs.pop("best_val_loss", None)
+        legacy_best_val_rmse = kwargs.pop("best_val_rmse", None)
+        super().__init__(**kwargs)
+        # copy remaining config fields
+        for k, v in kwargs.items():
+            setattr(self, k, v)
+        # initialize tracking
+        if tracking is None or not isinstance(tracking, dict):
+            tracking = {"best": {}, "meta": {}}
+        tracking.setdefault("best", {})
+        tracking.setdefault("meta", {})
+        self.tracking: Dict[str, Any] = tracking
+        # fold legacy into canonical schema if present
+        legacy = legacy_best_val_loss if legacy_best_val_loss is not None else legacy_best_val_rmse
+        if legacy is not None:
+            # choose a canonical metric name; I'd recommend "val_rmse" if that’s what it is.
+            self.set_best("val_rmse", legacy)
+        # optional alias for older codepaths
+        self._sync_legacy_aliases()
+    # --------- public API ----------
+    def set_best(
+        self,
+        metric_name: str,
+        value: Any,
+        *,
+        step: Optional[int] = None,
+        epoch: Optional[int] = None,
+    ) -> None:
+        v = _to_float(value)
+        self.tracking["best"][metric_name] = {"value": v, "step": step, "epoch": epoch}
+        self._sync_legacy_aliases()
+    def get_best(self, metric_name: str, default: float = math.inf) -> float:
+        d = self.tracking.get("best", {}).get(metric_name)
+        if not d:
+            return float(default)
+        return _to_float(d.get("value", default))
+    def is_better(
+        self,
+        metric_name: str,
+        candidate_value: Any,
+        *,
+        higher_is_better: bool = False,
+    ) -> bool:
+        cand = _to_float(candidate_value)
+        best = self.get_best(metric_name, default=(-math.inf if higher_is_better else math.inf))
+        return cand > best if higher_is_better else cand < best
+    def update_if_better(
+        self,
+        metric_name: str,
+        candidate_value: Any,
+        *,
+        step: Optional[int] = None,
+        epoch: Optional[int] = None,
+        higher_is_better: bool = False,
+    ) -> bool:
+        if self.is_better(metric_name, candidate_value, higher_is_better=higher_is_better):
+            self.set_best(metric_name, candidate_value, step=step, epoch=epoch)
+            return True
+        return False
+    # --------- construction ----------
+    @classmethod
+    def from_flowpk(cls, flowpk_cfg, **tracked_best: float) -> "HFFlowPKConfig":
+        """
+        tracked_best: e.g. val_rmse=..., val_nll=..., val_crps=...
+        """
+        cfg_dict = asdict(flowpk_cfg)
+        cfg = cls(**cfg_dict)
+        for k, v in tracked_best.items():
+            cfg.set_best(k, v)
+        return cfg
+    # --------- internal ----------
+    def _sync_legacy_aliases(self) -> None:
+        """
+        Keep a legacy scalar field for older code that expects `best_val_loss`.
+        Here we mirror it to `best["val_rmse"]` by convention.
+        """
+        # if val_rmse exists, mirror it; otherwise inf
+        self.best_val_loss = self.get_best("val_rmse", default=math.inf)
+FlowPKConfig = FlowPKExperimentConfig

sim_priors_pk/config_classes/node_pk_config.py ADDED Viewed

	@@ -0,0 +1,518 @@

+import math
+import os
+from copy import deepcopy
+from dataclasses import asdict, dataclass, field
+from typing import Any, Dict, List, Optional, Tuple, Union
+import yaml  # type: ignore
+from transformers import PretrainedConfig
+from sim_priors_pk.config_classes.data_config import (
+    MetaDosingConfig,
+    MetaStudyConfig,
+    MixDataConfig,
+    ObservationsConfig,
+    SimpleMetaStudyConfig,
+)
+from sim_priors_pk.config_classes.training_config import TrainingConfig
+from sim_priors_pk.config_classes.utils import TupleSafeLoader
+def _to_float(x: Any) -> float:
+    try:
+        v = float(x)
+    except Exception:
+        return math.inf
+    # guard against NaN
+    if math.isnan(v):
+        return math.inf
+    return v
+@dataclass
+class EncoderDecoderNetworkConfig:
+    """
+    Configuration for the encoder-decoder network.
+    """
+    # Encoder configuration
+    individual_encoder_name: str = "RNNContextEncoder"
+    time_obs_encoder_hidden_dim: int = 200
+    time_obs_encoder_output_dim: int = 200
+    rnn_individual_encoder_number_of_layers: int = 2
+    individual_encoder_number_of_heads: int = 4
+    encoder_rnn_hidden_dim: int = 128
+    input_encoding_hidden_dim: int = 128
+    zi_latent_dim: int = 200
+    use_attention: bool = True
+    use_self_attention: bool = False
+    use_time_deltas: bool = True
+    # Decoder configuration
+    decoder_name: str = "RNNDecoder"
+    decoder_num_layers: int = 2
+    decoder_attention_layers: int = 2
+    decoder_hidden_dim: int = 128
+    decoder_rnn_hidden_dim: int = 200
+    rnn_decoder_number_of_layers: int = 4
+    node_step: bool = True
+    exclusive_node_step: bool = False
+    cov_proj_dim: int = 16  # p in the paper
+    ignore_logvar: bool = True  # sampling
+    # Aggregator
+    aggregator_type: str = "attention"  # attention, mean
+    aggregator_num_heads: int = 8
+    # Control reconstruction vs prediction losses
+    prediction_only: bool = False
+    reconstruction_only: bool = False
+    # Deterministic study latent (disable sampling)
+    study_latent_deterministic: bool = False
+    # Deterministic individual latent for prediction
+    prediction_latent_deterministic: bool = False
+    # How to combine study and individual latents
+    combine_latent_mode: str = "mlp"  # Options: "mlp", "sum"
+    # MLP configurations (used in init_hidden, output heads, drift)
+    init_hidden_num_layers: int = 2
+    output_head_num_layers: int = 2
+    drift_num_layers: int = 3
+    dropout: float = 0.1
+    activation: str = "ReLU"  # For init/logvar/mean
+    drift_activation: str = "Tanh"
+    norm: str = "layer"  # Options: "layer", "batch", None
+    # Loss
+    loss_name: str = "nll"  # Options: "nll", "log_nll", "rmse", mv_nll
+    # latent node pk
+    kl_weight: float = 1.0
+    # KL regularisation flags
+    use_kl_s: bool = True
+    use_kl_i: bool = True
+    use_kl_i_np: bool = True
+    use_kl_init: bool = True
+    use_invariance_loss: bool = True
+    # Optional scaling for dosing amount inputs (route types remain unscaled)
+    scale_dosing_amounts: bool = True
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "EncoderDecoderNetworkConfig":
+        """Instantiate the network configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict) and "network" in config_dict:
+            config_dict = config_dict.get("network") or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected 'network' section in YAML to be a mapping.")
+        return cls(**config_dict)
+@dataclass
+class NodePKExperimentConfig:
+    """Experiment configuration for NodePK-family models."""
+    experiment_type: str = "nodepk"
+    name_str: str = "NodePK"
+    comet_ai_key: str = None
+    experiment_name: str = "node_pk_compartments"
+    hugging_face_token: str = None
+    upload_to_hf_hub: bool = True
+    hf_model_name: str = "NodePK_test"
+    hf_model_card_path: Tuple[str, str, str] = ("hf_model_card", "CVAE_Readme.md")
+    tags: List[str] = field(default_factory=lambda: ["node-pk", "B-0"])
+    experiment_indentifier: str = None
+    my_results_path: str = None
+    experiment_dir: str = None
+    verbose: bool = False
+    run_index: int = 0
+    debug_test: bool = False
+    network: EncoderDecoderNetworkConfig = field(default_factory=EncoderDecoderNetworkConfig)
+    mix_data: MixDataConfig = field(default_factory=MixDataConfig)
+    context_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    target_observations: ObservationsConfig = field(default_factory=ObservationsConfig)
+    meta_study: MetaStudyConfig = field(default_factory=MetaStudyConfig)
+    dosing: MetaDosingConfig = field(default_factory=MetaDosingConfig)
+    train: TrainingConfig = field(default_factory=TrainingConfig)
+    @staticmethod
+    def from_yaml(file_path: str) -> "NodePKExperimentConfig":
+        """Initializes the class from a YAML file.
+        Supports both monolithic experiment YAML files as well as files that
+        reference dedicated data, training, and model configuration YAMLs.
+        """
+        with open(file_path, "r") as file:
+            config_dict = yaml.load(file, Loader=TupleSafeLoader) or {}
+        exp_type = None
+        if isinstance(config_dict, dict):
+            exp_type = config_dict.get("experiment_type")
+        if exp_type is not None and str(exp_type).lower() != "nodepk":
+            raise ValueError(
+                f"Expected experiment_type 'nodepk' for NodePKExperimentConfig, got {exp_type!r}."
+            )
+        base_dir = os.path.dirname(os.path.abspath(file_path))
+        data_cfg_dict = (
+            NodePKExperimentConfig._load_ref_yaml(config_dict.get("data_config"), base_dir) or {}
+        )
+        training_cfg_dict = (
+            NodePKExperimentConfig._load_ref_yaml(config_dict.get("training_config"), base_dir)
+            or {}
+        )
+        model_cfg_dict = (
+            NodePKExperimentConfig._load_ref_yaml(config_dict.get("model_config"), base_dir) or {}
+        )
+        observations_section = NodePKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "observations_config"
+        )
+        if observations_section is None:
+            observations_section = NodePKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "observations_config"
+            )
+        if observations_section is not None:
+            context_observations_base = observations_section.get("context_observations")
+            target_observations_base = observations_section.get("target_observations")
+        else:
+            context_observations_base = data_cfg_dict.get("context_observations")
+            target_observations_base = data_cfg_dict.get("target_observations")
+        mix_data_section = NodePKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "mix_data_config"
+        )
+        if mix_data_section is None:
+            mix_data_section = NodePKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "mix_data_config"
+            )
+        if mix_data_section is None:
+            mix_data_section = data_cfg_dict.get("mix_data")
+        meta_study_section = NodePKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_study_config"
+        )
+        if meta_study_section is None:
+            meta_study_section = NodePKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_study_config"
+            )
+        meta_dosing_section = NodePKExperimentConfig._resolve_config_section(
+            config_dict, base_dir, "meta_dosing_config"
+        )
+        if meta_dosing_section is None:
+            meta_dosing_section = NodePKExperimentConfig._resolve_config_section(
+                data_cfg_dict, base_dir, "meta_dosing_config"
+            )
+        meta_study_base = NodePKExperimentConfig._extract_config_mapping(
+            meta_study_section, "meta_study"
+        )
+        if meta_study_base is None and meta_dosing_section is not None:
+            meta_study_base = NodePKExperimentConfig._extract_config_mapping(
+                meta_dosing_section, "meta_study"
+            )
+        if meta_study_base is None:
+            meta_study_base = data_cfg_dict.get("meta_study")
+        dosing_base = NodePKExperimentConfig._extract_config_mapping(meta_dosing_section, "dosing")
+        if dosing_base is None:
+            dosing_base = data_cfg_dict.get("dosing")
+        mix_data_cfg = NodePKExperimentConfig._merge_dicts(
+            mix_data_section, config_dict.get("mix_data")
+        )
+        context_obs_cfg = NodePKExperimentConfig._merge_dicts(
+            context_observations_base, config_dict.get("context_observations")
+        )
+        target_obs_cfg = NodePKExperimentConfig._merge_dicts(
+            target_observations_base, config_dict.get("target_observations")
+        )
+        meta_study_cfg = NodePKExperimentConfig._merge_dicts(
+            meta_study_base, config_dict.get("meta_study")
+        )
+        dosing_cfg = NodePKExperimentConfig._merge_dicts(dosing_base, config_dict.get("dosing"))
+        train_section = training_cfg_dict.get("train", training_cfg_dict)
+        train_cfg = NodePKExperimentConfig._merge_dicts(train_section, config_dict.get("train"))
+        network_section = model_cfg_dict.get("network", model_cfg_dict)
+        network_cfg = NodePKExperimentConfig._merge_dicts(
+            network_section, config_dict.get("network")
+        )
+        # -----------------------------------------------------------------
+        # Choose MetaStudy class dynamically (simple vs full)
+        # -----------------------------------------------------------------
+        if meta_study_cfg.get("simple_mode", False):
+            meta_study_instance = SimpleMetaStudyConfig(**meta_study_cfg)
+        else:
+            meta_study_instance = MetaStudyConfig(**meta_study_cfg)
+        train_cfg = TrainingConfig._filter_kwargs(train_cfg)
+        return NodePKExperimentConfig(
+            experiment_type=str(config_dict.get("experiment_type", "nodepk")).lower(),
+            name_str=config_dict.get("name_str", "ExampleModel"),
+            tags=config_dict.get("tags", ["node-pk", "B-0"]),
+            experiment_name=config_dict.get("experiment_name", "aicme_compartments"),
+            experiment_indentifier=config_dict.get("experiment_indentifier", None),
+            my_results_path=config_dict.get("my_results_path", None),
+            experiment_dir=config_dict.get("experiment_dir", None),
+            comet_ai_key=config_dict.get("comet_ai_key", None),
+            hugging_face_token=config_dict.get("hugging_face_token", None),
+            upload_to_hf_hub=config_dict.get("upload_to_hf_hub", True),
+            hf_model_name=config_dict.get("hf_model_name", "NodePK_test"),
+            hf_model_card_path=tuple(
+                config_dict.get("hf_model_card_path", ("hf_model_card", "CVAE_Readme.md"))
+            ),
+            debug_test=config_dict.get("debug_test", False),
+            network=EncoderDecoderNetworkConfig(**network_cfg),
+            mix_data=MixDataConfig(**mix_data_cfg),
+            context_observations=ObservationsConfig(**context_obs_cfg),
+            target_observations=ObservationsConfig(**target_obs_cfg),
+            meta_study=meta_study_instance,
+            dosing=MetaDosingConfig(**dosing_cfg),
+            train=TrainingConfig(**train_cfg),
+        )
+    @staticmethod
+    def _merge_dicts(
+        base_dict: Optional[Dict[str, Any]], override_dict: Optional[Dict[str, Any]]
+    ) -> Dict[str, Any]:
+        """Merge two optional dictionaries returning a new dictionary."""
+        merged: Dict[str, Any] = {}
+        if base_dict:
+            if not isinstance(base_dict, dict):
+                raise TypeError(
+                    "Expected base_dict to be a mapping when merging configuration sections."
+                )
+            merged = deepcopy(base_dict)
+        if override_dict:
+            if not isinstance(override_dict, dict):
+                raise TypeError(
+                    "Expected override_dict to be a mapping when merging configuration sections."
+                )
+            merged.update(override_dict)
+        return merged
+    @staticmethod
+    def _extract_config_mapping(
+        section: Optional[Dict[str, Any]], nested_key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Return a nested configuration mapping or the section itself."""
+        if section is None:
+            return None
+        if not isinstance(section, dict):
+            raise TypeError(
+                "Expected configuration section to be a mapping when extracting nested"
+                f" '{nested_key}' values."
+            )
+        if nested_key in section:
+            nested_value = section[nested_key]
+            if nested_value is None:
+                return None
+            if not isinstance(nested_value, dict):
+                raise TypeError(
+                    f"Expected '{nested_key}' section to be a mapping when extracting configuration values."
+                )
+            return nested_value
+        return section
+    @staticmethod
+    def _load_ref_yaml(
+        ref: Optional[Union[str, Dict[str, Any]]], base_dir: str
+    ) -> Optional[Dict[str, Any]]:
+        """Load a referenced YAML block or return inline dictionaries as-is."""
+        if ref is None:
+            return None
+        if isinstance(ref, dict):
+            return ref
+        if isinstance(ref, str):
+            ref_path = ref
+            if not os.path.isabs(ref_path):
+                ref_path = os.path.join(base_dir, ref_path)
+            with open(ref_path, "r") as handle:
+                return yaml.load(handle, Loader=TupleSafeLoader) or {}
+        raise TypeError("Expected configuration reference to be a mapping or string path.")
+    @staticmethod
+    def _resolve_config_section(
+        cfg_dict: Dict[str, Any], base_dir: str, key: str
+    ) -> Optional[Dict[str, Any]]:
+        """Resolve nested configuration references within a configuration block."""
+        if key not in cfg_dict:
+            return None
+        section = cfg_dict[key]
+        if section is None:
+            return None
+        if isinstance(section, dict):
+            ref_value = section.get("_ref") if "_ref" in section else None
+            if ref_value is not None:
+                loaded = NodePKExperimentConfig._load_ref_yaml(ref_value, base_dir)
+                return loaded or {}
+            return section
+        if isinstance(section, str):
+            loaded = NodePKExperimentConfig._load_ref_yaml(section, base_dir)
+            return loaded or {}
+        raise TypeError(
+            f"Expected configuration section '{key}' to be a mapping or string reference."
+        )
+    def to_yaml(self, file_path: str):
+        """Saves the class to a YAML file."""
+        with open(file_path, "w") as file:
+            yaml.dump(asdict(self), file, default_flow_style=False)
+NodePKConfig = NodePKExperimentConfig
+class HFNodePKConfig(PretrainedConfig):
+    """
+    HF config wrapping NodePKConfig plus tracked metrics.
+    Canonical storage:
+        self.tracking: dict with shape
+            {
+              "best": { "<metric_name>": {"value": float, "step": int|None, "epoch": int|None} },
+              "meta": { ...optional... }
+            }
+    Backward compat:
+        - Accepts legacy keys like best_val_loss / best_val_rmse.
+        - Mirrors best["val_rmse"] to `best_val_loss` if you still use that elsewhere.
+    """
+    model_type = "node_pk"
+    def __init__(self, **kwargs):
+        # --- extract tracking / legacy keys before super().__init__ ---
+        tracking = kwargs.pop("tracking", None)
+        # legacy keys (accept either; normalize into tracking)
+        legacy_best_val_loss = kwargs.pop("best_val_loss", None)
+        legacy_best_val_rmse = kwargs.pop("best_val_rmse", None)
+        super().__init__(**kwargs)
+        # copy remaining config fields
+        for k, v in kwargs.items():
+            setattr(self, k, v)
+        # initialize tracking
+        if tracking is None or not isinstance(tracking, dict):
+            tracking = {"best": {}, "meta": {}}
+        tracking.setdefault("best", {})
+        tracking.setdefault("meta", {})
+        self.tracking: Dict[str, Any] = tracking
+        # fold legacy into canonical schema if present
+        legacy = legacy_best_val_loss if legacy_best_val_loss is not None else legacy_best_val_rmse
+        if legacy is not None:
+            # choose a canonical metric name; I'd recommend "val_rmse" if that’s what it is.
+            self.set_best("val_rmse", legacy)
+        # optional alias for older codepaths
+        self._sync_legacy_aliases()
+    # --------- public API ----------
+    def set_best(
+        self,
+        metric_name: str,
+        value: Any,
+        *,
+        step: Optional[int] = None,
+        epoch: Optional[int] = None,
+    ) -> None:
+        v = _to_float(value)
+        self.tracking["best"][metric_name] = {"value": v, "step": step, "epoch": epoch}
+        self._sync_legacy_aliases()
+    def get_best(self, metric_name: str, default: float = math.inf) -> float:
+        d = self.tracking.get("best", {}).get(metric_name)
+        if not d:
+            return float(default)
+        return _to_float(d.get("value", default))
+    def is_better(
+        self,
+        metric_name: str,
+        candidate_value: Any,
+        *,
+        higher_is_better: bool = False,
+    ) -> bool:
+        cand = _to_float(candidate_value)
+        best = self.get_best(metric_name, default=(-math.inf if higher_is_better else math.inf))
+        return cand > best if higher_is_better else cand < best
+    def update_if_better(
+        self,
+        metric_name: str,
+        candidate_value: Any,
+        *,
+        step: Optional[int] = None,
+        epoch: Optional[int] = None,
+        higher_is_better: bool = False,
+    ) -> bool:
+        if self.is_better(metric_name, candidate_value, higher_is_better=higher_is_better):
+            self.set_best(metric_name, candidate_value, step=step, epoch=epoch)
+            return True
+        return False
+    # --------- construction ----------
+    @classmethod
+    def from_nodepk(cls, nodepk_cfg, **tracked_best: float) -> "HFNodePKConfig":
+        """
+        tracked_best: e.g. val_rmse=..., val_nll=..., val_crps=...
+        """
+        cfg_dict = asdict(nodepk_cfg)
+        cfg = cls(**cfg_dict)
+        for k, v in tracked_best.items():
+            cfg.set_best(k, v)
+        return cfg
+    # --------- internal ----------
+    def _sync_legacy_aliases(self) -> None:
+        """
+        Keep a legacy scalar field for older code that expects `best_val_loss`.
+        Here we mirror it to `best["val_rmse"]` by convention.
+        """
+        # if val_rmse exists, mirror it; otherwise inf
+        self.best_val_loss = self.get_best("val_rmse", default=math.inf)

sim_priors_pk/config_classes/source_process_config.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import os
+from dataclasses import dataclass
+from typing import Optional, Union
+try:  # pragma: no cover - exercised indirectly via configuration loading
+    import yaml  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - fallback for minimal environments
+    from sim_priors_pk.config_classes import yaml_fallback as yaml
+@dataclass
+class SourceProcessConfig:
+    """
+    Configuration for source processes used by flow and diffusion PK models.
+    Supported source_type values (case-insensitive):
+      - "gaussian_process" / "gp"
+      - "ornstein_uhlenbeck" / "ou"
+      - "wiener"
+      - "normal" / "gaussian"
+    """
+    source_type: str = "gaussian_process"
+    # Gaussian process hyper-parameter for RBF or OU.
+    gp_length_scale: float = 0.1
+    gp_variance: float = 1.0
+    gp_eps: float = 1e-8
+    gp_transform: str = 'softplus'  # transformation to apply to the sampled noise, e.g. 'softplus', 'exp'
+    # Flow matching additive noise scale (used only in FlowPK).
+    flow_sigma: float = 1e-4
+    flow_num_steps: int = 100
+    use_OT_coupling: bool = False
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "SourceProcessConfig":
+        """Instantiate the source-process configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict):
+            for key in ("source_process", "source", "noise_model"):
+                if key in config_dict:
+                    config_dict = config_dict.get(key) or {}
+                    break
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected source process configuration to be a mapping.")
+        return cls(**config_dict)

sim_priors_pk/config_classes/training_config.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import os
+from dataclasses import dataclass, field, fields
+from typing import Any, Dict, List, Optional, Union
+import yaml
+@dataclass
+class SchedulerTaskConfig:
+    """Typed configuration for one scheduler task."""
+    name: str
+    fn_key: str
+    n_samples: int = 0
+    sample_source: str = "unconditional"
+    split: str = "val"
+    empirical_name: Optional[str] = None
+    save_to_disk: bool = True
+    log_prefix: str = "val"
+    use_ema: bool = False
+    checkpoint_metric: bool = False
+    checkpoint_metric_name: Optional[str] = None
+    checkpoint_mode: str = "min"
+    task_cfg: Dict[str, Any] = field(default_factory=dict)
+@dataclass
+class SchedulerConfig:
+    """Typed configuration for scheduler-driven callback execution."""
+    percent_step: float = 1.0
+    include_end: bool = True
+    skip_sanity_check: bool = True
+    store_samples: bool = True
+    max_samples_per_group: int = 32
+    keep_temp_files: bool = False
+    cache_dir: Optional[str] = None
+    # Supported selectors are:
+    # - ``end`` for in-memory train-end weights,
+    # - ``last`` / ``best`` for experiment checkpoint callbacks,
+    # - scheduler-managed metric checkpoint names emitted by tasks.
+    checkpoint_used_in_end: List[str] = field(default_factory=lambda: ["end"])
+    tasks_validation: List[SchedulerTaskConfig] = field(default_factory=list)
+    task_during: List[SchedulerTaskConfig] = field(default_factory=list)
+    tasks_end: List[SchedulerTaskConfig] = field(default_factory=list)
+@dataclass
+class TrainingConfig:
+    epochs: int = 20
+    batch_size: int = 8
+    gradient_clip_val: float = 1.0
+    optimizer_name: str = "AdamW"
+    learning_rate: float = 0.0001
+    weight_decay: float = 1.0e-4
+    num_workers: int = 3
+    persistent_workers: bool = True
+    shuffle_val: bool = True
+    num_batch_plot: int = 1
+    log_interval: int = 1  # Frequency of logging and visualization
+    # Scheduler-driven PK evaluation and visualization.
+    callbacks_scheduler: Optional[Union[SchedulerConfig, Dict[str, Any]]] = None
+    betas: List[float] = field(default_factory=lambda: [0.9, 0.999])
+    eps: float = 1.0e-8
+    amsgrad: bool = False
+    scheduler_name: str = "CosineAnnealingLR"
+    scheduler_params: Dict[str, Union[float, int]] = field(
+        default_factory=lambda: {"T_max": 1000, "eta_min": 5.0e-5, "last_epoch": -1}
+    )
+    @classmethod
+    def from_yaml(cls, file_path: Union[str, os.PathLike]) -> "TrainingConfig":
+        """Instantiate the training configuration from a YAML file."""
+        with open(file_path, "r", encoding="utf-8") as handle:
+            config_dict = yaml.safe_load(handle) or {}
+        if isinstance(config_dict, dict) and "train" in config_dict:
+            config_dict = config_dict.get("train") or {}
+        if not isinstance(config_dict, dict):
+            raise TypeError("Expected 'train' section in YAML to be a mapping.")
+        return cls(**cls._filter_kwargs(config_dict))
+    @classmethod
+    def _filter_kwargs(cls, raw: Dict[str, Any]) -> Dict[str, Any]:
+        """Drop unknown keys (including deprecated logging flags)."""
+        if not isinstance(raw, dict):
+            return {}
+        valid = {f.name for f in fields(cls)}
+        return {key: value for key, value in raw.items() if key in valid}

sim_priors_pk/config_classes/utils.py ADDED Viewed

	@@ -0,0 +1,14 @@

+try:  # pragma: no cover - exercised indirectly via configuration loading
+    import yaml  # type: ignore
+    from yaml import SafeLoader  # type: ignore
+except ModuleNotFoundError:  # pragma: no cover - fallback for minimal environments
+    from sim_priors_pk.config_classes import yaml_fallback as yaml
+    SafeLoader = yaml.SafeLoader
+class TupleSafeLoader(SafeLoader):
+    def construct_python_tuple(self, node):
+        # Convert the YAML sequence (e.g., [0.01, 0.1]) into a tuple
+        return tuple(self.construct_sequence(node))
+# Register the constructor for the fully qualified tag
+TupleSafeLoader.add_constructor('tag:yaml.org,2002:python/tuple', TupleSafeLoader.construct_python_tuple)

sim_priors_pk/config_classes/yaml_fallback.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""Minimal YAML loader fallback used when PyYAML is unavailable."""
+from __future__ import annotations
+import ast
+import json
+from typing import Any, Dict, List
+class SafeLoader:
+    """Compatibility stub mimicking :class:`yaml.SafeLoader`."""
+    _constructors: Dict[str, Any] = {}
+    @classmethod
+    def add_constructor(cls, tag: str, constructor: Any) -> None:
+        cls._constructors[tag] = constructor
+Loader = SafeLoader
+def _convert_scalar(value: str) -> Any:
+    lowered = value.lower()
+    if lowered in {"true", "yes"}:
+        return True
+    if lowered in {"false", "no"}:
+        return False
+    if lowered in {"null", "none", "~"}:
+        return None
+    if value.startswith("[") or value.startswith("{") or value.startswith("("):
+        try:
+            return ast.literal_eval(value)
+        except (SyntaxError, ValueError):
+            pass
+    if value.startswith("\"") and value.endswith("\""):
+        return value[1:-1]
+    if value.startswith("'") and value.endswith("'"):
+        return value[1:-1]
+    try:
+        if "." in value or "e" in lowered:
+            return float(value)
+        return int(value)
+    except ValueError:
+        pass
+    return value
+def _parse_lines(lines: List[str], indent: int = 0) -> Any:
+    mapping: Dict[str, Any] = {}
+    sequence: List[Any] = []
+    is_list: bool | None = None
+    while lines:
+        line = lines[0]
+        stripped = line.lstrip()
+        if not stripped or stripped.startswith("#"):
+            lines.pop(0)
+            continue
+        current_indent = len(line) - len(stripped)
+        if current_indent < indent and not stripped.startswith("- "):
+            break
+        if stripped.startswith("- "):
+            if is_list is False:
+                raise ValueError("Mixed mapping and sequence at the same level is unsupported.")
+            is_list = True
+            lines.pop(0)
+            item_value = stripped[2:].strip()
+            if not item_value:
+                sequence.append(_parse_lines(lines, current_indent + 2))
+                continue
+            if item_value.endswith(":"):
+                key = item_value[:-1].strip()
+                value = _parse_lines(lines, current_indent + 2)
+                sequence.append({key: value})
+                continue
+            sequence.append(_convert_scalar(item_value))
+            continue
+        if is_list is True:
+            raise ValueError("Mixed mapping and sequence at the same level is unsupported.")
+        is_list = False
+        lines.pop(0)
+        if ":" not in stripped:
+            raise ValueError(f"Invalid mapping entry: '{stripped}'.")
+        key, value_part = stripped.split(":", 1)
+        key = key.strip()
+        value_part = value_part.strip()
+        if value_part:
+            mapping[key] = _convert_scalar(value_part)
+        else:
+            mapping[key] = _parse_lines(lines, current_indent + 2)
+    if is_list:
+        return sequence
+    return mapping
+def safe_load(stream: Any) -> Any:
+    """Parse YAML content from ``stream`` and return Python data structures."""
+    if hasattr(stream, "read"):
+        content = stream.read()
+    else:
+        content = stream
+    if not isinstance(content, str):
+        raise TypeError("YAML content must be a string or text stream.")
+    raw_lines = content.splitlines()
+    return _parse_lines(raw_lines.copy()) if raw_lines else None
+def load(stream: Any, Loader: Any | None = None) -> Any:  # noqa: N803 - API compatibility
+    """Compatibility wrapper mirroring :func:`yaml.load`."""
+    return safe_load(stream)
+def dump(data: Any, stream: Any | None = None, default_flow_style: bool | None = None) -> str:
+    """Serialise ``data`` to YAML (JSON style in the fallback implementation)."""
+    text = json.dumps(data, indent=2)
+    if stream is not None:
+        stream.write(text)
+        return ""
+    return text

sim_priors_pk/data/README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# `sim_priors_pk.data` Package Guide
+This guide documents the purpose of every subpackage that lives under `sim_priors_pk/data`. Use it as a quick reference when wiring new data pipelines or navigating the simulated pharmacokinetic (PK) workflow.
+## Configuration Preamble
+Simulations are configured by combining reusable YAML files with the dataclasses that live in `sim_priors_pk.config_classes`. YAMLS are the file that populates those classes, we can define configs from the classes or reading the files.
+- **YAML files (`config_files/`)** – Ready-to-use experiment definitions grouped under `config_files/experiment_configs`. For example, the `node-pk` folder contains `base-homogeneous.*.yaml` files that describe meta-study, dosing, and observation settings.
+- **Config dataclasses (`sim_priors_pk/config_classes/`)** – Python dataclasses (`MetaStudyConfig`, `MetaDosingConfig`, `ObservationsConfig`, and friends) that parse those YAML files or can be instantiated directly in code when you need programmatic overrides.
+When you load configurations in tests or scripts, prefer `MetaStudyConfig.from_yaml(...)` and similar helpers. They keep the simulation code aligned with the canonical YAML layout while still allowing you to craft configurations in pure Python when necessary.
+## Top-Level Layout
+This are the files that matter for the handling of simulations anda data:
+```
+sim_priors_pk/
+├── config_files/
+│   └── experiment_configs/
+├── scripts/
+├── sim_priors_pk/
+│   ├── config_classes/
+│   └── data/
+│       ├── data_empirical/
+│       ├── data_generation/
+│       ├── data_preprocessing/
+│       ├── datasets/
+│       └── extra/
+└── tests/
+    └── data/
+        └── simulation_data/
+            └── test_simulations.py
+```
+Each directory is described below together with the most important entry points it exposes.
+## `data_empirical`
+Defines the data contracts used across the project.
+These contracts specify the canonical JSON schema (StudyJSON, IndividualJSON) that standardizes how pharmacokinetic studies are represented — both empirical and simulated.
+They serve as the interface between raw datasets, tensor batches, and model-ready data structures, ensuring a unified format throughout the pipeline. These helpers make it straightforward to load Hugging Face datasets or local JSON files, validate them, and materialise PyTorch-compatible batches.
+## `data_generation`
+Simulation building blocks used to synthesise PK trajectories under configurable dosing and observation schemes.
+* [`compartment_models.py`](data_generation/compartment_models.py) implements the stochastic sampling of population/individual PK parameters and the compartmental simulation loops.
+* [`observations_classes.py`](data_generation/observations_classes.py) describe observation strategies (e.g. sparse vs. dense sampling) and utilities to realise them.
+* [`compartment_models_management.py`](data_generation/compartment_models_management.py) orchestrates the full simulation workflow: it takes the meta-configuration, samples individual and dosing configurations, runs the compartmental simulations, applies the observation strategy, and assembles complete ensembles of studies in the data contracts.
+Together these modules allow you to go from configuration dataclasses to simulated studies that mirror the empirical format.
+## `data_preprocessing`
+deprecated
+## `datasets`
+Lightning-ready dataset/dataloader factories.
+- [`aicme_datasets.py`](datasets/aicme_datasets.py) defines `AICMECompartmentsDataBatch` and related PyTorch Lightning `DataModule` wrappers that harmonise both empirical and simulated studies for downstream training.
+## Putting It All Together
+A typical workflow is:
+1. **Configure**: Use `sim_priors_pk.config_classes` to describe study, dosing, and observation priors.
+2. **Simulate**: Call into `data_generation` to sample synthetic studies or to augment empirical cohorts.
+3. **Serialise or load**: Store simulations as JSON, or load existing JSON/CSV with `data_empirical` and `data_preprocessing`.
+4. **Batch**: Wrap tensors using `datasets.AICMECompartmentsDataModule` for consumption by modules in `sim_priors_pk.models` and training scripts.
+Refer back to this document whenever you onboard a new collaborator or reorganise data flows—the sections above stay aligned with the current code base.
+## Worked Examples and Tests
+Integration-style tests in `tests/data/simulation_data/test_simulations.py` demonstrate how the configuration pieces fit together:
+- `test_prepare_full_simulation_to_study_json` shows how YAML-driven configs from `config_files/experiment_configs/node-pk` feed into `prepare_full_simulation_to_study_json` and culminate in a canonical `StudyJSON`.
+- `test_prepare_ensemble_of_simulations` builds on the same configuration files to generate an ensemble of studies and persists them to disk, illustrating how bulk simulations can be orchestrated.
+Use these tests as executable documentation whenever you need to follow the end-to-end flow from configuration files to simulated study artefacts.

sim_priors_pk/data/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+"""Utility namespace for data-related modules.
+This package groups empirical, generation, preprocessing, and dataset
+helpers so they can be imported with the ``sim_priors_pk.data`` prefix.
+"""
+__all__ = [
+    "data_empirical",
+    "data_generation",
+    "data_preprocessing",
+    "datasets",
+]

sim_priors_pk/data/data_empirical/__init__.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""Utilities for working with empirical JSON study data."""
+try:  # pragma: no cover - optional torch dependency
+    from .builder import (
+        JSON2AICMEBuilder,
+        EmpiricalBatchConfig,
+        held_out_ind_json,
+        held_out_list_json,
+        load_empirical_json_batches,
+        load_empirical_json_batches_as_dm,
+        load_empirical_hf_batches_as_dm,
+        databatch_to_study_jsons,
+        prediction_to_study_jsons,
+    )
+except ModuleNotFoundError as exc:  # pragma: no cover - allow missing torch
+    if exc.name != "torch":
+        raise
+    JSON2AICMEBuilder = EmpiricalBatchConfig = None  # type: ignore
+    held_out_ind_json = held_out_list_json = None  # type: ignore
+    load_empirical_json_batches = load_empirical_json_batches_as_dm = None  # type: ignore
+    databatch_to_study_jsons = prediction_to_study_jsons = None  # type: ignore
+__all__ = [
+    "json_schema",
+    "JSON2AICMEBuilder",
+    "EmpiricalBatchConfig",
+    "held_out_ind_json",
+    "held_out_list_json",
+    "load_empirical_json_batches",
+    "load_empirical_json_batches_as_dm",
+    "load_empirical_hf_batches_as_dm",
+    "databatch_to_study_jsons",
+    "prediction_to_study_jsons",
+    "json_stats",
+]

sim_priors_pk/data/data_empirical/builder.py ADDED Viewed

	@@ -0,0 +1,1139 @@

+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import TYPE_CHECKING, Dict, List, Optional, Union
+import torch
+from datasets import load_dataset
+from torchtyping import TensorType as TT
+from sim_priors_pk.config_classes.data_config import (
+    MetaDosingConfig,
+)
+from sim_priors_pk.data.datasets.aicme_batch import AICMECompartmentsDataBatch
+if TYPE_CHECKING:  # pragma: no cover - imported only for type hints
+    from sim_priors_pk.data.datasets.aicme_datasets import AICMECompartmentsDataModule
+from .json_schema import IndividualJSON, StudyJSON, canonicalize_study
+from .json_stats import EmpiricalJSONStats, compute_json_stats
+@dataclass
+class EmpiricalBatchConfig:
+    """Configuration for empirical batch construction.
+    Attributes
+    ----------
+    pad_value_time:
+        Value used to pad time tensors.
+    pad_value_obs:
+        Value used to pad observation tensors.
+    max_databatch_size:
+        Maximum number of studies that can be stacked into a single batch.
+    max_individuals:
+        Maximum number of individuals per context or target block.
+    max_observations:
+        Maximum number of observation time points per individual.
+    max_remaining:
+        Maximum number of remaining time points per individual.
+    max_context_individuals / max_target_individuals:
+        Optional overrides specifying separate capacities for context and
+        target individual counts.
+    max_context_observations / max_target_observations:
+        Optional overrides specifying per-block observation capacities.
+    max_context_remaining / max_target_remaining:
+        Optional overrides specifying per-block remaining simulation
+        capacities.
+    """
+    pad_value_time: float = 0.0
+    pad_value_obs: float = 0.0
+    max_databatch_size: int = 8
+    max_individuals: int = 1
+    max_observations: int = 0
+    max_remaining: int = 0
+    max_context_individuals: Optional[int] = None
+    max_target_individuals: Optional[int] = None
+    max_context_observations: Optional[int] = None
+    max_target_observations: Optional[int] = None
+    max_context_remaining: Optional[int] = None
+    max_target_remaining: Optional[int] = None
+class JSON2AICMEBuilder:
+    """Convert empirical study JSON to :class:`AICMECompartmentsDataBatch`.
+    The builder pads context and target individuals to fixed sizes and
+    assembles the :class:`AICMECompartmentsDataBatch` expected by the models.
+    """
+    def __init__(self, cfg: EmpiricalBatchConfig) -> None:
+        self.cfg = cfg
+    def _ctx_cap(self) -> int:
+        return (
+            self.cfg.max_context_individuals
+            if self.cfg.max_context_individuals is not None
+            else self.cfg.max_individuals
+        )
+    def _tgt_cap(self) -> int:
+        return (
+            self.cfg.max_target_individuals
+            if self.cfg.max_target_individuals is not None
+            else self.cfg.max_individuals
+        )
+    def _ctx_obs_cap(self) -> int:
+        return (
+            self.cfg.max_context_observations
+            if self.cfg.max_context_observations is not None
+            else self.cfg.max_observations
+        )
+    def _tgt_obs_cap(self) -> int:
+        return (
+            self.cfg.max_target_observations
+            if self.cfg.max_target_observations is not None
+            else self.cfg.max_observations
+        )
+    def _ctx_rem_cap(self) -> int:
+        return (
+            self.cfg.max_context_remaining
+            if self.cfg.max_context_remaining is not None
+            else self.cfg.max_remaining
+        )
+    def _tgt_rem_cap(self) -> int:
+        return (
+            self.cfg.max_target_remaining
+            if self.cfg.max_target_remaining is not None
+            else self.cfg.max_remaining
+        )
+    def _block_from_inds(
+        self,
+        inds: List[IndividualJSON],
+        *,
+        max_individuals: int,
+        obs_cap: int,
+        rem_cap: int,
+    ) -> Dict[str, TT]:
+        """Assemble tensors for a list of individuals.
+        Padding is applied so that each block has the same number of
+        individuals (``max_individuals``) and time steps
+        (``max_observations``/``max_remaining``).
+        """
+        I_max = max(0, max_individuals)
+        ET = max(0, obs_cap)
+        R = max(0, rem_cap)
+        obs_tensor = torch.full((I_max, ET), self.cfg.pad_value_obs)  # [I, ET]
+        time_tensor = torch.full((I_max, ET), self.cfg.pad_value_time)  # [I, ET]
+        mask_tensor = torch.zeros((I_max, ET), dtype=torch.bool)  # [I, ET]
+        rem_tensor = (
+            torch.full((I_max, R), self.cfg.pad_value_obs) if R else torch.zeros(I_max, 0)
+        )  # [I, R]
+        rem_time_tensor = (
+            torch.full((I_max, R), self.cfg.pad_value_time) if R else torch.zeros(I_max, 0)
+        )  # [I, R]
+        rem_mask_tensor = (
+            torch.zeros((I_max, R), dtype=torch.bool)
+            if R
+            else torch.zeros(I_max, 0, dtype=torch.bool)
+        )  # [I, R]
+        for i, ind in enumerate(inds[:I_max]):
+            obs = torch.tensor(ind.get("observations", []), dtype=torch.float32)  # [ET?]
+            time = torch.tensor(ind.get("observation_times", []), dtype=torch.float32)  # [ET?]
+            L = min(obs.shape[0], ET)
+            obs_tensor[i, :L] = obs[:L]
+            time_tensor[i, :L] = time[:L]
+            mask_tensor[i, :L] = True
+            rem = torch.tensor(ind.get("remaining", []), dtype=torch.float32)  # [R?]
+            rem_t = torch.tensor(ind.get("remaining_times", []), dtype=torch.float32)  # [R?]
+            Lr = min(rem.shape[0], R)
+            if R:
+                rem_tensor[i, :Lr] = rem[:Lr]
+                rem_time_tensor[i, :Lr] = rem_t[:Lr]
+                rem_mask_tensor[i, :Lr] = True
+        return {
+            "obs": obs_tensor,
+            "time": time_tensor,
+            "mask": mask_tensor,
+            "rem": rem_tensor,
+            "rem_time": rem_time_tensor,
+            "rem_mask": rem_mask_tensor,
+        }
+    def build_study_batch(
+        self, study: StudyJSON, meta_dosing: MetaDosingConfig
+    ) -> AICMECompartmentsDataBatch:
+        """Build a batch for a single study.
+        DOES NOT USES OBSERVATIONS STRATEGIESM,
+        takes the observation structure as given by the JSON data
+        Parameters
+        ----------
+        study:
+            Canonicalised representation of one study.
+        meta_dosing:
+            Global dosing configuration.
+        Returns
+        -------
+        AICMECompartmentsDataBatch
+            Batch with ``B=1``.
+        """
+        study = canonicalize_study(study)
+        ctx_cap = self._ctx_cap()
+        tgt_cap = self._tgt_cap()
+        ctx_block = self._block_from_inds(
+            study["context"],
+            max_individuals=ctx_cap,
+            obs_cap=self._ctx_obs_cap(),
+            rem_cap=self._ctx_rem_cap(),
+        )
+        tgt_block = self._block_from_inds(
+            study["target"],
+            max_individuals=tgt_cap,
+            obs_cap=self._tgt_obs_cap(),
+            rem_cap=self._tgt_rem_cap(),
+        )
+        route_vocab = {r: i for i, r in enumerate(meta_dosing.route_options)}
+        def _dose_route(inds: List[IndividualJSON], I_max: int):
+            amounts = torch.zeros(1, I_max, dtype=torch.float32)  # [1, I]
+            routes = torch.zeros(1, I_max, dtype=torch.long)  # [1, I]
+            for i, ind in enumerate(inds[:I_max]):
+                if ind.get("dosing"):
+                    amounts[0, i] = ind["dosing"][0]
+                    routes[0, i] = route_vocab.get(ind["dosing_type"][0], 0)
+            return amounts, routes
+        c_dose, c_route = _dose_route(study["context"], ctx_cap)
+        t_dose, t_route = _dose_route(study["target"], tgt_cap)
+        def _unsqueeze(block):
+            obs = block["obs"].unsqueeze(0).unsqueeze(-1)  # [1, I, ET, 1]
+            time = block["time"].unsqueeze(0).unsqueeze(-1)  # [1, I, ET, 1]
+            mask = block["mask"].unsqueeze(0)  # [1, I, ET]
+            rem = block["rem"].unsqueeze(0).unsqueeze(-1)  # [1, I, R, 1]
+            rem_time = block["rem_time"].unsqueeze(0).unsqueeze(-1)  # [1, I, R, 1]
+            rem_mask = block["rem_mask"].unsqueeze(0)  # [1, I, R]
+            return obs, time, mask, rem, rem_time, rem_mask
+        t_obs, t_time, t_mask, t_rem, t_rem_time, t_rem_mask = _unsqueeze(tgt_block)
+        c_obs, c_time, c_mask, c_rem, c_rem_time, c_rem_mask = _unsqueeze(ctx_block)
+        mask_ctx_inds = torch.zeros(1, ctx_cap, dtype=torch.bool)  # [1, I]
+        mask_ctx_inds[0, : min(len(study["context"]), ctx_cap)] = True
+        mask_tgt_inds = torch.zeros(1, tgt_cap, dtype=torch.bool)  # [1, I]
+        mask_tgt_inds[0, : min(len(study["target"]), tgt_cap)] = True
+        study_name = [study["meta_data"]["study_name"]]
+        substance_name = [study["meta_data"].get("substance_name", "")]
+        context_subject_name = [
+            [
+                study["context"][i].get("name_id", "") if i < len(study["context"]) else ""
+                for i in range(ctx_cap)
+            ]
+        ]
+        target_subject_name = [
+            [
+                study["target"][i].get("name_id", "") if i < len(study["target"]) else ""
+                for i in range(tgt_cap)
+            ]
+        ]
+        batch = AICMECompartmentsDataBatch(
+            target_obs=t_obs,
+            target_obs_time=t_time,
+            target_obs_mask=t_mask,
+            target_rem_sim=t_rem,
+            target_rem_sim_time=t_rem_time,
+            target_rem_sim_mask=t_rem_mask,
+            context_obs=c_obs,
+            context_obs_time=c_time,
+            context_obs_mask=c_mask,
+            context_rem_sim=c_rem,
+            context_rem_sim_time=c_rem_time,
+            context_rem_sim_mask=c_rem_mask,
+            target_dosing_amounts=t_dose,
+            target_dosing_route_types=t_route,
+            context_dosing_amounts=c_dose,
+            context_dosing_route_types=c_route,
+            mask_context_individuals=mask_ctx_inds,
+            mask_target_individuals=mask_tgt_inds,
+            study_name=study_name,
+            context_subject_name=context_subject_name,
+            target_subject_name=target_subject_name,
+            substance_name=substance_name,
+            time_scales=torch.tensor([[0.0, 0.0]]),
+            is_empirical=True,
+        )
+        return batch
+    @staticmethod
+    def _stack_B(
+        batches: List[AICMECompartmentsDataBatch],
+    ) -> AICMECompartmentsDataBatch:
+        """Concatenate ``batches`` along the batch dimension ``B``.
+        Each input batch must have ``B=1``; the returned batch will have
+        ``B=len(batches)`` with index order preserved.
+        """
+        if not batches:
+            raise ValueError("batches must not be empty")
+        stacked_fields = []
+        for values in zip(*batches):
+            first = values[0]
+            if isinstance(first, torch.Tensor):
+                stacked_fields.append(torch.cat(values, dim=0))  # [B, ...]
+            elif isinstance(first, list):
+                if first and isinstance(first[0], list):
+                    merged_nested: List[List[str]] = []
+                    for v in values:
+                        merged_nested.extend(v)
+                    stacked_fields.append(merged_nested)
+                else:
+                    merged: List[str] = []
+                    for v in values:
+                        merged.extend(v)
+                    stacked_fields.append(merged)
+            else:
+                stacked_fields.append(first)
+        return AICMECompartmentsDataBatch(*stacked_fields)
+    def build_one_aicmebatch(
+        self, studies: List[StudyJSON], meta_dosing: MetaDosingConfig
+    ) -> AICMECompartmentsDataBatch:
+        """Build a single batch from multiple studies.
+        Parameters
+        ----------
+        studies:
+            List of studies to combine. The resulting batch will have
+            ``B=len(studies)``.
+        meta_dosing:
+            Global dosing configuration shared across studies.
+        Returns
+        -------
+        AICMECompartmentsDataBatch
+            Combined batch with batch dimension indexing the supplied
+            studies in order.
+        """
+        per_study = [self.build_study_batch(s, meta_dosing) for s in studies]
+        return self._stack_B(per_study)
+    def build_one_aicmebatch_as_dataset(
+        self,
+        studies: List[StudyJSON],
+        context_strategy,
+        target_strategy,
+        meta_dosing: MetaDosingConfig,
+        *,
+        return_studies: bool = False,  # ← debugging flag (default = True)
+    ) -> List[Union[AICMECompartmentsDataBatch, List[StudyJSON]]]:
+        """Create batches mirroring ``AICMECompartmentsDataset`` processing.
+        For each study we generate leave-one-out permutations using
+        :func:`held_out_ind_json`. The provided ``context_strategy`` and
+        ``target_strategy`` are then used to apply the same empirical splitting
+        between observed and remaining measurements as performed in
+        :class:`AICMECompartmentsDataset`. Each permutation across all studies is
+        stacked along the batch dimension ``B``.
+        Parameters
+        ----------
+        studies:
+            List of empirical studies. Each study is expected to contain only a
+            context block; target individuals are produced via leave-one-out
+            permutations.
+        context_strategy / target_strategy:
+            Observation strategies matching those used by
+            :class:`AICMECompartmentsDataset` for shaping context and target
+            data respectively.
+        meta_dosing:
+            Global dosing configuration.
+        return_studies:
+            If True (default), return the intermediate permuted study dicts
+            instead of building full ``AICMECompartmentsDataBatch`` objects.
+            Useful for debugging.
+        Returns
+        -------
+        List[Union[AICMECompartmentsDataBatch, List[StudyJSON]]]
+            If `return_studies` is True → list of permuted study dicts.
+            If `return_studies` is False → list of ``AICMECompartmentsDataBatch``.
+        """
+        canon_studies = [canonicalize_study(s, drop_tgt_too_few=False) for s in studies]
+        max_perm = max(len(s["context"]) for s in canon_studies)
+        per_study_perms = [held_out_ind_json(s, max_perm) for s in canon_studies]
+        batches = []
+        for perm_idx in range(max_perm):
+            permuted_studies = [
+                self._process_one_study_perm(
+                    study_perms[perm_idx], context_strategy, target_strategy
+                )
+                for study_perms in per_study_perms
+            ]
+            if return_studies:
+                batches.append(permuted_studies)  # debugging: raw dicts
+            else:
+                batches.append(self.build_one_aicmebatch(permuted_studies, meta_dosing))
+        return batches
+    def build_one_aicmebatch_as_dataset_no_heldout(
+        self,
+        studies: List[StudyJSON],
+        context_strategy,
+        target_strategy,
+        meta_dosing: MetaDosingConfig,
+        *,
+        return_studies: bool = False,
+    ) -> List[Union[AICMECompartmentsDataBatch, List[StudyJSON]]]:
+        """Create a single empirical batch without leave-one-out targets.
+        This method mirrors :meth:`build_one_aicmebatch_as_dataset` preprocessing
+        but does not move any individual from context to target. All individuals
+        remain in context and the returned list has a single element.
+        Parameters
+        ----------
+        studies:
+            List of empirical studies.
+        context_strategy / target_strategy:
+            Observation strategies matching those used by
+            :class:`AICMECompartmentsDataset`.
+        meta_dosing:
+            Global dosing configuration.
+        return_studies:
+            If ``True``, return the processed ``StudyJSON`` records instead of a
+            fully built :class:`AICMECompartmentsDataBatch`.
+        Returns
+        -------
+        List[Union[AICMECompartmentsDataBatch, List[StudyJSON]]]
+            A list with length one containing either processed studies or one
+            ``AICMECompartmentsDataBatch``.
+        """
+        canon_studies = [canonicalize_study(s, drop_tgt_too_few=False) for s in studies]
+        context_only_studies: List[StudyJSON] = []
+        for study in canon_studies:
+            all_inds = list(study.get("context", [])) + list(study.get("target", []))
+            context_only_studies.append(
+                {
+                    "context": all_inds,
+                    "target": [],
+                    "meta_data": dict(study.get("meta_data", {})),
+                }
+            )
+        processed_studies = [
+            self._process_one_study_perm(study, context_strategy, target_strategy)
+            for study in context_only_studies
+        ]
+        if return_studies:
+            return [processed_studies]
+        return [self.build_one_aicmebatch(processed_studies, meta_dosing)]
+    def _process_one_study_perm(
+        self,
+        study: StudyJSON,
+        context_strategy,
+        target_strategy,
+    ) -> StudyJSON:
+        """Turn one permuted study into tensors and apply strategies."""
+        processed = {"context": [], "target": [], "meta_data": study["meta_data"]}
+        for block, inds, strat in (
+            ("context", study["context"], context_strategy),
+            ("target", study["target"], target_strategy),
+        ):
+            processed[block] = self._process_block(inds, strat)
+        return processed
+    def _process_block(
+        self,
+        inds: List[IndividualJSON],
+        strat,
+    ) -> List[IndividualJSON]:
+        """Convert a list of individuals into padded tensors, then apply strategy."""
+        if not inds:
+            return []
+        obs, times, mask = self._pack_individuals(inds)
+        obs_o, time_o, mask_o, rem_o, rem_t, rem_m = strat.generate_empirical(obs, times, mask)
+        return self._rebuild_individuals(inds, obs_o, time_o, mask_o, rem_o, rem_t, rem_m)
+    def _pack_individuals(
+        self,
+        inds: List[IndividualJSON],
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Pad individuals into (obs, times, mask)."""
+        I = len(inds)
+        ET = max(len(ind["observations"]) for ind in inds)
+        obs = torch.full((I, ET), self.cfg.pad_value_obs)
+        times = torch.full((I, ET), self.cfg.pad_value_time)
+        mask = torch.zeros((I, ET), dtype=torch.bool)
+        for i, ind in enumerate(inds):
+            o = torch.tensor(ind["observations"], dtype=torch.float32)
+            t = torch.tensor(ind["observation_times"], dtype=torch.float32)
+            L = o.shape[0]
+            obs[i, :L], times[i, :L], mask[i, :L] = o, t, True
+        return obs, times, mask
+    def _rebuild_individuals(
+        self,
+        inds: List[IndividualJSON],
+        obs_o: torch.Tensor,
+        time_o: torch.Tensor,
+        mask_o: torch.Tensor,
+        rem_o: Optional[torch.Tensor],
+        rem_t: Optional[torch.Tensor],
+        rem_m: Optional[torch.Tensor],
+    ) -> List[IndividualJSON]:
+        """Convert tensors back to JSON-like dicts for each individual."""
+        block_inds = []
+        for i in range(obs_o.shape[0]):
+            ind_dict: IndividualJSON = {
+                "observations": obs_o[i][mask_o[i]].tolist(),
+                "observation_times": time_o[i][mask_o[i]].tolist(),
+            }
+            name_id = inds[i].get("name_id") if i < len(inds) else None
+            if name_id:
+                ind_dict["name_id"] = name_id
+            if rem_o is not None and rem_m is not None:
+                ind_dict["remaining"] = rem_o[i][rem_m[i]].tolist()
+                ind_dict["remaining_times"] = rem_t[i][rem_m[i]].tolist()
+            block_inds.append(ind_dict)
+        return block_inds
+def databatch_to_study_jsons(
+    batch: AICMECompartmentsDataBatch,
+    meta_dosing: MetaDosingConfig,
+) -> list[StudyJSON]:
+    """Convert an ``AICMECompartmentsDataBatch`` back to ``StudyJSON`` records.
+    Parameters
+    ----------
+    batch:
+        Batch carrying tensors with a leading study dimension ``B``.
+    meta_dosing:
+        Dosing configuration used to decode route type indices.
+    Returns
+    -------
+    List[StudyJSON]
+        One study per element along the batch dimension ``B``. Missing
+        ``study_name`` or ``substance_name`` entries are replaced by
+        fallback placeholders ``study_{b}`` and ``substance_{b}``.
+    """
+    route_options = meta_dosing.route_options
+    studies: list[StudyJSON] = []
+    B = batch.context_obs.shape[0]
+    def _block(
+        obs: TT["B", "I", "T", 1],
+        time: TT["B", "I", "T", 1],
+        mask: TT["B", "I", "T"],
+        rem: TT["B", "I", "R", 1],
+        rem_time: TT["B", "I", "R", 1],
+        rem_mask: TT["B", "I", "R"],
+        doses: TT["B", "I"],
+        routes: TT["B", "I"],
+        ind_mask: TT["B", "I"],
+        names: list[list[str]],
+    ) -> list[IndividualJSON]:
+        inds: list[IndividualJSON] = []
+        for i in range(obs.shape[1]):
+            if not ind_mask[b, i]:
+                continue
+            name_list = names[b] if b < len(names) else []
+            ind: IndividualJSON = {}
+            if i < len(name_list) and name_list[i]:
+                ind["name_id"] = name_list[i]
+            obs_i = obs[b, i, :, 0]  # [T]
+            time_i = time[b, i, :, 0]  # [T]
+            mask_i = mask[b, i]  # [T]
+            ind["observations"] = obs_i[mask_i].tolist()
+            ind["observation_times"] = time_i[mask_i].tolist()
+            rem_i = rem[b, i, :, 0]  # [R]
+            rem_time_i = rem_time[b, i, :, 0]  # [R]
+            rem_mask_i = rem_mask[b, i]  # [R]
+            rem_vals = rem_i[rem_mask_i].tolist()
+            rem_times = rem_time_i[rem_mask_i].tolist()
+            if rem_vals:
+                ind["remaining"] = rem_vals
+                ind["remaining_times"] = rem_times
+            dose = float(doses[b, i].item())
+            route_idx = int(routes[b, i].item())
+            if dose or route_idx:
+                route = (
+                    route_options[route_idx] if route_idx < len(route_options) else str(route_idx)
+                )
+                ind["dosing"] = [dose]
+                ind["dosing_type"] = [route]
+                ind["dosing_times"] = [meta_dosing.time]
+                ind["dosing_name"] = [route]
+            inds.append(ind)
+        return inds
+    for b in range(B):
+        study_name = (
+            batch.study_name[b]
+            if b < len(batch.study_name) and batch.study_name[b]
+            else f"study_{b}"
+        )
+        substance_name = (
+            batch.substance_name[b]
+            if b < len(batch.substance_name) and batch.substance_name[b]
+            else f"substance_{b}"
+        )
+        meta = {"study_name": study_name, "substance_name": substance_name}
+        ctx = _block(
+            batch.context_obs,
+            batch.context_obs_time,
+            batch.context_obs_mask,
+            batch.context_rem_sim,
+            batch.context_rem_sim_time,
+            batch.context_rem_sim_mask,
+            batch.context_dosing_amounts,
+            batch.context_dosing_route_types,
+            batch.mask_context_individuals,
+            batch.context_subject_name,
+        )
+        tgt = _block(
+            batch.target_obs,
+            batch.target_obs_time,
+            batch.target_obs_mask,
+            batch.target_rem_sim,
+            batch.target_rem_sim_time,
+            batch.target_rem_sim_mask,
+            batch.target_dosing_amounts,
+            batch.target_dosing_route_types,
+            batch.mask_target_individuals,
+            batch.target_subject_name,
+        )
+        studies.append({"context": ctx, "target": tgt, "meta_data": meta})
+    return studies
+def prediction_to_study_jsons(
+    prediction_sample: TT["S", "B", "It", "Tr", 1],
+    prediction_time: TT["S", "B", "It", "Tr", 1],
+    batch: AICMECompartmentsDataBatch,
+    meta_dosing: MetaDosingConfig,
+) -> list[StudyJSON]:
+    """Attach prediction samples to study records.
+    Parameters
+    ----------
+    prediction_sample:
+        Predicted trajectories with a leading sample dimension ``S``.
+    prediction_time:
+        Time points corresponding to ``prediction_sample``.
+    batch:
+        Original :class:`AICMECompartmentsDataBatch` used to generate the
+        predictions.
+    meta_dosing:
+        Dosing configuration for route decoding.
+    Returns
+    -------
+    list[StudyJSON]
+        Studies with ``prediction_samples`` and ``prediction_times`` fields in
+        each predicted target individual.
+    Notes
+    -----
+    Some predictive samplers (for example FlowPK individual prediction) may
+    return predictions for only a subset of target individuals compared with
+    the original batch. In that case this function keeps only the first ``It``
+    target entries (where ``It`` is inferred from ``prediction_sample``) so
+    JSON plots and exported records stay aligned with the predicted tensors.
+    """
+    studies = databatch_to_study_jsons(batch, meta_dosing)
+    _, B, It, _, _ = prediction_sample.shape  # [S, B, It, Tr, 1]
+    for b in range(B):
+        # Keep studies aligned with the number of predicted target individuals.
+        studies[b]["target"] = studies[b]["target"][:It]
+        for i in range(min(It, len(studies[b]["target"]))):
+            samples = prediction_sample[:, b, i, :, 0]  # [S, Tr]
+            times = prediction_time[0, b, i, :, 0]  # [Tr]
+            studies[b]["target"][i]["prediction_samples"] = samples.tolist()
+            studies[b]["target"][i]["prediction_times"] = times.tolist()
+    return studies
+def simulation_obs_to_study_json(
+    obs_out: torch.Tensor,
+    time_out: torch.Tensor,
+    mask_out: torch.Tensor,
+    rem_sim: Optional[torch.Tensor],
+    rem_time: Optional[torch.Tensor],
+    rem_mask: Optional[torch.Tensor],
+    dosing_config_array: list,
+    dosing_amounts: torch.Tensor,
+    study_config,
+    idx: int,
+) -> StudyJSON:
+    """Convert processed simulation tensors into a :class:`StudyJSON` entry.
+    Parameters
+    ----------
+    obs_out, time_out, mask_out:
+        Tensors describing the observed concentrations and time points for the
+        simulated individuals.  ``mask_out`` identifies valid entries in the
+        padded tensors.
+    rem_sim, rem_time, rem_mask:
+        Optional tensors describing the remaining (unobserved) simulation
+        trajectory.  When provided, the tensors must have the same leading
+        dimensions as ``obs_out`` and ``time_out`` with ``rem_mask`` marking
+        valid entries.
+    dosing_config_array:
+        Sequence with dosing configuration objects for each individual.
+    dosing_amounts:
+        Tensor containing the dosing amount per individual.
+    study_config:
+        Configuration object describing the simulated study. Only the
+        ``drug_id`` attribute is accessed, if present.
+    idx:
+        Index used to label the generated study name.
+    Returns
+    -------
+    StudyJSON
+        JSON-compatible dictionary describing the context block of the
+        simulation.
+    """
+    context: list[IndividualJSON] = []
+    num_individuals = obs_out.shape[0]
+    for ind_idx in range(num_individuals):
+        mask = mask_out[ind_idx].to(torch.bool)
+        observations = obs_out[ind_idx][mask].tolist()
+        observation_times = time_out[ind_idx][mask].tolist()
+        individual: IndividualJSON = {
+            "name_id": f"context_{ind_idx}",
+            "observations": observations,
+            "observation_times": observation_times,
+        }
+        if rem_sim is not None and rem_time is not None and rem_mask is not None:
+            rem_mask_row = rem_mask[ind_idx].to(torch.bool)
+            if rem_mask_row.any():
+                individual["remaining"] = rem_sim[ind_idx][rem_mask_row].tolist()
+                individual["remaining_times"] = rem_time[ind_idx][rem_mask_row].tolist()
+        dosing_cfg = dosing_config_array[ind_idx]
+        dose = float(dosing_amounts[ind_idx].item())
+        route = getattr(dosing_cfg, "route", "")
+        dosing_time = float(getattr(dosing_cfg, "time", 0.0))
+        if dose or route:
+            individual["dosing"] = [dose]
+            individual["dosing_type"] = [route]
+            individual["dosing_times"] = [dosing_time]
+            individual["dosing_name"] = [route]
+        context.append(individual)
+    study_json: StudyJSON = {
+        "context": context,
+        "target": [],
+        "meta_data": {
+            "study_name": f"simulated_study_{idx}",
+            "substance_name": getattr(study_config, "drug_id", "simulated_substance"),
+        },
+    }
+    return study_json
+def held_out_ind_json(study: StudyJSON, max_held_out_individuals: int) -> List[StudyJSON]:
+    """Create study permutations with one individual moved to target.
+    Parameters
+    ----------
+    study:
+        Study JSON containing only context individuals (``target`` must be empty).
+    max_held_out_individuals:
+        Maximum number of permutations to generate.
+    Returns
+    -------
+    List[StudyJSON]
+        List with ``max_held_out_individuals`` studies where each of the first
+        ``len(context)`` entries corresponds to one context individual being
+        moved to the target block. Remaining entries repeat the original study
+        with an empty target.
+    """
+    context = list(study.get("context", []))
+    meta = dict(study.get("meta_data", {}))
+    out: List[StudyJSON] = []
+    n_ctx = len(context)
+    limit = min(max_held_out_individuals, n_ctx)
+    for idx in range(limit):
+        target = [context[idx]]
+        ctx = context[:idx] + context[idx + 1 :]
+        out.append({"context": ctx, "target": target, "meta_data": meta})
+    base = {"context": context, "target": [], "meta_data": meta}
+    while len(out) < max_held_out_individuals:
+        out.append(base)
+    return out
+def held_out_list_json(
+    builder: JSON2AICMEBuilder,
+    studies: List[StudyJSON],
+    meta_dosing: MetaDosingConfig,
+    max_held_out_individuals: int,
+) -> List[AICMECompartmentsDataBatch]:
+    """
+    Generate batches for leave-one-out permutations across studies.
+    Parameters
+    ----------
+    builder:
+        Instance used to convert studies to :class:`AICMECompartmentsDataBatch`.
+    studies:
+        Studies where only the context block is populated.
+    meta_dosing:
+        Global dosing configuration.
+    max_held_out_individuals:
+        Maximum number of held-out permutations per study.
+    Returns
+    -------
+    List[AICMECompartmentsDataBatch]
+        ``max_held_out_individuals`` batches. The ``i``-th batch contains the
+        ``i``-th permutation from each study stacked along the batch
+        dimension.
+    """
+    per_study = [held_out_ind_json(s, max_held_out_individuals) for s in studies]
+    batches: List[AICMECompartmentsDataBatch] = []
+    for i in range(max_held_out_individuals):
+        perm = [per_study[j][i] for j in range(len(studies))]
+        batches.append(builder.build_one_aicmebatch(perm, meta_dosing))
+    return batches
+def load_empirical_json_batches(
+    json_path: Path,
+    meta_dosing: Optional[MetaDosingConfig] = None,
+    stats: Optional[EmpiricalJSONStats] = None,
+    datamodule: Optional[AICMECompartmentsDataModule] = None,
+) -> List[AICMECompartmentsDataBatch]:
+    """
+    Load an empirical study JSON file and build leave-one-out batches.
+    We place all the individuals in the context
+    Parameters
+    ----------
+    json_path:
+        Path to a JSON file containing a list of :class:`StudyJSON` records.
+    meta_dosing:
+        Global dosing configuration. If ``None`` a default
+        :class:`MetaDosingConfig` is used.
+    stats:
+        Pre-computed statistics describing the dataset. When ``None`` the
+        statistics are calculated from ``json_path`` via
+        :func:`compute_json_stats`.
+    datamodule:
+        Optional synthetic data module providing shape information via
+        :meth:`AICMECompartmentsDataModule.obtain_shapes`. When given, these
+        shapes override those inferred from ``stats``.
+    Returns
+    -------
+    List[AICMECompartmentsDataBatch]
+        Leave-one-out batches constructed from the studies in ``json_path``.
+    Notes
+    -----
+    The function canonicalises all studies and either uses the provided
+    ``stats`` or computes them from the JSON file to determine the number of
+    leave-one-out permutations. When ``datamodule`` is supplied the padding
+    shapes ``(max_individuals, max_observations, max_remaining)`` are taken
+    from :meth:`AICMECompartmentsDataModule.obtain_shapes`.
+    """
+    # read file SHOULD BE A LIST OF STUDY JSON
+    with json_path.open() as f:
+        raw_studies = json.load(f)
+    if not isinstance(raw_studies, list):
+        raise ValueError("Expected JSON file to contain a list of StudyJSON records")
+    # ensure data quality
+    canon_studies: List[StudyJSON] = [
+        canonicalize_study(s, drop_tgt_too_few=False) for s in raw_studies
+    ]
+    # we set all the individuals as context
+    studies: List[StudyJSON] = []
+    for study in canon_studies:
+        all_individuals = list(study.get("context", [])) + list(study.get("target", []))
+        studies.append(
+            {"context": all_individuals, "target": [], "meta_data": study.get("meta_data", {})}
+        )
+    # define shapes
+    if not studies:
+        raise ValueError("No studies found in JSON file")
+    if datamodule is not None:
+        max_inds, max_obs, max_rem = datamodule.obtain_shapes()  # (I, T, R)
+        ctx_cap = getattr(datamodule.train_dataset, "max_context_individuals", max_inds)
+        tgt_cap = getattr(datamodule.train_dataset, "n_of_target_individuals", max_inds)
+    else:
+        # compute statitics of the whole dataset
+        stats = compute_json_stats(canon_studies)
+        max_inds, max_obs, max_rem = (
+            stats.max_total_individuals,
+            stats.max_observations,
+            stats.max_remaining,
+        )
+        ctx_cap = max_inds
+        tgt_cap = max_inds
+    # the maximum batch is so that we have all the empirical at once
+    cfg = EmpiricalBatchConfig(
+        max_databatch_size=len(studies),
+        max_individuals=max_inds,
+        max_observations=max_obs,
+        max_remaining=max_rem,
+        max_context_individuals=ctx_cap,
+        max_target_individuals=tgt_cap,
+    )
+    builder = JSON2AICMEBuilder(cfg)
+    meta = meta_dosing or MetaDosingConfig()
+    return held_out_list_json(
+        builder, studies, meta, max_held_out_individuals=stats.max_total_individuals
+    )
+def load_empirical_json_batches_as_dm(
+    json_path: Optional[Path] = None,
+    meta_dosing: Optional[MetaDosingConfig] = None,
+    stats: Optional[EmpiricalJSONStats] = None,
+    datamodule: Optional[AICMECompartmentsDataModule] = None,
+    raw_studies: Optional[List[StudyJSON]] = None,
+    *,
+    held_out: bool = True,
+) -> List[AICMECompartmentsDataBatch]:
+    """Load an empirical study JSON file and build leave-one-out batches.
+    This variant mirrors the empirical preprocessing performed by
+    :class:`AICMECompartmentsDataset` by relying on the observation strategies
+    of a provided :class:`AICMECompartmentsDataModule` and using
+    :meth:`JSON2AICMEBuilder.build_one_aicmebatch_as_dataset`.
+    Parameters
+    ----------
+    json_path:
+        Path to a JSON file containing a list of :class:`StudyJSON` records.
+    meta_dosing:
+        Global dosing configuration. If ``None`` a default
+        :class:`MetaDosingConfig` is used.
+    stats:
+        Pre-computed statistics describing the dataset. When ``None`` the
+        statistics are calculated from ``json_path`` via
+        :func:`compute_json_stats`.
+    datamodule:
+        Synthetic data module providing observation strategies and shape
+        information via :meth:`AICMECompartmentsDataModule.obtain_shapes`.
+        The module must be provided; its shapes override those inferred from
+        ``stats``.
+    held_out:
+        If ``True`` (default), build leave-one-out permutations (one empirical
+        individual in target). If ``False``, keep all empirical individuals in
+        context and return a single batch.
+    Returns
+    -------
+    List[AICMECompartmentsDataBatch]
+        Leave-one-out batches constructed from the studies in ``json_path``
+        using the datamodule's strategies.
+    """
+    if datamodule is None:
+        raise ValueError("datamodule must be provided to supply observation strategies")
+    if raw_studies is None:
+        with json_path.open() as f:
+            raw_studies = json.load(f)
+    if not isinstance(raw_studies, list):
+        raise ValueError("Expected JSON file to contain a list of StudyJSON records")
+    canon_studies: List[StudyJSON] = [
+        canonicalize_study(s, drop_tgt_too_few=False) for s in raw_studies
+    ]
+    if stats is None:
+        stats = compute_json_stats(canon_studies)
+    studies: List[StudyJSON] = []
+    for study in canon_studies:
+        all_inds = list(study.get("context", [])) + list(study.get("target", []))
+        studies.append({"context": all_inds, "target": [], "meta_data": study.get("meta_data", {})})
+    if not studies:
+        raise ValueError("No studies found in JSON file")
+    max_inds, max_obs, max_rem = datamodule.obtain_shapes()  # (I, T, R)
+    ctx_cap = getattr(datamodule.train_dataset, "max_context_individuals", max_inds)
+    tgt_cap = getattr(datamodule.train_dataset, "n_of_target_individuals", max_inds)
+    context_strategy = getattr(datamodule, "context_strategy", None)
+    # For empirical targets we prefer the dedicated datamodule override
+    # (legacy PK behavior + fixed capacities), falling back to target_strategy.
+    target_strategy = getattr(datamodule, "empirical_target_strategy", None)
+    if target_strategy is None:
+        target_strategy = getattr(datamodule, "target_strategy", None)
+    if context_strategy is None or target_strategy is None:
+        raise ValueError("datamodule is missing context or target strategies")
+    ctx_obs_cap, ctx_rem_cap = context_strategy.get_shapes()
+    tgt_obs_cap, tgt_rem_cap = target_strategy.get_shapes()
+    cfg = EmpiricalBatchConfig(
+        max_databatch_size=len(studies),
+        max_individuals=max_inds,
+        max_observations=max_obs,
+        max_remaining=max_rem,
+        max_context_individuals=ctx_cap,
+        max_target_individuals=tgt_cap,
+        max_context_observations=ctx_obs_cap,
+        max_target_observations=tgt_obs_cap,
+        max_context_remaining=ctx_rem_cap,
+        max_target_remaining=tgt_rem_cap,
+    )
+    builder = JSON2AICMEBuilder(cfg)
+    meta = meta_dosing or MetaDosingConfig()
+    if held_out:
+        return builder.build_one_aicmebatch_as_dataset(
+            studies, context_strategy, target_strategy, meta
+        )
+    return builder.build_one_aicmebatch_as_dataset_no_heldout(
+        studies, context_strategy, target_strategy, meta
+    )
+def load_empirical_hf_batches_as_dm(
+    repo_id: str,
+    split: str = "train",
+    meta_dosing: Optional[MetaDosingConfig] = None,
+    stats: Optional[EmpiricalJSONStats] = None,
+    datamodule: Optional[AICMECompartmentsDataModule] = None,
+    *,
+    held_out: bool = True,
+) -> List[AICMECompartmentsDataBatch]:
+    """Load a StudyJSON dataset from Hugging Face Hub.
+    Parameters
+    ----------
+    repo_id:
+        Hugging Face dataset id.
+    split:
+        Dataset split to load.
+    meta_dosing:
+        Dosing configuration.
+    stats:
+        Optional precomputed dataset statistics.
+    datamodule:
+        Datamodule providing empirical shape and strategy information.
+    held_out:
+        If ``True`` (default), build leave-one-out permutations. If ``False``,
+        keep all empirical individuals in context and return a single batch.
+    """
+    if datamodule is None:
+        raise ValueError("datamodule must be provided to supply observation strategies")
+    # Load from HF Hub
+    ds = load_dataset(repo_id, split=split)
+    raw_studies = [dict(study) for study in ds]  # Hugging Face rows are dict-like
+    # reuse your old code
+    canon_studies: List[StudyJSON] = [
+        canonicalize_study(s, drop_tgt_too_few=False) for s in raw_studies
+    ]
+    if stats is None:
+        stats = compute_json_stats(canon_studies)
+    studies: List[StudyJSON] = []
+    for study in canon_studies:
+        all_inds = list(study.get("context", [])) + list(study.get("target", []))
+        studies.append({"context": all_inds, "target": [], "meta_data": study.get("meta_data", {})})
+    if not studies:
+        raise ValueError("No studies found in HF dataset")
+    max_inds, max_obs, max_rem = datamodule.obtain_shapes()
+    ctx_cap = getattr(datamodule.train_dataset, "max_context_individuals", max_inds)
+    tgt_cap = getattr(datamodule.train_dataset, "n_of_target_individuals", max_inds)
+    context_strategy = getattr(datamodule, "context_strategy", None)
+    # For empirical targets we prefer the dedicated datamodule override
+    # (legacy PK behavior + fixed capacities), falling back to target_strategy.
+    target_strategy = getattr(datamodule, "empirical_target_strategy", None)
+    if target_strategy is None:
+        target_strategy = getattr(datamodule, "target_strategy", None)
+    if context_strategy is None or target_strategy is None:
+        raise ValueError("datamodule is missing context or target strategies")
+    ctx_obs_cap, ctx_rem_cap = context_strategy.get_shapes()
+    tgt_obs_cap, tgt_rem_cap = target_strategy.get_shapes()
+    cfg = EmpiricalBatchConfig(
+        max_databatch_size=len(studies),
+        max_individuals=max_inds,
+        max_observations=max_obs,
+        max_remaining=max_rem,
+        max_context_individuals=ctx_cap,
+        max_target_individuals=tgt_cap,
+        max_context_observations=ctx_obs_cap,
+        max_target_observations=tgt_obs_cap,
+        max_context_remaining=ctx_rem_cap,
+        max_target_remaining=tgt_rem_cap,
+    )
+    builder = JSON2AICMEBuilder(cfg)
+    meta = meta_dosing or MetaDosingConfig()
+    if held_out:
+        return builder.build_one_aicmebatch_as_dataset(
+            studies, context_strategy, target_strategy, meta
+        )
+    return builder.build_one_aicmebatch_as_dataset_no_heldout(
+        studies, context_strategy, target_strategy, meta
+    )

sim_priors_pk/data/data_empirical/json_schema.py ADDED Viewed

	@@ -0,0 +1,372 @@

+"""TypedDict schemas for empirical pharmacokinetic JSON inputs."""
+from typing import TYPE_CHECKING, Dict, List, Optional, Sequence, TypedDict
+try:  # pragma: no cover - optional torch dependency
+    import torch
+    from torchtyping import TensorType as TT
+except ModuleNotFoundError:  # pragma: no cover - allow missing torch
+    torch = None  # type: ignore
+    TT = object  # type: ignore
+if TYPE_CHECKING:  # pragma: no cover - typing only
+    from sim_priors_pk.data.datasets.aicme_batch import AICMECompartmentsDataBatch
+class IndividualJSON(TypedDict, total=False):
+    """Schema for a single individual's PK data.
+    Optional ``prediction_samples`` and ``prediction_times`` fields allow
+    storing model forecasts for the individual's future trajectory.
+    Each element in ``prediction_samples`` corresponds to a full simulated
+    trajectory for the times listed in ``prediction_times``.
+    """
+    name_id: str
+    observations: List[float]
+    observation_times: List[float]
+    remaining: List[float]
+    remaining_times: List[float]
+    dosing: List[float]
+    dosing_type: List[str]
+    dosing_times: List[float]
+    dosing_name: List[str]
+    prediction_samples: List[List[float]]
+    prediction_times: List[float]
+    covariates: Dict[str, object]
+class StudyJSON(TypedDict):
+    """Schema for a full study consisting of context and target individuals."""
+    context: List[IndividualJSON]
+    target: List[IndividualJSON]
+    meta_data: Dict[str, str]
+MIN_OBS_DEFAULT = 0
+class ValidationError(Exception):
+    """Raised when data do not conform to :class:`StudyJSON`."""
+    pass
+def canonicalize_individual(
+    ind: IndividualJSON,
+    *,
+    min_obs: int = MIN_OBS_DEFAULT,
+    drop_if_too_few: bool = True,
+) -> Optional[IndividualJSON]:
+    """Return a canonical version of ``ind``.
+    Parameters
+    ----------
+    ind:
+        Individual JSON record to canonicalize. The input dictionary is **not**
+        mutated.
+    min_obs:
+        Minimum required number of observations. Defaults to
+        :data:`MIN_OBS_DEFAULT`.
+    drop_if_too_few:
+        If ``True`` and the individual has fewer than ``min_obs`` observations
+        after sorting/de-duplication, ``None`` is returned.
+    Returns
+    -------
+    Optional[IndividualJSON]
+        Canonicalized record or ``None`` when dropped.
+    Notes
+    -----
+    The function performs the following steps:
+    - Validate the presence and equal length of ``observations`` and
+      ``observation_times``.
+    - Sort observations by ascending time and remove duplicate time entries
+      keeping the first occurrence.
+    - Optionally drop the individual if the number of observations is below
+      ``min_obs``.
+    - Ensure ``remaining``/``remaining_times`` are disjoint from
+      ``observation_times`` and of equal length.
+    - If any dosing related fields are provided, require that all dosing fields
+      are present and have equal lengths.
+    """
+    # --- observations & times ---
+    if "observations" not in ind or "observation_times" not in ind:
+        raise ValidationError("observations and observation_times are required")
+    obs = list(ind["observations"])
+    times = list(ind["observation_times"])
+    if len(obs) != len(times):
+        raise ValidationError("observations and observation_times must match in length")
+    # sort and de-duplicate by time (stable sort keeps first occurrence)
+    pairs = sorted(zip(times, obs), key=lambda x: x[0])
+    seen = set()
+    obs_sorted: List[float] = []
+    times_sorted: List[float] = []
+    for t, o in pairs:
+        if t in seen:
+            continue
+        seen.add(t)
+        times_sorted.append(t)
+        obs_sorted.append(o)
+    if len(obs_sorted) < min_obs and drop_if_too_few:
+        return None
+    new_ind: IndividualJSON = {}
+    if "name_id" in ind:
+        new_ind["name_id"] = ind["name_id"]
+    new_ind["observations"] = obs_sorted
+    new_ind["observation_times"] = times_sorted
+    # --- remaining ---
+    has_rem = "remaining" in ind or "remaining_times" in ind
+    if has_rem:
+        if "remaining" not in ind or "remaining_times" not in ind:
+            raise ValidationError(
+                "both remaining and remaining_times required when one is provided"
+            )
+        rem = list(ind["remaining"])
+        rem_t = list(ind["remaining_times"])
+        if len(rem) != len(rem_t):
+            raise ValidationError("remaining and remaining_times must match in length")
+        obs_time_set = set(times_sorted)
+        rem_filtered: List[float] = []
+        rem_t_filtered: List[float] = []
+        for t, r in zip(rem_t, rem):
+            if t in obs_time_set:
+                continue
+            rem_t_filtered.append(t)
+            rem_filtered.append(r)
+        new_ind["remaining"] = rem_filtered
+        new_ind["remaining_times"] = rem_t_filtered
+    # --- dosing ---
+    dosing_keys = ["dosing", "dosing_type", "dosing_times", "dosing_name"]
+    present_dosing = [k for k in dosing_keys if k in ind]
+    if present_dosing:
+        if len(present_dosing) != len(dosing_keys):
+            raise ValidationError("all dosing fields must be present when dosing is provided")
+        lengths = [len(ind[k]) for k in dosing_keys]  # type: ignore[index]
+        if len(set(lengths)) != 1:
+            raise ValidationError("dosing fields must have equal lengths")
+        for k in dosing_keys:
+            new_ind[k] = list(ind[k])  # type: ignore[index]
+    # --- covariates ---
+    if "covariates" in ind:
+        new_ind["covariates"] = dict(ind["covariates"])
+    # --- prediction samples ---
+    if "prediction_samples" in ind:
+        new_ind["prediction_samples"] = [list(s) for s in ind["prediction_samples"]]
+    if "prediction_times" in ind:
+        new_ind["prediction_times"] = list(ind["prediction_times"])
+    if "prediction_mean" in ind:
+        new_ind["prediction_mean"] = list(ind["prediction_mean"])
+    if "prediction_std" in ind:
+        new_ind["prediction_std"] = list(ind["prediction_std"])
+    return new_ind
+def canonicalize_study(
+    study: StudyJSON,
+    *,
+    min_obs_ctx: int = MIN_OBS_DEFAULT,
+    min_obs_tgt: int = MIN_OBS_DEFAULT,
+    drop_tgt_too_few: bool = True,
+) -> StudyJSON:
+    """Canonicalize all individuals in ``study`` and validate meta data."""
+    meta = study.get("meta_data", {})
+    if not meta.get("study_name") or not meta.get("substance_name"):
+        raise ValidationError("meta_data must include non-empty study_name and substance_name")
+    context_canon: List[IndividualJSON] = []
+    for ind in study.get("context", []):
+        canon = canonicalize_individual(ind, min_obs=min_obs_ctx, drop_if_too_few=False)
+        if canon is not None:
+            context_canon.append(canon)
+    target_canon: List[IndividualJSON] = []
+    for ind in study.get("target", []):
+        canon = canonicalize_individual(ind, min_obs=min_obs_tgt, drop_if_too_few=drop_tgt_too_few)
+        if canon is not None:
+            target_canon.append(canon)
+    new_study: StudyJSON = {
+        "context": context_canon,
+        "target": target_canon,
+        "meta_data": dict(meta),
+    }
+    return new_study
+def studies_from_sampled_targets(
+    *,
+    db: "AICMECompartmentsDataBatch",
+    samples: "TT['S', 'B', 'T', 1]",
+    times: "TT['B', 'T', 1]",
+    mask: "TT['B', 'T']",
+    route_options: Sequence[str],
+    dosing_time: float,
+    name_prefix: str = "new_individual",
+) -> List[StudyJSON]:
+    """Convert sampled trajectories into :class:`StudyJSON` records.
+    Parameters
+    ----------
+    db:
+        Batch containing the conditioning study information. Only the fields
+        accessed in this function are required, allowing reuse with compatible
+        NamedTuple implementations used throughout the project.
+    samples, times, mask:
+        Output tensors from ``sample_new_individual`` where ``samples`` carries
+        the simulated trajectories, ``times`` their corresponding decode times
+        and ``mask`` selects valid entries along the temporal dimension.
+    route_options:
+        Lookup table translating dosing route indices into human readable
+        labels. Indices outside the provided range are returned as their string
+        representation.
+    dosing_time:
+        Absolute time at which the dosing event occurred. Used for both context
+        and newly sampled target individuals when dosing information is
+        present.
+    name_prefix:
+        Prefix for generated target individual identifiers. Defaults to
+        ``"new_individual"``.
+    Returns
+    -------
+    list[StudyJSON]
+        One ``StudyJSON`` per batch element in ``db``.
+    """
+    if torch is None:
+        raise ValidationError("torch is required to build StudyJSON records from tensors")
+    S, B, _, _ = samples.shape
+    studies: List[StudyJSON] = []
+    for b in range(B):
+        study_name = (
+            db.study_name[b] if b < len(db.study_name) and db.study_name[b] else f"study_{b}"
+        )
+        substance_name = (
+            db.substance_name[b]
+            if b < len(db.substance_name) and db.substance_name[b]
+            else f"substance_{b}"
+        )
+        context_list: List[IndividualJSON] = []
+        I = db.context_obs.shape[1]
+        for i in range(I):
+            if not db.mask_context_individuals[b, i]:
+                continue
+            ind: IndividualJSON = {}
+            if b < len(db.context_subject_name) and i < len(db.context_subject_name[b]):
+                name = db.context_subject_name[b][i]
+                if name:
+                    ind["name_id"] = name
+            obs_i = db.context_obs[b, i, :, 0]
+            time_i = db.context_obs_time[b, i, :, 0]
+            mask_i = db.context_obs_mask[b, i]
+            ind["observations"] = obs_i[mask_i].tolist()
+            ind["observation_times"] = time_i[mask_i].tolist()
+            if db.context_rem_sim.shape[2] > 0:
+                rem_i = db.context_rem_sim[b, i, :, 0]
+                rem_t = db.context_rem_sim_time[b, i, :, 0]
+                rem_m = db.context_rem_sim_mask[b, i]
+                rem_vals = rem_i[rem_m].tolist()
+                rem_times = rem_t[rem_m].tolist()
+                if rem_vals:
+                    ind["remaining"] = rem_vals
+                    ind["remaining_times"] = rem_times
+            dose = float(db.context_dosing_amounts[b, i].item())
+            route_idx = int(db.context_dosing_route_types[b, i].item())
+            if dose or route_idx:
+                route = (
+                    route_options[route_idx] if route_idx < len(route_options) else str(route_idx)
+                )
+                ind["dosing"] = [dose]
+                ind["dosing_type"] = [route]
+                ind["dosing_times"] = [dosing_time]
+                ind["dosing_name"] = [route]
+            context_list.append(ind)
+        target_list: List[IndividualJSON] = []
+        valid_mask = mask[b]
+        valid_times = times[b, valid_mask, 0].tolist()
+        for s in range(S):
+            traj = samples[s, b, valid_mask, 0].tolist()
+            ind: IndividualJSON = {
+                "name_id": f"{name_prefix}_{s}",
+                "observations": traj,
+                "observation_times": valid_times,
+            }
+            dose = float(db.target_dosing_amounts[b, 0].item())
+            route_idx = int(db.target_dosing_route_types[b, 0].item())
+            if dose or route_idx:
+                route = (
+                    route_options[route_idx] if route_idx < len(route_options) else str(route_idx)
+                )
+                ind["dosing"] = [dose]
+                ind["dosing_type"] = [route]
+                ind["dosing_times"] = [dosing_time]
+                ind["dosing_name"] = [route]
+            target_list.append(ind)
+        studies.append(
+            {
+                "context": context_list,
+                "target": target_list,
+                "meta_data": {
+                    "study_name": study_name,
+                    "substance_name": substance_name,
+                },
+            }
+        )
+    return studies
+def prediction_stats(study: StudyJSON) -> StudyJSON:
+    """Compute prediction mean and std for target individuals.
+    For each target individual with ``prediction_samples`` the function
+    calculates the mean and standard deviation across the sample dimension and
+    stores the results in ``prediction_mean`` and ``prediction_std`` fields.
+    Parameters
+    ----------
+    study:
+        ``StudyJSON`` record containing prediction samples.
+    Returns
+    -------
+    StudyJSON
+        The input study where target individuals now also carry ``prediction_mean``
+        and ``prediction_std`` fields. The input mapping is mutated for
+        convenience.
+    """
+    for ind in study.get("target", []):
+        samples = ind.get("prediction_samples")
+        if samples:
+            if torch is None:
+                raise ValidationError("torch is required to compute prediction summaries")
+            samples_t: TT["S", "Tr"] = torch.tensor(samples)
+            ind["prediction_mean"] = samples_t.mean(dim=0).tolist()
+            ind["prediction_std"] = samples_t.std(dim=0, unbiased=False).tolist()
+    return study

sim_priors_pk/data/data_empirical/json_stats.py ADDED Viewed

	@@ -0,0 +1,201 @@

+"""This is only used for checking the shapes of the empirical data that are passed to the Dataloader"""
+from __future__ import annotations
+from collections import defaultdict
+from dataclasses import dataclass
+from typing import Dict, List, Sequence, Set
+from .json_schema import StudyJSON
+@dataclass
+class EmpiricalJSONStats:
+    """Basic statistics collected from empirical study JSON files.
+    Attributes
+    ----------
+    min_context_individuals, max_context_individuals:
+        Range of context individuals across studies.
+    min_target_individuals, max_target_individuals:
+        Range of target individuals across studies.
+    min_observation, max_observation:
+        Extremal observed values across all individuals.
+    substances:
+        Sorted list of distinct substance names.
+    max_total_individuals:
+        Maximum combined number of context and target individuals in a study.
+    max_observations:
+        Maximum number of observation time points for any individual.
+    max_remaining:
+        Maximum number of remaining time points for any individual.
+    substance_summaries:
+        Nested mapping keyed by substance name containing per-substance
+        statistics. Each inner dictionary exposes the total number of
+        individuals, the minimum and maximum number of observation time points
+        per individual, and the sorted list of unique time steps observed
+        across all individuals (including observation and remaining times) for
+        the substance.
+    studies_by_substance:
+        Mapping from substance name to the list of studies associated with it.
+    """
+    min_context_individuals: int
+    max_context_individuals: int
+    min_target_individuals: int
+    max_target_individuals: int
+    min_observation: float
+    max_observation: float
+    substances: List[str]
+    max_total_individuals: int
+    max_observations: int
+    max_remaining: int
+    substance_summaries: Dict[str, Dict[str, object]]
+    studies_by_substance: Dict[str, List[StudyJSON]]
+    def studies_for_substance(self, substance: str) -> List[StudyJSON]:
+        """Return all studies that reference ``substance``.
+        Parameters
+        ----------
+        substance:
+            Name of the substance whose studies should be returned.
+        Returns
+        -------
+        List[StudyJSON]
+            Study dictionaries associated with ``substance``. An empty list is
+            returned when the substance was not observed.
+        """
+        return list(self.studies_by_substance.get(substance, []))
+    def get_substance_summary(self, substance: str) -> Dict[str, object]:
+        """Return the per-substance statistics for ``substance``.
+        Parameters
+        ----------
+        substance:
+            Name of the substance whose statistics should be retrieved.
+        Returns
+        -------
+        Dict[str, object]
+            Dictionary containing the ``individual_count``,
+            ``min_observations``, ``max_observations`` and
+            ``observation_time_steps`` entries. An empty dictionary is returned
+            if the substance is unknown.
+        """
+        summary = self.substance_summaries.get(substance)
+        if summary is None:
+            return {}
+        return dict(summary)
+def compute_json_stats(studies: Sequence[StudyJSON]) -> EmpiricalJSONStats:
+    """Compute statistics across empirical pharmacokinetic studies.
+    Parameters
+    ----------
+    studies:
+        Sequence of :class:`StudyJSON` objects to aggregate.
+    Returns
+    -------
+    EmpiricalJSONStats
+        Aggregated statistics across all provided studies.
+    """
+    min_ctx = float("inf")
+    max_ctx = 0
+    min_tgt = float("inf")
+    max_tgt = 0
+    min_obs = float("inf")
+    max_obs = float("-inf")
+    substances = set()
+    max_total_inds = 0
+    max_obs_len = 0
+    max_rem_len = 0
+    substance_counts: Dict[str, int] = defaultdict(int)
+    substance_min_obs: Dict[str, int] = {}
+    substance_max_obs: Dict[str, int] = {}
+    substance_times: Dict[str, Set[float]] = defaultdict(set)
+    studies_by_substance: Dict[str, List[StudyJSON]] = defaultdict(list)
+    for study in studies:
+        c_len = len(study.get("context", []))
+        t_len = len(study.get("target", []))
+        total_len = c_len + t_len
+        min_ctx = min(min_ctx, c_len)
+        max_ctx = max(max_ctx, c_len)
+        min_tgt = min(min_tgt, t_len)
+        max_tgt = max(max_tgt, t_len)
+        max_total_inds = max(max_total_inds, total_len)
+        meta = study.get("meta_data", {})
+        substance = meta.get("substance_name")
+        if substance:
+            substances.add(substance)
+            studies_by_substance[substance].append(study)
+        for ind in study.get("context", []) + study.get("target", []):
+            obs = ind.get("observations", [])
+            obs_len = len(obs)
+            rem = ind.get("remaining", [])
+            times = ind.get("observation_times", [])
+            rem_times = ind.get("remaining_times", [])
+            max_obs_len = max(max_obs_len, len(obs))
+            max_rem_len = max(max_rem_len, len(rem))
+            if obs:
+                min_obs = min(min_obs, min(obs))
+                max_obs = max(max_obs, max(obs))
+            if substance:
+                substance_counts[substance] += 1
+                current_min = substance_min_obs.get(substance)
+                if current_min is None:
+                    substance_min_obs[substance] = obs_len
+                else:
+                    substance_min_obs[substance] = min(current_min, obs_len)
+                current_max = substance_max_obs.get(substance)
+                if current_max is None:
+                    substance_max_obs[substance] = obs_len
+                else:
+                    substance_max_obs[substance] = max(current_max, obs_len)
+                substance_times[substance].update(times)
+                substance_times[substance].update(rem_times)
+    if min_ctx == float("inf"):
+        min_ctx = 0
+    if min_tgt == float("inf"):
+        min_tgt = 0
+    if min_obs == float("inf"):
+        min_obs = float("nan")
+    if max_obs == float("-inf"):
+        max_obs = float("nan")
+    substance_summaries = {
+        substance: {
+            "individual_count": substance_counts.get(substance, 0),
+            "min_observations": substance_min_obs.get(substance, 0),
+            "max_observations": substance_max_obs.get(substance, 0),
+            "observation_time_steps": sorted(substance_times.get(substance, set())),
+        }
+        for substance in sorted(substances)
+    }
+    return EmpiricalJSONStats(
+        min_context_individuals=int(min_ctx),
+        max_context_individuals=int(max_ctx),
+        min_target_individuals=int(min_tgt),
+        max_target_individuals=int(max_tgt),
+        min_observation=float(min_obs),
+        max_observation=float(max_obs),
+        substances=sorted(substances),
+        max_total_individuals=int(max_total_inds),
+        max_observations=int(max_obs_len),
+        max_remaining=int(max_rem_len),
+        substance_summaries=substance_summaries,
+        studies_by_substance={k: list(v) for k, v in studies_by_substance.items()},
+    )

sim_priors_pk/data/data_empirical/simulx_to_json.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+Tools for converting simulx output .csv files (simulation from an NLME model) to study JSON format
+"""
+import csv
+from collections import defaultdict
+from typing import Sequence
+from sim_priors_pk.data.data_empirical.json_schema import StudyJSON
+def simulx_to_json(
+    csv_path,
+    study_name="simulated_study",
+    substance_name="Drug_A",
+    dosing_type="oral"
+) -> Sequence[StudyJSON]:
+    # rep -> ID -> data
+    reps = defaultdict(lambda: defaultdict(lambda: {
+        "observations": [],
+        "observation_times": [],
+        "dosing": [],
+        "dosing_type": [],
+        "dosing_times": [],
+        "dosing_name": []
+    }))
+    with open(csv_path, newline="") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            rep = int(row["rep"])
+            id_ = row["ID"]
+            time = float(row["TIME"])
+            # Observations
+            if row["value"] != ".":
+                reps[rep][id_]["observations"].append(float(row["value"]))
+                reps[rep][id_]["observation_times"].append(time)
+            # Dosing (assumed at TIME == 0)
+            if time == 0 and row["AMOUNT"] != ".":
+                reps[rep][id_]["dosing"].append(float(row["AMOUNT"]))
+                reps[rep][id_]["dosing_times"].append(0.0)
+                reps[rep][id_]["dosing_type"].append(dosing_type)
+                reps[rep][id_]["dosing_name"].append(dosing_type)
+    # Build final output: one JSON object per rep
+    output = []
+    for rep, ids in sorted(reps.items()):
+        contexts = []
+        for i, (id_, data) in enumerate(ids.items()):
+            contexts.append({
+                "name_id": f"context_{id_}",
+                **data
+            })
+        study = StudyJSON({
+            "context": contexts,
+            "meta_data": {
+                "study_name": f"{study_name}_rep{rep}",
+                "substance_name": substance_name
+            }
+        })
+        output.append(study)
+    return output
+if __name__ == "__main__":
+    output = simulx_to_json(csv_path="data/raw_nlme_simulx/indometacin-test-data.csv")

sim_priors_pk/data/data_generation/__init__.py ADDED Viewed

File without changes

sim_priors_pk/data/data_generation/compartment_models.py ADDED Viewed

	@@ -0,0 +1,721 @@

+import random
+from dataclasses import dataclass, field
+from typing import Callable, List, Optional, Tuple
+import numpy as np
+import torch
+from torchdiffeq import odeint
+from torchtyping import TensorType
+from sim_priors_pk.config_classes.data_config import (
+    DosingConfig,
+    DosingWithDurationConfig,
+    MetaDosingConfig,
+    MetaDosingWithDurationConfig,
+    MetaStudyConfig,
+)
+from sim_priors_pk.config_classes.node_pk_config import NodePKExperimentConfig
+@dataclass
+class StudyConfig:
+    """
+    This corresponds to the configuration of one study
+    """
+    drug_id: str  # Identifier for the drug
+    num_individuals: int  # Number of individuals in the population
+    num_peripherals: int  # Number of peripheral compartments
+    log_k_a_mean: float  # Mean absorption rate constant
+    log_k_a_std: float  # Standard deviation for absorption rate constant
+    k_a_tmag: float  # Magnitude of time-dependent variation of absorption rate constant
+    k_a_tscl: float  # Scale of time-dependent variation of absorption rate constant
+    log_k_e_mean: float  # Mean elimination rate constant
+    log_k_e_std: float  # Standard deviation for elimination rate constant
+    k_e_tmag: float  # Magnitude of time-dependent variation of elimination rate constant
+    k_e_tscl: float  # Scale of time-dependent variation of elimination rate constant
+    log_V_mean: float  # Mean volume of central compartment
+    log_V_std: float  # Standard deviation for volume of central compartment
+    V_tmag: float  # Magnitude of time-dependent variation of volume of central compartment
+    V_tscl: float  # Scale of time-dependent variation of volume of central compartment
+    log_k_1p_mean: List[float]  # Mean rate constants (central to other peripherals)
+    log_k_1p_std: List[float]  # Standard deviations for k_1p
+    k_1p_tmag: List[float]  # Magnitude of time-dependent variation of k_1p
+    k_1p_tscl: List[float]  # Scale of time-dependent variation of k_1p
+    log_k_p1_mean: List[float]  # Mean rate constants (other peripherals to central)
+    log_k_p1_std: List[float]  # Standard deviations for k_p1
+    k_p1_tmag: List[float]  # Magnitude of time-dependent variation of k_p1
+    k_p1_tscl: List[float]  # Scale of time-dependent variation of k_p1
+    time_start: float  # Start time for the study
+    time_stop: float  # End time for the study
+    rel_ruv: float  # Relative residual unexplained variability for the study
+@dataclass
+class IndividualConfig:
+    """
+    This corresponds to the configuration of one individual.
+    """
+    num_peripherals: int = 2  # Number of peripheral compartments
+    k_a: Callable[[float], float] = lambda t: 0.1  # Absorption rate constant (gut to central)
+    k_e: Callable[[float], float] = lambda t: 0.05  # Elimination rate constant (central)
+    V: Callable[[float], float] = lambda t: 0.05  # Volume of central compartment
+    k_1p: List[Callable[[float], float]] = field(
+        default_factory=lambda: [lambda t: 0.01, lambda t: 0.01]
+    )  # Rate constants from central to other peripherals
+    k_p1: List[Callable[[float], float]] = field(
+        default_factory=lambda: [lambda t: 0.01, lambda t: 0.01]
+    )  # Rate constants from other peripherals to central
+    rel_ruv: float = 0.1  # Relative residual unexplained variability per individual
+def sample_study_config(config: MetaStudyConfig):
+    """
+    Samples a StudyConfig object based on the MetaStudyConfig.
+    """
+    # Generate random values for each parameter
+    drug_id = random.choice(config.drug_id_options)
+    num_individuals = random.randint(*config.num_individuals_range)
+    num_peripherals = random.randint(*config.num_peripherals_range)
+    # Sample mean, std, and tmag for each rate constant
+    log_k_a_mean = random.uniform(*config.log_k_a_mean_range)
+    log_k_a_std = random.uniform(*config.log_k_a_std_range)
+    k_a_tmag = random.uniform(*config.k_a_tmag_range)
+    k_a_tscl = random.uniform(*config.k_a_tscl_range)
+    log_k_e_mean = random.uniform(*config.log_k_e_mean_range)
+    log_k_e_std = random.uniform(*config.log_k_e_std_range)
+    k_e_tmag = random.uniform(*config.k_e_tmag_range)
+    k_e_tscl = random.uniform(*config.k_e_tscl_range)
+    log_V_mean = random.uniform(*config.log_V_mean_range)
+    log_V_std = random.uniform(*config.log_V_std_range)
+    V_tmag = random.uniform(*config.V_tmag_range)
+    V_tscl = random.uniform(*config.V_tscl_range)
+    log_k_1p_mean = [random.uniform(*config.log_k_1p_mean_range) for _ in range(num_peripherals)]
+    log_k_1p_std = [random.uniform(*config.log_k_1p_std_range) for _ in range(num_peripherals)]
+    k_1p_tmag = [random.uniform(*config.k_1p_tmag_range) for _ in range(num_peripherals)]
+    k_1p_tscl = [random.uniform(*config.k_1p_tscl_range) for _ in range(num_peripherals)]
+    log_k_p1_mean = [random.uniform(*config.log_k_p1_mean_range) for _ in range(num_peripherals)]
+    log_k_p1_std = [random.uniform(*config.log_k_p1_std_range) for _ in range(num_peripherals)]
+    k_p1_tmag = [random.uniform(*config.k_p1_tmag_range) for _ in range(num_peripherals)]
+    k_p1_tscl = [random.uniform(*config.k_p1_tscl_range) for _ in range(num_peripherals)]
+    rel_ruv = random.uniform(*config.rel_ruv_range)
+    return StudyConfig(
+        drug_id=drug_id,
+        num_individuals=num_individuals,
+        num_peripherals=num_peripherals,
+        log_k_a_mean=log_k_a_mean,
+        log_k_a_std=log_k_a_std,
+        k_a_tmag=k_a_tmag,
+        k_a_tscl=k_a_tscl,
+        log_k_e_mean=log_k_e_mean,
+        log_k_e_std=log_k_e_std,
+        k_e_tmag=k_e_tmag,
+        k_e_tscl=k_e_tscl,
+        log_V_mean=log_V_mean,
+        log_V_std=log_V_std,
+        V_tmag=V_tmag,
+        V_tscl=V_tscl,
+        log_k_1p_mean=log_k_1p_mean,
+        log_k_1p_std=log_k_1p_std,
+        k_1p_tmag=k_1p_tmag,
+        k_1p_tscl=k_1p_tscl,
+        log_k_p1_mean=log_k_p1_mean,
+        log_k_p1_std=log_k_p1_std,
+        k_p1_tmag=k_p1_tmag,
+        k_p1_tscl=k_p1_tscl,
+        time_start=config.time_start,
+        time_stop=config.time_stop,
+        rel_ruv=rel_ruv,
+    )
+def sample_rate_function(mean_rate, variability, variability_type="sinusoidal"):
+    """
+    Samples a time-dependent rate function.
+    :param mean_rate: Mean rate constant
+    :param variability: Variability in the rate constant
+    :param variability_type: Type of variability ("sinusoidal" or "decaying")
+    :return: A time-dependent rate function
+    """
+    if variability_type == "sinusoidal":
+        def rate_function(t):
+            return mean_rate + variability * torch.sin(t)  # Sinusoidal variability
+    elif variability_type == "decaying":
+        def rate_function(t):
+            return mean_rate * torch.exp(-variability * t)  # Decaying variability
+    else:
+        raise ValueError(f"Unknown variability_type: {variability_type}")
+    return rate_function
+def simulate_ou_process(
+    mu: float, sigma: float, theta: float, dt: float, T: float, seed: Optional[int] = None
+) -> np.ndarray:
+    """Simulate a mean-reverting Ornstein-Uhlenbeck process."""
+    if seed is not None:
+        np.random.seed(seed)
+    N = int(T / dt)
+    X = np.zeros(N)
+    # Start from the stationary distribution
+    X[0] = np.random.normal(mu, np.sqrt(sigma**2 / (2 * theta)))
+    for t in range(1, N):
+        dW = np.random.normal(0, np.sqrt(dt))
+        X[t] = X[t - 1] + theta * (mu - X[t - 1]) * dt + sigma * dW
+    return X
+def sample_individual_configs(study_config: StudyConfig, n: Optional[int] = None):
+    """
+    Samples parameters for a population of individuals.
+    Parameters
+    ----------
+    study_config : StudyConfig
+        Configuration object with parameter distributions.
+    n : int, optional
+        Number of individuals to sample. If None, defaults to
+        study_config.num_individuals.
+    Returns
+    -------
+    List[IndividualConfig]
+        A list of sampled individual configurations.
+    """
+    num_individuals = n if n is not None else study_config.num_individuals
+    individual_configs = []
+    for _ in range(num_individuals):
+        # Sample parameters from lognormal distributions
+        k_a = np.random.lognormal(study_config.log_k_a_mean, study_config.log_k_a_std)
+        k_e = np.random.lognormal(study_config.log_k_e_mean, study_config.log_k_e_std)
+        V = np.random.lognormal(study_config.log_V_mean, study_config.log_V_std)
+        k_1p = [
+            np.random.lognormal(mean, std)
+            for mean, std in zip(study_config.log_k_1p_mean, study_config.log_k_1p_std)
+        ]
+        k_p1 = [
+            np.random.lognormal(mean, std)
+            for mean, std in zip(study_config.log_k_p1_mean, study_config.log_k_p1_std)
+        ]
+        # Ornstein–Uhlenbeck processes for time-dependent variability
+        dt = 0.1
+        ou_times = np.arange(study_config.time_start, study_config.time_stop, dt)
+        ou_k_a = k_a * np.exp(
+            simulate_ou_process(
+                0,
+                study_config.k_a_tmag * np.sqrt(2 * study_config.k_a_tscl),
+                study_config.k_a_tmag,
+                dt,
+                study_config.time_stop - study_config.time_start,
+            )
+        )
+        ou_k_e = k_e * np.exp(
+            simulate_ou_process(
+                0,
+                study_config.k_e_tmag * np.sqrt(2 * study_config.k_e_tscl),
+                study_config.k_e_tmag,
+                dt,
+                study_config.time_stop - study_config.time_start,
+            )
+        )
+        ou_V = V * np.exp(
+            simulate_ou_process(
+                0,
+                study_config.V_tmag * np.sqrt(2 * study_config.V_tscl),
+                study_config.V_tmag,
+                dt,
+                study_config.time_stop - study_config.time_start,
+            )
+        )
+        # Time-dependent rate functions
+        def k_a_fn(t, ou_k_a=ou_k_a):
+            return np.interp(t, ou_times, ou_k_a)
+        def k_e_fn(t, ou_k_e=ou_k_e):
+            return np.interp(t, ou_times, ou_k_e)
+        def V_fn(t, ou_V=ou_V):
+            return np.interp(t, ou_times, ou_V)
+        # Peripheral exchange rates (sinusoidal modulation as placeholder)
+        k_1p_fn = [
+            lambda t,
+            k_1p_i=k_1p[i],
+            tmag_i=study_config.k_1p_tmag[i],
+            tscl_i=study_config.k_1p_tscl[i]: k_1p_i * (1 + tmag_i * np.sin(t / tscl_i))
+            for i in range(len(k_1p))
+        ]
+        k_p1_fn = [
+            lambda t,
+            k_p1_i=k_p1[i],
+            tmag_i=study_config.k_p1_tmag[i],
+            tscl_i=study_config.k_p1_tscl[i]: k_p1_i * (1 + tmag_i * np.sin(t / tscl_i))
+            for i in range(len(k_p1))
+        ]
+        # Create config for this individual
+        config = IndividualConfig(
+            num_peripherals=study_config.num_peripherals,
+            k_a=k_a_fn,
+            k_e=k_e_fn,
+            V=V_fn,
+            k_1p=k_1p_fn,
+            k_p1=k_p1_fn,
+            rel_ruv=study_config.rel_ruv,
+        )
+        individual_configs.append(config)
+    return individual_configs
+def create_dynamic_ode_matrix(config: IndividualConfig, t: float):
+    """
+    Creates the ODE matrix for the compartment model at time t.
+    :param config: IndividualConfig object
+    :param t: Current time
+    :return: ODE matrix as a torch tensor
+    """
+    num_compartments = 2 + config.num_peripherals  # gut, central, and peripherals
+    ode_matrix = torch.zeros((num_compartments, num_compartments))
+    # Gut compartment
+    ode_matrix[0, 0] = -config.k_a(t)  # d_gut/dt = -k_a(t) * gut
+    # Central compartment
+    ode_matrix[1, 0] = config.k_a(t)  # d_central/dt += k_a(t) * gut
+    ode_matrix[1, 1] = -config.k_e(t)  # d_central/dt += -k_e(t) * central
+    # Peripheral compartments
+    for i in range(config.num_peripherals):
+        ode_matrix[1, 1] -= config.k_1p[i](t)  # d_central/dt += - sum_p(k_1p(t)) * central
+        ode_matrix[1, 2 + i] = config.k_p1[i](t)  # d_central/dt += k_p1[i](t) * peripheral(i)
+        ode_matrix[2 + i, 1] = config.k_1p[i](t)  # d_peripheral(i)/dt += k_1p[i](t) * central
+        ode_matrix[2 + i, 2 + i] = -config.k_p1[i](
+            t
+        )  # d_peripheral(i)/dt += -k_p1[i](t) * peripheral(i)
+    return ode_matrix
+def create_dynamic_ode_matrix_batched(configs, t, num_peripherals):
+    """
+    Creates batched ODE matrices for multiple individuals.
+    Parameters:
+    ----------
+    configs : list
+        List of IndividualConfig objects.
+    t : float
+        Current time point.
+    num_peripherals : int
+        Number of peripheral compartments (same for all individuals).
+    Returns:
+    -------
+    A_all : torch.Tensor
+        Tensor of shape (N, M, M) containing ODE matrices for all individuals.
+    """
+    import torch
+    N = len(configs)
+    M = 2 + num_peripherals
+    A_all = torch.zeros((N, M, M), dtype=torch.float32)
+    # Compute batched rate parameters
+    k_a_all = torch.tensor([config.k_a(t) for config in configs], dtype=torch.float32)
+    k_e_all = torch.tensor([config.k_e(t) for config in configs], dtype=torch.float32)
+    k_1p_all = torch.tensor(
+        [[config.k_1p[i](t) for i in range(num_peripherals)] for config in configs],
+        dtype=torch.float32,
+    )
+    k_p1_all = torch.tensor(
+        [[config.k_p1[i](t) for i in range(num_peripherals)] for config in configs],
+        dtype=torch.float32,
+    )
+    # Populate the batched ODE matrices
+    A_all[:, 0, 0] = -k_a_all  # Gut compartment
+    A_all[:, 1, 0] = k_a_all  # Absorption into central
+    A_all[:, 1, 1] = -k_e_all - k_1p_all.sum(dim=1)  # Central compartment
+    A_all[:, 1, 2 : 2 + num_peripherals] = k_p1_all  # Central to peripheral
+    A_all[:, 2 : 2 + num_peripherals, 1] = k_1p_all  # Peripheral to central
+    for i in range(num_peripherals):
+        A_all[:, 2 + i, 2 + i] = -k_p1_all[:, i]  # Peripheral compartments
+    return A_all
+def sample_study(
+    individual_config_array, dosing_config_array, t: torch.Tensor, solver_method: str = "rk4"
+) -> Tuple[
+    torch.Tensor,  # [N, T] concentration profiles
+    torch.Tensor,  # [N, T] time points
+    torch.Tensor,  # [N] dosing amounts
+    torch.Tensor,  # [N] dosing route types (0 = oral, 1 = iv)
+]:
+    """
+    Simulates the pharmacokinetic study for a group of individuals and returns
+    concentration profiles, time points, and dosing metadata.
+    Parameters:
+    ----------
+    individual_config_array : list
+        List of IndividualConfig objects for each individual.
+    dosing_config_array : list
+        List of DosingConfig objects for each individual.
+    t : torch.Tensor
+        A 1D tensor of time points [T].
+    Returns:
+    -------
+    full_simulation : torch.Tensor
+        Concentration profiles [N, T].
+    full_simulation_times : torch.Tensor
+        Time points [N, T].
+    dosing_amounts : torch.Tensor
+        Dosing amounts [N].
+    dosing_route_types : torch.Tensor
+        Route types [N], 0 = oral, 1 = iv.
+    """
+    # Sanity check
+    if len(individual_config_array) != len(dosing_config_array):
+        raise ValueError("Number of individuals and dosing configurations must match.")
+    N = len(individual_config_array)
+    num_peripherals_list = [cfg.num_peripherals for cfg in individual_config_array]
+    all_same_peripherals = all(n == num_peripherals_list[0] for n in num_peripherals_list)
+    # Extract dosing info
+    dosing_amounts = torch.tensor(
+        [cfg.dose for cfg in dosing_config_array], dtype=torch.float32
+    )  # [N]
+    routes_str = [cfg.route for cfg in dosing_config_array]
+    route_map = {"oral": 0, "iv": 1}
+    dosing_route_types = torch.tensor([route_map[r] for r in routes_str], dtype=torch.int64)  # [N]
+    if all_same_peripherals:
+        P = num_peripherals_list[0]
+        M = 2 + P
+        y0 = torch.zeros((N, M), dtype=torch.float32)
+        is_oral = dosing_route_types == 0
+        is_iv = dosing_route_types == 1
+        y0[is_oral, 0] = dosing_amounts[is_oral]
+        y0[is_iv, 1] = dosing_amounts[is_iv]
+        def ode_func(t, y):
+            A_all = create_dynamic_ode_matrix_batched(individual_config_array, t.item(), P)
+            return torch.bmm(A_all, y.unsqueeze(-1)).squeeze(-1)
+        y = odeint(ode_func, y0, t, method=solver_method)  # [T, N, M]
+        V_all = torch.tensor(
+            [[cfg.V(ti.item()) for ti in t] for cfg in individual_config_array], dtype=torch.float32
+        )  # [N, T]
+        full_simulation = y[:, :, 1].T / V_all  # [N, T]
+        full_simulation *= (
+            1 + torch.randn_like(full_simulation) * individual_config_array[0].rel_ruv
+        )
+    else:
+        full_simulation = []
+        for config, dosing_config in zip(individual_config_array, dosing_config_array):
+            P = config.num_peripherals
+            M = 2 + P
+            if dosing_config.route == "oral":
+                y0 = torch.tensor([dosing_config.dose] + [0.0] * (M - 1), dtype=torch.float32)
+            elif dosing_config.route == "iv":
+                y0 = torch.tensor([0.0, dosing_config.dose] + [0.0] * (M - 2), dtype=torch.float32)
+            else:
+                raise ValueError(f"Unsupported route: {dosing_config.route}")
+            def ode_func(t, y):
+                A = create_dynamic_ode_matrix(config, t.item())
+                return torch.matmul(A, y)
+            y = odeint(ode_func, y0, t, method=solver_method)  # [T, M]
+            V = torch.tensor([config.V(ti.item()) for ti in t], dtype=torch.float32)  # [T]
+            concentration = y[:, 1] / V
+            concentration *= 1 + torch.randn_like(concentration) * config.rel_ruv
+            full_simulation.append(concentration)
+        full_simulation = torch.stack(full_simulation)  # [N, T]
+    full_times = t.unsqueeze(0).repeat(N, 1)  # [N, T]
+    return full_simulation, full_times, dosing_amounts, dosing_route_types
+def sample_study_with_duration(
+    individual_config_array,
+    dosing_config_array: List[DosingWithDurationConfig],
+    t: torch.Tensor,
+    solver_method: str = "rk4",
+) -> Tuple[
+    torch.Tensor,  # [N, T] concentration profiles
+    torch.Tensor,  # [N, T] time points
+    torch.Tensor,  # [N] dosing amounts
+    torch.Tensor,  # [N] dosing route types (0 = oral, 1 = iv)
+]:
+    """
+    Simulates the pharmacokinetic study for a group of individuals and returns
+    concentration profiles, time points, and dosing metadata.
+    This is a parallel implementation to sample_study that supports infusion dosing with duration.
+    Once validated, the two can be merged.
+    Parameters:
+    ----------
+    individual_config_array : list
+        List of IndividualConfig objects for each individual.
+    dosing_config_array : list
+        List of DosingWithDurationConfig objects for each individual.
+    t : torch.Tensor
+        A 1D tensor of time points [T].
+    Returns:
+    -------
+    full_simulation : torch.Tensor
+        Concentration profiles [N, T].
+    full_simulation_times : torch.Tensor
+        Time points [N, T].
+    dosing_amounts : torch.Tensor
+        Dosing amounts [N].
+    dosing_route_types : torch.Tensor
+        Route types [N], 0 = oral, 1 = iv.
+    """
+    # Sanity check
+    if len(individual_config_array) != len(dosing_config_array):
+        raise ValueError("Number of individuals and dosing configurations must match.")
+    N = len(individual_config_array)
+    num_peripherals_list = [cfg.num_peripherals for cfg in individual_config_array]
+    all_same_peripherals = all(n == num_peripherals_list[0] for n in num_peripherals_list)
+    # Extract dosing info
+    dosing_amounts = torch.tensor(
+        [cfg.dose for cfg in dosing_config_array], dtype=torch.float32
+    )  # [N]
+    routes_str = [cfg.route for cfg in dosing_config_array]
+    route_map = {"oral": 0, "iv": 1}
+    dosing_route_types = torch.tensor([route_map[r] for r in routes_str], dtype=torch.int64)  # [N]
+    dosing_durations = torch.tensor(
+        [cfg.duration for cfg in dosing_config_array], dtype=torch.float32
+    )  # [N]
+    if all_same_peripherals and all(dosing_durations == 0):
+        P = num_peripherals_list[0]
+        M = 2 + P  # gut, central, peripherals
+        y0 = torch.zeros((N, M), dtype=torch.float32)
+        is_oral = dosing_route_types == 0
+        is_iv_bolus = dosing_route_types == 1
+        y0[is_oral, 0] = dosing_amounts[is_oral]
+        y0[is_iv_bolus, 1] = dosing_amounts[is_iv_bolus]
+        def ode_func(t, y):
+            A_all = create_dynamic_ode_matrix_batched(individual_config_array, t.item(), P)
+            return torch.bmm(A_all, y.unsqueeze(-1)).squeeze(-1)
+        # ODE solving during infusion
+        y = odeint(ode_func, y0, t, method=solver_method)  # [T, N, M]
+        V_all = torch.tensor(
+            [[cfg.V(ti.item()) for ti in t] for cfg in individual_config_array], dtype=torch.float32
+        )  # [N, T]
+        full_simulation = y[:, :, 1].T / V_all  # [N, T]
+        full_simulation *= (
+            1 + torch.randn_like(full_simulation) * individual_config_array[0].rel_ruv
+        )
+    else:
+        full_simulation = []
+        for config, dosing_config in zip(individual_config_array, dosing_config_array):
+            P = config.num_peripherals
+            M = 2 + P  # gut, central, peripherals
+            if dosing_config.route == "oral":
+                assert dosing_config.duration == 0, "Oral dosing cannot have a duration."
+                y0 = torch.tensor([dosing_config.dose] + [0.0] * (M - 1), dtype=torch.float32)
+            elif dosing_config.route == "iv":
+                if dosing_config.duration > 0:
+                    # Infusion dosing
+                    y0 = torch.tensor(
+                        [0.0, 0.0] + [0.0] * (M - 2),
+                        dtype=torch.float32,
+                    )
+                else:  # Bolus dosing
+                    y0 = torch.tensor(
+                        [0.0, dosing_config.dose] + [0.0] * (M - 2), dtype=torch.float32
+                    )
+            else:
+                raise ValueError(f"Unsupported route: {dosing_config.route}")
+            def ode_func(t, y):
+                A = create_dynamic_ode_matrix(config, t.item())
+                b = torch.zeros_like(y)
+                if (
+                    dosing_config.route == "iv"
+                    and dosing_config.duration > 0
+                    and t.item() < dosing_config.duration
+                ):
+                    # During infusion, add rate to central compartment
+                    b[1] = dosing_config.dose / dosing_config.duration
+                return torch.matmul(A, y) + b
+            y = odeint(ode_func, y0, t, method=solver_method)  # [T, M]
+            V = torch.tensor([config.V(ti.item()) for ti in t], dtype=torch.float32)  # [T]
+            concentration = y[:, 1] / V
+            concentration *= 1 + torch.randn_like(concentration) * config.rel_ruv
+            full_simulation.append(concentration)
+        full_simulation = torch.stack(full_simulation)  # [N, T]
+    full_times = t.unsqueeze(0).repeat(N, 1)  # [N, T]
+    return full_simulation, full_times, dosing_amounts, dosing_route_types
+def derive_timescale_parameters(config: StudyConfig, meta_config: MetaStudyConfig):
+    """
+    Derive peak time and terminal half life for typical parameters,
+    which can then be used to inform a study-specific sampling schedule.
+    """
+    k_a = np.exp(config.log_k_a_mean)
+    k_e = np.exp(config.log_k_e_mean)
+    tmax = (np.log(k_e) - np.log(k_a)) / (k_e - k_a)
+    # mean residence time approximation for terminal half-life
+    MRT = 1 / k_e
+    # for i in range(config.num_peripherals):
+    #     k_1i = np.exp(config.log_k_p1_mean[i])
+    #     MRT += 1/k_1i
+    t12 = np.log(2) * MRT
+    if t12 > meta_config.time_stop:
+        t12 = float(meta_config.time_stop / 2.0)
+    if tmax > t12:
+        tmax = float(t12 * 0.5)
+    return torch.Tensor([tmax, t12])
+def sample_dosing_configs(config: MetaDosingConfig):
+    """
+    Sample a dosing configuration based on the meta dosing configuration.
+    Route may be the same for all individuals in the study or not.
+    Doses are lognormally distributed with log-mean and log-std sample uniformly from the specified range.
+    In the special case of logdose_std_range = (0, 0), the dose is identical for all individuals.
+    """
+    dosing_configs = []
+    if config.same_route:
+        route = np.random.choice(config.route_options, p=config.route_weights)
+        route = np.repeat(route, config.num_individuals)
+    else:
+        route = np.random.choice(
+            config.route_options, p=config.route_weights, size=config.num_individuals
+        )
+    # Draw lognormal distribution parameters for dose
+    logdose_mean = np.random.uniform(*config.logdose_mean_range)
+    logdose_std = np.random.uniform(*config.logdose_std_range)
+    dose = np.random.lognormal(logdose_mean, logdose_std, size=config.num_individuals)
+    for i in range(config.num_individuals):
+        time = config.time
+        this_config = DosingConfig(dose=dose[i], route=route[i], time=time)
+        dosing_configs.append(this_config)
+    return dosing_configs
+def sample_dosing_with_duration_configs(config: MetaDosingWithDurationConfig):
+    """
+    Sample a dosing configuration based on the meta dosing configuration.
+    Route may be the same for all individuals in the study or not.
+    Doses are lognormally distributed with log-mean and log-std sample uniformly from the specified range.
+    In the special case of logdose_std_range = (0, 0), the dose is identical for all individuals.
+    """
+    dosing_configs = []
+    if config.same_route:
+        route = np.random.choice(config.route_options, p=config.route_weights)
+        route = np.repeat(route, config.num_individuals)
+    else:
+        route = np.random.choice(
+            config.route_options, p=config.route_weights, size=config.num_individuals
+        )
+    # Draw durations for infusion dosing depending on route
+    duration_raw = np.random.uniform(
+        config.duration_range[0], config.duration_range[1], size=config.num_individuals
+    )
+    # Draw lognormal distribution parameters for dose
+    logdose_mean = np.random.uniform(*config.logdose_mean_range)
+    logdose_std = np.random.uniform(*config.logdose_std_range)
+    dose = np.random.lognormal(logdose_mean, logdose_std, size=config.num_individuals)
+    for i in range(config.num_individuals):
+        time = config.time
+        # Add duration flag based on route duration weights
+        duration_flag = np.random.binomial(1, config.route_duration_weights[route[i]], size=1)[0]
+        # Define a dosing config with a (default) duration of 0; can be modified once MetaDosingConfig supports it
+        this_config = DosingWithDurationConfig(
+            dose=dose[i],
+            route=route[i],
+            time=time,
+            duration=duration_raw[i] * duration_flag,
+        )
+        dosing_configs.append(this_config)
+    return dosing_configs
+def get_random_simulation(
+    model_config: NodePKExperimentConfig,
+) -> Tuple[TensorType["I", "T"], TensorType["I", "T"]]:
+    """
+    Generates random simulation data based on the model configuration.
+    Args:
+        model_config (NodePKConfig): Configuration for the simulation.
+    Returns:
+        Tuple[TensorType["I", "T"], TensorType["I", "T"]]: Time steps and simulation points.
+    """
+    I = model_config.meta_study.num_individuals_range[0]
+    T = model_config.meta_study.time_num_steps
+    # Generate time steps using linspace
+    time_steps = (
+        torch.linspace(
+            model_config.meta_study.time_start,
+            model_config.meta_study.time_stop,
+            T,
+            dtype=torch.float32,
+        )
+        .unsqueeze(0)
+        .repeat(I, 1)
+    )  # Shape: [I, T]
+    # Generate random simulation points with the same shape
+    simulation_points = torch.rand(I, T)  # Shape: [I, T]
+    simulation_points = simulation_points / model_config.meta_study.time_stop
+    return simulation_points, time_steps

sim_priors_pk/data/data_generation/compartment_models_management.py ADDED Viewed

	@@ -0,0 +1,1338 @@

+# pyright: reportAssignmentType=false
+# compartment_models_management.py
+import json
+import logging
+from dataclasses import replace
+from pathlib import Path
+from typing import TYPE_CHECKING, Dict, Optional, Tuple
+import numpy as np
+import torch
+from torchtyping import TensorType
+from sim_priors_pk.config_classes.data_config import (
+    DosingConfig,
+    DosingWithDurationConfig,
+    MetaDosingConfig,
+    MetaDosingWithDurationConfig,
+    MetaStudyConfig,
+    ObservationsConfig,
+)
+from sim_priors_pk.data.data_empirical.json_schema import StudyJSON
+from sim_priors_pk.data.data_generation.compartment_models import (
+    StudyConfig,
+    derive_timescale_parameters,
+    sample_dosing_configs,
+    sample_dosing_with_duration_configs,
+    sample_individual_configs,
+    sample_study,
+    sample_study_config,
+    sample_study_with_duration,
+)
+from sim_priors_pk.data.data_generation.observations_classes import ObservationStrategyFactory
+logger = logging.getLogger(__name__)
+if TYPE_CHECKING:  # pragma: no cover - typing only
+    from sim_priors_pk.data.data_empirical.json_schema import IndividualJSON, StudyJSON
+else:  # pragma: no cover - runtime fallback avoids heavy import cycle
+    IndividualJSON = Dict[str, object]
+    StudyJSON = Dict[str, object]
+def is_valid_simulation(sim: torch.Tensor) -> bool:
+    """Returns True if the simulation is numerically valid and all values are < 10."""
+    return torch.isfinite(sim).all() and (sim >= 0).all() and (sim < 10).all()
+def sample_dosing_configs_repeated_target(config: MetaDosingConfig, n_targets: int):
+    """
+    Generate dosing configs where all target individuals share the same
+    dose and route.
+    Parameters
+    ----------
+    config : MetaDosingConfig
+        Meta dosing configuration (num_individuals field may be ignored).
+    n_targets : int
+        Number of target individuals to generate.
+    Returns
+    -------
+    List[DosingConfig]
+        Identical dosing configs repeated `n_targets` times.
+    """
+    # Choose one route for all targets
+    route = np.random.choice(config.route_options, p=config.route_weights)
+    # Sample one dose (lognormal)
+    logdose_mean = np.random.uniform(*config.logdose_mean_range)
+    logdose_std = np.random.uniform(*config.logdose_std_range)
+    dose_value = float(np.random.lognormal(logdose_mean, logdose_std))
+    # Build identical configs
+    dosing_configs = [
+        DosingConfig(dose=dose_value, route=route, time=config.time) for _ in range(n_targets)
+    ]
+    return dosing_configs
+def sample_dosing_with_duration_configs_repeated_target(
+    config: MetaDosingWithDurationConfig, n_targets: int
+):
+    """
+    Generate dosing configs where all target individuals share the same
+    dose and route.
+    Parameters
+    ----------
+    config : MetaDosingWithDurationConfig
+        Meta dosing configuration with duration (num_individuals field may be ignored).
+    n_targets : int
+        Number of target individuals to generate.
+    Returns
+    -------
+    List[DosingConfig]
+        Identical dosing configs repeated `n_targets` times.
+    """
+    # Choose one route for all targets
+    route = np.random.choice(config.route_options, p=config.route_weights)
+    # Handling the duration logic
+    duration_weight = config.route_duration_weights[route]
+    duration_range = np.random.uniform(*config.duration_range)
+    duration = duration_weight * duration_range
+    # Sample one dose (lognormal)
+    logdose_mean = np.random.uniform(*config.logdose_mean_range)
+    logdose_std = np.random.uniform(*config.logdose_std_range)
+    dose_value = float(np.random.lognormal(logdose_mean, logdose_std))
+    # Build identical configs
+    dosing_configs = [
+        DosingWithDurationConfig(dose=dose_value, route=route, time=config.time, duration=duration)
+        for _ in range(n_targets)
+    ]
+    return dosing_configs
+# ──────────────────────────────────────────────────────────────
+# NEW: split where *all* individuals are target
+# ──────────────────────────────────────────────────────────────
+def split_context_only(
+    full_simulation: torch.Tensor,
+    full_simulation_times: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor, list[int]]:
+    """Return all individuals as context, no targets."""
+    num_individuals = full_simulation.shape[0]
+    context_indices = list(range(num_individuals))
+    return full_simulation, full_simulation_times, context_indices
+def split_simulations_repeated_target(
+    full_simulation: torch.Tensor,
+    full_simulation_times: torch.Tensor,
+) -> Tuple[
+    Optional[torch.Tensor],
+    Optional[torch.Tensor],
+    torch.Tensor,
+    torch.Tensor,
+    list[int],
+    list[int],
+]:
+    """
+    Variant of split_simulations where **all individuals are in the target set**
+    and no context individuals are returned.
+    Parameters
+    ----------
+    full_simulation : torch.Tensor [N, T]
+    full_simulation_times : torch.Tensor [N, T]
+    Returns
+    -------
+    context_simulation : None
+    context_simulation_times : None
+    target_simulation : torch.Tensor [N, T]
+    target_simulation_times : torch.Tensor [N, T]
+    context_indices : []
+    target_indices : list[int] = [0,...,N-1]
+    """
+    num_individuals = full_simulation.shape[0]
+    target_indices = list(range(num_individuals))
+    return (
+        None,
+        None,
+        full_simulation,
+        full_simulation_times,
+        [],
+        target_indices,
+    )
+def _generate_full_simulation(
+    meta_study_config: MetaStudyConfig,
+    meta_dosing_config: MetaDosingConfig,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> Tuple[
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    StudyConfig,
+    list[DosingConfig],
+    int,
+]:
+    """Internal helper returning the raw tensors alongside sampling metadata."""
+    study_config = sample_study_config(meta_study_config)
+    indiv_config_array = sample_individual_configs(study_config)
+    time_scales = derive_timescale_parameters(study_config, meta_study_config)
+    time_points = torch.linspace(
+        meta_study_config.time_start,
+        meta_study_config.time_stop,
+        meta_study_config.time_num_steps,
+        dtype=torch.float32,
+    )
+    local_meta_dosing = replace(meta_dosing_config, num_individuals=study_config.num_individuals)
+    dosing_config_array = sample_dosing_configs(local_meta_dosing)
+    full_sim, full_times, dosing_amounts, dosing_routes = sample_study(
+        indiv_config_array,
+        dosing_config_array,
+        time_points,
+        meta_study_config.solver_method,
+    )
+    if not is_valid_simulation(full_sim):
+        attempt_number = idx + 1
+        if attempt_number > 5:
+            logger.warning(
+                "Invalid simulation encountered during attempt %d (recursion depth %d); retry_on_invalid=%s.",
+                attempt_number,
+                idx,
+                retry_on_invalid,
+            )
+        if retry_on_invalid:
+            (
+                full_sim,
+                full_times,
+                dosing_amounts,
+                dosing_routes,
+                time_points,
+                time_scales,
+                study_config,
+                dosing_config_array,
+                downstream_failures,
+            ) = _generate_full_simulation(
+                meta_study_config,
+                meta_dosing_config,
+                retry_on_invalid=retry_on_invalid,
+                idx=idx + 1,
+            )
+            return (
+                full_sim,
+                full_times,
+                dosing_amounts,
+                dosing_routes,
+                time_points,
+                time_scales,
+                study_config,
+                dosing_config_array,
+                downstream_failures + 1,
+            )
+        raise RuntimeError("Invalid simulation")
+    return (
+        full_sim,
+        full_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+        study_config,
+        dosing_config_array,
+        0,
+    )
+def _generate_full_simulation_with_duration(
+    meta_study_config: MetaStudyConfig,
+    meta_dosing_config: MetaDosingWithDurationConfig,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> Tuple[
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    torch.Tensor,
+    StudyConfig,
+    list[DosingConfig],
+    int,
+]:
+    """
+    Internal helper returning the raw tensors alongside sampling metadata.
+    This is a parallel implementation to `_generate_full_simulation` that supports
+    dosing with duration. Once validated, the two can be merged.
+    """
+    study_config = sample_study_config(meta_study_config)
+    indiv_config_array = sample_individual_configs(study_config)
+    time_scales = derive_timescale_parameters(study_config, meta_study_config)
+    time_points = torch.linspace(
+        meta_study_config.time_start,
+        meta_study_config.time_stop,
+        meta_study_config.time_num_steps,
+        dtype=torch.float32,
+    )
+    local_meta_dosing = replace(meta_dosing_config, num_individuals=study_config.num_individuals)
+    dosing_config_array = sample_dosing_with_duration_configs(local_meta_dosing)
+    full_sim, full_times, dosing_amounts, dosing_routes = sample_study_with_duration(
+        indiv_config_array,
+        dosing_config_array,
+        time_points,
+        meta_study_config.solver_method,
+    )
+    if not is_valid_simulation(full_sim):
+        attempt_number = idx + 1
+        if attempt_number > 5:
+            logger.warning(
+                "Invalid simulation encountered during attempt %d (recursion depth %d); retry_on_invalid=%s.",
+                attempt_number,
+                idx,
+                retry_on_invalid,
+            )
+        if retry_on_invalid:
+            (
+                full_sim,
+                full_times,
+                dosing_amounts,
+                dosing_routes,
+                time_points,
+                time_scales,
+                study_config,
+                dosing_config_array,
+                downstream_failures,
+            ) = _generate_full_simulation_with_duration(
+                meta_study_config,
+                meta_dosing_config,
+                retry_on_invalid=retry_on_invalid,
+                idx=idx + 1,
+            )
+            return (
+                full_sim,
+                full_times,
+                dosing_amounts,
+                dosing_routes,
+                time_points,
+                time_scales,
+                study_config,
+                dosing_config_array,
+                downstream_failures + 1,
+            )
+        raise RuntimeError("Invalid simulation")
+    return (
+        full_sim,
+        full_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+        study_config,
+        dosing_config_array,
+        0,
+    )
+def _generate_simple_exp_simulation(
+    meta_study_config,
+) -> Tuple[
+    TensorType["N", "T"],  # full_simulation
+    TensorType["N", "T"],  # full_simulation_times
+    TensorType["N"],  # dosing_amounts
+    TensorType["N"],  # dosing_route_types
+    TensorType["T"],  # time_points
+    TensorType[2],  # time_scales [tmax, t12]
+]:
+    """
+    Minimal synthetic PK-like simulator.
+    Changes:
+      - Samples a single per-RUN decay rate k ~ U(decay_rate_range) and uses it for all individuals.
+      - Uses only band_scale_range, baseline_range, and (new) decay_rate_range.
+    Derivations per RUN:
+      baseline_run   ~ U(baseline_range)
+      band_scale_run ~ U(band_scale_range)
+      decay_rate     ~ U(decay_rate_range)
+      intercept_mean = 1.0 + baseline_run
+      intercept_std  = 0.5 * band_scale_run
+    """
+    # ---------------------------
+    # Basic hyperparameters
+    # ---------------------------
+    N: int = getattr(meta_study_config, "num_individuals", 16)
+    Tn: int = getattr(meta_study_config, "time_num_steps", 40)
+    t_min: float = getattr(meta_study_config, "time_start", 0.0)
+    t_max: float = getattr(meta_study_config, "time_stop", 24.0)
+    band_scale_range = getattr(meta_study_config, "band_scale_range", (0.1, 0.3))
+    baseline_range = getattr(meta_study_config, "baseline_range", (0.0, 0.1))
+    decay_rate_range = getattr(meta_study_config, "decay_rate_range", (0.3, 0.6))  # NEW
+    # ---------------------------
+    # Per-RUN draws (no seeds)
+    # ---------------------------
+    def _urun(lo, hi):  # uniform helper
+        return (torch.rand(1) * (hi - lo) + lo).item()
+    band_scale_run = _urun(*band_scale_range)
+    baseline_run = _urun(*baseline_range)
+    decay_rate_k = _urun(*decay_rate_range)  # shared across all individuals this run
+    intercept_mean = 1.0 + baseline_run
+    intercept_std = 0.5 * band_scale_run
+    # ---------------------------
+    # Time grid & single-exp shape
+    # ---------------------------
+    t: TensorType["T", 1] = torch.linspace(t_min, t_max, Tn).unsqueeze(-1)  # [T,1]
+    f_t: TensorType["T", 1] = torch.exp(-decay_rate_k * t)  # f_t(0)=1, shared shape
+    # ---------------------------
+    # Per-individual intercepts
+    # ---------------------------
+    intercepts: TensorType["N", 1, 1] = torch.normal(
+        mean=float(intercept_mean),
+        std=float(intercept_std),
+        size=(N, 1, 1),
+    ).clamp_min(0.0)
+    # Build samples: scaled shape + shared run baseline.
+    samples: TensorType["N", "T", 1] = intercepts * f_t.unsqueeze(0) + baseline_run
+    samples = samples.clamp_min(0.0)  # numerical safety
+    # ---------------------------
+    # Dummy dosing / time scales
+    # ---------------------------
+    dosing_amounts: TensorType["N"] = torch.zeros(N)
+    dosing_routes: TensorType["N"] = torch.zeros(N)
+    duration = float(t_max - t_min)
+    tmax = 0.3 * duration
+    t12 = 0.75 * duration
+    time_scales: TensorType[2] = torch.tensor([tmax, t12], dtype=torch.float32)
+    # ---------------------------
+    # Construct outputs
+    # ---------------------------
+    full_sim = samples.squeeze(-1)  # [N, T]
+    full_sim_times = t.expand(N, -1, -1).squeeze(-1)  # [N, T]
+    time_points = t.squeeze(-1)  # [T]
+    return (
+        full_sim,
+        full_sim_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+    )
+def _generate_pulse_simulation(
+    meta_study_config,
+) -> Tuple[
+    TensorType["N", "T"],  # full_simulation
+    TensorType["N", "T"],  # full_simulation_times
+    TensorType["N"],  # dosing_amounts
+    TensorType["N"],  # dosing_route_types
+    TensorType["T"],  # time_points
+    TensorType[2],  # time_scales [t_peak, t_half_tail]
+]:
+    """
+    Pulse-like PK-style simulator (rise -> peak -> decay).
+    Config used (all optional with safe defaults):
+      - num_individuals (int)
+      - time_start, time_stop, time_num_steps
+      - band_scale_range: (lo, hi)   # controls intercept std via 0.5 * band_scale_run
+      - baseline_range:   (lo, hi)   # per-run vertical offset added to all traces
+      - decay_rate_range: (lo, hi)   # per-run tail rate r; larger r => faster decay
+    Construction (per RUN):
+      duration = time_stop - time_start
+      t_peak   = 0.30 * duration
+      r        ~ U(decay_rate_range)
+      beta     = 1 / r
+      alpha    = 1 + r * t_peak              # => peak near t_peak for Gamma(alpha, beta)
+      f(t)     = t^(alpha-1) * exp(-t/beta)  # normalized so max(f)=1
+      baseline_run   ~ U(baseline_range)
+      band_scale_run ~ U(band_scale_range)
+      intercept_mean = 1.0 + baseline_run
+      intercept_std  = 0.5 * band_scale_run
+    Per INDIVIDUAL:
+      intercept_i ~ Normal(intercept_mean, intercept_std), clamped to >= 0
+    Output:
+      samples_i(t) = intercept_i * f_norm(t) + baseline_run, clamped to >= 0
+    """
+    # ---------------------------
+    # Basics
+    # ---------------------------
+    N: int = getattr(meta_study_config, "num_individuals", 16)
+    Tn: int = getattr(meta_study_config, "time_num_steps", 40)
+    t_min: float = getattr(meta_study_config, "time_start", 0.0)
+    t_max: float = getattr(meta_study_config, "time_stop", 24.0)
+    band_scale_range = getattr(meta_study_config, "band_scale_range", (0.1, 0.3))
+    baseline_range = getattr(meta_study_config, "baseline_range", (0.0, 0.1))
+    decay_rate_range = getattr(meta_study_config, "decay_rate_range", (0.3, 0.6))
+    # ---------------------------
+    # Per-RUN draws (no seeds)
+    # ---------------------------
+    def _urun(lo, hi):
+        return (torch.rand(1) * (hi - lo) + lo).item()
+    band_scale_run = _urun(*band_scale_range)
+    baseline_run = _urun(*baseline_range)
+    r_tail = _urun(*decay_rate_range)  # shared by all individuals this run
+    duration = float(t_max - t_min)
+    t_peak = 0.30 * duration  # desired peak position
+    beta = 1.0 / max(r_tail, 1e-6)  # tail scale
+    alpha = 1.0 + r_tail * t_peak  # ensures peak near t_peak (alpha>1)
+    # Guardrails: make sure alpha > 1 for a proper rise-then-decay
+    if alpha <= 1.05:
+        alpha = 1.05
+    # ---------------------------
+    # Time grid & Gamma-shaped pulse
+    # ---------------------------
+    t: TensorType["T"] = torch.linspace(t_min, t_max, Tn)  # [T]
+    t_shift = t - t_min  # start at 0
+    # Gamma shape (unnormalized). For t=0, t^(alpha-1) is 0 if alpha>1.
+    f_t = (t_shift.clamp_min(0.0) ** (alpha - 1.0)) * torch.exp(-t_shift / beta)
+    # Normalize to max=1 so intercept controls amplitude
+    f_max = torch.amax(f_t).clamp_min(1e-12)
+    f_t = f_t / f_max  # [T]
+    # ---------------------------
+    # Per-individual intercepts
+    # ---------------------------
+    intercept_mean = 1.0 + baseline_run
+    intercept_std = 0.5 * band_scale_run
+    intercepts: TensorType["N", 1] = torch.normal(
+        mean=float(intercept_mean),
+        std=float(intercept_std),
+        size=(N, 1),
+    ).clamp_min(0.0)
+    # Samples: scale by intercept, add per-run baseline
+    samples: TensorType["N", "T"] = (intercepts * f_t.unsqueeze(0)) + baseline_run
+    samples = samples.clamp_min(0.0)
+    # ---------------------------
+    # Dummy dosing / time scales
+    # ---------------------------
+    dosing_amounts: TensorType["N"] = torch.zeros(N)
+    dosing_routes: TensorType["N"] = torch.zeros(N)
+    # Report t_peak and an approximate tail half-life (after the peak)
+    t_half_tail = t_peak + (torch.log(torch.tensor(2.0)) / max(r_tail, 1e-6)).item()
+    time_scales: TensorType[2] = torch.tensor([t_peak, t_half_tail], dtype=torch.float32)
+    # ---------------------------
+    # Construct outputs
+    # ---------------------------
+    full_sim = samples  # [N, T]
+    full_sim_times: TensorType = t.unsqueeze(0).expand(N, -1)  # [N, T]
+    time_points = t  # [T]
+    return (
+        full_sim,
+        full_sim_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+    )
+def _generate_simple_simulation(
+    meta_study_config,
+) -> Tuple[
+    TensorType["N", "T"],
+    TensorType["N", "T"],
+    TensorType["N"],
+    TensorType["N"],
+    TensorType["T"],
+    TensorType[2],
+]:
+    """
+    Dispatcher that mixes two generators:
+      - with probability p1: _generate_simple_exp_simulation(...)
+      - with probability 1 - p1: _generate_pulse_simulation(...)
+    Config:
+      - p1 (float in [0,1]), default 0.5
+    """
+    p1 = float(getattr(meta_study_config, "p1", 0.5))
+    # clamp to [0,1]
+    p1 = 0.0 if p1 < 0.0 else (1.0 if p1 > 1.0 else p1)
+    if torch.rand(1).item() < p1:
+        return _generate_simple_exp_simulation(meta_study_config)
+    else:
+        return _generate_pulse_simulation(meta_study_config)
+def prepare_full_simulation(
+    meta_study_config,
+    meta_dosing_config,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> Tuple[
+    TensorType["N", "T", 1],
+    TensorType["N", "T"],
+    TensorType["N"],
+    TensorType["N"],
+    TensorType["T"],
+    TensorType[2],
+]:
+    """
+    Generate a full INDIVIDUAL study simulation (before context/target split).
+    This bundles the common steps shared across all dataset generators.
+    If `meta_study_config.simple_mode=True`, uses `_generate_simple_simulation`.
+    """
+    if getattr(meta_study_config, "simple_mode", False):
+        return _generate_simple_simulation(meta_study_config)
+    (
+        full_sim,
+        full_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+        _,
+        _,
+        _,
+    ) = _generate_full_simulation(
+        meta_study_config,
+        meta_dosing_config,
+        retry_on_invalid=retry_on_invalid,
+        idx=idx,
+    )
+    return full_sim, full_times, dosing_amounts, dosing_routes, time_points, time_scales
+def prepare_full_simulation_with_duration(
+    meta_study_config,
+    meta_dosing_config,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> Tuple[
+    TensorType["N", "T", 1],
+    TensorType["N", "T"],
+    TensorType["N"],
+    TensorType["N"],
+    TensorType["T"],
+    TensorType[2],
+]:
+    """
+    Generate a full INDIVIDUAL study simulation (before context/target split).
+    This bundles the common steps shared across all dataset generators.
+    If `meta_study_config.simple_mode=True`, uses `_generate_simple_simulation`.
+    This is a parallel implementation to `prepare_full_simulation` that supports
+    dosing with duration. Once validated, the two can be merged.
+    """
+    if getattr(meta_study_config, "simple_mode", False):
+        return _generate_simple_simulation(meta_study_config)
+    (
+        full_sim,
+        full_times,
+        dosing_amounts,
+        dosing_routes,
+        time_points,
+        time_scales,
+        _,
+        _,
+        _,
+    ) = _generate_full_simulation_with_duration(
+        meta_study_config,
+        meta_dosing_config,
+        retry_on_invalid=retry_on_invalid,
+        idx=idx,
+    )
+    return full_sim, full_times, dosing_amounts, dosing_routes, time_points, time_scales
+def _ensure_strictly_increasing_observations(
+    obs_times: list[float], obs_vals: list[list[float]], *, individual_id: str
+) -> None:
+    """Validate that the provided observation times are strictly increasing.
+    Parameters
+    ----------
+    obs_times:
+        Sequence of observation timestamps extracted from the simulator.
+    obs_vals:
+        Sequence of observation values sampled at ``obs_times``.
+    individual_id:
+        Identifier of the individual being validated. Included in the
+        diagnostic error message to simplify debugging when duplicates are
+        detected in batched runs.
+    """
+    if len(obs_times) != len(obs_vals):
+        raise ValueError(
+            "Observation times must be sorted and match the number of observations. "
+            f"Received lengths times={len(obs_times)} and values={len(obs_vals)} for "
+            f"{individual_id}. Observations={obs_vals}, times={obs_times}."
+        )
+    for idx_time in range(len(obs_times) - 1):
+        if obs_times[idx_time] >= obs_times[idx_time + 1]:
+            raise ValueError(
+                "Observation times must be sorted and match the number of observations. "
+                f"Detected non-increasing times for {individual_id} at position {idx_time}. "
+                f"Observations={obs_vals}, times={obs_times}."
+            )
+def prepare_full_simulation_to_study_json(
+    meta_study_config: MetaStudyConfig,
+    observation_config: ObservationsConfig,
+    meta_dosing_config: MetaDosingConfig,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> tuple[StudyJSON, int]:
+    """Generate a full simulation and convert it into a :class:`StudyJSON` record.
+    Parameters
+    ----------
+    meta_study_config:
+        Sampling configuration describing the population and numerical solver.
+        If meta_study_config.simple_mode is True, uses simplified synthetic data.
+    observation_config:
+        Configuration for the observation strategy used to extract measurements
+        from the raw simulation.  All generated observations are stored under
+        the ``context`` section of the returned study.
+    meta_dosing_config:
+        Configuration describing the dosing regimen for each simulated
+        individual.
+    retry_on_invalid:
+        When ``True`` (default) the function retries simulation sampling if the
+        generated trajectories are numerically invalid.
+    idx:
+        Internal recursion depth counter exposed for debugging and testing.
+    Returns
+    -------
+    tuple[StudyJSON, int]
+        Canonical JSON representation of the simulated study with all
+        individuals stored in the ``context`` field and an empty ``target``
+        list, alongside the number of failed attempts before obtaining the
+        valid simulation.
+    """
+    if getattr(meta_study_config, "simple_mode", False):
+        # Handle simple synthetic data generation
+        (
+            full_sim,
+            full_times,
+            dosing_amounts,
+            dosing_routes,
+            _time_points,
+            time_scales,
+        ) = _generate_simple_simulation(meta_study_config)
+        study_config = {""}
+        dosing_config_array = [
+            DosingConfig(dose=float(d), route="", time=0.0) for d in dosing_amounts
+        ]
+        failed_attempts = 0
+    else:
+        # Original mechanistic simulation code
+        (
+            full_sim,
+            full_times,
+            dosing_amounts,
+            _dosing_routes,
+            _time_points,
+            time_scales,
+            study_config,
+            dosing_config_array,
+            failed_attempts,
+        ) = _generate_full_simulation(
+            meta_study_config,
+            meta_dosing_config,
+            retry_on_invalid=retry_on_invalid,
+            idx=idx,
+        )
+    observation_strategy = ObservationStrategyFactory.from_config(
+        observation_config, meta_study_config
+    )
+    obs_out, time_out, mask_out, rem_sim, rem_time, rem_mask, _ = observation_strategy.generate(
+        full_simulation=full_sim,
+        full_simulation_times=full_times,
+        time_scales=time_scales,
+    )
+    context: list[IndividualJSON] = []
+    num_individuals = full_sim.shape[0]
+    for ind_idx in range(num_individuals):
+        mask = mask_out[ind_idx].to(torch.bool)
+        observations = obs_out[ind_idx][mask].tolist()
+        observation_times = time_out[ind_idx][mask].tolist()
+        _ensure_strictly_increasing_observations(
+            observation_times,
+            observations,
+            individual_id=f"context_{ind_idx}",
+        )
+        individual: IndividualJSON = {
+            "name_id": f"context_{ind_idx}",
+            "observations": observations,
+            "observation_times": observation_times,
+        }
+        if rem_sim is not None and rem_time is not None and rem_mask is not None:
+            rem_mask_row = rem_mask[ind_idx].to(torch.bool)
+            if rem_mask_row.any():
+                individual["remaining"] = rem_sim[ind_idx][rem_mask_row].tolist()
+                individual["remaining_times"] = rem_time[ind_idx][rem_mask_row].tolist()
+        dosing_cfg = dosing_config_array[ind_idx]
+        dose = float(dosing_amounts[ind_idx].item())
+        route = getattr(dosing_cfg, "route", "")
+        dosing_time = float(getattr(dosing_cfg, "time", 0.0))
+        if dose or route:
+            individual["dosing"] = [dose]
+            individual["dosing_type"] = [route]
+            individual["dosing_times"] = [dosing_time]
+            individual["dosing_name"] = [route]
+        context.append(individual)
+    study_json: StudyJSON = {
+        "context": context,
+        "target": [],
+        "meta_data": {
+            "study_name": f"simulated_study_{idx}",
+            "substance_name": getattr(study_config, "drug_id", "simulated_substance"),
+        },
+    }
+    return study_json, failed_attempts
+def prepare_full_simulation_with_repeated_targets(
+    meta_study_config: MetaStudyConfig,
+    meta_dosing_config: MetaDosingConfig,
+    n_targets: int,
+    *,
+    different_dosing: bool = False,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+):
+    """
+    Generate a context simulation (normal dosing) plus a new set of target
+    individuals.
+    Parameters
+    ----------
+    different_dosing:
+        If ``False`` (default), all target individuals share one repeated
+        dosing configuration.
+        If ``True``, each target individual gets an independent dosing sample
+        from the same distribution used for context individuals.
+    Returns
+    -------
+    context_sim, context_times,
+    target_sim, target_times,
+    dosing_amounts_ctx, dosing_routes_ctx,
+    dosing_amounts_tgt, dosing_routes_tgt,
+    time_points, time_scales
+    """
+    study_config = sample_study_config(meta_study_config)
+    indiv_config_array = sample_individual_configs(study_config)
+    time_scales = derive_timescale_parameters(study_config, meta_study_config)
+    time_points = torch.linspace(
+        meta_study_config.time_start,
+        meta_study_config.time_stop,
+        meta_study_config.time_num_steps,
+        dtype=torch.float32,
+    )
+    # Context part
+    local_meta_dosing_ctx = replace(
+        meta_dosing_config, num_individuals=study_config.num_individuals
+    )
+    dosing_config_array_ctx = sample_dosing_configs(local_meta_dosing_ctx)
+    full_sim, full_times, dosing_amounts_all, dosing_routes_all = sample_study(
+        indiv_config_array,
+        dosing_config_array_ctx,
+        time_points,
+        meta_study_config.solver_method,
+    )
+    if not is_valid_simulation(full_sim):
+        if retry_on_invalid:
+            return prepare_full_simulation_with_repeated_targets(
+                meta_study_config,
+                meta_dosing_config,
+                n_targets,
+                different_dosing=different_dosing,
+                idx=idx + 1,
+            )
+        raise RuntimeError("Invalid context simulation")
+    context_sim, context_times, ctx_idx = split_context_only(full_sim, full_times)
+    dosing_amounts_ctx = dosing_amounts_all[ctx_idx]
+    dosing_routes_ctx = dosing_routes_all[ctx_idx]
+    dosing_amounts_ctx = dosing_amounts_all[ctx_idx]
+    dosing_routes_ctx = dosing_routes_all[ctx_idx]
+    # Target part
+    indiv_cfg_targets = sample_individual_configs(study_config, n=n_targets)
+    local_meta_dosing_tgt = replace(meta_dosing_config, num_individuals=n_targets)
+    if different_dosing:
+        dosing_config_array_tgt = sample_dosing_configs(local_meta_dosing_tgt)
+    else:
+        dosing_config_array_tgt = sample_dosing_configs_repeated_target(
+            local_meta_dosing_tgt, n_targets
+        )
+    full_sim_tgt, full_times_tgt, dosing_amounts_tgt, dosing_routes_tgt = sample_study(
+        indiv_cfg_targets,
+        dosing_config_array_tgt,
+        time_points,
+        meta_study_config.solver_method,
+    )
+    if not is_valid_simulation(full_sim_tgt):
+        if retry_on_invalid:
+            return prepare_full_simulation_with_repeated_targets(
+                meta_study_config,
+                meta_dosing_config,
+                n_targets,
+                different_dosing=different_dosing,
+                idx=idx + 1,
+            )
+        raise RuntimeError("Invalid target simulation")
+    _, _, target_sim, target_times, _, tgt_idx = split_simulations_repeated_target(
+        full_sim_tgt, full_times_tgt
+    )
+    return (
+        context_sim,
+        context_times,
+        target_sim,
+        target_times,
+        dosing_amounts_ctx,
+        dosing_routes_ctx,
+        dosing_amounts_tgt[tgt_idx],
+        dosing_routes_tgt[tgt_idx],
+        time_points,
+        time_scales,
+    )
+def prepare_full_simulation_list_with_repeated_targets(
+    meta_study_config: MetaStudyConfig,
+    meta_dosing_config: MetaDosingConfig,
+    n_targets: int,
+    num_of_different_dosages: int,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+):
+    """Generate one shared context and ``L`` target sets with repeated dosing.
+    Parameters
+    ----------
+    meta_study_config:
+        Sampling configuration controlling PK population and solver behaviour.
+    meta_dosing_config:
+        Dosing-distribution configuration used for both context and targets.
+    n_targets:
+        Number of target individuals for each dosing condition.
+    num_of_different_dosages:
+        Number of target dosing conditions ``L``.
+    retry_on_invalid:
+        Whether to retry sampling when numerical invalid simulations are found.
+    idx:
+        Retry depth / attempt index used for diagnostics.
+    Returns
+    -------
+    tuple
+        ``(context_sim, context_times, dosing_amounts_ctx, dosing_routes_ctx,``
+        ``target_simulations, target_times_list, target_dosing_amounts_list,``
+        ``target_dosing_routes_list, time_points, time_scales)`` where each
+        target list has length ``num_of_different_dosages``.
+    """
+    if num_of_different_dosages < 0:
+        raise ValueError("num_of_different_dosages must be non-negative")
+    study_config = sample_study_config(meta_study_config)
+    indiv_config_array = sample_individual_configs(study_config)
+    time_scales = derive_timescale_parameters(study_config, meta_study_config)
+    # [T]
+    time_points = torch.linspace(
+        meta_study_config.time_start,
+        meta_study_config.time_stop,
+        meta_study_config.time_num_steps,
+        dtype=torch.float32,
+    )
+    # Context is sampled exactly once.
+    local_meta_dosing_ctx = replace(
+        meta_dosing_config, num_individuals=study_config.num_individuals
+    )
+    dosing_config_array_ctx = sample_dosing_configs(local_meta_dosing_ctx)
+    full_sim, full_times, dosing_amounts_all, dosing_routes_all = sample_study(
+        indiv_config_array,
+        dosing_config_array_ctx,
+        time_points,
+        meta_study_config.solver_method,
+    )
+    if not is_valid_simulation(full_sim):
+        if retry_on_invalid:
+            return prepare_full_simulation_list_with_repeated_targets(
+                meta_study_config,
+                meta_dosing_config,
+                n_targets,
+                num_of_different_dosages,
+                idx=idx + 1,
+            )
+        raise RuntimeError("Invalid context simulation")
+    # context_sim: [N_ctx, T], context_times: [N_ctx, T]
+    context_sim, context_times, ctx_idx = split_context_only(full_sim, full_times)
+    dosing_amounts_ctx = dosing_amounts_all[ctx_idx]
+    dosing_routes_ctx = dosing_routes_all[ctx_idx]
+    # Keep the same target PK individuals across all dosing conditions so that
+    # only dosing changes across list elements.
+    indiv_cfg_targets = sample_individual_configs(study_config, n=n_targets)
+    local_meta_dosing_tgt = replace(meta_dosing_config, num_individuals=n_targets)
+    target_simulations = []
+    target_times_list = []
+    target_dosing_amounts_list = []
+    target_dosing_routes_list = []
+    seen_dosing_signatures: set[tuple[str, float]] = set()
+    for _ in range(num_of_different_dosages):
+        attempts = 0
+        while True:
+            attempts += 1
+            dosing_config_array_tgt = sample_dosing_configs_repeated_target(
+                local_meta_dosing_tgt, n_targets
+            )
+            # Ensure distinct dosing regimens across list elements.
+            dosing_signature = ("", 0.0)
+            if n_targets > 0 and len(dosing_config_array_tgt) > 0:
+                first_cfg = dosing_config_array_tgt[0]
+                dosing_signature = (
+                    str(getattr(first_cfg, "route", "")),
+                    float(getattr(first_cfg, "dose", 0.0)),
+                )
+                if dosing_signature in seen_dosing_signatures and num_of_different_dosages > 1:
+                    if attempts < 100:
+                        continue
+                    logger.warning(
+                        "Could not sample a unique repeated target dosing signature after %d attempts.",
+                        attempts,
+                    )
+            full_sim_tgt, full_times_tgt, dosing_amounts_tgt, dosing_routes_tgt = sample_study(
+                indiv_cfg_targets,
+                dosing_config_array_tgt,
+                time_points,
+                meta_study_config.solver_method,
+            )
+            if not is_valid_simulation(full_sim_tgt):
+                if retry_on_invalid and attempts < 100:
+                    continue
+                if retry_on_invalid:
+                    return prepare_full_simulation_list_with_repeated_targets(
+                        meta_study_config,
+                        meta_dosing_config,
+                        n_targets,
+                        num_of_different_dosages,
+                        idx=idx + 1,
+                    )
+                raise RuntimeError("Invalid target simulation")
+            _, _, target_sim, target_times, _, tgt_idx = split_simulations_repeated_target(
+                full_sim_tgt, full_times_tgt
+            )
+            target_simulations.append(target_sim)
+            target_times_list.append(target_times)
+            target_dosing_amounts_list.append(dosing_amounts_tgt[tgt_idx])
+            target_dosing_routes_list.append(dosing_routes_tgt[tgt_idx])
+            if n_targets > 0:
+                seen_dosing_signatures.add(dosing_signature)
+            break
+    return (
+        context_sim,
+        context_times,
+        dosing_amounts_ctx,
+        dosing_routes_ctx,
+        target_simulations,
+        target_times_list,
+        target_dosing_amounts_list,
+        target_dosing_routes_list,
+        time_points,
+        time_scales,
+    )
+def prepare_ensemble_of_simulations(
+    meta_study_config: MetaStudyConfig,
+    observation_config: ObservationsConfig,
+    meta_dosing_config: MetaDosingConfig,
+    number_of_samples: int,
+    file_name: Optional[str] = None,
+    group_size: Optional[int] = None,
+) -> tuple[list[StudyJSON] | list[list[StudyJSON]], float]:
+    """Generate an ensemble of simulated studies.
+    The helper repeatedly calls :func:`prepare_full_simulation_to_study_json`
+    to produce ``number_of_samples`` independent simulations. When ``file_name``
+    is provided, the resulting list is serialized as JSON for reproducibility
+    and downstream processing.
+    Parameters
+    ----------
+    meta_study_config:
+        Sampling configuration controlling the pharmacokinetic population and
+        solver settings.
+    observation_config:
+        Observation strategy applied to each generated simulation.
+    meta_dosing_config:
+        Configuration describing the dosing regimen per simulated individual.
+    number_of_samples:
+        Number of simulations to generate.
+    file_name:
+        Optional path used to persist the generated ensemble as a JSON file.
+    group_size:
+        Optional number of studies per group. If provided, the return value is
+        a list of lists where each sublist has ``group_size`` elements. Extra
+        simulations that do not fit evenly into the last group are ignored.
+    Returns
+    -------
+    tuple[list[StudyJSON] | list[list[StudyJSON]], float]
+        Ensemble of simulated studies (flat or grouped) and the proportion of
+        failed simulation attempts encountered while generating the ensemble.
+    """
+    studies: list[StudyJSON] = []
+    total_failed_attempts = 0
+    for idx in range(number_of_samples):
+        study, failed_attempts = prepare_full_simulation_to_study_json(
+            meta_study_config=meta_study_config,
+            observation_config=observation_config,
+            meta_dosing_config=meta_dosing_config,
+            idx=idx,
+        )
+        studies.append(study)
+        total_failed_attempts += failed_attempts
+    # --- Optional serialization ---
+    if file_name:
+        path = Path(file_name)
+        path.write_text(json.dumps(studies, indent=2))
+    # --- Compute failure rate ---
+    total_successful = len(studies)
+    total_attempts = total_failed_attempts + total_successful
+    failure_rate = total_failed_attempts / total_attempts if total_attempts > 0 else 0.0
+    # --- Optional grouping ---
+    if group_size and group_size > 0:
+        n_full_groups = len(studies) // group_size
+        grouped_studies = [
+            studies[i * group_size : (i + 1) * group_size] for i in range(n_full_groups)
+        ]
+        return grouped_studies, failure_rate
+    return studies, failure_rate
+def prepare_full_simulation_to_study_json_context_target(
+    meta_study_config: MetaStudyConfig,
+    observation_config: ObservationsConfig,
+    meta_dosing_config_context: MetaDosingConfig,
+    meta_dosing_config_target: MetaDosingConfig,
+    *,
+    retry_on_invalid: bool = True,
+    idx: int = 0,
+) -> tuple[StudyJSON, int]:
+    """Generate a full simulation and convert it into a :class:`StudyJSON` record.
+    Different dosing regimens are used for context and target individuals.
+    Parameters
+    ----------
+    meta_study_config:
+        Sampling configuration describing the population and numerical solver.
+        If meta_study_config.simple_mode is True, uses simplified synthetic data.
+    observation_config:
+        Configuration for the observation strategy used to extract measurements
+        from the raw simulation.  All generated observations are stored under
+        the ``context`` section of the returned study.
+    meta_dosing_config_context:
+        Configuration describing the dosing regimen for each simulated
+        individual in the context set.
+    meta_dosing_config_target:
+        Configuration describing the dosing regimen for each simulated
+        individual in the target set.
+    retry_on_invalid:
+        When ``True`` (default) the function retries simulation sampling if the
+        generated trajectories are numerically invalid.
+    idx:
+        Internal recursion depth counter exposed for debugging and testing.
+    Returns
+    -------
+    tuple[StudyJSON, int]
+        Canonical JSON representation of the simulated study with all
+        individuals stored in the ``context`` field and an empty ``target``
+        list, alongside the number of failed attempts before obtaining the
+        valid simulation.
+    """
+    def prepare_section(name, meta_dosing_config):
+        (
+            full_sim,
+            full_times,
+            dosing_amounts,
+            _dosing_routes,
+            _time_points,
+            time_scales,
+            study_config,
+            dosing_config_array,
+            failed_attempts,
+        ) = _generate_full_simulation(
+            meta_study_config,
+            meta_dosing_config,
+            retry_on_invalid=retry_on_invalid,
+            idx=idx,
+        )
+        observation_strategy = ObservationStrategyFactory.from_config(
+            observation_config, meta_study_config
+        )
+        obs_out, time_out, mask_out, rem_sim, rem_time, rem_mask, _ = observation_strategy.generate(
+            full_simulation=full_sim,
+            full_simulation_times=full_times,
+            time_scales=time_scales,
+        )
+        section: list[IndividualJSON] = []
+        num_individuals = full_sim.shape[0]
+        for ind_idx in range(num_individuals):
+            mask = mask_out[ind_idx].to(torch.bool)
+            observations = obs_out[ind_idx][mask].tolist()
+            observation_times = time_out[ind_idx][mask].tolist()
+            _ensure_strictly_increasing_observations(
+                observation_times,
+                observations,
+                individual_id=f"{name}_{ind_idx}",
+            )
+            individual: IndividualJSON = {
+                "name_id": f"{name}_{ind_idx}",
+                "observations": observations,
+                "observation_times": observation_times,
+            }
+            if rem_sim is not None and rem_time is not None and rem_mask is not None:
+                rem_mask_row = rem_mask[ind_idx].to(torch.bool)
+                if rem_mask_row.any():
+                    individual["remaining"] = rem_sim[ind_idx][rem_mask_row].tolist()
+                    individual["remaining_times"] = rem_time[ind_idx][rem_mask_row].tolist()
+            dosing_cfg = dosing_config_array[ind_idx]
+            dose = float(dosing_amounts[ind_idx].item())
+            route = getattr(dosing_cfg, "route", "")
+            dosing_time = float(getattr(dosing_cfg, "time", 0.0))
+            if dose or route:
+                individual["dosing"] = [dose]
+                individual["dosing_type"] = [route]
+                individual["dosing_times"] = [dosing_time]
+                individual["dosing_name"] = [route]
+            section.append(individual)
+        return section, study_config, failed_attempts
+    # Set RNG to have the same study config for both context and target
+    torch.manual_seed(42)
+    context, study_config, failed_attempts_context = prepare_section(
+        "context", meta_dosing_config_context
+    )
+    torch.manual_seed(42)
+    target, _, failed_attempts_target = prepare_section("target", meta_dosing_config_target)
+    study_json: StudyJSON = {
+        "context": context,
+        "target": target,
+        "meta_data": {
+            "study_name": f"simulated_study_{idx}",
+            "substance_name": getattr(study_config, "drug_id", "simulated_substance"),
+        },
+    }
+    failed_attempts = failed_attempts_context + failed_attempts_target
+    return study_json, failed_attempts

sim_priors_pk/data/data_generation/dosing_models.py ADDED Viewed

File without changes

sim_priors_pk/data/data_generation/observations_classes.py ADDED Viewed

	@@ -0,0 +1,1776 @@

+from abc import ABC, abstractmethod
+from typing import Callable, Optional, Tuple, List
+import torch
+from torch import Tensor
+from torchtyping import TensorType
+from sim_priors_pk.config_classes.data_config import ObservationsConfig, MetaStudyConfig
+from sim_priors_pk.data.data_generation.observations_functions import fix_past_time_random_selection
+def _sample_past_count_with_bias(
+    low: int,
+    high: int,
+    *,
+    generative_bias: bool,
+    generator: torch.Generator,
+    device: torch.device,
+) -> int:
+    """Sample the number of past observations under the configured bias mode."""
+    if high <= 0:
+        return 0
+    if generative_bias:
+        sample_zero = int(torch.randint(0, 2, (1,), generator=generator, device=device).item()) == 0
+        if sample_zero:
+            return 0
+        rest_low = max(1, low)
+        if rest_low > high:
+            return 0
+        if rest_low == high:
+            return rest_low
+        return int(
+            torch.randint(
+                rest_low,
+                high + 1,
+                (1,),
+                generator=generator,
+                device=device,
+            ).item()
+        )
+    if low >= high:
+        return int(high)
+    return int(torch.randint(low, high + 1, (1,), generator=generator, device=device).item())
+class ObservationStrategy(ABC):
+    def __init__(self, observations_config: ObservationsConfig, meta_config: MetaStudyConfig):
+        self.observations_config = observations_config
+        self.meta_config = meta_config
+    def _drop_non_positive_times_from_mask(self, times: Tensor, mask: Tensor) -> Tensor:
+        """Optionally invalidate observations at non-positive timestamps.
+        When ``drop_time_zero_observations=True`` in :class:`ObservationsConfig`,
+        entries with ``time <= 0`` are excluded from downstream sampling.
+        """
+        if not getattr(self.observations_config, "drop_time_zero_observations", False):
+            return mask
+        return mask & (times > 0)
+    def generate(
+        self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs
+    ) -> Tuple[Tensor, ...]:
+        """Wrap raw generate: apply add_rem flag"""
+        # call subclass raw generation
+        obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._generate_raw(
+            full_simulation, full_simulation_times, **kwargs
+        )
+        # drop remaining if not desired
+        if not self.observations_config.add_rem:
+            rem_sim = rem_time = rem_mask = None
+        return obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, None
+    @abstractmethod
+    def _generate_raw(
+        self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs
+    ) -> Tuple[
+        Tensor,
+        TensorType["B", "T_obs"],
+        TensorType["B", "T_obs"],
+        TensorType["B", "T_rem"],
+        TensorType["B", "T_rem"],
+        TensorType["B", "T_rem"],
+    ]:
+        """Generate observations and remaining sims raw, regardless of add_rem"""
+        pass
+    def get_shapes(self) -> Tuple[int, int]:
+        """Wrap raw shapes: apply add_rem flag"""
+        max_obs, max_rem = self._get_shapes_raw()
+        if not self.observations_config.add_rem:
+            max_rem = 0
+        return max_obs, max_rem
+    @abstractmethod
+    def _get_shapes_raw(self) -> Tuple[int, int]:
+        """Return max observations and max remaining assuming add_rem=True"""
+        pass
+class PKPeakHalfLifeStrategy(ObservationStrategy):
+    """Observation strategy tailored to pharmacokinetic (PK) curves.
+    The strategy samples observations around the absorption peak and along the
+    elimination phase of a PK simulation. It uses a canonical grid composed of
+    four segments:
+    1. Several points before the peak that are proportional to the configured
+       peak time.
+    2. The peak itself.
+    3. Several points after the peak spaced by multiples of the provided
+       half-life.
+    4. Optional remainder points that are handed back to the caller when
+       ``add_rem`` is enabled.
+    For **synthetic simulations**, the strategy still uses this canonical grid
+    and nearest-neighbour alignment.
+    For **empirical data**, measurements are treated as already canonical:
+    * No canonical time grid construction.
+    * No time normalisation or template matching.
+    * No interpolation or re-scaling to canonical coordinates.
+    Empirical sequences are only padded / truncated to the internal capacity
+    implied by :class:`ObservationsConfig` and :class:`MetaStudyConfig`, and
+    then passed through the same past/future splitting logic.
+    Past/future splitting
+    ----------------------
+    When ``split_past_future=True``, the canonical sequence for each row is
+    split into:
+    * a *past* observation block of fixed width (``max_obs``), and
+    * an optional *remainder* block of width (``max_rem``).
+    In the default mode (no fixed past selection), the number of past points
+    is sampled according to ``generative_bias``:
+    * ``False`` samples in ``[min_past, max_past]``.
+    * ``True`` samples exactly ``0`` with probability 0.5 and, otherwise,
+      samples uniformly in ``[max(1, min_past), max_past]``.
+    Under ``generative_bias=False``, **short sequences** receive a special treatment: when
+    the number of valid canonical points is less than or equal to the
+    observation capacity, *all* valid points are placed in the observation
+    block and none are shifted into the remainder.
+    Fixed past selection
+    --------------------
+    Calling :meth:`fix_past_selection(k)` activates a strict mode in which
+    the strategy tries to expose exactly ``k`` earliest valid timestamps as
+    "past" for each series, subject to the following structural limits:
+    1. The number of real data points available in the series.
+    2. The observation capacity dictated by :meth:`_get_shapes_raw`.
+    Concretely, for each row:
+    * Let ``k`` be the fixed past count.
+    * Let ``total_valid`` be the number of valid canonical points.
+    * Let ``past_required = min(k, total_valid)``.
+    The observation block receives
+    * ``obs_count = min(past_required, max_obs)`` earliest valid points.
+    If ``past_required > obs_count`` (for example because ``k`` exceeds the
+    number of observation slots), the remaining required past events
+    ``past_required - obs_count`` are the *first entries* in the remainder
+    block (subject to the remainder capacity). This guarantees that, as long
+    as data and shapes allow, the first ``k`` valid timestamps appear in
+    ``obs`` + ``rem`` before any later timestamps.
+    Calling :meth:`release_past_selection()` returns to the default stochastic
+    behaviour governed by ``min_past``/``max_past``.
+    """
+    _PEAK_PHASE_MULTIPLIERS = (0.1, 0.2, 0.5, 0.8)
+    _POST_PEAK_HALF_LIFE_MULTIPLIERS = (
+        0.25,
+        0.50,
+        1.00,
+        2.00,
+        4.00,
+        6.00,
+        8.00,
+        9.00,
+        14.0,
+        19.0,
+        30.0,
+    )
+    _RAW_CANONICAL_POINTS = len(_PEAK_PHASE_MULTIPLIERS) + 1 + len(_POST_PEAK_HALF_LIFE_MULTIPLIERS)
+    def __init__(
+        self, observations_config: ObservationsConfig, meta_config: MetaStudyConfig
+    ) -> None:
+        super().__init__(observations_config, meta_config)
+        self.max_num_obs = observations_config.max_num_obs
+        self.split_past_future = observations_config.split_past_future
+        self.min_past = observations_config.min_past
+        self.max_past = observations_config.max_past
+        self.generative_bias = observations_config.generative_bias
+        # None → default random selection. When set, the strategy enforces a
+        # strict fixed-past semantics as documented above.
+        self._fixed_past_obs_count: Optional[int] = None
+    def fix_past_selection(self, obs_count: int) -> None:
+        """Activate strict ``k``-past behaviour.
+        When this mode is active and ``split_past_future=True``, every call to
+        :meth:`generate` or :meth:`generate_empirical` will:
+        * expose up to ``obs_count`` earliest valid timestamps in the
+          observation block, bounded by the available data and the observation
+          capacity;
+        * place any additional required past events (when ``obs_count`` is
+          larger than the observation capacity) at the *front* of the remainder
+          block (when a remainder is present).
+        The strategy is allowed to allocate fewer than ``obs_count`` past
+        events only when:
+        * the series contains fewer real data points than ``obs_count``, or
+        * the observation/remainder shapes leave insufficient slots.
+        In all other cases the earliest valid timestamps are allocated in the
+        order: observation block first, then remainder.
+        """
+        if not self.split_past_future:
+            # No split → fixed past count is meaningless.
+            return
+        if obs_count < self.min_past or obs_count > self.max_past:
+            raise ValueError(
+                "Fixed past observation count must lie within the configured min/max bounds."
+            )
+        self._fixed_past_obs_count = int(obs_count)
+    def release_past_selection(self) -> None:
+        """Return to the default random past selection behaviour."""
+        self._fixed_past_obs_count = None
+    @classmethod
+    def _build_canonical_grid(
+        cls,
+        *,
+        t_peak: float,
+        t_half: float,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> Tensor:
+        """Construct the canonical grid for a single simulation.
+        The grid covers the pre-peak, peak and post-peak regime of the curve by
+        scaling two fundamental quantities supplied at runtime: the time of the
+        peak concentration ``t_peak`` and the half-life ``t_half``. Both values
+        are expected to be expressed in the same units as the simulation time
+        axis.
+        """
+        before_peak = [mult * t_peak for mult in cls._PEAK_PHASE_MULTIPLIERS]
+        after_peak = [t_peak + mult * t_half for mult in cls._POST_PEAK_HALF_LIFE_MULTIPLIERS]
+        values = before_peak + [t_peak] + after_peak
+        return torch.tensor(values, device=device, dtype=dtype)
+    def _canonical_grid_capacity(self) -> int:
+        """Return the number of canonical grid points available.
+        The capacity is the minimum between the simulator resolution and the
+        theoretical number of canonical points. This ensures that the
+        observation tensors never attempt to gather indices outside the
+        original simulation.
+        """
+        time_steps = getattr(self.meta_config, "time_num_steps", self.max_num_obs)
+        return max(
+            0,
+            min(int(self.max_num_obs), int(time_steps), self._RAW_CANONICAL_POINTS),
+        )
+    def _get_shapes_raw(self) -> Tuple[int, int]:
+        """Compute the maximum number of observation and remainder slots.
+        Returns
+        -------
+        max_obs, max_rem : int, int
+            * ``max_obs`` – maximum number of observation time-steps.
+            * ``max_rem`` – maximum number of remainder time-steps when
+              ``add_rem`` is enabled.
+        Raises
+        ------
+        ValueError
+            If a past/future split is requested but the canonical capacity
+            cannot satisfy the configured ``min_past`` requirement.
+        """
+        canonical_cap = self._canonical_grid_capacity()
+        if canonical_cap == 0:
+            return 0, 0
+        if self.split_past_future:
+            if canonical_cap < self.min_past:
+                raise ValueError("Canonical grid capacity is smaller than the configured min_past")
+            max_obs = min(self.max_past, canonical_cap)
+            max_rem = max(0, canonical_cap - self.min_past)
+        else:
+            max_obs = canonical_cap
+            max_rem = canonical_cap
+        return max_obs, max_rem
+    @staticmethod
+    def _deduplicate_sorted_indices(
+        idx: Tensor, valid_mask: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Tensor]:
+        """Collapse repeated gather indices while preserving alignment.
+        ``idx`` is expected to be monotonically increasing. Consecutive
+        duplicates are collapsed into a single entry at the front of the tensor
+        and the corresponding ``valid_mask`` entries are shifted accordingly.
+        """
+        if valid_mask is None:
+            valid_mask = torch.ones_like(idx, dtype=torch.bool)
+        if idx.numel() <= 1:
+            return idx, valid_mask
+        duplicate_mask = torch.zeros_like(idx, dtype=torch.bool)
+        duplicate_mask[1:] = idx[1:] == idx[:-1]
+        if not duplicate_mask.any():
+            return idx, valid_mask
+        unique_mask = ~duplicate_mask
+        kept_idx = idx[unique_mask]
+        duplicate_idx = idx[duplicate_mask]
+        padded_idx = torch.empty_like(idx)
+        padded_idx[: kept_idx.numel()] = kept_idx
+        padded_idx[kept_idx.numel() :] = duplicate_idx
+        kept_valid = valid_mask[unique_mask]
+        padded_mask = torch.zeros_like(valid_mask)
+        padded_mask[: kept_valid.numel()] = kept_valid
+        return padded_idx, padded_mask
+    def _assemble_from_canonical(
+        self,
+        canonical_vals: Tensor,
+        canonical_times: Tensor,
+        canonical_mask: Tensor,
+        *,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+        """Convert canonical tensors into output observations.
+        The canonical representation stores **all** admissible samples for a
+        batch element. This helper slices the canonical tensors into the
+        "past" observations that will be returned to the caller and (when
+        requested) the "future" remainder.
+        Allocation invariants
+        ---------------------
+        For each batch row:
+        * Let ``valid_idx`` be the indices where ``canonical_mask`` is True,
+          sorted in ascending order.
+        * The observation block always receives the **earliest**
+          ``obs_count`` indices from ``valid_idx``.
+        * The remainder block (when present) receives later indices only; it
+          never contains timestamps that precede those in the observation block.
+        * Under ``generative_bias=False``, short sequences
+          (``total_valid <= max_obs``) keep all valid points in the
+          observation block and do not shift points to the remainder.
+        When :meth:`fix_past_selection(k)` is active, we define::
+            past_required = min(k, total_valid)
+        and allocate:
+        * ``obs_count = min(past_required, max_obs)`` to the observation
+          block; and
+        * any surplus past events ``past_required - obs_count`` at the **front**
+          of the remainder block (subject to the remainder capacity), followed
+          by any truly future points.
+        Releasing the fixed selection returns to the stochastic behaviour
+        controlled by ``generative_bias``.
+        """
+        max_obs, max_rem = self._get_shapes_raw()
+        device = canonical_vals.device
+        dtype = canonical_vals.dtype
+        batch, _ = canonical_vals.shape
+        obs_out = torch.zeros(batch, max_obs, dtype=dtype, device=device)
+        obs_time = torch.zeros_like(obs_out)
+        obs_mask = torch.zeros(batch, max_obs, dtype=torch.bool, device=device)
+        rem_sim = rem_time = rem_mask = None
+        if max_rem > 0:
+            rem_sim = torch.zeros(batch, max_rem, dtype=dtype, device=device)
+            rem_time = torch.zeros_like(rem_sim)
+            rem_mask = torch.zeros(batch, max_rem, dtype=torch.bool, device=device)
+        gen = generator if generator is not None else torch.default_generator
+        for row in range(batch):
+            valid_idx = canonical_mask[row].nonzero(as_tuple=True)[0]
+            total_valid = int(valid_idx.numel())
+            if total_valid == 0:
+                continue
+            fixed_k = self._fixed_past_obs_count if self.split_past_future else None
+            # ------------------------------------------------------------------
+            # 1) Decide obs_count
+            # ------------------------------------------------------------------
+            if self.split_past_future and fixed_k is not None:
+                # Strict fixed-past semantics. Structural limits:
+                #   - real data (total_valid)
+                #   - observation capacity (max_obs)
+                past_required = min(fixed_k, total_valid)
+                obs_capacity = min(max_obs, total_valid)
+                obs_count = min(past_required, obs_capacity)
+            else:
+                # Default stochastic behaviour; the short-series fix is kept
+                # for the non-biased mode only.
+                if self.split_past_future:
+                    low = min(self.min_past, total_valid)
+                    high = min(self.max_past, total_valid)
+                    sampled = _sample_past_count_with_bias(
+                        low=low,
+                        high=high,
+                        generative_bias=self.generative_bias,
+                        generator=gen,
+                        device=device,
+                    )
+                    if (not self.generative_bias) and total_valid <= max_obs:
+                        # Short-series fix: never push valid points into the
+                        # remainder just to satisfy a random split.
+                        obs_count = total_valid
+                    else:
+                        obs_count = min(sampled, max_obs)
+                else:
+                    obs_count = min(total_valid, max_obs)
+            # Safety clamp.
+            obs_count = max(0, min(obs_count, min(max_obs, total_valid)))
+            # ------------------------------------------------------------------
+            # 2) Fill observation block (earliest obs_count indices)
+            # ------------------------------------------------------------------
+            if obs_count > 0:
+                take = valid_idx[:obs_count]
+                obs_out[row, :obs_count] = canonical_vals[row, take]
+                obs_time[row, :obs_count] = canonical_times[row, take]
+                obs_mask[row, :obs_count] = True
+            # ------------------------------------------------------------------
+            # 3) Fill remainder block (if enabled)
+            # ------------------------------------------------------------------
+            if rem_sim is not None:
+                if self.split_past_future and fixed_k is not None:
+                    # Remaining required past events plus genuine future.
+                    past_required = min(fixed_k, total_valid)
+                    # indices that are still part of the fixed past window
+                    # but did not fit into the observation block
+                    extra_past_idx = valid_idx[obs_count:past_required]
+                    future_idx = valid_idx[past_required:]
+                    candidates: List[Tensor] = []
+                    if extra_past_idx.numel() > 0:
+                        candidates.append(extra_past_idx)
+                    if future_idx.numel() > 0:
+                        candidates.append(future_idx)
+                    if candidates:
+                        remainder_candidates = torch.cat(candidates, dim=0)
+                    else:
+                        remainder_candidates = valid_idx.new_empty((0,), dtype=valid_idx.dtype)
+                else:
+                    # Default behaviour: everything after the obs window.
+                    remainder_candidates = valid_idx[obs_count:]
+                rem_count = min(int(remainder_candidates.numel()), max_rem)
+                if rem_count > 0:
+                    rem_idx = remainder_candidates[:rem_count]
+                    rem_sim[row, :rem_count] = canonical_vals[row, rem_idx]
+                    rem_time[row, :rem_count] = canonical_times[row, rem_idx]
+                    rem_mask[row, :rem_count] = True
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask
+    def _align_simulation_to_canonical(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        *,
+        time_scales: Tensor,
+        num_obs_sampler: Optional[Callable[[int], Tensor]] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
+        """Gather canonical samples from a simulated PK curve.
+        Synthetic behaviour is unchanged compared to the original strategy:
+        we build a canonical grid, snap it to the nearest simulation times and
+        optionally subsample points via ``num_obs_sampler``.
+        """
+        device = full_simulation.device
+        dtype = full_simulation.dtype
+        batch, _ = full_simulation.shape
+        time_steps = int(full_simulation_times.size(1))
+        # DataLoader workers may receive empty row slices (B=0). In that case
+        # there is no reference timeline to align against; return an empty
+        # canonical block and let _assemble_from_canonical create [B, *] outputs.
+        if batch == 0 or time_steps == 0:
+            zero = torch.zeros(batch, 0, dtype=dtype, device=device)
+            mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            return zero, zero, mask, time_scales.clone()
+        canonical_cap = self._canonical_grid_capacity()
+        if canonical_cap == 0:
+            zero = torch.zeros(batch, 0, dtype=dtype, device=device)
+            mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            return zero, zero, mask, time_scales.clone()
+        grid = self._build_canonical_grid(
+            t_peak=time_scales[0].item(),
+            t_half=time_scales[1].item(),
+            device=device,
+            dtype=dtype,
+        )[:canonical_cap]
+        ref_times = full_simulation_times[0]
+        min_time = ref_times.min()
+        max_time = ref_times.max()
+        grid_valid_mask = (grid >= min_time) & (grid <= max_time)
+        idx = torch.cdist(grid[:, None], ref_times[:, None]).argmin(dim=1)
+        idx, order = idx.sort()
+        grid_valid_mask = grid_valid_mask[order]
+        idx, grid_valid_mask = self._deduplicate_sorted_indices(idx, grid_valid_mask)
+        gather_idx = idx[None, :].expand(batch, -1)
+        batch_idx = torch.arange(batch, device=device)[:, None]
+        canonical_vals = full_simulation[batch_idx, gather_idx]
+        canonical_times = full_simulation_times[batch_idx, gather_idx]
+        invalid_slots = ~grid_valid_mask
+        if invalid_slots.any():
+            canonical_vals[:, invalid_slots] = 0
+            canonical_times[:, invalid_slots] = 0
+        if num_obs_sampler is None:
+            total_counts = torch.full((batch,), canonical_cap, dtype=torch.long, device=device)
+        else:
+            sampled = num_obs_sampler(batch).to(device=device).long()
+            total_counts = sampled.clamp(min=0, max=canonical_cap)
+        max_valid = int(grid_valid_mask.sum().item())
+        if max_valid == 0:
+            total_counts.zero_()
+        else:
+            total_counts.clamp_(max=max_valid)
+        valid_order = grid_valid_mask.long().cumsum(dim=0) - 1
+        valid_order = torch.where(
+            grid_valid_mask,
+            valid_order,
+            torch.full_like(valid_order, -1, dtype=valid_order.dtype),
+        )
+        canonical_mask = grid_valid_mask[None, :] & (valid_order[None, :] < total_counts[:, None])
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        return canonical_vals, canonical_times, canonical_mask, time_scales.clone()
+    def _align_empirical_to_canonical(
+        self,
+        empirical_obs: Tensor,
+        empirical_times: Tensor,
+        empirical_mask: Tensor,
+    ) -> Tuple[Tensor, Tensor, Tensor]:
+        """(Legacy) Project empirical observations onto the canonical grid.
+        This method is retained for backward compatibility but is **not** used
+        by :meth:`generate_empirical`, which now treats empirical data as
+        already canonical. New code should avoid calling this helper.
+        """
+        device = empirical_obs.device
+        dtype = empirical_obs.dtype
+        batch, _ = empirical_obs.shape
+        canonical_cap = self._canonical_grid_capacity()
+        canonical_vals = torch.zeros(batch, canonical_cap, dtype=dtype, device=device)
+        canonical_times = torch.zeros_like(canonical_vals)
+        canonical_mask = torch.zeros(batch, canonical_cap, dtype=torch.bool, device=device)
+        if canonical_cap == 0:
+            return canonical_vals, canonical_times, canonical_mask
+        for row in range(batch):
+            valid_idx = empirical_mask[row].nonzero(as_tuple=True)[0]
+            if valid_idx.numel() == 0:
+                continue
+            obs_row = empirical_obs[row, valid_idx]
+            time_row = empirical_times[row, valid_idx]
+            max_time = torch.maximum(time_row.max(), torch.tensor(1.0, device=device))
+            norm_time = time_row / max_time
+            peak_idx = obs_row.argmax().item()
+            t_peak = norm_time[peak_idx].item()
+            post_times = norm_time[peak_idx:]
+            post_obs = obs_row[peak_idx:]
+            half_level = obs_row[peak_idx] / 2
+            below_half = (post_obs <= half_level).nonzero(as_tuple=True)[0]
+            if below_half.numel() == 0:
+                half_time = post_times[-1].item()
+            else:
+                half_time = post_times[below_half[0]].item()
+            t_half = max(half_time - t_peak, 1e-3)
+            grid = self._build_canonical_grid(
+                t_peak=t_peak if t_peak > 0 else 1e-3,
+                t_half=t_half,
+                device=device,
+                dtype=dtype,
+            )[:canonical_cap].clamp(max=1.0)
+            actual_grid = grid * max_time
+            distances = torch.cdist(actual_grid[:, None], time_row[:, None])
+            nearest = distances.argmin(dim=1)
+            usable = min(time_row.numel(), grid.numel())
+            if usable == 0:
+                continue
+            canonical_vals[row, :usable] = obs_row[nearest[:usable]]
+            canonical_times[row, :usable] = time_row[nearest[:usable]]
+            canonical_mask[row, :usable] = True
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        return canonical_vals, canonical_times, canonical_mask
+    def _prepare_empirical_as_canonical(
+        self,
+        empirical_obs: Tensor,
+        empirical_times: Tensor,
+        empirical_mask: Tensor,
+    ) -> Tuple[Tensor, Tensor, Tensor]:
+        """Treat empirical observations as already canonical.
+        This helper:
+        * does **not** build any canonical grid;
+        * does **not** normalise or re-scale time;
+        * simply copies valid empirical points in their original order into
+          fixed-size tensors, padding with zeros / False as needed.
+        The resulting tensors have width equal to the canonical capacity so
+        that they can be passed to :meth:`_assemble_from_canonical`.
+        """
+        device = empirical_obs.device
+        dtype = empirical_obs.dtype
+        batch, _ = empirical_obs.shape
+        canonical_cap = self._canonical_grid_capacity()
+        canonical_vals = torch.zeros(batch, canonical_cap, dtype=dtype, device=device)
+        canonical_times = torch.zeros_like(canonical_vals)
+        canonical_mask = torch.zeros(batch, canonical_cap, dtype=torch.bool, device=device)
+        if canonical_cap == 0:
+            return canonical_vals, canonical_times, canonical_mask
+        for row in range(batch):
+            valid_idx = empirical_mask[row].nonzero(as_tuple=True)[0]
+            if valid_idx.numel() == 0:
+                continue
+            take_count = min(int(valid_idx.numel()), canonical_cap)
+            take_idx = valid_idx[:take_count]
+            canonical_vals[row, :take_count] = empirical_obs[row, take_idx]
+            canonical_times[row, :take_count] = empirical_times[row, take_idx]
+            canonical_mask[row, :take_count] = True
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        return canonical_vals, canonical_times, canonical_mask
+    def _generate_raw(
+        self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        """Deterministic canonical PK sampling for synthetic simulations."""
+        time_scales: Optional[Tensor] = kwargs.get("time_scales")
+        if time_scales is None:
+            raise ValueError("time_scales must be provided for PKPeakHalfLifeStrategy")
+        canonical_vals, canonical_times, canonical_mask, rescaled = (
+            self._align_simulation_to_canonical(
+                full_simulation,
+                full_simulation_times,
+                time_scales=time_scales,
+                num_obs_sampler=kwargs.get("num_obs_sampler"),
+            )
+        )
+        obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals,
+            canonical_times,
+            canonical_mask,
+            generator=kwargs.get("generator"),
+        )
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled
+    def _generate_random(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        *,
+        time_scales: Tensor,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        """Randomised variant of canonical observation generation.
+        The pre- and post-peak segments are sampled from uniform distributions
+        bounded by the canonical limits. This keeps the semantic meaning of the
+        selected points while injecting stochasticity that can improve
+        robustness during training.
+        """
+        device, dtype = full_simulation.device, full_simulation.dtype
+        batch = full_simulation.size(0)
+        time_steps = int(full_simulation_times.size(1))
+        if batch == 0 or time_steps == 0:
+            canonical_vals = torch.zeros(batch, 0, dtype=dtype, device=device)
+            canonical_times = torch.zeros(batch, 0, dtype=dtype, device=device)
+            canonical_mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+                canonical_vals, canonical_times, canonical_mask, generator=generator
+            )
+            return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, time_scales.clone()
+        t_peak, t_half = time_scales[0].item(), time_scales[1].item()
+        n_pre = len(self._PEAK_PHASE_MULTIPLIERS)
+        n_post = len(self._POST_PEAK_HALF_LIFE_MULTIPLIERS)
+        # Uniform samples before peak
+        pre_times = torch.rand(n_pre, device=device, dtype=dtype) * t_peak
+        # Always include the peak
+        peak_time = torch.tensor([t_peak], device=device, dtype=dtype)
+        # Uniform samples after peak
+        post_times = []
+        for mult in self._POST_PEAK_HALF_LIFE_MULTIPLIERS:
+            t_end = t_peak + mult * t_half
+            t_rand = torch.empty(1, device=device, dtype=dtype).uniform_(t_peak, t_end)
+            post_times.append(t_rand)
+        post_times = torch.cat(post_times, dim=0)
+        # Truncate to canonical capacity
+        grid = torch.cat([pre_times, peak_time, post_times], dim=0)
+        canonical_cap = self._canonical_grid_capacity()
+        grid = grid[:canonical_cap]
+        # Map grid to nearest simulation points
+        ref_times = full_simulation_times[0]
+        idx = torch.cdist(grid[:, None], ref_times[:, None]).argmin(dim=1)
+        idx, _ = idx.sort()
+        valid_mask = torch.ones_like(idx, dtype=torch.bool)
+        idx, valid_mask = self._deduplicate_sorted_indices(idx, valid_mask)
+        gather_idx = idx[None, :].expand(batch, -1)
+        batch_idx = torch.arange(batch, device=device)[:, None]
+        canonical_vals = full_simulation[batch_idx, gather_idx]
+        canonical_times = full_simulation_times[batch_idx, gather_idx]
+        invalid_slots = ~valid_mask
+        if invalid_slots.any():
+            canonical_vals[:, invalid_slots] = 0
+            canonical_times[:, invalid_slots] = 0
+        canonical_mask = valid_mask[None, :].expand(batch, -1).clone()
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals, canonical_times, canonical_mask, generator=generator
+        )
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, time_scales.clone()
+    def generate(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        **kwargs,
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        """Generate PK observations for synthetic simulations.
+        With probability ``randomize_prob`` (default 0.5) the method delegates
+        to :meth:`_generate_random`; otherwise the deterministic
+        :meth:`_generate_raw` path is taken. Setting ``deterministic_only=True``
+        forces the deterministic branch. Both paths require ``time_scales`` and
+        honour the ``add_rem`` flag.
+        """
+        time_scales: Optional[Tensor] = kwargs.get("time_scales")
+        if time_scales is None:
+            raise ValueError("time_scales must be provided for PKPeakHalfLifeStrategy")
+        deterministic_only = kwargs.pop("deterministic_only", False)
+        use_random = False
+        if not deterministic_only:
+            use_random = torch.rand(()) < getattr(self, "randomize_prob", 0.5)
+        if use_random:
+            obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled = self._generate_random(
+                full_simulation,
+                full_simulation_times,
+                time_scales=time_scales,
+                generator=kwargs.get("generator"),
+            )
+        else:
+            obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled = self._generate_raw(
+                full_simulation,
+                full_simulation_times,
+                **kwargs,
+            )
+        if not self.observations_config.add_rem:
+            rem_sim = rem_time = rem_mask = None
+        return obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled
+    def generate_empirical(
+        self,
+        empirical_obs: Tensor,
+        empirical_times: Tensor,
+        empirical_mask: Tensor,
+        *,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+        """Generate observations from empirical data.
+        Empirical measurements are assumed to already live on their correct
+        time grid. This routine:
+        * does **not** perform canonical alignment or time normalisation;
+        * only pads / truncates sequences to match the internal capacity;
+        * applies past/future splitting via :meth:`_assemble_from_canonical`
+          using the configuration in :class:`ObservationsConfig`.
+        Synthetic simulations keep using the canonical alignment path.
+        """
+        canonical_vals, canonical_times, canonical_mask = self._prepare_empirical_as_canonical(
+            empirical_obs,
+            empirical_times,
+            empirical_mask,
+        )
+        obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals,
+            canonical_times,
+            canonical_mask,
+            generator=generator,
+        )
+        if not self.observations_config.add_rem:
+            rem_sim = rem_time = rem_mask = None
+        return obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask
+class PKPeakHalfLifeStrategyOld(ObservationStrategy):
+    """Observation strategy tailored to pharmacokinetic (PK) curves.
+    The strategy samples observations around the absorption peak and along the
+    elimination phase of a PK simulation.  It uses a canonical grid composed of
+    four segments:
+    1. Several points before the peak that are proportional to the configured
+       peak time.
+    2. The peak itself.
+    3. Several points after the peak spaced by multiples of the provided
+       half-life.
+    4. Optional remainder points that are handed back to the caller when
+       ``add_rem`` is enabled.
+    The resulting observation tensor can be optionally split into "past" and
+    "future" observations according to :class:`ObservationsConfig`.
+    Parameters
+    ----------
+    observations_config:
+        Simulation-level configuration that defines sampling constraints such
+        as ``max_num_obs`` or the minimum/maximum number of "past" points when
+        a split is requested.
+    meta_config:
+        Meta-study configuration.  Only the ``time_num_steps`` attribute is
+        used and allows clamping the canonical grid to the resolution of the
+        simulator.
+    """
+    _PEAK_PHASE_MULTIPLIERS = (0.1, 0.2, 0.5, 0.8)
+    _POST_PEAK_HALF_LIFE_MULTIPLIERS = (
+        0.25,
+        0.50,
+        1.00,
+        2.00,
+        4.00,
+        6.00,
+        8.00,
+        9.00,
+        14.0,
+        19.0,
+        30.0,
+    )
+    _RAW_CANONICAL_POINTS = len(_PEAK_PHASE_MULTIPLIERS) + 1 + len(_POST_PEAK_HALF_LIFE_MULTIPLIERS)
+    def __init__(
+        self, observations_config: ObservationsConfig, meta_config: MetaStudyConfig
+    ) -> None:
+        super().__init__(observations_config, meta_config)
+        self.max_num_obs = observations_config.max_num_obs
+        self.split_past_future = observations_config.split_past_future
+        self.min_past = observations_config.min_past
+        self.max_past = observations_config.max_past
+        self.generative_bias = observations_config.generative_bias
+        # ``None`` indicates that the number of past observations should be
+        # sampled according to the standard strategy.  When populated it forces
+        # :meth:`_assemble_from_canonical` to always select the provided number
+        # of past observations (within the valid range).
+        self._fixed_past_obs_count: Optional[int] = None
+    def fix_past_selection(self, obs_count: int) -> None:
+        """Force the past observation count to ``obs_count`` when splitting.
+        The override is only applied when ``split_past_future`` is enabled.  The
+        provided ``obs_count`` must fall within ``[min_past, max_past]``.
+        """
+        if not self.split_past_future:
+            return
+        if obs_count < self.min_past or obs_count > self.max_past:
+            raise ValueError(
+                "Fixed past observation count must lie within the configured min/max bounds."
+            )
+        self._fixed_past_obs_count = int(obs_count)
+    def release_past_selection(self) -> None:
+        """Return to the default random past selection behaviour."""
+        self._fixed_past_obs_count = None
+    @classmethod
+    def _build_canonical_grid(
+        cls,
+        *,
+        t_peak: float,
+        t_half: float,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> Tensor:
+        """Construct the canonical grid for a single simulation.
+        The grid covers the pre-peak, peak and post-peak regime of the curve by
+        scaling two fundamental quantities supplied at runtime: the time of the
+        peak concentration ``t_peak`` and the half-life ``t_half``.  Both values
+        are expected to be expressed in the same units as the simulation time
+        axis.
+        Parameters
+        ----------
+        t_peak:
+            Estimated time of the concentration peak.
+        t_half:
+            Estimated half-life used to position post-peak points.
+        device, dtype:
+            Torch device and dtype for the returned tensor so that it matches
+            the simulation tensors that will be gathered later on.
+        Returns
+        -------
+        torch.Tensor
+            One-dimensional tensor containing monotonically increasing times
+            representing the canonical sampling grid.
+        """
+        before_peak = [mult * t_peak for mult in cls._PEAK_PHASE_MULTIPLIERS]
+        after_peak = [t_peak + mult * t_half for mult in cls._POST_PEAK_HALF_LIFE_MULTIPLIERS]
+        values = before_peak + [t_peak] + after_peak
+        return torch.tensor(values, device=device, dtype=dtype)
+    def _canonical_grid_capacity(self) -> int:
+        """Return the number of canonical grid points available.
+        The capacity is the minimum between the simulator resolution and the
+        theoretical number of canonical points.  This ensures that the
+        observation tensors never attempt to gather indices outside the
+        original simulation.
+        Returns
+        -------
+        int
+            Maximum number of grid points that can be sampled for each
+            simulation in the batch.
+        """
+        time_steps = getattr(self.meta_config, "time_num_steps", self.max_num_obs)
+        return max(
+            0,
+            min(int(self.max_num_obs), int(time_steps), self._RAW_CANONICAL_POINTS),
+        )
+    def _get_shapes_raw(self) -> Tuple[int, int]:
+        """Compute the maximum number of observation and remainder slots.
+        The method applies the canonical grid capacity alongside the
+        ``split_past_future`` configuration to decide how many points can be
+        surfaced directly as observations and how many should be exposed as
+        "remaining" (future) points.
+        Returns
+        -------
+        tuple[int, int]
+            The first entry is the maximum number of observations.  The second
+            entry is the maximum number of remaining observations when
+            ``add_rem`` is enabled.
+        Raises
+        ------
+        ValueError
+            If a past/future split is requested but the canonical capacity
+            cannot satisfy the configured ``min_past`` requirement.
+        """
+        canonical_cap = self._canonical_grid_capacity()
+        if canonical_cap == 0:
+            return 0, 0
+        if self.split_past_future:
+            if canonical_cap < self.min_past:
+                raise ValueError("Canonical grid capacity is smaller than the configured min_past")
+            max_obs = min(self.max_past, canonical_cap)
+            max_rem = max(0, canonical_cap - self.min_past)
+        else:
+            max_obs = canonical_cap
+            max_rem = canonical_cap
+        return max_obs, max_rem
+    @staticmethod
+    def _deduplicate_sorted_indices(
+        idx: Tensor, valid_mask: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Tensor]:
+        """Collapse repeated gather indices while preserving alignment."""
+        if valid_mask is None:
+            valid_mask = torch.ones_like(idx, dtype=torch.bool)
+        if idx.numel() <= 1:
+            return idx, valid_mask
+        duplicate_mask = torch.zeros_like(idx, dtype=torch.bool)
+        duplicate_mask[1:] = idx[1:] == idx[:-1]
+        if not duplicate_mask.any():
+            return idx, valid_mask
+        unique_mask = ~duplicate_mask
+        kept_idx = idx[unique_mask]
+        duplicate_idx = idx[duplicate_mask]
+        padded_idx = torch.empty_like(idx)
+        padded_idx[: kept_idx.numel()] = kept_idx
+        padded_idx[kept_idx.numel() :] = duplicate_idx
+        kept_valid = valid_mask[unique_mask]
+        padded_mask = torch.zeros_like(valid_mask)
+        padded_mask[: kept_valid.numel()] = kept_valid
+        return padded_idx, padded_mask
+    def _assemble_from_canonical(
+        self,
+        canonical_vals: Tensor,
+        canonical_times: Tensor,
+        canonical_mask: Tensor,
+        *,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+        """Convert canonical tensors into output observations.
+        The canonical representation stores **all** admissible samples for a
+        batch element.  This helper slices the canonical tensors into the
+        "past" observations that will be returned to the caller and (when
+        requested) the "future" remainder.  The selection proceeds row by row:
+        1. ``canonical_mask`` is inspected to identify the indices that contain
+           valid information.  These are the only points that may be surfaced.
+        2. When ``split_past_future`` is ``False`` every valid point is treated
+           as part of the observation history up to the configured capacity.
+        3. Otherwise we randomly draw ``obs_count`` between ``min_past`` and
+           ``max_past`` (capped by the number of valid canonical entries).  The
+           first ``obs_count`` indices become past observations while the
+           remaining valid points are placed in the remainder tensors.
+        Parameters
+        ----------
+        canonical_vals, canonical_times:
+            Tensors produced by aligning the simulation or empirical data to
+            the canonical grid.
+        canonical_mask:
+            Boolean tensor marking valid entries for each batch element.
+        generator:
+            Optional random generator used when sampling ``obs_count`` in
+            split-past/future mode.
+        Returns
+        -------
+        tuple of tensors
+            Observation and remaining tensors matching the shapes dictated by
+            :meth:`_get_shapes_raw`.  All tensors share the same device and
+            dtype as the inputs.  ``None`` is returned for remainder tensors
+            when the capacity is zero.
+        """
+        max_obs, max_rem = self._get_shapes_raw()
+        device = canonical_vals.device
+        dtype = canonical_vals.dtype
+        batch, _ = canonical_vals.shape
+        obs_out = torch.zeros(batch, max_obs, dtype=dtype, device=device)
+        obs_time = torch.zeros_like(obs_out)
+        obs_mask = torch.zeros(batch, max_obs, dtype=torch.bool, device=device)
+        rem_sim = rem_time = rem_mask = None
+        if max_rem > 0:
+            rem_sim = torch.zeros(batch, max_rem, dtype=dtype, device=device)
+            rem_time = torch.zeros_like(rem_sim)
+            rem_mask = torch.zeros(batch, max_rem, dtype=torch.bool, device=device)
+        gen = generator if generator is not None else torch.default_generator
+        for row in range(batch):
+            valid_idx = canonical_mask[row].nonzero(as_tuple=True)[0]
+            total_valid = valid_idx.numel()
+            if total_valid == 0:
+                continue
+            if self.split_past_future:
+                low = min(self.min_past, total_valid)
+                high = min(self.max_past, total_valid)
+                if self._fixed_past_obs_count is not None:
+                    obs_count = min(self._fixed_past_obs_count, total_valid)
+                else:
+                    obs_count = _sample_past_count_with_bias(
+                        low=low,
+                        high=high,
+                        generative_bias=self.generative_bias,
+                        generator=gen,
+                        device=device,
+                    )
+                obs_count = min(obs_count, max_obs)
+            else:
+                obs_count = min(total_valid, max_obs)
+            if obs_count > 0:
+                take = valid_idx[:obs_count]
+                obs_out[row, :obs_count] = canonical_vals[row, take]
+                obs_time[row, :obs_count] = canonical_times[row, take]
+                obs_mask[row, :obs_count] = True
+            if rem_sim is not None:
+                rem_candidates = valid_idx[obs_count:]
+                rem_count = min(rem_candidates.numel(), max_rem)
+                if rem_count > 0:
+                    rem_idx = rem_candidates[:rem_count]
+                    rem_sim[row, :rem_count] = canonical_vals[row, rem_idx]
+                    rem_time[row, :rem_count] = canonical_times[row, rem_idx]
+                    rem_mask[row, :rem_count] = True
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask
+    def _align_simulation_to_canonical(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        *,
+        time_scales: Tensor,
+        num_obs_sampler: Optional[Callable[[int], Tensor]] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
+        """Gather the canonical samples from a simulated PK curve.
+        The routine creates the canonical grid described in the configuration
+        (using the provided ``time_scales``) and then performs a nearest-neighbour
+        lookup on the simulated trajectory.  Each grid location picks the
+        closest time point from the reference simulation (the first batch row);
+        the same indices are applied to every batch element so that values and
+        times remain aligned across the batch.  ``num_obs_sampler`` can further
+        prune the resulting grid by specifying how many of those canonical
+        points should remain valid for each row.
+        Parameters
+        ----------
+        full_simulation, full_simulation_times:
+            Batched tensors representing the simulated concentration curve and
+            its time axis.
+        time_scales:
+            Two-element tensor with ``t_peak`` and ``t_half`` scaling factors.
+        num_obs_sampler:
+            Optional callable that samples how many canonical points should be
+            retained for each batch element.
+        Returns
+        -------
+        tuple of torch.Tensor
+            The canonical values, their corresponding times, a boolean mask of
+            valid entries and the (cloned) ``time_scales`` tensor.  When the
+            canonical capacity is zero, zero-sized tensors are returned for the
+            first three entries.
+        """
+        device = full_simulation.device
+        dtype = full_simulation.dtype
+        batch, _ = full_simulation.shape
+        time_steps = int(full_simulation_times.size(1))
+        # Empty worker slices (B=0) and zero-step trajectories are valid edge
+        # cases; return empty canonical tensors and keep shape assembly
+        # delegated to _assemble_from_canonical.
+        if batch == 0 or time_steps == 0:
+            zero = torch.zeros(batch, 0, dtype=dtype, device=device)
+            mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            return zero, zero, mask, time_scales.clone()
+        canonical_cap = self._canonical_grid_capacity()
+        if canonical_cap == 0:
+            zero = torch.zeros(batch, 0, dtype=dtype, device=device)
+            mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            return zero, zero, mask, time_scales.clone()
+        grid = self._build_canonical_grid(
+            t_peak=time_scales[0].item(),
+            t_half=time_scales[1].item(),
+            device=device,
+            dtype=dtype,
+        )[:canonical_cap]
+        ref_times = full_simulation_times[0]
+        min_time = ref_times.min()
+        max_time = ref_times.max()
+        grid_valid_mask = (grid >= min_time) & (grid <= max_time)
+        idx = torch.cdist(grid[:, None], ref_times[:, None]).argmin(dim=1)
+        idx, order = idx.sort()
+        grid_valid_mask = grid_valid_mask[order]
+        idx, grid_valid_mask = self._deduplicate_sorted_indices(idx, grid_valid_mask)
+        gather_idx = idx[None, :].expand(batch, -1)
+        batch_idx = torch.arange(batch, device=device)[:, None]
+        canonical_vals = full_simulation[batch_idx, gather_idx]
+        canonical_times = full_simulation_times[batch_idx, gather_idx]
+        invalid_slots = ~grid_valid_mask
+        if invalid_slots.any():
+            canonical_vals[:, invalid_slots] = 0
+            canonical_times[:, invalid_slots] = 0
+        if num_obs_sampler is None:
+            total_counts = torch.full((batch,), canonical_cap, dtype=torch.long, device=device)
+        else:
+            sampled = num_obs_sampler(batch).to(device=device).long()
+            total_counts = sampled.clamp(min=0, max=canonical_cap)
+        max_valid = int(grid_valid_mask.sum().item())
+        if max_valid == 0:
+            total_counts.zero_()
+        else:
+            total_counts.clamp_(max=max_valid)
+        valid_order = grid_valid_mask.long().cumsum(dim=0) - 1
+        valid_order = torch.where(
+            grid_valid_mask,
+            valid_order,
+            torch.full_like(valid_order, -1, dtype=valid_order.dtype),
+        )
+        canonical_mask = grid_valid_mask[None, :] & (valid_order[None, :] < total_counts[:, None])
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        return canonical_vals, canonical_times, canonical_mask, time_scales.clone()
+    def _align_empirical_to_canonical(
+        self,
+        empirical_obs: Tensor,
+        empirical_times: Tensor,
+        empirical_mask: Tensor,
+    ) -> Tuple[Tensor, Tensor, Tensor]:
+        """Project empirical observations onto the canonical grid.
+        The projection normalises the empirical time axis to estimate the peak
+        and half-life from the data itself.  This allows harmonising real
+        measurements with the canonical layout used during simulation-driven
+        training.
+        Parameters
+        ----------
+        empirical_obs, empirical_times, empirical_mask:
+            Batched tensors storing empirical observations, the corresponding
+            time stamps and a mask of valid entries.
+        Returns
+        -------
+        tuple[torch.Tensor, torch.Tensor, torch.Tensor]
+            Canonical values, times and boolean masks aligned to the canonical
+            sampling scheme.
+        """
+        device = empirical_obs.device
+        dtype = empirical_obs.dtype
+        batch, _ = empirical_obs.shape
+        canonical_cap = self._canonical_grid_capacity()
+        canonical_vals = torch.zeros(batch, canonical_cap, dtype=dtype, device=device)
+        canonical_times = torch.zeros_like(canonical_vals)
+        canonical_mask = torch.zeros(batch, canonical_cap, dtype=torch.bool, device=device)
+        if canonical_cap == 0:
+            return canonical_vals, canonical_times, canonical_mask
+        for row in range(batch):
+            valid_idx = empirical_mask[row].nonzero(as_tuple=True)[0]
+            if valid_idx.numel() == 0:
+                continue
+            obs_row = empirical_obs[row, valid_idx]
+            time_row = empirical_times[row, valid_idx]
+            max_time = torch.maximum(time_row.max(), torch.tensor(1.0, device=device))
+            norm_time = time_row / max_time
+            peak_idx = obs_row.argmax().item()
+            t_peak = norm_time[peak_idx].item()
+            post_times = norm_time[peak_idx:]
+            post_obs = obs_row[peak_idx:]
+            half_level = obs_row[peak_idx] / 2
+            below_half = (post_obs <= half_level).nonzero(as_tuple=True)[0]
+            if below_half.numel() == 0:
+                half_time = post_times[-1].item()
+            else:
+                half_time = post_times[below_half[0]].item()
+            t_half = max(half_time - t_peak, 1e-3)
+            grid = self._build_canonical_grid(
+                t_peak=t_peak if t_peak > 0 else 1e-3,
+                t_half=t_half,
+                device=device,
+                dtype=dtype,
+            )[:canonical_cap].clamp(max=1.0)
+            actual_grid = grid * max_time
+            distances = torch.cdist(actual_grid[:, None], time_row[:, None])
+            nearest = distances.argmin(dim=1)
+            usable = min(time_row.numel(), grid.numel())
+            if usable == 0:
+                continue
+            canonical_vals[row, :usable] = obs_row[nearest[:usable]]
+            canonical_times[row, :usable] = time_row[nearest[:usable]]
+            canonical_mask[row, :usable] = True
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        return canonical_vals, canonical_times, canonical_mask
+    def _generate_raw(
+        self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        time_scales: Optional[Tensor] = kwargs.get("time_scales")
+        if time_scales is None:
+            raise ValueError("time_scales must be provided for PKPeakHalfLifeStrategy")
+        canonical_vals, canonical_times, canonical_mask, rescaled = (
+            self._align_simulation_to_canonical(
+                full_simulation,
+                full_simulation_times,
+                time_scales=time_scales,
+                num_obs_sampler=kwargs.get("num_obs_sampler"),
+            )
+        )
+        obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals,
+            canonical_times,
+            canonical_mask,
+            generator=kwargs.get("generator"),
+        )
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled
+    def _generate_random(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        *,
+        time_scales: Tensor,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        """Randomized variant of canonical observation generation.
+        Instead of fixed multipliers, the pre- and post-peak segments are
+        sampled from uniform distributions bounded by the canonical limits.
+        This keeps the semantic meaning of the selected points while injecting
+        stochasticity that improves robustness when training amortised
+        inference models.
+        """
+        device, dtype = full_simulation.device, full_simulation.dtype
+        batch = full_simulation.size(0)
+        time_steps = int(full_simulation_times.size(1))
+        if batch == 0 or time_steps == 0:
+            canonical_vals = torch.zeros(batch, 0, dtype=dtype, device=device)
+            canonical_times = torch.zeros(batch, 0, dtype=dtype, device=device)
+            canonical_mask = torch.zeros(batch, 0, dtype=torch.bool, device=device)
+            obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+                canonical_vals, canonical_times, canonical_mask, generator=generator
+            )
+            return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, time_scales.clone()
+        t_peak, t_half = time_scales[0].item(), time_scales[1].item()
+        n_pre = len(self._PEAK_PHASE_MULTIPLIERS)
+        n_post = len(self._POST_PEAK_HALF_LIFE_MULTIPLIERS)
+        # Uniform samples before peak
+        pre_times = torch.rand(n_pre, device=device, dtype=dtype) * t_peak
+        # Always include the peak
+        peak_time = torch.tensor([t_peak], device=device, dtype=dtype)
+        # Uniform samples after peak
+        post_times = []
+        for mult in self._POST_PEAK_HALF_LIFE_MULTIPLIERS:
+            t_end = t_peak + mult * t_half
+            t_rand = torch.empty(1, device=device, dtype=dtype).uniform_(t_peak, t_end)
+            post_times.append(t_rand)
+        post_times = torch.cat(post_times, dim=0)
+        # Truncate to canonical capacity
+        grid = torch.cat([pre_times, peak_time, post_times], dim=0)
+        canonical_cap = self._canonical_grid_capacity()
+        grid = grid[:canonical_cap]
+        # Map grid to nearest simulation points
+        ref_times = full_simulation_times[0]
+        idx = torch.cdist(grid[:, None], ref_times[:, None]).argmin(dim=1)
+        idx, _ = idx.sort()
+        valid_mask = torch.ones_like(idx, dtype=torch.bool)
+        idx, valid_mask = self._deduplicate_sorted_indices(idx, valid_mask)
+        gather_idx = idx[None, :].expand(batch, -1)
+        batch_idx = torch.arange(batch, device=device)[:, None]
+        canonical_vals = full_simulation[batch_idx, gather_idx]
+        canonical_times = full_simulation_times[batch_idx, gather_idx]
+        invalid_slots = ~valid_mask
+        if invalid_slots.any():
+            canonical_vals[:, invalid_slots] = 0
+            canonical_times[:, invalid_slots] = 0
+        canonical_mask = valid_mask[None, :].expand(batch, -1).clone()
+        canonical_mask = self._drop_non_positive_times_from_mask(canonical_times, canonical_mask)
+        obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals, canonical_times, canonical_mask, generator=generator
+        )
+        return obs_out, obs_time, obs_mask, rem_sim, rem_time, rem_mask, time_scales.clone()
+    def generate(
+        self,
+        full_simulation: Tensor,
+        full_simulation_times: Tensor,
+        **kwargs,
+    ) -> Tuple[
+        Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor], Tensor
+    ]:
+        """Generate PK observations using canonical or randomized schedules.
+        With probability ``randomize_prob`` (default 0.5) the method delegates
+        to :meth:`_generate_random`; otherwise the deterministic
+        :meth:`_generate_raw` path is taken.  Setting the keyword argument
+        ``deterministic_only=True`` forces the deterministic branch regardless
+        of the random draw.  Both paths require the caller to provide
+        ``time_scales`` specifying the peak and half-life.  The method honours
+        the ``add_rem`` flag by optionally returning remainder tensors.
+        """
+        time_scales: Optional[Tensor] = kwargs.get("time_scales")
+        if time_scales is None:
+            raise ValueError("time_scales must be provided for PKPeakHalfLifeStrategy")
+        deterministic_only = kwargs.pop("deterministic_only", False)
+        use_random = False
+        if not deterministic_only:
+            use_random = torch.rand(()) < getattr(self, "randomize_prob", 0.5)
+        if use_random:
+            obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled = self._generate_random(
+                full_simulation,
+                full_simulation_times,
+                time_scales=time_scales,
+                generator=kwargs.get("generator"),
+            )
+        else:
+            obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled = self._generate_raw(
+                full_simulation,
+                full_simulation_times,
+                **kwargs,
+            )
+        if not self.observations_config.add_rem:
+            rem_sim = rem_time = rem_mask = None
+        return obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask, rescaled
+    def generate_empirical(
+        self,
+        empirical_obs: Tensor,
+        empirical_times: Tensor,
+        empirical_mask: Tensor,
+        *,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Optional[Tensor], Optional[Tensor], Optional[Tensor]]:
+        canonical_vals, canonical_times, canonical_mask = self._align_empirical_to_canonical(
+            empirical_obs,
+            empirical_times,
+            empirical_mask,
+        )
+        obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask = self._assemble_from_canonical(
+            canonical_vals,
+            canonical_times,
+            canonical_mask,
+            generator=generator,
+        )
+        if not self.observations_config.add_rem:
+            rem_sim = rem_time = rem_mask = None
+        return obs, obs_time, obs_mask, rem_sim, rem_time, rem_mask
+class FixPastTimeRandomSelectionStrategy(ObservationStrategy):
+    """Randomly sample observations and split with fixed-capacity past/future slots.
+    For ``split_past_future=True`` this strategy enforces the contract:
+    ``obs_capacity=max_past`` and ``rem_capacity=max_num_obs-max_past``
+    (subject to ``fixed_M_max=min(max_num_obs, time_num_steps)``).
+    """
+    def __init__(self, config: ObservationsConfig, meta_config: MetaStudyConfig):
+        super().__init__(config, meta_config)
+        time_steps = getattr(meta_config, "time_num_steps", config.max_num_obs)
+        self.fixed_M_max = min(config.max_num_obs, time_steps)
+        self.split_past_future = config.split_past_future
+        self.max_past = config.max_past
+        self.min_past = config.min_past
+        self.generative_bias = config.generative_bias
+        self.boundary_ratio = getattr(config, "past_time_ratio", 0.1)
+    def _generate_raw(self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs):
+        return fix_past_time_random_selection(
+            full_simulation=full_simulation,
+            full_simulation_times=full_simulation_times,
+            boundary_ratio=self.boundary_ratio,
+            fixed_M_max=self.fixed_M_max,
+            num_obs_sampler=kwargs.get("num_obs_sampler", None),
+            generator=kwargs.get("generator", None),
+        )
+    def _get_shapes_raw(self) -> Tuple[int, int]:
+        """Return fixed-capacity shapes for random split outputs.
+        With ``split_past_future=True``:
+        - ``max_obs`` is bounded by ``max_past``
+        - ``max_rem`` is bounded by ``max_num_obs - max_past``
+        """
+        if self.split_past_future:
+            if self.min_past is None or self.max_past is None:
+                raise ValueError(
+                    "min_past and max_past must be specified when split_past_future=True"
+                )
+            if self.fixed_M_max < self.min_past:
+                raise ValueError("fixed_M_max is smaller than the configured min_past")
+            max_obs = min(self.max_past, self.fixed_M_max)
+            max_rem = max(0, self.fixed_M_max - self.max_past)
+        else:
+            max_obs = self.fixed_M_max
+            max_rem = self.fixed_M_max
+        return max_obs, max_rem
+    def _split_by_boundary(
+        self,
+        obs: TensorType["B", "M"],
+        obs_time: TensorType["B", "M"],
+        obs_mask: TensorType["B", "M"],
+        *,
+        generator: Optional[torch.Generator] = None,
+    ) -> Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:
+        """Split sampled observations into strict past and future blocks.
+        The split is boundary-based and strict:
+        - Past block samples ``k`` points from ``time <= boundary`` candidates,
+          where ``k`` follows ``min_past``/``max_past`` (and ``generative_bias``),
+          capped by available candidates and ``K_max``.
+        - When ``k > 0``, remainder receives up to ``R_cap`` points sampled
+          from ``time > boundary`` only (strict future).
+        - When ``k == 0``, boundary splitting is ignored for remainder and
+          points are sampled from all valid candidates.
+        Extra past/future candidates are ignored, and missing entries are
+        padded by zeros with mask=False.
+        """
+        B, M = obs.shape
+        # K_max: capacity of the past block [B, K_max]
+        K_max = min(int(self.max_past), int(M))
+        K_min = min(int(self.min_past), K_max)
+        # R_cap: fixed capacity of the remainder block [B, R_cap]
+        R_cap = max(0, int(M) - K_max)
+        boundary = self.meta_config.time_stop * self.boundary_ratio
+        gen = generator if generator is not None else torch.default_generator
+        past_obs = torch.zeros(B, K_max, dtype=obs.dtype, device=obs.device)
+        past_time = torch.zeros_like(past_obs)
+        past_mask = torch.zeros(B, K_max, dtype=torch.bool, device=obs.device)
+        rem_obs = torch.zeros(B, R_cap, dtype=obs.dtype, device=obs.device)
+        rem_time = torch.zeros_like(rem_obs)
+        rem_mask = torch.zeros(B, R_cap, dtype=torch.bool, device=obs.device)
+        for b in range(B):
+            valid_idx = obs_mask[b].nonzero(as_tuple=True)[0]
+            past_candidates = valid_idx[obs_time[b, valid_idx] <= boundary]
+            future_candidates = valid_idx[obs_time[b, valid_idx] > boundary]
+            if past_candidates.numel() > 1:
+                order = torch.argsort(obs_time[b, past_candidates])
+                past_candidates = past_candidates[order]
+            if future_candidates.numel() > 1:
+                order = torch.argsort(obs_time[b, future_candidates])
+                future_candidates = future_candidates[order]
+            # Past is sampled uniformly without replacement from pre-boundary points.
+            k_high = min(K_max, int(past_candidates.numel()))
+            k_low = min(K_min, k_high)
+            k = _sample_past_count_with_bias(
+                low=int(k_low),
+                high=int(k_high),
+                generative_bias=self.generative_bias,
+                generator=gen,
+                device=obs.device,
+            )
+            if k > 0 and past_candidates.numel() > 0:
+                chosen_offsets = torch.randperm(
+                    past_candidates.numel(),
+                    generator=gen,
+                    device=obs.device,
+                )[:k]
+                chosen_past = past_candidates[chosen_offsets]
+                chosen_order = torch.argsort(obs_time[b, chosen_past])
+                chosen_past = chosen_past[chosen_order]
+            else:
+                chosen_past = past_candidates[:0]
+            num_past = chosen_past.numel()
+            if num_past > 0:
+                past_obs[b, :num_past] = obs[b, chosen_past]
+                past_time[b, :num_past] = obs_time[b, chosen_past]
+                past_mask[b, :num_past] = True
+            # If no past point is selected, allow remainder sampling across the
+            # whole valid domain. Otherwise keep strict future-only remainder.
+            rem_pool = valid_idx if num_past == 0 else future_candidates
+            if rem_pool.numel() > 1:
+                order = torch.argsort(obs_time[b, rem_pool])
+                rem_pool = rem_pool[order]
+            if R_cap <= 0 or rem_pool.numel() == 0:
+                chosen_rem = rem_pool[:0]
+            elif rem_pool.numel() <= R_cap:
+                chosen_rem = rem_pool
+            else:
+                chosen_offsets = torch.randperm(
+                    rem_pool.numel(),
+                    generator=gen,
+                    device=obs.device,
+                )[:R_cap]
+                chosen_rem = rem_pool[chosen_offsets]
+                chosen_order = torch.argsort(obs_time[b, chosen_rem])
+                chosen_rem = chosen_rem[chosen_order]
+            r = chosen_rem.numel()
+            if r > 0:
+                rem_obs[b, :r] = obs[b, chosen_rem]
+                rem_time[b, :r] = obs_time[b, chosen_rem]
+                rem_mask[b, :r] = True
+        return past_obs, past_time, past_mask, rem_obs, rem_time, rem_mask
+    def generate(
+        self, full_simulation: Tensor, full_simulation_times: Tensor, **kwargs
+    ) -> Tuple[Tensor, ...]:
+        obs, obs_time, obs_mask, _, _, _ = self._generate_raw(
+            full_simulation, full_simulation_times, **kwargs
+        )
+        obs_mask = self._drop_non_positive_times_from_mask(obs_time, obs_mask)
+        if self.split_past_future:
+            out = self._split_by_boundary(
+                obs,
+                obs_time,
+                obs_mask,
+                generator=kwargs.get("generator", None),
+            )
+        else:
+            past_obs, past_time, past_mask = obs, obs_time, obs_mask
+            rem_obs = rem_time = rem_mask = None
+            out = (past_obs, past_time, past_mask, rem_obs, rem_time, rem_mask)
+        if not self.observations_config.add_rem:
+            out = out[:3] + (None, None, None)
+        return (*out, None)
+class ObservationStrategyFactory:
+    @staticmethod
+    def from_config(
+        obs_config: ObservationsConfig, meta_config: MetaStudyConfig
+    ) -> ObservationStrategy:
+        # Legacy compatibility:
+        # - omitted ``type`` defaults via dataclass to ``pk_peak_half_life``
+        # - explicit YAML ``type: null`` is loaded as ``None`` and also falls
+        #   back to ``pk_peak_half_life``
+        strategy_type = getattr(obs_config, "type", None)
+        if strategy_type is None:
+            normalized_type = "pk_peak_half_life"
+        elif isinstance(strategy_type, str):
+            stripped = strategy_type.strip()
+            if stripped == "" or stripped.lower() in {"null", "none"}:
+                normalized_type = "pk_peak_half_life"
+            else:
+                normalized_type = stripped.lower()
+        else:
+            normalized_type = str(strategy_type).strip().lower()
+        if normalized_type in {
+            "observations_pk_peak_halflife",
+            "pk_peak_half_life",
+        }:
+            return PKPeakHalfLifeStrategy(obs_config, meta_config)
+        if normalized_type in {
+            "fix_past_time_random_selection",
+            "random",
+        }:
+            return FixPastTimeRandomSelectionStrategy(obs_config, meta_config)
+        raise ValueError(f"Unknown observation type: {strategy_type}")

sim_priors_pk/data/data_generation/observations_functions.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""
+This file contains the observation functions that create the separation
+between observations and  remainders, the reminder can be either future
+or selected from random in betweens, or None
+"""
+import torch
+from typing import Callable, Optional, Tuple
+from torchtyping import TensorType
+def fix_past_time_random_selection(
+    full_simulation: TensorType["N", "S"],
+    full_simulation_times: TensorType["N", "S"],
+    *,
+    boundary_ratio: float = 0.1,
+    fixed_M_max: int,
+    num_obs_sampler: Optional[Callable[[int], torch.Tensor]] = None,
+    generator: Optional[torch.Generator] = None,
+    **kwargs,
+) -> Tuple[
+    TensorType["N", "M"],
+    TensorType["N", "M"],
+    TensorType["N", "M"],
+    None,
+    None,
+    None,
+]:
+    """Select observation time-points uniformly without replacement.
+    Each row samples indices from the simulation grid independently and
+    uniformly (no replacement), then sorts the selected points by sampled
+    timestamps to keep chronological ordering in the output tensors.
+    """
+    if full_simulation is None:
+        return (None,) * 6
+    device = full_simulation.device
+    N, S = full_simulation.shape
+    M = int(max(0, fixed_M_max))
+    gen = generator if generator is not None else torch.default_generator
+    observations = torch.zeros(N, M, device=device, dtype=full_simulation.dtype)
+    observation_times = torch.zeros(N, M, device=device, dtype=full_simulation_times.dtype)
+    obs_mask = torch.zeros(N, M, dtype=torch.bool, device=device)
+    sample_cap = min(M, S)
+    if sample_cap == 0:
+        return observations, observation_times, obs_mask, None, None, None
+    if num_obs_sampler is None:
+        num_obs = torch.full((N,), sample_cap, dtype=torch.long, device=device)
+    else:
+        num_obs = num_obs_sampler(N).to(device=device, dtype=torch.long).clamp(1, sample_cap)
+    # Per-row sampling keeps selection uniform without replacement.
+    for row in range(N):
+        row_count = int(num_obs[row].item())
+        if row_count <= 0:
+            continue
+        selected = torch.randperm(S, generator=gen, device=device)[:row_count]
+        if row_count > 1:
+            # Order chosen simulation indices by sampled time for stable packing.
+            order = torch.argsort(full_simulation_times[row, selected])
+            selected = selected[order]
+        observations[row, :row_count] = full_simulation[row, selected]
+        observation_times[row, :row_count] = full_simulation_times[row, selected]
+        obs_mask[row, :row_count] = True
+    return observations, observation_times, obs_mask, None, None, None

sim_priors_pk/data/data_generation/study_population_stats.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""This is used for calculating summary statistics over ensembles of StudyJSONs to check that
+the distribution of simulated data matches empirical data."""
+from abc import ABC, abstractmethod
+from typing import Dict, List
+import numpy as np
+from sim_priors_pk.data.data_empirical.json_schema import IndividualJSON, StudyJSON
+class StudyPopulationStats(ABC):
+    """Abstract interface for computing and aggregating statistics over ensembles of StudyJSONs."""
+    @abstractmethod
+    def compute_per_individual(self, ind: IndividualJSON) -> Dict[str, float]:
+        """Compute statistics for a single individual (e.g., min/max observation value, count)."""
+    @abstractmethod
+    def compute_per_study(self, study: StudyJSON) -> Dict[str, float]:
+        """Compute statistics for a single study (e.g., min/max observation value, count)."""
+    @abstractmethod
+    def aggregate(
+        self,
+        per_study: List[Dict[str, float]],
+    ) -> Dict[str, object]:
+        """Aggregate statistics across studies (e.g., global extrema, averages, or histograms)."""
+    def compute_study_population_statistics(
+        self,
+        studies: List[StudyJSON],
+    ) -> Dict[str, object]:
+        """Compute and aggregate statistics for a StudyJSON ensemble."""
+        per_study = [self.compute_per_study(study) for study in studies]
+        return self.aggregate(per_study)
+class BasicObservationStats(StudyPopulationStats):
+    """Compute descriptive statistics for observation values across individuals.
+    For each individual, computes:
+    - nAUC: Area Under the Curve (AUC), normalized by dose, using trapezoidal rule.
+    - nCmax: Maximum observed concentration, normalized by dose.
+    - Tmax: Time at which Cmax occurs.
+    - Nobs: Number of observations.
+    - Duration: Duration of the observation period (max observation time).
+    For each study, computes:
+    - Mean and standard deviation of nAUC, nCmax, Tmax across individuals.
+    - Mean and total number of observations (Nobs) across all individuals.
+    - Total study duration (max Duration across individuals).
+    Aggregates across studies to provide percentiles of each study-level statistic.
+    """
+    def __init__(self, alpha=0.1):
+        self.alpha = alpha
+    def compute_per_individual(self, ind: IndividualJSON) -> Dict[str, float]:
+        obs_vals = ind.get("observations", [])
+        obs_times = ind.get("observation_times", [])
+        dose = ind.get("dosing", [])
+        dosing_time = ind.get("dosing_times", [])
+        route = ind.get("dosing_type", [])
+        if not obs_vals:
+            return {"nAUC": np.nan, "nCmax": np.nan, "Tmax": np.nan, "Nobs": 0, "Duration": np.nan}
+        # Check that input times are sorted and match the number of observations
+        if len(obs_times) != len(obs_vals) or any(
+            obs_times[i] >= obs_times[i + 1] for i in range(len(obs_times) - 1)
+        ):
+            raise ValueError(
+                "Observation times must be sorted and match the number of observations."
+            )
+        # Check that there is only a single positive dose
+        if len(dose) != 1 or len(dosing_time) != 1 or len(route) != 1:
+            raise ValueError("Only single dosing is supported in this statistic.")
+        if dose[0] <= 0 or np.isnan(dose) or np.isnan(dosing_time[0]):
+            raise ValueError("Dose must be positive.")
+        # Check that dose precedes observations
+        if any(t < dosing_time[0] for t in obs_times):
+            raise ValueError("Dosing time must precede observation times.")
+        # calculate AUC using the trapezoidal rule:
+        # - for oral dosing, add a value of 0 at dosing time
+        # - for iv bolus, add the first observation at dosing time
+        obs_times_trapz = dosing_time + obs_times
+        if route[0] == "oral":
+            obs_vals_trapz = [0.0] + obs_vals
+        elif route[0] == "iv":
+            obs_vals_trapz = [obs_vals[0]] + obs_vals
+        else:
+            raise ValueError("Only 'oral' and 'iv' dosing types are supported.")
+        auc = np.trapezoid(obs_vals_trapz, obs_times_trapz) if len(obs_vals) > 0 else np.nan
+        auc /= dose[0]
+        # Calculate Cmax and Tmax
+        Cmax_idx = np.argmax(obs_vals)
+        Cmax = obs_vals[Cmax_idx]
+        Tmax = obs_times[Cmax_idx]
+        Cmax /= dose[0]
+        return {
+            "nAUC": float(auc),
+            "nCmax": float(Cmax),
+            "Tmax": float(Tmax),
+            "Nobs": len(obs_vals),
+            "Duration": np.max(obs_times),
+        }
+    def compute_per_study(self, study: StudyJSON) -> Dict[str, float]:
+        ind_stats = [
+            self.compute_per_individual(ind)
+            for block in ("context", "target")
+            for ind in study.get(block, [])
+        ]
+        if not ind_stats:
+            return {"max_obs": np.nan, "min_obs": np.nan, "mean_obs": np.nan, "num_obs": 0}
+        # Calculate statistics (maybe a bit too much, can be simplified later)
+        metrics = {
+            "nAUC_mean": ("nAUC", np.mean),
+            "nAUC_sd": ("nAUC", np.std),
+            "nAUC_cv": ("nAUC", lambda x: np.std(x) / np.mean(x) * 100 if np.mean(x) != 0 else np.nan),
+            "nCmax_mean": ("nCmax", np.mean),
+            "nCmax_sd": ("nCmax", np.std),
+            "nCmax_cv": ("nCmax", lambda x: np.std(x) / np.mean(x) * 100 if np.mean(x) != 0 else np.nan),
+            "Tmax_mean": ("Tmax", np.mean),
+            "Tmax_sd": ("Tmax", np.std),
+            "Tmax_cv": ("Tmax", lambda x: np.std(x) / np.mean(x) * 100 if np.mean(x) != 0 else np.nan),
+            "Nobs_mean": ("Nobs", np.mean),
+            "Nobs_total": ("Nobs", np.sum),
+            "Duration_max": ("Duration", np.max),
+            "nID": ("Nobs", lambda x: len(x)),
+        }
+        results = {name: func([d[key] for d in ind_stats]) for name, (key, func) in metrics.items()}
+        # Ensure all values are floats for JSON-friendliness or downstream compatibility
+        return {k: float(v) for k, v in results.items()}
+    def aggregate(
+        self,
+        per_study: List[Dict[str, float]],
+    ) -> Dict[str, object]:
+        """Aggregate statistics across studies."""
+        # Calculate percentiles of study-level statistics
+        percentiles = [5, 50, 95]
+        summary: Dict[str, object] = {}
+        for key in per_study[0].keys():
+            values = [s[key] for s in per_study if not np.isnan(s[key])]
+            if values:
+                summary[f"{key}_percentiles"] = {
+                    f"P{p}": float(np.percentile(values, p)) for p in percentiles
+                }
+            else:
+                summary[f"{key}_percentiles"] = {f"P{p}": np.nan for p in percentiles}
+        summary["Nstudy"] = len(per_study)
+        return summary
+class ListedObservationStats(BasicObservationStats):
+    """Variant of BasicObservationStats that returns lists of study-level statistics instead of percentiles.
+    This is useful for more detailed analyses or visualizations of the distribution of study-level statistics.
+    """
+    def __init__(self, alpha=0.1):
+        self.alpha = alpha
+    def aggregate(
+        self,
+        per_study: List[Dict[str, float]],
+    ) -> Dict[str, object]:
+        """Aggregate statistics across studies."""
+        # Collect lists of study-level statistics
+        summary: Dict[str, object] = {}
+        for key in per_study[0].keys():
+            values = [s[key] for s in per_study]
+            summary[f"{key}_list"] = [float(v) for v in values]
+        summary["Nstudy"] = len(per_study)
+        return summary

sim_priors_pk/data/data_preprocessing/__init__.py ADDED Viewed

File without changes

sim_priors_pk/data/data_preprocessing/data_preprocessing_utils.py ADDED Viewed

	@@ -0,0 +1,321 @@

+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+from torchtyping import TensorType
+import torch
+from torchtyping import TensorType
+from typing import List,Tuple,Optional
+import numpy as np
+from sim_priors_pk.data.data_preprocessing.raw_to_tensors_bundles import substance_cvs_to_tensors_bundle,substances_csv_to_tensors
+from typing import NamedTuple
+import torch
+from torchtyping import TensorType
+class SubstanceTensorGroup(NamedTuple):
+    observations: TensorType[1, "I", "T"]
+    times: TensorType[1, "I", "T"]
+    mask: TensorType[1, "I", "T"]
+    subject_mask: TensorType[1, "I"]
+def apply_timescale_filter(
+    observations:    TensorType["S", "I", "T"],
+    times:           TensorType["S", "I", "T"],
+    masks:           TensorType["S", "I", "T"],
+    subject_mask:    TensorType["S", "I"],
+    *,
+    strategy: str = "log_zscore",   # "log_zscore" | "median_fraction" | "none"
+    max_abs_z: float = 2.0,         # for "log_zscore"
+    tau: float = 0.4,               # for "median_fraction"  (≈ ln 1.5)
+) -> Tuple[
+    TensorType["S", "I", "T"],  # filtered observations
+    TensorType["S", "I", "T"],  # filtered times
+    TensorType["S", "I", "T"],  # filtered masks
+    TensorType["S", "I"],       # filtered subject_mask
+]:
+    """
+    Zeroes‑out and un‑masks subjects whose time‑span is an outlier
+    w.r.t. other subjects in the *same* substance.
+    •  strategy="log_zscore": keep subjects with |z| ≤ max_abs_z in log‑span
+    •  strategy="median_fraction": keep subjects within ±tau of median(log‑span)
+    •  strategy="none": return inputs unchanged
+    """
+    if strategy == "none":
+        return observations, times, masks, subject_mask
+    # combine padding + subject mask to know valid time points
+    valid = masks.bool() & subject_mask.unsqueeze(-1)
+    # --- compute log‑spans ----------------------------------------------------
+    t_max = times.masked_fill(~valid, float("-inf")).max(dim=2).values  # [S, I]
+    t_min = times.masked_fill(~valid, float("inf")).min(dim=2).values   # [S, I]
+    span  = (t_max - t_min).clamp(min=1e-12)
+    log_span = span.log()                                               # [S, I]
+    # --- decide which subjects to keep ---------------------------------------
+    if strategy == "log_zscore":
+        z      = (log_span - log_span.mean(dim=1, keepdim=True)) / \
+                 (log_span.std(dim=1, keepdim=True).clamp(min=1e-6))
+        keep = torch.abs(z) <= max_abs_z                                # [S, I]
+    elif strategy == "median_fraction":
+        med   = log_span.median(dim=1, keepdim=True).values            # [S,1]
+        keep  = (log_span >= med - tau) & (log_span <= med + tau)      # [S,I]
+    else:
+        # No filtering applied — return inputs unchanged
+        return observations, times, masks, subject_mask
+    # --- apply filter: zero & un‑mask ----------------------------------------
+    # clone so we don't mutate original tensors accidentally
+    obs_f   = observations.clone()
+    times_f = times.clone()
+    masks_f = masks.clone()
+    subj_f  = subject_mask.clone()
+    # indices where we drop subjects
+    drop = ~keep & subj_f.bool()
+    subj_f[drop] = False
+    masks_f[drop] = False
+    obs_f[drop] = 0.0
+    times_f[drop] = 0.0
+    return obs_f, times_f, masks_f, subj_f
+def plot_subjects_for_substance(
+    drug_data_frame,
+    substance_label: str,
+    *,
+    z_score_normalization: bool = False,
+    normalize_by_max:bool = False,
+    time_strategy:str="log_zscore",   # "log_zscore" | "median_fraction" | "none"
+    max_abs_z:float=2.,
+    x_scale: str = "linear",          # "linear" ▸ default  ·  "log"
+    y_scale: str = "linear",          # "linear" ▸ default  ·  "log"
+    alpha: float = 1.0,               # 0 ≤ alpha ≤ 1
+    legend_outside: bool = True,      # park legend to the right
+    figsize: Tuple[float, float] = (10, 5),  # default width × height
+    save_dir: Optional[str] = None,  # if set, saves the figure here
+) -> None:
+    """
+    Draw every subject‑trajectory (points + line) for *one* substance.
+    Parameters
+    ----------
+    drug_data_frame : pandas.DataFrame
+    substance_label : str
+    z_score_normalization : bool, optional
+    x_scale, y_scale : {"linear", "log"}, optional
+        Axis scaling.  If you pick "log", make sure data are strictly > 0
+        on that axis or Matplotlib will complain.
+    alpha : float in [0, 1], optional
+        Transparency applied to both the line and the markers.
+    legend_outside : bool, optional
+        True ⇢ legend in a separate column to the right;
+        False ⇢ legend inside plot.
+    """
+    # ── 1.  Pull tensors ────────────────────────────────────────────
+    data_bundle = substance_cvs_to_tensors_bundle(drug_data_frame,normalize_by_max=True)
+    all_obs      = data_bundle.observations         # [S, I, T]
+    all_times   = data_bundle.times                # [S, I, T]
+    all_masks    = data_bundle.masks                # [S, I, T]
+    all_subj_mask = data_bundle.individuals_mask
+    substance_labels      = data_bundle.substance_names     # [S]
+    mapping               = data_bundle.mapping
+    study_names           = data_bundle.study_names          # [S]
+    subject_names         = data_bundle.individuals_names        # [S][I]
+    empirical_loaded      = True
+    # ── 2.  Find substance row ──────────────────────────────────────
+    try:
+        s_idx: int = int(np.where(substance_labels == substance_label)[0][0])
+    except IndexError:
+        raise ValueError(f"Substance '{substance_label}' not found.")
+    # ("I", "T")
+    obs:        TensorType["I", "T"] = all_obs[s_idx]
+    times:      TensorType["I", "T"] = all_times[s_idx]
+    step_mask:  TensorType["I", "T"] = all_masks[s_idx].bool()
+    subj_mask:  TensorType["I"]      = all_subj_mask[s_idx].bool()
+    # ── 3.  Filter Time Series  ──────────────────────────────────────
+    # Add batch dimension to match expected input [S, I, T], [S, I]
+    obs_b       = obs.unsqueeze(0)         # [1, I, T]
+    times_b     = times.unsqueeze(0)       # [1, I, T]
+    step_mask_b = step_mask.unsqueeze(0)   # [1, I]
+    subj_mask_b = subj_mask.unsqueeze(0)   # [1, I]
+    # Apply timescale filter (choose one strategy)
+    obs_b, times_b, step_mask_b, subj_mask_b = apply_timescale_filter(
+        observations=obs_b,
+        times=times_b,
+        masks=step_mask_b,
+        subject_mask=subj_mask_b,
+        strategy=time_strategy,   # or "median_fraction"
+        max_abs_z=max_abs_z,
+        tau=0.4,
+    )
+    # Remove batch dim again
+    obs = obs_b[0]
+    times = times_b[0]
+    step_mask = step_mask_b[0]
+    subj_mask = subj_mask_b[0]
+    # ── 4.  Plot one line per *real* subject ────────────────────────
+    fig, ax = plt.subplots(figsize=figsize)
+    for i in range(obs.shape[0]):                # iterate subjects (I)
+        if not subj_mask[i]:
+            continue                             # skip padded rows
+        valid: TensorType["T"] = step_mask[i]    # True ⇢ real sample
+        t: TensorType["T"] = times[i][valid].cpu()
+        y: TensorType["T"] = obs[i][valid].cpu()
+        ax.plot(t, y, marker="o", alpha=alpha, label=f"subject {i}")
+    # ── 5.  Styling ────────────────────────────────────────────────
+    ax.set_title(f"All subjects – {substance_label}")
+    ax.set_xlabel("Time (normalised per substance)")
+    ax.set_ylabel("Observation")
+    # Axis scales
+    ax.set_xscale(x_scale)
+    ax.set_yscale(y_scale)
+    # Legend placement
+    if legend_outside:
+        # ncol=1 ▸ vertical list; bbox_to_anchor shifts legend fully outside
+        ax.legend(
+            loc="center left",
+            bbox_to_anchor=(1.02, 0.5),
+            borderaxespad=0.0,
+            frameon=False,
+        )
+        plt.tight_layout(rect=[0, 0, 0.82, 1])   # leave room on the right
+    else:
+        ax.legend(frameon=False)
+        plt.tight_layout()
+    # Save figure if path is given
+    if save_dir is not None:
+        from pathlib import Path
+        study_name = mapping[substance_label]["study_name"]
+        index = mapping[substance_label]["index"]
+        Path(save_dir).mkdir(parents=True, exist_ok=True)
+        filename = f"{study_name}_{substance_label}_{index}.png"
+        filepath = Path(save_dir) / filename
+        fig.savefig(filepath, bbox_inches="tight", dpi=300)
+    plt.show()
+def substances_with_min_timesteps(
+    drug_data_frame,
+    min_timesteps: int = 140,
+    *,
+    z_score_normalization: bool = False,
+    normalize_by_max:bool = False,
+) -> List[str]:
+    """
+    Return the list of substance labels whose **best** subject has
+    ≥ `min_timesteps` valid observations.
+    Parameters
+    ----------
+    drug_data_frame : pandas.DataFrame
+        Same dataframe you already pass to `substance_cvs_to_tensors_from_list`.
+    min_timesteps : int, default = 140
+        Threshold on the number of valid (unpadded) time‑points.
+    z_score_normalization : bool, default = False
+        Passed straight through to `substance_cvs_to_tensors_from_list`.
+    Returns
+    -------
+    List[str]
+        Substance strings that satisfy the criterion.
+    """
+    (
+        all_observations,            # TensorType["S", "I", "T"]   – concentration values
+        all_times,                   # TensorType["S", "I", "T"]   – time grid (0‥1)
+        all_masks,                   # TensorType["S", "I", "T"]   – bool, 1 = real step
+        all_subjects_mask,           # TensorType["S", "I"]        – bool, 1 = real subject
+        substance_labels,            # np.ndarray, shape ["S"]
+        mapping
+    ) = substance_cvs_to_tensors_bundle(
+        drug_data_frame,
+        z_score_normalization=z_score_normalization,
+        normalize_by_max=normalize_by_max
+    )
+    # --- Shapes -------------------------------------------------------
+    # S = number of substances, I = max subjects per substance,
+    # T = max time‑steps per subject.
+    # all_masks       : (S, I, T) – True at valid positions
+    # all_subjects_mask: (S, I)   – True for *existing* subjects only
+    # -----------------------------------------------------------------
+    # Convert to bool & mask out padded subjects
+    valid_masks: TensorType["S", "I", "T"] = all_masks.bool()
+    subj_mask:   TensorType["S", "I", 1]   = all_subjects_mask.bool().unsqueeze(-1)
+    valid_masks = valid_masks & subj_mask          # shape keeps (S,I,T)
+    # Count valid steps per subject ───────────────────────────────────
+    # counts[s, i] = #valid time‑points of subject i in substance s
+    counts: TensorType["S", "I"] = valid_masks.sum(dim=2)  # (S, I)
+    # Max over subjects (per substance) -------------------------------
+    max_counts: TensorType["S"] = counts.max(dim=1).values  # (S,)
+    # Pick substances that meet / beat the threshold ------------------
+    qualifying: TensorType["S"] = max_counts >= min_timesteps  # (S,)
+    # Build the output list -------------------------------------------
+    return [label for label, keep in zip(substance_labels.tolist(), qualifying.tolist()) if keep]
+def get_substance_tensors_by_label(
+    drug_data_frame,
+    substance_label: str,
+    *,
+    z_score_normalization: bool = False,
+    normalize_by_max: bool = False,
+) -> SubstanceTensorGroup:
+    """
+    Returns tensors for a selected substance, preserving S=1 batch shape.
+    Shapes:
+        observations : [1, I, T]
+        times        : [1, I, T]
+        mask         : [1, I, T]
+        subject_mask : [1, I]
+    """
+    data_bundle = substance_cvs_to_tensors_bundle(drug_data_frame,
+                                                     z_score_normalization=z_score_normalization,
+                                                     normalize_by_max=normalize_by_max)
+    all_observations      = data_bundle.observations         # [S, I, T]
+    all_empirical_times   = data_bundle.times                # [S, I, T]
+    all_empirical_mask    = data_bundle.masks                # [S, I, T]
+    all_subjects_mask     = data_bundle.individuals_mask
+    substance_labels      = data_bundle.substance_names     # [S]
+    mapping               = data_bundle.mapping
+    # Lookup index
+    label_to_index = {label: idx for idx, label in enumerate(substance_labels)}
+    if substance_label not in label_to_index:
+        raise ValueError(f"Substance label '{substance_label}' not found.")
+    s_idx = label_to_index[substance_label]
+    # Add batch dim: [1, I, T] or [1, I]
+    return SubstanceTensorGroup(
+        observations=all_observations[s_idx].unsqueeze(0),     # [1, I, T]
+        times=all_empirical_times[s_idx].unsqueeze(0),                   # [1, I, T]
+        mask=all_empirical_mask[s_idx].unsqueeze(0).bool(),             # [1, I, T]
+        subject_mask=all_subjects_mask[s_idx].unsqueeze(0).bool()  # [1, I]
+    )

sim_priors_pk/data/data_preprocessing/raw_to_tensors_bundles.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""
+Here we define the functions requiered to process the data
+    https://pk-db.com/
+"""
+import torch
+import numpy as np
+import pandas as pd
+from typing import Tuple
+from torchtyping import TensorType
+from typing import NamedTuple, List, Dict
+from torchtyping import TensorType
+from typing import Dict, Tuple, List
+import numpy as np
+import torch
+from torchtyping import TensorType
+from typing import Dict, Tuple, List, Optional
+import numpy as np
+import torch
+from torchtyping import TensorType
+lenuzza_doses_mg_per_g = {
+    "memantine": 0.005,
+    "omeprazole": 0.010,
+    "repaglinide": 0.00025,
+    "rosuvastatin": 0.005,
+    "tolbutamide": 0.010,
+    "dextromethorphan": 0.018,
+    "digoxin": 0.00025,
+    "paracetamol": 0.060,
+    "caffeine": 0.073,
+    "midazolam": 0.004,
+    "paraxanthine":0.073,
+    "dextrorphan":0.018,
+}
+class EmpiricalSubstanceTensorBundle(NamedTuple):
+    observations:           TensorType["S", "I", "T"]  # padded concentration values
+    times:                  TensorType["S", "I", "T"]  # padded normalized times [0,1]
+    masks:                  TensorType["S", "I", "T"]  # 1 = observed, 0 = missing or padded
+    individuals_mask:       TensorType["S", "I"]       # 1 = real subject, 0 = padded row
+    study_names:            List[str]                  # [S] → one study name per substance
+    individuals_names:      List[List[str]]            # [S][I] → subject name per padded subject
+    substance_names:        List[str]                 # [S] substance_label entries
+    mapping:                Dict[str, Dict[str, object]]
+    dosing_amounts:         TensorType["S", "I"]      # dose mg/g per subject
+    dosing_route_types:     TensorType["S", "I"]      # route type index per subject
+def map_substance_to_index_and_study(
+    drug_data_frame
+) -> dict[str, dict[str, object]]:
+    """
+    Returns a dictionary mapping each substance_label to its index (in np.unique order)
+    and its associated study_name (taken from the first row where that label appears).
+    Returns
+    -------
+    dict: {
+        "substance_label": {
+            "index": int,
+            "study_name": str
+        },
+        ...
+    }
+    """
+    substance_labels = np.unique(drug_data_frame["substance_label"].values)
+    mapping = {}
+    for idx, label in enumerate(substance_labels):
+        study_name = drug_data_frame.loc[
+            drug_data_frame["substance_label"] == label, "study_name"
+        ].iloc[0]
+        mapping[label] = {
+            "index": idx,
+            "study_name": study_name
+        }
+    return mapping
+def substances_csv_to_tensors(drug_data_frame, substance_label='omeprazole'):
+    """
+    The function groups by substance_label and obtains the time series
+    for each subject, pads when necessary, and returns observations, times, and masks.
+    Params:
+        drug_data_frame (pd.DataFrame): Input DataFrame with specified columns.
+        substance_label (str): The substance label to filter by. Defaults to 'omeprazole'.
+    Returns:
+        observations (torch.Tensor): Padded observation values tensor of shape [num_subjects, max_time].
+        observations_times (torch.Tensor): Padded time points tensor of shape [num_subjects, max_time].
+        observations_mask (torch.Tensor): Mask tensor indicating valid data points, shape [num_subjects, max_time].
+        dosing_amounts (torch.Tensor): Dose amount per subject [num_subjects].
+        dosing_route_types (torch.Tensor): Route type index per subject [num_subjects].
+    """
+    # Filter the DataFrame by the given substance_label
+    substance_data = drug_data_frame[drug_data_frame['substance_label'] == substance_label]
+    # Group by subject_name
+    subject_groups = substance_data.groupby('subject_name')
+    # Collect sorted time and value arrays for each subject
+    times_list = []
+    values_list = []
+    dosing_amounts_list = []
+    route_list = []
+    for subject_name, group in subject_groups:
+        # Sort the group by 'time' to ensure chronological order
+        sorted_group = group.sort_values('time')
+        times = sorted_group['time'].values.astype(np.float32)
+        values = sorted_group['value'].values.astype(np.float32)
+        times_list.append(times)
+        values_list.append(values)
+        # Determine dosing amount based on substance name
+        if 'substance_name' in group.columns:
+            s_name = str(group['substance_name'].iloc[0]).lower()
+        else:
+            s_name = str(substance_label).lower()
+        dose_value = 0.5
+        for key, val in lenuzza_doses_mg_per_g.items():
+            if key in s_name:
+                dose_value = val
+                break
+        dosing_amounts_list.append(dose_value)
+        route_list.append(0)  # oral
+    # Determine the maximum time sequence length
+    max_len = max(len(times) for times in times_list) if times_list else 0
+    # Pad each subject's time and value arrays, and create the mask
+    padded_times = []
+    padded_values = []
+    masks = []
+    for times, values in zip(times_list, values_list):
+        current_len = len(times)
+        pad_len = max_len - current_len
+        # Pad with zeros
+        padded_time = np.pad(times, (0, pad_len), mode='constant', constant_values=0)
+        padded_value = np.pad(values, (0, pad_len), mode='constant', constant_values=0)
+        # Create mask (1 for real data, 0 for padding)
+        mask = np.ones(max_len, dtype=np.float32)
+        mask[current_len:] = 0
+        padded_times.append(padded_time)
+        padded_values.append(padded_value)
+        masks.append(mask)
+    # Convert to PyTorch tensors
+    observations = torch.tensor(padded_values, dtype=torch.float32)      # [P, T]
+    observations_times = torch.tensor(padded_times, dtype=torch.float32) # [P, T]
+    observations_mask = torch.tensor(masks, dtype=torch.float32)         # [P, T]
+    dosing_amounts = torch.tensor(dosing_amounts_list, dtype=torch.float32)  # [P]
+    dosing_route_types = torch.tensor(route_list, dtype=torch.long)          # [P]
+    return observations, observations_times, observations_mask, dosing_amounts, dosing_route_types
+def substance_dict_to_tensors(
+    selected_series: Optional[Dict[str, Dict[str, List[float]]]],
+    hidden_series: Optional[Dict[str, Dict[str, List[float]]]],
+) -> Tuple[
+    Optional[TensorType["N_sel", "T"]], Optional[TensorType["N_sel", "T"]], Optional[TensorType["N_sel", "T"]],
+    Optional[TensorType["N_hid", "T"]], Optional[TensorType["N_hid", "T"]], Optional[TensorType["N_hid", "T"]],
+]:
+    """
+    Converts two dictionaries of time series into padded tensors, sharing a common maximum sequence length.
+    Typically comming from the frontend payload
+    Args:
+        selected_series: Mapping subject_name -> {'timepoints': [...], 'values': [...]}.
+        hidden_series:   Mapping subject_name -> {'timepoints': [...], 'values': [...]}.
+    Returns:
+        sel_obs, sel_times, sel_mask: [N_sel, T] or None.
+        hid_obs, hid_times, hid_mask: [N_hid, T] or None.
+    """
+    def _extract_sorted(series: Dict[str, Dict[str, List[float]]]) -> Tuple[List[np.ndarray], List[np.ndarray]]:
+        times_list, values_list = [], []
+        for subj, data in series.items():
+            t = np.array(data['timepoints'], dtype=np.float32)
+            v = np.array(data['values'], dtype=np.float32)
+            idx = np.argsort(t)
+            times_list.append(t[idx])
+            values_list.append(v[idx])
+        return times_list, values_list
+    def _pad(times_list: List[np.ndarray], vals_list: List[np.ndarray], T: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        padded_times, padded_vals, masks = [], [], []
+        for t, v in zip(times_list, vals_list):
+            pad = T - len(t)
+            t_pad = np.pad(t, (0, pad), mode='constant', constant_values=0)
+            v_pad = np.pad(v, (0, pad), mode='constant', constant_values=0)
+            mask  = np.ones(T, dtype=np.float32)
+            mask[len(t):] = 0
+            padded_times.append(t_pad)
+            padded_vals.append(v_pad)
+            masks.append(mask)
+        return (
+            torch.tensor(padded_vals, dtype=torch.float32),   # [N, T]
+            torch.tensor(padded_times, dtype=torch.float32),  # [N, T]
+            torch.tensor(masks, dtype=torch.float32),         # [N, T]
+        )
+    # Handle selected_series
+    if selected_series:
+        sel_times_list, sel_vals_list = _extract_sorted(selected_series)
+        max_len_sel = max((len(t) for t in sel_times_list), default=0)
+    else:
+        sel_times_list = sel_vals_list = []
+        max_len_sel = 0
+    # Handle hidden_series
+    if hidden_series:
+        hid_times_list, hid_vals_list = _extract_sorted(hidden_series)
+        max_len_hid = max((len(t) for t in hid_times_list), default=0)
+    else:
+        hid_times_list = hid_vals_list = []
+        max_len_hid = 0
+    # Determine shared max length
+    T = max(max_len_sel, max_len_hid)
+    # Pad or return None depending on presence of data
+    if sel_times_list:
+        sel_obs, sel_times, sel_mask = _pad(sel_times_list, sel_vals_list, T)
+    else:
+        sel_obs = sel_times = sel_mask = None
+    if hid_times_list:
+        hid_obs, hid_times, hid_mask = _pad(hid_times_list, hid_vals_list, T)
+    else:
+        hid_obs = hid_times = hid_mask = None
+    return sel_obs, sel_times, sel_mask, hid_obs, hid_times, hid_mask
+def substance_cvs_to_tensors_bundle(
+    drug_data_frame: pd.DataFrame,
+    **kwargs
+) -> EmpiricalSubstanceTensorBundle:
+    """
+    Groups by substance_label and returns padded tensors for:
+      - observations,
+      - times (normalized per-substance to [0, 1]),
+      - observation masks.
+    Handles invalid (NaN) values in observations, applies optional normalization,
+    and constructs per-substance tensors.
+    Also returns metadata:
+      - study_names: one per substance,
+      - subject_names: one per subject (padded to max P).
+    Returns:
+        observations:     TensorType["S", "I", "T"]
+        times:            TensorType["S", "I", "T"]
+        masks:            TensorType["S", "I", "T"]
+        subjects_mask:    TensorType["S", "I"]
+        substance_labels: np.ndarray of length S
+        mapping:          metadata dictionary
+        study_names:      list of S strings
+        subject_names:    list of S lists of I strings
+    """
+    import numpy as np
+    import torch
+    import torch.nn.functional as F
+    substance_labels = np.unique(drug_data_frame["substance_label"].values)
+    mapping = map_substance_to_index_and_study(drug_data_frame)
+    substance_observations = []
+    substance_times = []
+    substance_masks = []
+    subject_masks = []
+    substance_doses = []
+    substance_routes = []
+    study_names_per_substance = []
+    subject_names_per_substance = []
+    max_time_steps = 0
+    max_subjects = 0
+    for substance_label in substance_labels:
+        df_sub = drug_data_frame[drug_data_frame["substance_label"] == substance_label]
+        obs, times, masks, doses, routes = substances_csv_to_tensors(
+            drug_data_frame, substance_label=substance_label
+        )
+        # obs, times, masks: [P, T]
+        valid_obs_mask = ~torch.isnan(obs)
+        masks = masks.bool() & valid_obs_mask
+        obs = obs.nan_to_num(nan=0.0)
+        max_time_steps = max(max_time_steps, obs.shape[1])
+        max_subjects = max(max_subjects, obs.shape[0])
+        # --- Metadata collection ---
+        grouped = df_sub.groupby("subject_name").first()
+        subject_names = list(grouped.index)
+        study_name = grouped["study_name"].iloc[0] if len(grouped) > 0 else ""
+        study_names_per_substance.append(study_name)
+        subject_names_per_substance.append(subject_names)
+        substance_observations.append(obs)
+        substance_times.append(times)
+        substance_masks.append(masks)
+        subject_masks.append(torch.ones(obs.shape[0], dtype=torch.float32))  # [P]
+        substance_doses.append(doses)
+        substance_routes.append(routes)
+    # Padding pass
+    all_observations, all_times, all_masks, all_subjects_mask = [], [], [], []
+    all_doses, all_routes = [], []
+    for obs, time, mask, subj_mask, subj_names, doses, routes in zip(
+        substance_observations,
+        substance_times,
+        substance_masks,
+        subject_masks,
+        subject_names_per_substance,
+        substance_doses,
+        substance_routes,
+    ):
+        pad_subjects = max_subjects - obs.shape[0]
+        pad_timesteps = max_time_steps - obs.shape[1]
+        obs_padded = F.pad(obs, (0, pad_timesteps, 0, pad_subjects))         # [I, T]
+        time_padded = F.pad(time, (0, pad_timesteps, 0, pad_subjects))       # [I, T]
+        mask_padded = F.pad(mask, (0, pad_timesteps, 0, pad_subjects))       # [I, T]
+        subj_mask_padded = F.pad(subj_mask, (0, pad_subjects))               # [I]
+        dose_padded = F.pad(doses, (0, pad_subjects))          # [I]
+        route_padded = F.pad(routes, (0, pad_subjects))        # [I]
+        subj_names += [""] * pad_subjects                      # [I] → pad with ""
+        all_observations.append(obs_padded)
+        all_times.append(time_padded)
+        all_masks.append(mask_padded)
+        all_subjects_mask.append(subj_mask_padded)
+        all_doses.append(dose_padded)
+        all_routes.append(route_padded)
+    return EmpiricalSubstanceTensorBundle(
+        observations=torch.stack(all_observations),       # [S, I, T]
+        times=torch.stack(all_times),                     # [S, I, T]
+        masks=torch.stack(all_masks),                     # [S, I, T]
+        individuals_mask=torch.stack(all_subjects_mask),     # [S, I]
+        substance_names=list(substance_labels),                # [S]
+        mapping=mapping,
+        study_names=study_names_per_substance,            # [S]
+        individuals_names=subject_names_per_substance,         # [S][I]
+        dosing_amounts=torch.stack(all_doses),            # [S, I]
+        dosing_route_types=torch.stack(all_routes)        # [S, I]
+    )

sim_priors_pk/data/data_preprocessing/tensors_to_databatch.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""Utility for initializing :class:`AICMECompartmentsDataBatch` objects.
+This small helper is primarily used in older preprocessing scripts. It takes
+precomputed observation tensors and wraps them into a minimal
+``AICMECompartmentsDataBatch`` where only the context fields are populated.
+All other entries are set to empty tensors or placeholders so that the
+resulting object conforms to the new metadata interface.
+"""
+from __future__ import annotations
+import torch
+from sim_priors_pk.data.datasets.aicme_batch import AICMECompartmentsDataBatch
+def initialize_aicme_batch(
+    observations: torch.Tensor,
+    observations_times: torch.Tensor,
+    observations_mask: torch.Tensor,
+) -> AICMECompartmentsDataBatch:
+    """Wrap raw tensors into an :class:`AICMECompartmentsDataBatch`.
+    Parameters
+    ----------
+    observations:
+        Tensor of shape ``[I, T]`` containing concentration values.
+    observations_times:
+        Tensor of shape ``[I, T]`` with the corresponding time points.
+    observations_mask:
+        Boolean tensor of shape ``[I, T]`` indicating valid entries.
+    Returns
+    -------
+    AICMECompartmentsDataBatch
+        Batch with ``B=1`` where all context fields are populated and the
+        remaining fields are placeholders (zeros or empty strings).
+    """
+    # Add batch dimension (B=1) and feature dimension for observations and times
+    context_obs = observations.unsqueeze(0).unsqueeze(-1)  # [1, I, T, 1]
+    context_obs_time = observations_times.unsqueeze(0).unsqueeze(-1)  # [1, I, T, 1]
+    # Add batch dimension for mask
+    context_obs_mask = observations_mask.unsqueeze(0)  # [1, I, T]
+    num_individuals = observations.shape[0]
+    return AICMECompartmentsDataBatch(
+        target_obs=None,
+        target_obs_time=None,
+        target_obs_mask=None,
+        target_rem_sim=None,
+        target_rem_sim_time=None,
+        target_rem_sim_mask=None,
+        target_dosing_amounts=torch.zeros(1, 0),
+        target_dosing_route_types=torch.zeros(1, 0, dtype=torch.long),
+        context_obs=context_obs,
+        context_obs_time=context_obs_time,
+        context_obs_mask=context_obs_mask,
+        context_rem_sim=None,
+        context_rem_sim_time=None,
+        context_rem_sim_mask=None,
+        context_dosing_amounts=torch.zeros(1, num_individuals),
+        context_dosing_route_types=torch.zeros(1, num_individuals, dtype=torch.long),
+        study_name=[""],
+        context_subject_name=[["" for _ in range(num_individuals)]],
+        target_subject_name=[["" for _ in range(0)]],
+        substance_name=[""],
+        time_scales=None,
+        is_empirical=False,
+    )

sim_priors_pk/data/datasets/aicme_batch.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""Batch structures shared between synthetic and empirical pipelines."""
+from __future__ import annotations
+from collections import namedtuple
+from typing import List, NamedTuple
+import torch
+from torchtyping import TensorType
+ShapeConfig = namedtuple(
+    "ShapeConfig",
+    [
+        "batch_size",
+        "c_individuals",
+        "num_obs_c",
+        "remaining_obs_c",
+        "t_individuals",
+        "num_obs_t",
+        "remaining_obs_t",
+    ],
+)
+class AICMECompartmentsDataBatch(NamedTuple):
+    """Container aggregating context and target trajectories.
+    The tuple carries tensors describing observed measurements, simulated
+    remainders, dosing metadata and masking utilities used across both the
+    synthetic simulation pipeline and the empirical JSON tooling.
+    """
+    # max_num_individuals-max_n_new_individuals = n_c_individuals
+    target_obs: TensorType["B", "t_ind", "num_obs_t", 1]
+    target_obs_time: TensorType["B", "t_ind", "num_obs_t", 1]
+    target_obs_mask: TensorType["B", "t_ind", "num_obs_t"]
+    target_rem_sim: TensorType["B", "t_ind", "rem_obs_t", 1]
+    target_rem_sim_time: TensorType["B", "t_ind", "rem_obs_t", 1]
+    target_rem_sim_mask: TensorType["B", "t_ind", "rem_obs_t"]
+    context_obs: TensorType["B", "c_ind", "num_obs_c", 1]
+    context_obs_time: TensorType["B", "c_ind", "num_obs_c", 1]
+    context_obs_mask: TensorType["B", "c_ind", "num_obs_c"]
+    context_rem_sim: TensorType["B", "c_ind", "rem_obs_c", 1]
+    context_rem_sim_time: TensorType["B", "c_ind", "rem_obs_c", 1]
+    context_rem_sim_mask: TensorType["B", "c_ind", "rem_obs_c"]
+    # Dosing information
+    target_dosing_amounts: TensorType["B", "t_ind"]
+    target_dosing_route_types: TensorType["B", "t_ind"]
+    context_dosing_amounts: TensorType["B", "c_ind"]
+    context_dosing_route_types: TensorType["B", "c_ind"]
+    # Masks over padded individuals
+    mask_context_individuals: TensorType["B", "c_ind"]
+    mask_target_individuals: TensorType["B", "t_ind"]
+    # 🆕 NEW: tracking metadata
+    study_name: List[str]
+    """Study identifier for each element in the batch (length ``B``)."""
+    context_subject_name: List[List[str]]
+    """Names of context individuals: shape ``[B][c_ind]``."""
+    target_subject_name: List[List[str]]
+    """Names of target individuals: shape ``[B][t_ind]``."""
+    substance_name: List[str]
+    """Drug or compound names corresponding to each study (length ``B``)."""
+    # Meta information
+    time_scales: TensorType["B", 2]  # shape : [B,2]
+    is_empirical: bool = False  # NEW: True ⇢ empirical CSV, False ⇢ simulation
+    @property
+    def mask_individuals(self) -> TensorType["B", "c_ind"]:
+        """Alias for backward compatibility; returns ``mask_context_individuals``."""
+        return self.mask_context_individuals
+    def detach_all(self) -> "AICMECompartmentsDataBatch":
+        """Detaches all tensor fields from the computation graph."""
+        return AICMECompartmentsDataBatch(
+            *(t.detach() if isinstance(t, torch.Tensor) else t for t in self)
+        )
+    def log_transform(self) -> "AICMECompartmentsDataBatch":
+        """Applies log transformation to observation and remainder tensors.
+        Deprecated for training: log scaling is now expected to be handled by
+        ``PKScaler`` (for example via ``value_method="log"`` or
+        ``value_method="log_and_max"``).
+        Kept for backward compatibility with older utilities.
+        """
+        transformed_tensors = []
+        for name, tensor in zip(self._fields, self):
+            if name in [
+                "target_obs",
+                "target_rem_sim",
+                "context_obs",
+                "context_rem_sim",
+            ]:
+                transformed_tensors.append(torch.log(tensor + 1e-6))
+            else:
+                transformed_tensors.append(tensor)
+        return AICMECompartmentsDataBatch(*transformed_tensors)
+    def to_device(self, device: torch.device) -> "AICMECompartmentsDataBatch":
+        """Moves all tensor fields to the specified device (leaves strings untouched)."""
+        return AICMECompartmentsDataBatch(
+            *(t.to(device) if isinstance(t, torch.Tensor) else t for t in self)
+        )
+    def to(self, device: torch.device | str) -> "AICMECompartmentsDataBatch":
+        """PyTorch-style alias delegating to :meth:`to_device`.
+        Several generic utilities expect batch-like objects to implement
+        ``.to(device)``. Exposing this alias keeps the explicit
+        ``to_device(...)`` API while allowing those utilities to move the full
+        databatch onto the target device safely.
+        """
+        return self.to_device(torch.device(device))
+    def to_reconstruct_type(self) -> "AICMECompartmentsDataBatch":
+        """
+        Return a new databatch where the target trajectories are reconstructed
+        by concatenating observed and remainder segments, then right-padding
+        so that the target has the same time dimension as the context.
+        The context is left untouched.
+        """
+        B, Ic, Tc, _ = self.context_obs.shape  # context time dimension is reference
+        _, It, _, _ = self.target_obs.shape
+        T_max = Tc  # max length for padding
+        # allocate reconstructed tensors
+        Xt_full = torch.zeros(
+            B, It, T_max, 1, dtype=self.target_obs.dtype, device=self.target_obs.device
+        )
+        Tt_full = torch.zeros(
+            B, It, T_max, 1, dtype=self.target_obs_time.dtype, device=self.target_obs_time.device
+        )
+        Mt_full = torch.zeros(B, It, T_max, dtype=torch.bool, device=self.target_obs_mask.device)
+        # fill with observed + remainder segments
+        for b in range(B):
+            for i in range(It):
+                o_len = int(self.target_obs_mask[b, i].sum().item())
+                r_len = int(self.target_rem_sim_mask[b, i].sum().item())
+                total = o_len + r_len
+                if total == 0:
+                    continue
+                Xt_full[b, i, :o_len] = self.target_obs[b, i, :o_len]
+                Xt_full[b, i, o_len:total] = self.target_rem_sim[b, i, :r_len]
+                Tt_full[b, i, :o_len] = self.target_obs_time[b, i, :o_len]
+                Tt_full[b, i, o_len:total] = self.target_rem_sim_time[b, i, :r_len]
+                Mt_full[b, i, :total] = True
+        return self._replace(
+            target_obs=Xt_full,
+            target_obs_time=Tt_full,
+            target_obs_mask=Mt_full,
+        )

sim_priors_pk/data/datasets/aicme_datasets.py ADDED Viewed

	@@ -0,0 +1,1874 @@

+import os
+import random
+import tempfile
+import warnings
+from dataclasses import replace
+from pathlib import Path
+from typing import Dict, List, Optional, Sequence, Tuple
+import lightning.pytorch as pl
+import torch
+from torch import Tensor
+from torch.utils.data import DataLoader, Dataset
+from torch.utils.data.dataloader import default_collate
+from sim_priors_pk import data_dir
+from sim_priors_pk.config_classes.node_pk_config import NodePKExperimentConfig
+from sim_priors_pk.data.data_generation.compartment_models_management import (
+    prepare_full_simulation,
+    prepare_full_simulation_list_with_repeated_targets as prepare_full_simulation_list_with_repeated_targets_backend,
+    prepare_full_simulation_with_repeated_targets,
+)
+from sim_priors_pk.data.data_generation.observations_classes import (
+    ObservationStrategyFactory,
+)
+from sim_priors_pk.data.datasets.aicme_batch import (
+    AICMECompartmentsDataBatch,
+)
+from sim_priors_pk.utils.tensors_operations import ensure_mask_or_empty, ensure_tensor_or_empty
+def ensure_min_valid(mask, min_length):
+    """
+    Ensures that each row of the last dimension in the mask has at least `min_length` valid (1s) entries.
+    """
+    valid_counts = mask.sum(dim=-1, keepdim=True)  # Count valid entries along time dimension
+    needs_fixing = valid_counts < min_length  # Identify sequences needing more valid entries
+    if needs_fixing.any():
+        # Find the top `min_length` indices in each row (sorted for deterministic filling)
+        _, topk_indices = torch.topk(
+            mask + torch.rand_like(mask) * 0.01, k=min_length, dim=-1, sorted=True
+        )
+        # Create an empty mask and scatter `1`s at selected indices
+        fixed_mask = torch.zeros_like(mask)
+        fixed_mask.scatter_(-1, topk_indices, 1.0)
+        # Combine the original and fixed masks
+        mask = torch.where(needs_fixing, fixed_mask, mask)
+    return mask
+def is_valid_simulation(sim: torch.Tensor) -> bool:
+    """Returns True if the simulation is numerically valid and all values are < 10."""
+    return torch.isfinite(sim).all() and (sim >= 0).all() and (sim < 10).all()
+def _stack_one_perm(
+    batches: Sequence["AICMECompartmentsDataBatch"],
+) -> "AICMECompartmentsDataBatch":
+    result = []
+    for f in AICMECompartmentsDataBatch._fields:
+        items = [getattr(b, f) for b in batches]
+        if f in {"study_name", "substance_name"}:
+            merged = []
+            for it in items:
+                if isinstance(it, (list, tuple)):
+                    merged.extend(map(str, it))
+                elif isinstance(it, str):
+                    merged.append(it)
+                else:
+                    raise TypeError(f"Unexpected type for {f}: {type(it)}")
+            result.append(merged)
+            continue
+        if f in {"context_subject_name", "target_subject_name"}:
+            merged_lls = []
+            for it in items:
+                if isinstance(it, (list, tuple)):
+                    merged_lls.extend([list(inner) for inner in it])
+                else:
+                    raise TypeError(f"Unexpected type for {f}: {type(it)}")
+            result.append(merged_lls)
+            continue
+        result.append(default_collate(items))
+    return AICMECompartmentsDataBatch(*result)
+def _collate_aicme_batches(batch_list):
+    """
+    Handles:
+      - [B] of AICMECompartmentsDataBatch → returns one collated batch
+      - [B][P] of AICMECompartmentsDataBatch → returns list of P collated batches
+    """
+    if not batch_list:
+        return batch_list
+    first = batch_list[0]
+    # Case 1: flat list of AICME batches
+    if hasattr(first, "_fields"):  # NamedTuple-like
+        return _stack_one_perm(batch_list)
+    # Case 2: nested [B][P]
+    if isinstance(first, (list, tuple)) and hasattr(first[0], "_fields"):
+        # transpose [B][P] -> [P][B]
+        transposed = list(zip(*batch_list))
+        return [_stack_one_perm(list(group)) for group in transposed]
+    # If we reach here and elements are Tensors, do NOT recurse further.
+    if torch.is_tensor(first):
+        raise TypeError(
+            "Got a list of tensors instead of AICMECompartmentsDataBatch. "
+            "Check that your Dataset returns AICMECompartmentsDataBatch, not raw tensors."
+        )
+    raise TypeError(
+        f"Unexpected element type in batch_list: {type(first)}. "
+        "Expected AICMECompartmentsDataBatch or list thereof."
+    )
+def split_individuals_tensor_batch(
+    full_tensor_a: torch.Tensor,
+    full_tensor_b: torch.Tensor,
+    full_tensor_c: Optional[torch.Tensor],
+    n_of_target_individuals: int,
+    seed: Optional[int] = None,
+) -> Tuple[
+    torch.Tensor,
+    torch.Tensor,
+    Optional[torch.Tensor],
+    torch.Tensor,
+    torch.Tensor,
+    Optional[torch.Tensor],
+]:
+    num_individuals = full_tensor_a.shape[0]
+    if seed is not None:
+        random.seed(seed)
+    if n_of_target_individuals == 0:
+        return full_tensor_a, full_tensor_b, full_tensor_c, None, None, None
+    all_indices = list(range(num_individuals))
+    target_indices = random.sample(all_indices, n_of_target_individuals)
+    context_indices = [i for i in all_indices if i not in target_indices]
+    context_a = full_tensor_a[context_indices]
+    context_b = full_tensor_b[context_indices]
+    context_c = full_tensor_c[context_indices] if full_tensor_c is not None else None
+    target_a = full_tensor_a[target_indices]
+    target_b = full_tensor_b[target_indices]
+    target_c = full_tensor_c[target_indices] if full_tensor_c is not None else None
+    return context_a, context_b, context_c, target_a, target_b, target_c
+def list_of_databath_to_device(
+    batch_list: List[AICMECompartmentsDataBatch],
+    device: torch.device | str,
+) -> List[AICMECompartmentsDataBatch]:
+    """Move a list of batches to ``device``.
+    Parameters
+    ----------
+    batch_list:
+        List of :class:`AICMECompartmentsDataBatch` objects.
+    device:
+        Target device.
+    """
+    return [b.to_device(device) for b in batch_list]
+def build_reconstruction_db(
+    db: AICMECompartmentsDataBatch,
+) -> AICMECompartmentsDataBatch:
+    """
+    Reconstruct the target trajectories by concatenating observed and remainder
+    segments, then right-padding so that the target has the same time dimension
+    as the context. The context is left untouched.
+    Returns a new AICMECompartmentsDataBatch.
+    """
+    B, Ic, Tc, _ = db.context_obs.shape  # context shape is the reference
+    _, It, _, _ = db.target_obs.shape
+    # reference length for padding (use context time dim)
+    T_max = Tc
+    # allocate new target tensors
+    Xt_full = torch.zeros(B, It, T_max, 1, dtype=db.target_obs.dtype, device=db.target_obs.device)
+    Tt_full = torch.zeros(
+        B, It, T_max, 1, dtype=db.target_obs_time.dtype, device=db.target_obs_time.device
+    )
+    Mt_full = torch.zeros(B, It, T_max, dtype=torch.bool, device=db.target_obs_mask.device)
+    # fill reconstructed target
+    for b in range(B):
+        for i in range(It):
+            o_len = int(db.target_obs_mask[b, i].sum().item())
+            r_len = int(db.target_rem_sim_mask[b, i].sum().item())
+            total = o_len + r_len
+            if total == 0:
+                continue
+            Xt_full[b, i, :o_len] = db.target_obs[b, i, :o_len]
+            Xt_full[b, i, o_len:total] = db.target_rem_sim[b, i, :r_len]
+            Tt_full[b, i, :o_len] = db.target_obs_time[b, i, :o_len]
+            Tt_full[b, i, o_len:total] = db.target_rem_sim_time[b, i, :r_len]
+            Mt_full[b, i, :total] = True
+    # replace only the target fields
+    return db._replace(
+        target_obs=Xt_full,
+        target_obs_time=Tt_full,
+        target_obs_mask=Mt_full,
+    )
+class AICMECompartmentsDataset(Dataset):
+    """Dataset generating synthetic PK batches for AICME models.
+    Target observation strategies should already divide past and future
+    observations (``split_past_future=True``).
+    """
+    def __init__(
+        self,
+        model_config: NodePKExperimentConfig,
+        ctx_fn,
+        tgt_fn,
+        number_of_process=1000,
+        *,
+        store_in_tempfile: bool = False,
+        keep_tempfile: bool = False,
+        recreate_tempfile: bool = False,
+        tempfile_path: str | None = None,
+        show_progress: bool = True,
+        split: str = "",
+        use_shared_target_dosing: bool = False,
+        shared_target_n_targets: int = 100,
+    ):
+        self.mix_data_config = model_config.mix_data
+        self.meta_study_config = model_config.meta_study
+        self.meta_dosing_config = model_config.dosing
+        self.number_of_process = number_of_process
+        # ``n_of_permutations`` specifies how many shuffled versions of the
+        # context/target split are generated for a single simulation.
+        # ``n_of_databatches`` is a deprecated alias kept for backward
+        # compatibility and mirrors ``n_of_permutations``.
+        self.n_of_permutations = model_config.mix_data.n_of_permutations
+        self.n_of_databatches = self.n_of_permutations  # deprecated alias
+        self.n_of_target_individuals = int(model_config.mix_data.n_of_target_individuals)
+        if self.n_of_target_individuals < 0:
+            raise ValueError("n_of_target_individuals must be >= 0")
+        # `num_individuals_range` controls context individuals only.
+        self.min_context_individuals = int(self.meta_study_config.num_individuals_range[0])
+        self.max_context_individuals = int(self.meta_study_config.num_individuals_range[-1])
+        if self.min_context_individuals < 0:
+            raise ValueError("meta_study.num_individuals_range minimum must be >= 0")
+        if self.max_context_individuals < self.min_context_individuals:
+            raise ValueError("meta_study.num_individuals_range must satisfy max >= min")
+        # Fixed total capacity used by downstream consumers.
+        self.max_individuals = self.max_context_individuals + self.n_of_target_individuals
+        self.context_fn = ctx_fn
+        self.target_fn = tgt_fn
+        self.store_in_tempfile = store_in_tempfile
+        self.keep_tempfile = keep_tempfile
+        self.recreate_tempfile = recreate_tempfile
+        self.show_progress = True
+        self._tmpfile_path: List[str] | None = None
+        self._loaded_data = None
+        self.run_id = getattr(model_config, "run_index", 0)
+        self.model_name = model_config.name_str
+        if self.store_in_tempfile:
+            self._prepare_tempfile_data(tempfile_path=tempfile_path, split=split)
+        self.use_shared_target_dosing = use_shared_target_dosing
+        self.shared_target_n_targets = shared_target_n_targets
+    def __del__(self):
+        if (
+            self.store_in_tempfile
+            and not self.keep_tempfile
+            and self._tmpfile_path
+            and os.path.exists(self._tmpfile_path)
+        ):
+            os.remove(self._tmpfile_path)
+    def __len__(self):
+        return self.number_of_process  # Arbitrary large number to simulate infinite data
+    def _prepare_tempfile_data(self, *, tempfile_path: str | None, split: str) -> None:
+        """Handle creation and (re)generation of the temporary data file."""
+        if tempfile_path is None:
+            tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".pt")
+            self._tmpfile_path = tmp.name
+            tmp.close()
+        else:
+            # Allow both Tuple paths from YAML and plain strings
+            if isinstance(tempfile_path, (tuple, list)):
+                base_path = os.path.join(data_dir, *tempfile_path)
+            else:
+                base_path = tempfile_path
+            dirname = os.path.dirname(base_path)
+            basename = os.path.basename(base_path)
+            suffix = f"_{self.model_name}_{split}"
+            if self.run_id is not None:
+                suffix += f"_run{self.run_id}"
+            new_basename = basename + suffix + ".tr"
+            self._tmpfile_path = os.path.join(dirname, new_basename)
+        if self.recreate_tempfile or not os.path.exists(self._tmpfile_path):
+            print("RECREATING DATASET!")
+            iterator = range(self.number_of_process)
+            if self.show_progress:
+                from tqdm.auto import tqdm
+                iterator = tqdm(iterator, desc="Generating AICME data")
+            data = [self._generate_item(i) for i in iterator]
+            torch.save(data, self._tmpfile_path)
+    def split_simulations(
+        self, full_simulation, full_simulation_times
+    ) -> Tuple[
+        torch.Tensor,
+        torch.Tensor,
+        Optional[torch.Tensor],
+        Optional[torch.Tensor],
+        list[int],
+        list[int],
+    ]:
+        """
+        From the full simulation, randomly select `n_of_target_individuals` as targets and keep the rest as context.
+        If `n_of_target_individuals == 0`, returns None for the target fields.
+        """
+        n_of_target_individuals = self.n_of_target_individuals
+        num_individuals = full_simulation.shape[0]
+        if n_of_target_individuals == 0:
+            context_simulation = full_simulation
+            context_simulation_times = full_simulation_times
+            return (
+                context_simulation,
+                context_simulation_times,
+                None,
+                None,
+                list(range(num_individuals)),
+                [],
+            )
+        if num_individuals < n_of_target_individuals:
+            raise ValueError(
+                "Simulation contains fewer individuals than requested targets: "
+                f"num_individuals={num_individuals}, "
+                f"n_of_target_individuals={n_of_target_individuals}."
+            )
+        # Randomly select indices for target individuals
+        target_indices = random.sample(range(num_individuals), n_of_target_individuals)
+        context_indices = [i for i in range(num_individuals) if i not in target_indices]
+        # Split the simulations, times, and masks
+        target_simulation = full_simulation[target_indices]
+        target_simulation_times = full_simulation_times[target_indices]
+        context_simulation = full_simulation[context_indices]
+        context_simulation_times = full_simulation_times[context_indices]
+        return (
+            context_simulation,
+            context_simulation_times,
+            target_simulation,
+            target_simulation_times,
+            context_indices,
+            target_indices,
+        )
+    def _build_generation_meta_study_config(self):
+        """Return a meta-study config where totals include fixed target individuals.
+        The user-facing ``meta_study.num_individuals_range`` represents context
+        individuals only. For raw simulation generation, we therefore sample
+        ``context + n_of_target_individuals`` total individuals.
+        """
+        total_min = self.min_context_individuals + self.n_of_target_individuals
+        total_max = self.max_context_individuals + self.n_of_target_individuals
+        if getattr(self.meta_study_config, "simple_mode", False):
+            total_individuals = random.randint(total_min, total_max)
+            return replace(
+                self.meta_study_config,
+                num_individuals=total_individuals,
+                num_individuals_range=(total_individuals, total_individuals),
+            )
+        return replace(
+            self.meta_study_config,
+            num_individuals_range=(total_min, total_max),
+        )
+    def __getitem__(self, idx):
+        if self.store_in_tempfile:
+            if self._loaded_data is None:
+                self._loaded_data = torch.load(self._tmpfile_path, weights_only=False)
+            # If in distributed mode, adjust the index based on process rank/world size
+            if torch.distributed.is_initialized():
+                rank = torch.distributed.get_rank()
+                world_size = torch.distributed.get_world_size()
+                total_len = len(self._loaded_data)
+                # Compute adjusted indices for this rank
+                adjusted_idx = idx * world_size + rank
+                if adjusted_idx >= total_len:
+                    # If we would go out of bounds, wrap around to get a valid index
+                    adjusted_idx = adjusted_idx % total_len
+                return self._loaded_data[adjusted_idx]
+            return self._loaded_data[idx]
+        if self.use_shared_target_dosing:
+            return self._generate_item_sample_target_dosing(
+                idx, n_targets=self.shared_target_n_targets
+            )
+        return self._generate_item(idx)
+    def _generate_item(self, idx) -> List[AICMECompartmentsDataBatch]:
+        """Generate a list of ``AICMECompartmentsDataBatch`` objects.
+        Each element corresponds to one permutation of the context/target split.
+        Target observations are generated using ``target_fn``, which is expected
+        to divide past and future observations.
+        """
+        (
+            full_simulation,
+            full_simulation_times,
+            dosing_amounts,
+            dosing_routes,
+            time_points,
+            time_scales,
+        ) = prepare_full_simulation(
+            self._build_generation_meta_study_config(),
+            self.meta_dosing_config,
+        )
+        list_of_databatches: List[AICMECompartmentsDataBatch] = []
+        for _ in range(self.n_of_permutations):
+            # Split into context and target
+            (
+                context_simulation,
+                context_simulation_times,
+                target_simulation,
+                target_simulation_times,
+                context_indices,
+                target_indices,
+            ) = self.split_simulations(full_simulation, full_simulation_times)
+            context_observations = self._safe_generate(
+                self.context_fn,
+                context_simulation,
+                context_simulation_times,
+                time_scales=time_scales,
+            )
+            target_observations = self._safe_generate(
+                self.target_fn,
+                target_simulation,
+                target_simulation_times,
+                time_scales=time_scales,
+            )
+            (
+                context_obs,  # [c_ind, num_obs_c, 1]
+                context_obs_time,  # [c_ind, num_obs_c, 1]
+                context_obs_mask,  # [c_ind, num_obs_c]
+                context_rem_sim,  # [c_ind, rem_obs_c, 1]
+                context_rem_sim_time,  # [c_ind, rem_obs_c, 1]
+                context_rem_sim_mask,  # [c_ind, rem_obs_c]
+                context_time_scales,
+            ) = context_observations
+            (
+                target_obs,  # [t_ind, num_obs_t, 1]
+                target_obs_time,  # [t_ind, num_obs_t, 1]
+                target_obs_mask,  # [t_ind, num_obs_t]
+                target_rem_sim,  # [t_ind, rem_obs_t, 1]
+                target_rem_sim_time,  # [t_ind, rem_obs_t, 1]
+                target_rem_sim_mask,  # [t_ind, rem_obs_t]
+                target_time_scales,
+            ) = target_observations
+            # Use provided time scales or fall back to simulation defaults
+            ts = (
+                context_time_scales
+                if context_time_scales is not None
+                else target_time_scales
+                if target_time_scales is not None
+                else time_scales
+            )
+            batch = self._build_padded_batch(
+                context_obs,
+                context_obs_time,
+                context_obs_mask,
+                context_rem_sim,
+                context_rem_sim_time,
+                context_rem_sim_mask,
+                dosing_amounts[context_indices],
+                dosing_routes[context_indices],
+                target_obs,
+                target_obs_time,
+                target_obs_mask,
+                target_rem_sim,
+                target_rem_sim_time,
+                target_rem_sim_mask,
+                dosing_amounts[target_indices] if len(target_indices) > 0 else None,
+                dosing_routes[target_indices] if len(target_indices) > 0 else None,
+                ts,
+            )
+            list_of_databatches.append(batch)
+        return list_of_databatches
+    def _generate_item_sample_target_dosing(
+        self,
+        idx: int,
+        n_targets: int = 100,
+        different_dosing: bool = False,
+    ):
+        (
+            context_sim,
+            context_times,
+            target_sim,
+            target_times,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+            dosing_amounts_tgt,
+            dosing_routes_tgt,
+            time_points,
+            time_scales,
+        ) = prepare_full_simulation_with_repeated_targets(
+            self.meta_study_config,
+            self.meta_dosing_config,
+            n_targets,
+            different_dosing=different_dosing,
+            idx=idx,
+        )
+        # Observations
+        context_obs_pack = self._safe_generate(
+            self.context_fn, context_sim, context_times, time_scales=time_scales
+        )
+        target_obs_pack = self._safe_generate(
+            self.target_fn, target_sim, target_times, time_scales=time_scales
+        )
+        (
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            context_time_scales,
+        ) = context_obs_pack
+        (
+            target_obs,
+            target_obs_time,
+            target_obs_mask,
+            target_rem_sim,
+            target_rem_sim_time,
+            target_rem_sim_mask,
+            target_time_scales,
+        ) = target_obs_pack
+        ts = (
+            context_time_scales
+            if context_time_scales is not None
+            else (target_time_scales or time_scales)
+        )
+        # Build batch
+        batch = self._build_padded_batch(
+            # context
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+            # target
+            target_obs,
+            target_obs_time,
+            target_obs_mask,
+            target_rem_sim,
+            target_rem_sim_time,
+            target_rem_sim_mask,
+            dosing_amounts_tgt,
+            dosing_routes_tgt,
+            # time scales
+            ts=ts,
+            target_capacity=n_targets,
+        )
+        return [batch]
+    # ------------------------------------------------------------------ #
+    # utilities
+    # ------------------------------------------------------------------ #
+    def _build_padded_batch(
+        self,
+        ctx_obs: Tensor,  # [c_ind, num_obs_c]
+        ctx_time: Tensor,  # [c_ind, num_obs_c]
+        ctx_mask: Tensor,  # [c_ind, num_obs_c]
+        ctx_rem: Optional[Tensor],  # [c_ind, rem_obs_c] | None
+        ctx_rem_time: Optional[Tensor],  # [c_ind, rem_obs_c] | None
+        ctx_rem_mask: Optional[Tensor],  # [c_ind, rem_obs_c] | None
+        ctx_dose: Tensor,  # [c_ind]
+        ctx_route: Tensor,  # [c_ind]
+        tgt_obs: Optional[Tensor],  # [t_ind, num_obs_t] | None
+        tgt_time: Optional[Tensor],  # [t_ind, num_obs_t] | None
+        tgt_mask: Optional[Tensor],  # [t_ind, num_obs_t] | None
+        tgt_rem: Optional[Tensor],  # [t_ind, rem_obs_t] | None
+        tgt_rem_time: Optional[Tensor],  # [t_ind, rem_obs_t] | None
+        tgt_rem_mask: Optional[Tensor],  # [t_ind, rem_obs_t] | None
+        tgt_dose: Optional[Tensor],  # [t_ind] | None
+        tgt_route: Optional[Tensor],  # [t_ind] | None
+        ts: Tensor,  # [B(=1), 2]
+        *,
+        target_capacity: Optional[
+            int
+        ] = None,  # ← NEW (optional). If None, use self.n_of_target_individuals
+    ) -> AICMECompartmentsDataBatch:
+        """Pad context and target tensors then pack them into a batch."""
+        max_c = self.max_context_individuals  # (unchanged)
+        max_t = (
+            target_capacity if target_capacity is not None else self.n_of_target_individuals
+        )  # ← ONLY CHANGE
+        # ── target padding (unchanged) ─────────────────────────────────────────
+        t_obs_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                tgt_obs.unsqueeze(-1) if tgt_obs is not None else None, (1, 1, 1)
+            ),  # to [t_ind, Tt, 1]
+            max_t,
+        )
+        t_time_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                tgt_time.unsqueeze(-1) if tgt_time is not None else None, (1, 1, 1)
+            ),  # to [t_ind, Tt, 1]
+            max_t,
+        )
+        t_mask_p = self._pad_first_dim(
+            ensure_mask_or_empty(
+                tgt_mask if tgt_mask is not None else None, (1, 1)
+            ),  # to [t_ind, Tt]
+            max_t,
+        )
+        t_rem_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                tgt_rem.unsqueeze(-1) if tgt_rem is not None else None, (t_obs_p.size(0), 1, 1)
+            ),  # [t_ind, Rt,1]
+            max_t,
+        )
+        t_rem_time_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                tgt_rem_time.unsqueeze(-1) if tgt_rem_time is not None else None,
+                (t_obs_p.size(0), 1, 1),
+            ),
+            max_t,
+        )
+        t_rem_mask_p = self._pad_first_dim(
+            ensure_mask_or_empty(
+                tgt_rem_mask if tgt_rem_mask is not None else None, (t_obs_p.size(0), 1)
+            ),
+            max_t,
+        )
+        t_dose_p = self._pad_first_dim(
+            ensure_tensor_or_empty(tgt_dose if tgt_dose is not None else None, (1,)),  # [t_ind]
+            max_t,
+        )
+        t_route_p = self._pad_first_dim(
+            ensure_tensor_or_empty(tgt_route if tgt_route is not None else None, (1,)),  # [t_ind]
+            max_t,
+        ).long()
+        # ── context padding (unchanged) ────────────────────────────────────────
+        c_obs_p = self._pad_first_dim(ctx_obs, max_c).unsqueeze(-1)  # [c_ind, Tc, 1]
+        c_time_p = self._pad_first_dim(ctx_time, max_c).unsqueeze(-1)  # [c_ind, Tc, 1]
+        c_mask_p = self._pad_first_dim(ctx_mask, max_c)  # [c_ind, Tc]
+        c_rem_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                ctx_rem.unsqueeze(-1) if ctx_rem is not None else None, (ctx_obs.size(0), 1, 1)
+            ),
+            max_c,
+        )
+        c_rem_time_p = self._pad_first_dim(
+            ensure_tensor_or_empty(
+                ctx_rem_time.unsqueeze(-1) if ctx_rem_time is not None else None,
+                (ctx_obs.size(0), 1, 1),
+            ),
+            max_c,
+        )
+        c_rem_mask_p = self._pad_first_dim(
+            ensure_mask_or_empty(
+                ctx_rem_mask if ctx_rem_mask is not None else None, (ctx_obs.size(0), 1)
+            ),
+            max_c,
+        )
+        c_dose_p = self._pad_first_dim(ctx_dose, max_c)  # [c_ind]
+        c_route_p = self._pad_first_dim(ctx_route, max_c).long()  # [c_ind]
+        total_c = ctx_obs.size(0)
+        mask_c_inds = torch.zeros(self.max_context_individuals, dtype=torch.bool)
+        mask_c_inds[:total_c] = True
+        total_t = tgt_obs.size(0) if tgt_obs is not None else 0
+        mask_t_inds = torch.zeros(
+            max_t, dtype=torch.bool
+        )  # ← use max_t here (unchanged logic, just variable)
+        mask_t_inds[:total_t] = True
+        return AICMECompartmentsDataBatch(
+            target_obs=t_obs_p,
+            target_obs_time=t_time_p,
+            target_obs_mask=t_mask_p,
+            target_rem_sim=t_rem_p,
+            target_rem_sim_time=t_rem_time_p,
+            target_rem_sim_mask=t_rem_mask_p,
+            target_dosing_amounts=t_dose_p,
+            target_dosing_route_types=t_route_p,
+            context_obs=c_obs_p,
+            context_obs_time=c_time_p,
+            context_obs_mask=c_mask_p,
+            context_rem_sim=c_rem_p,
+            context_rem_sim_time=c_rem_time_p,
+            context_rem_sim_mask=c_rem_mask_p,
+            context_dosing_amounts=c_dose_p,
+            context_dosing_route_types=c_route_p,
+            mask_context_individuals=mask_c_inds,
+            mask_target_individuals=mask_t_inds,
+            study_name=[""],
+            context_subject_name=[[""] * max_c],
+            target_subject_name=[[""] * max_t],  # ← still uses max_t
+            substance_name=[""],
+            time_scales=ts,
+            is_empirical=False,
+        )
+    @staticmethod
+    def _safe_generate(strategy, sim, times, **kw):
+        """
+        Call ObservationStrategy.generate() only when `sim` is not None.
+        Returns a 7-tuple of Nones otherwise.
+        """
+        if sim is None:
+            return (None, None, None, None, None, None, None)
+        for _ in range(10):  # retries, like old manager
+            out = strategy.generate(sim, times, **kw)
+            if out[0] is not None:  # got a non-empty slice
+                return out
+        raise RuntimeError(
+            "Unable to generate non-empty observations "
+            "after 10 attempts – check strategy parameters."
+        )
+    @staticmethod
+    def _pad_first_dim(t: torch.Tensor, size: int) -> torch.Tensor:
+        """Pad tensor along the first dimension up to ``size``.
+        Parameters
+        ----------
+        t : TensorType["I", *Ts]
+            Input tensor where ``I`` may be smaller than ``size``.
+        size : int
+            Desired first-dimension size after padding.
+        Returns
+        -------
+        TensorType["size", *Ts]
+            Tensor padded with zeros (or ``False`` for bool tensors) so that the
+            first dimension equals ``size``. If ``t`` already has ``size`` or
+            more elements along the first dimension, it is truncated.
+        """
+        current = t.size(0)
+        if current >= size:
+            return t[:size]
+        pad_shape = (size - current, *t.shape[1:])
+        pad_value = False if t.dtype == torch.bool else 0.0
+        padding = torch.full(pad_shape, pad_value, dtype=t.dtype, device=t.device)
+        return torch.cat([t, padding], dim=0)
+class AICMECompartmentsDataModule(pl.LightningDataModule):
+    """LightningDataModule for synthetic PK simulation data."""
+    # Empirical target batches always use the legacy PK observation strategy
+    # with a fixed capacity profile, independent from synthetic target config.
+    _EMPIRICAL_TARGET_MAX_NUM_OBS = 15
+    _EMPIRICAL_TARGET_MIN_PAST = 0
+    _EMPIRICAL_TARGET_MAX_PAST = 5
+    def __init__(
+        self,
+        model_config: NodePKExperimentConfig,
+    ):
+        super().__init__()
+        self.model_config = model_config
+        self.context_config = model_config.context_observations
+        self.target_config = model_config.target_observations
+        self.meta_config = model_config.meta_study
+        self.data_config = model_config.mix_data
+        self.study_config = model_config.meta_study
+        self.num_workers = model_config.train.num_workers
+        self.persistent_workers = model_config.train.persistent_workers
+        self.shuffle_val = getattr(model_config.train, "shuffle_val", True)
+        self.train_size = self.data_config.train_size
+        self.val_size = self.data_config.val_size
+        self.test_size = self.data_config.test_size
+        self.batch_size = model_config.train.batch_size
+        self._prepared = False
+        # Cached shape parameters for empirical batch builders
+        self.max_individuals: int | None = None
+        self.max_observations: int | None = None
+        self.max_remaining: int | None = None
+        self.empirical_target_config = None
+        self.empirical_target_strategy = None
+        self.empirical_test_batches: Dict[str, List["AICMECompartmentsDataBatch"]] = {}
+        self.empirical_test_batches_no_heldout: Dict[str, List["AICMECompartmentsDataBatch"]] = {}
+    def prepare_data(self):
+        # Use this method to download or prepare data if needed.
+        # This is called only once and on a single GPU.
+        # Here the Observation Manager Also Handles Empirical Data
+        tempfile_path = getattr(self.data_config, "tempfile_path", None)
+        if tempfile_path:
+            temp_dir = Path(data_dir).joinpath(*tempfile_path)
+        else:
+            temp_dir = Path(data_dir) / "preprocessed"
+        temp_dir.mkdir(parents=True, exist_ok=True)
+        self.context_strategy = ObservationStrategyFactory.from_config(
+            self.context_config,
+            self.meta_config,
+        )
+        self.target_strategy = ObservationStrategyFactory.from_config(
+            self.target_config,
+            self.meta_config,
+        )
+        # Empirical target path: enforce legacy PK strategy and fixed capacities.
+        # This is intentionally decoupled from synthetic target strategy settings.
+        self.empirical_target_config = replace(
+            self.target_config,
+            type=None,
+            split_past_future=True,
+            max_num_obs=self._EMPIRICAL_TARGET_MAX_NUM_OBS,
+            min_past=self._EMPIRICAL_TARGET_MIN_PAST,
+            max_past=self._EMPIRICAL_TARGET_MAX_PAST,
+        )
+        self.empirical_target_strategy = ObservationStrategyFactory.from_config(
+            self.empirical_target_config,
+            self.meta_config,
+        )
+        self.train_dataset = AICMECompartmentsDataset(
+            self.model_config,
+            ctx_fn=self.context_strategy,
+            tgt_fn=self.target_strategy,
+            number_of_process=self.train_size,
+            store_in_tempfile=self.data_config.store_in_tempfile,
+            keep_tempfile=self.data_config.keep_tempfile,
+            recreate_tempfile=self.data_config.recreate_tempfile,
+            tempfile_path=self.data_config.tempfile_path,
+            show_progress=self.data_config.tqdm_progress,
+            split="train",
+        )
+        self.val_dataset = AICMECompartmentsDataset(
+            self.model_config,
+            ctx_fn=self.context_strategy,
+            tgt_fn=self.target_strategy,
+            number_of_process=self.val_size,
+            store_in_tempfile=self.data_config.store_in_tempfile,
+            keep_tempfile=self.data_config.keep_tempfile,
+            recreate_tempfile=self.data_config.recreate_tempfile,
+            tempfile_path=self.data_config.tempfile_path,
+            show_progress=self.data_config.tqdm_progress,
+            split="val",
+        )
+        self.test_dataset = AICMECompartmentsDataset(
+            self.model_config,
+            ctx_fn=self.context_strategy,
+            tgt_fn=self.target_strategy,
+            number_of_process=self.test_size,
+            store_in_tempfile=self.data_config.store_in_tempfile,
+            keep_tempfile=self.data_config.keep_tempfile,
+            recreate_tempfile=self.data_config.recreate_tempfile,
+            tempfile_path=self.data_config.tempfile_path,
+            show_progress=self.data_config.tqdm_progress,
+            split="test",
+        )
+        # Record shapes for empirical builders
+        ctx_obs, ctx_rem = self.context_strategy.get_shapes()
+        tgt_obs, tgt_rem = self.target_strategy.get_shapes()
+        self.max_observations = max(ctx_obs, tgt_obs)
+        self.max_remaining = max(ctx_rem, tgt_rem)
+        self.max_individuals = max(
+            self.train_dataset.max_context_individuals,
+            self.train_dataset.n_of_target_individuals,
+        )
+        self._prepared = True
+        self._empirical_loaded = False
+        # Preload empirical datasets during prepare_data so they are available
+        # before training callbacks query them.
+        # In DDP, keep network/download activity on rank 0 only.
+        if self._is_global_zero_process():
+            self._load_empirical_test_batches()
+            self._empirical_loaded = True
+    def setup(self, stage=None):
+        # Use this method to split data into train, validation, and test sets.
+        # This is called on every GPU.
+        if not self._prepared:
+            self.prepare_data()
+    def train_dataloader(self):
+        # Returns the training dataloader.
+        num_workers, persistent_workers = self._resolve_dataloader_workers()
+        return DataLoader(
+            self.train_dataset,
+            batch_size=self.batch_size,
+            shuffle=True,
+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            collate_fn=_collate_aicme_batches,
+        )
+    def val_dataloader(self):
+        # Returns the validation dataloader.
+        num_workers, persistent_workers = self._resolve_dataloader_workers()
+        return DataLoader(
+            self.val_dataset,
+            batch_size=self.batch_size,
+            shuffle=self.shuffle_val,
+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            collate_fn=_collate_aicme_batches,
+        )
+    def test_dataloader(self):
+        # Optional: Returns the test dataloader.
+        # If you don't have a test set, you can omit this method.
+        num_workers, persistent_workers = self._resolve_dataloader_workers()
+        return DataLoader(
+            self.test_dataset,
+            batch_size=self.batch_size,
+            shuffle=False,
+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            collate_fn=_collate_aicme_batches,
+        )
+    def obtain_shapes(self) -> Tuple[int, int, int]:
+        """Expose dataset shape parameters for empirical batching.
+        Returns
+        -------
+        Tuple[int, int, int]
+            ``(max_individuals, max_observations, max_remaining)`` as used by
+            :class:`AICMECompartmentsDataset`.
+        """
+        if not self._prepared:
+            self.prepare_data()
+        assert self.max_individuals is not None
+        assert self.max_observations is not None
+        assert self.max_remaining is not None
+        return (
+            self.max_individuals,
+            self.max_observations,
+            self.max_remaining,
+        )
+    def _resolve_dataloader_workers(self) -> Tuple[int, bool]:
+        """Return DataLoader worker settings that are safe for single-process runs."""
+        num_workers = max(0, int(self.num_workers))
+        persistent_workers = self.persistent_workers and num_workers > 0
+        return num_workers, persistent_workers
+    @staticmethod
+    def _is_global_zero_process() -> bool:
+        """Return True for rank 0 (or single-process execution)."""
+        if torch.distributed.is_available() and torch.distributed.is_initialized():
+            return torch.distributed.get_rank() == 0
+        return True
+    def _load_empirical_test_batches(self) -> None:
+        """Download and cache empirical Hugging Face datasets for evaluation."""
+        from sim_priors_pk.data.data_empirical import load_empirical_hf_batches_as_dm
+        datasets = getattr(self.data_config, "test_empirical_datasets", [])
+        self.empirical_test_batches = {}
+        self.empirical_test_batches_no_heldout = {}
+        if not datasets:
+            return
+        for repo_id in datasets:
+            try:
+                batches = load_empirical_hf_batches_as_dm(
+                    repo_id,
+                    meta_dosing=self.model_config.dosing,
+                    datamodule=self,
+                    held_out=True,
+                )
+            except Exception as exc:  # noqa: BLE001 - surface download issues
+                warnings.warn(
+                    f"Failed to load empirical dataset '{repo_id}': {exc}",
+                    stacklevel=2,
+                )
+                continue
+            if not batches:
+                warnings.warn(
+                    f"No empirical batches returned for dataset '{repo_id}'",
+                    stacklevel=2,
+                )
+                continue
+            self.empirical_test_batches[repo_id] = batches
+            try:
+                no_heldout_batches = load_empirical_hf_batches_as_dm(
+                    repo_id,
+                    meta_dosing=self.model_config.dosing,
+                    datamodule=self,
+                    held_out=False,
+                )
+            except Exception as exc:  # noqa: BLE001 - surface download issues
+                warnings.warn(
+                    f"Failed to load no-heldout empirical dataset '{repo_id}': {exc}",
+                    stacklevel=2,
+                )
+                continue
+            if not no_heldout_batches:
+                warnings.warn(
+                    f"No no-heldout empirical batches returned for dataset '{repo_id}'",
+                    stacklevel=2,
+                )
+                continue
+            self.empirical_test_batches_no_heldout[repo_id] = no_heldout_batches
+    def get_empirical_test_batches(
+        self,
+        *,
+        no_heldout: bool = False,
+        device: Optional[torch.device | str] = None,
+    ) -> Dict[str, List["AICMECompartmentsDataBatch"]]:
+        """Return cached empirical batches keyed by Hugging Face dataset id.
+        Parameters
+        ----------
+        no_heldout:
+            If ``True``, return batches where all empirical individuals remain
+            in context (no held-out target). If ``False`` (default), return the
+            leave-one-out batches.
+        device:
+            Optional device where returned batches should live. When provided,
+            returned batches are moved to ``device`` without mutating the
+            internal cache.
+        """
+        # Safety fallback for direct/manual datamodule usage.
+        if not getattr(self, "_empirical_loaded", False):
+            if not self._prepared:
+                self.prepare_data()
+            elif self._is_global_zero_process():
+                self._load_empirical_test_batches()
+            self._empirical_loaded = True
+        batch_map = (
+            self.empirical_test_batches_no_heldout if no_heldout else self.empirical_test_batches
+        )
+        if device is None:
+            return batch_map
+        return {
+            repo_id: list_of_databath_to_device(batch_list, device)
+            for repo_id, batch_list in batch_map.items()
+        }
+    def get_empirical_batches(
+        self,
+        *,
+        split: str,
+        empirical_name: Optional[str],
+        device: Optional[torch.device | str] = None,
+    ) -> List["AICMECompartmentsDataBatch"]:
+        """Return one empirical batch list using scheduler-oriented split aliases.
+        Supported split aliases:
+        - ``empirical_heldout``: leave-one-out empirical targets
+        - ``empirical_no_heldout``: all empirical individuals remain in context
+        """
+        normalized_split = str(split).strip().lower()
+        if normalized_split == "empirical_heldout":
+            batch_map = self.get_empirical_test_batches(no_heldout=False, device=device)
+        elif normalized_split == "empirical_no_heldout":
+            batch_map = self.get_empirical_test_batches(no_heldout=True, device=device)
+        else:
+            raise ValueError(
+                f"Unsupported empirical split alias '{split}'. "
+                "Expected 'empirical_heldout' or 'empirical_no_heldout'."
+            )
+        if empirical_name is None:
+            raise ValueError("`empirical_name` must be provided for empirical scheduler tasks.")
+        try:
+            return batch_map[str(empirical_name)]
+        except KeyError as exc:
+            raise ValueError(
+                f"No empirical batches found for split='{split}' and empirical_name='{empirical_name}'."
+            ) from exc
+    @staticmethod
+    def _normalize_substance_name(name: object) -> str:
+        """Normalize substance names for robust matching."""
+        return "".join(ch.lower() for ch in str(name) if ch.isalnum())
+    def select_empirical_batch_list(
+        self,
+        dataset_key: Optional[str] = None,
+        *,
+        no_heldout: bool = False,
+    ) -> Tuple[Optional[str], List["AICMECompartmentsDataBatch"]]:
+        """Select one empirical dataset batch list for plotting/evaluation.
+        Parameters
+        ----------
+        dataset_key:
+            Explicit dataset key to use. If missing or unknown, the first
+            non-empty dataset in cache is selected.
+        no_heldout:
+            Whether to read from the no-heldout cache.
+        Returns
+        -------
+        Tuple[Optional[str], List[AICMECompartmentsDataBatch]]
+            Selected dataset key (or ``None`` if unavailable) and batch list.
+        """
+        empirical_batches = self.get_empirical_test_batches(no_heldout=no_heldout)
+        if dataset_key is not None and dataset_key in empirical_batches:
+            selected_key = dataset_key
+            batch_list = empirical_batches[dataset_key]
+        else:
+            selected_key = None
+            batch_list = None
+            for repo_id, batches in empirical_batches.items():
+                if batches:
+                    selected_key = repo_id
+                    batch_list = batches
+                    break
+        if not batch_list:
+            label = "no-heldout" if no_heldout else "heldout"
+            raise RuntimeError(f"No empirical {label} batches available for predictive plotting.")
+        return selected_key, batch_list
+    def describe_empirical_test_batches(
+        self,
+        empirical_batches: Optional[Dict[str, List["AICMECompartmentsDataBatch"]]] = None,
+        *,
+        no_heldout: bool = False,
+        batch_index: int = 0,
+        print_available: bool = True,
+    ) -> Tuple[List[str], List[str]]:
+        """Describe empirical test batches and return available studies/drugs.
+        This helper is designed to be called after
+        :meth:`get_empirical_test_batches` in notebook/script workflows.
+        Parameters
+        ----------
+        empirical_batches:
+            Optional pre-fetched empirical batches (typically from
+            :meth:`get_empirical_test_batches`). If ``None``, batches are
+            fetched internally.
+        no_heldout:
+            Whether to describe no-heldout batches.
+        batch_index:
+            Batch index to inspect within each dataset. Default is ``0``.
+        print_available:
+            If ``True``, print available datasets/studies/drugs.
+        Returns
+        -------
+        Tuple[List[str], List[str]]
+            Unique available study names and drug names from the selected
+            ``batch_index`` across datasets.
+        """
+        batch_map = empirical_batches
+        if batch_map is None:
+            batch_map = self.get_empirical_test_batches(no_heldout=no_heldout)
+        if batch_index < 0:
+            raise ValueError("batch_index must be non-negative")
+        available_studies: List[str] = []
+        available_drugs: List[str] = []
+        seen_studies: set[str] = set()
+        seen_drugs: set[str] = set()
+        if print_available:
+            label = "no_heldout=True" if no_heldout else "heldout"
+            print(f"Available empirical datasets ({label}):", list(batch_map.keys()))
+        for repo_id, batch_list in batch_map.items():
+            if print_available:
+                print(f"Dataset '{repo_id}' contains {len(batch_list)} empirical batch(es).")
+            if batch_index >= len(batch_list):
+                if print_available:
+                    print(
+                        f"  Skipping dataset '{repo_id}': batch_index={batch_index} "
+                        f"is out of range."
+                    )
+                continue
+            batch = batch_list[batch_index]
+            studies, drugs = self.describe_empirical_batch(batch, print_available=False)
+            for study in studies:
+                if study not in seen_studies:
+                    seen_studies.add(study)
+                    available_studies.append(study)
+            for drug in drugs:
+                if drug not in seen_drugs:
+                    seen_drugs.add(drug)
+                    available_drugs.append(drug)
+            if print_available:
+                print(f"  Batch {batch_index} studies:", studies)
+                print(f"  Batch {batch_index} drugs:", drugs)
+        if print_available:
+            print("Available studies:", available_studies)
+            print("Available drugs:", available_drugs)
+        return available_studies, available_drugs
+    @staticmethod
+    def describe_empirical_batch(
+        batch: "AICMECompartmentsDataBatch",
+        *,
+        print_available: bool = True,
+    ) -> Tuple[List[str], List[str]]:
+        """Return display-ready study and substance names for a batch.
+        Parameters
+        ----------
+        batch:
+            Empirical batch to inspect.
+        print_available:
+            If ``True``, print available studies and drugs to stdout.
+        """
+        studies = [str(name) if name else f"study_{i}" for i, name in enumerate(batch.study_name)]
+        drugs = [
+            str(name) if name else f"substance_{i}" for i, name in enumerate(batch.substance_name)
+        ]
+        if print_available:
+            print("Available studies in selected batch:", studies)
+            print("Available drugs in selected batch:", drugs)
+        return studies, drugs
+    @staticmethod
+    def slice_single_substance_batch(
+        batch: "AICMECompartmentsDataBatch",
+        b_idx: int,
+    ) -> "AICMECompartmentsDataBatch":
+        """Extract one substance entry from a multi-substance batch.
+        Parameters
+        ----------
+        batch:
+            Batch with leading batch dimension ``B``.
+        b_idx:
+            Substance index along ``B``.
+        Returns
+        -------
+        AICMECompartmentsDataBatch
+            Single-substance batch with tensors sliced to ``B=1``.
+        """
+        if b_idx < 0 or b_idx >= len(batch.substance_name):
+            raise IndexError(
+                f"Substance index {b_idx} is out of range for batch size "
+                f"{len(batch.substance_name)}."
+            )
+        values = []
+        for field_name in batch._fields:
+            value = getattr(batch, field_name)
+            if isinstance(value, torch.Tensor):
+                # Keep tensor rank stable by preserving a singleton leading B axis.
+                values.append(value[b_idx : b_idx + 1])
+            elif field_name in {
+                "study_name",
+                "substance_name",
+                "context_subject_name",
+                "target_subject_name",
+            }:
+                values.append([value[b_idx]])
+            else:
+                values.append(value)
+        return batch.__class__(*values)
+    @classmethod
+    def slice_single_substance_batch_by_name(
+        cls,
+        batch: "AICMECompartmentsDataBatch",
+        substance_name: str,
+    ) -> "AICMECompartmentsDataBatch":
+        """Extract one substance entry by matching drug name."""
+        _, available_drugs = cls.describe_empirical_batch(batch, print_available=False)
+        norm_target = cls._normalize_substance_name(substance_name)
+        matches = [
+            i
+            for i, name in enumerate(available_drugs)
+            if cls._normalize_substance_name(name) == norm_target
+        ]
+        if not matches:
+            raise ValueError(
+                f"Selected drug '{substance_name}' not found in heldout batch. "
+                f"Choose from: {available_drugs}"
+            )
+        return cls.slice_single_substance_batch(batch, matches[0])
+    def select_empirical_drug_batch(
+        self,
+        empirical_batches: Dict[str, List["AICMECompartmentsDataBatch"]],
+        selected_drug: str,
+        *,
+        permutation_indexes: Optional[int | Sequence[int]] = None,
+        print_selection: bool = True,
+    ) -> Tuple[
+        "AICMECompartmentsDataBatch | List[AICMECompartmentsDataBatch]",
+        str,
+        str,
+    ]:
+        """Select one drug from empirical batches, optionally across permutations.
+        Parameters
+        ----------
+        empirical_batches:
+            Mapping returned by :meth:`get_empirical_test_batches`.
+        selected_drug:
+            Drug name to match across all empirical batches.
+        permutation_indexes:
+            Optional permutation index or list of permutation indices within the
+            selected empirical dataset's batch list. When ``None`` (default),
+            the method preserves legacy behaviour and returns the first matching
+            single-substance batch. When a list/tuple is provided, returns a
+            list of single-substance batches in the requested permutation order.
+        print_selection:
+            If ``True``, print where the match was found.
+        """
+        norm_target = self._normalize_substance_name(selected_drug)
+        requested_permutations: Optional[List[int]]
+        return_many = isinstance(permutation_indexes, (list, tuple))
+        if permutation_indexes is None:
+            requested_permutations = None
+        elif return_many:
+            if len(permutation_indexes) == 0:
+                raise ValueError("'permutation_indexes' must not be empty.")
+            requested_permutations = [int(idx) for idx in permutation_indexes]
+            if len(set(requested_permutations)) != len(requested_permutations):
+                raise ValueError("'permutation_indexes' must contain unique indices.")
+        else:
+            requested_permutations = [int(permutation_indexes)]
+        all_available_drugs: List[str] = []
+        seen_drugs: set[str] = set()
+        for repo_id, batch_list in empirical_batches.items():
+            for batch_index, batch in enumerate(batch_list):
+                _, available_drugs = self.describe_empirical_batch(batch, print_available=False)
+                for drug in available_drugs:
+                    if drug not in seen_drugs:
+                        seen_drugs.add(drug)
+                        all_available_drugs.append(drug)
+                matches = [
+                    i
+                    for i, name in enumerate(available_drugs)
+                    if self._normalize_substance_name(name) == norm_target
+                ]
+                if matches:
+                    if requested_permutations is None:
+                        selected_batches: List[AICMECompartmentsDataBatch] = [
+                            self.slice_single_substance_batch(batch, matches[0])
+                        ]
+                        chosen_permutations = [batch_index]
+                    else:
+                        selected_batches = []
+                        chosen_permutations = requested_permutations
+                        for permutation_index in requested_permutations:
+                            if permutation_index < 0 or permutation_index >= len(batch_list):
+                                raise IndexError(
+                                    f"Permutation index {permutation_index} is out of range for "
+                                    f"dataset '{repo_id}' with {len(batch_list)} permutations."
+                                )
+                            perm_batch = batch_list[permutation_index]
+                            _, perm_drugs = self.describe_empirical_batch(
+                                perm_batch, print_available=False
+                            )
+                            perm_matches = [
+                                i
+                                for i, name in enumerate(perm_drugs)
+                                if self._normalize_substance_name(name) == norm_target
+                            ]
+                            if not perm_matches:
+                                raise ValueError(
+                                    f"Selected drug '{selected_drug}' was not found in dataset "
+                                    f"'{repo_id}' at permutation index {permutation_index}."
+                                )
+                            selected_batches.append(
+                                self.slice_single_substance_batch(perm_batch, perm_matches[0])
+                            )
+                    studies, drugs = self.describe_empirical_batch(
+                        selected_batches[0], print_available=False
+                    )
+                    selected_study = studies[0]
+                    selected_name = drugs[0]
+                    if print_selection:
+                        print("Selected empirical dataset key:", repo_id)
+                        if len(chosen_permutations) == 1:
+                            print("Selected empirical batch index:", chosen_permutations[0])
+                        else:
+                            print("Selected empirical batch indexes:", chosen_permutations)
+                        print("Selected study:", selected_study)
+                        print("Selected drug:", selected_name)
+                    if return_many:
+                        return selected_batches, selected_study, selected_name
+                    return selected_batches[0], selected_study, selected_name
+        raise ValueError(
+            f"Selected drug '{selected_drug}' was not found in empirical batches. "
+            f"Choose from: {all_available_drugs}"
+        )
+    def _select_strategy(self, who: str):
+        """Return the observation strategy requested via ``who``.
+        Parameters
+        ----------
+        who:
+            Either ``"target"`` or ``"context"``.
+        Returns
+        -------
+        ObservationStrategy
+            The strategy matching the requested role.
+        """
+        if who == "target":
+            return self.target_strategy
+        if who == "context":
+            return self.context_strategy
+        raise ValueError("'who' must be either 'target' or 'context'.")
+    def _select_strategies(self, who: str) -> List[object]:
+        """Return strategy list for the requested role.
+        For ``who='target'`` this includes both synthetic and empirical target
+        strategies so past-selection overrides remain consistent when empirical
+        batches are generated from the datamodule.
+        """
+        if who == "context":
+            return [self.context_strategy]
+        if who == "target":
+            strategies: List[object] = [self.target_strategy]
+            empirical_target_strategy = getattr(self, "empirical_target_strategy", None)
+            if empirical_target_strategy is not None:
+                strategies.append(empirical_target_strategy)
+            # Keep order stable while avoiding duplicate objects.
+            deduped: List[object] = []
+            seen_ids: set[int] = set()
+            for strategy in strategies:
+                strategy_id = id(strategy)
+                if strategy_id in seen_ids:
+                    continue
+                seen_ids.add(strategy_id)
+                deduped.append(strategy)
+            return deduped
+        raise ValueError("'who' must be either 'target' or 'context'.")
+    def fix_past_selection(self, fix_past_value: int, *, who: str = "target") -> None:
+        """Force a fixed number of past observations for the selected strategy.
+        The override is only applied for strategies with ``split_past_future``
+        enabled; for others the call is ignored.
+        """
+        if not self._prepared:
+            self.prepare_data()
+        for strategy in self._select_strategies(who):
+            if hasattr(strategy, "fix_past_selection"):
+                strategy.fix_past_selection(fix_past_value)
+        # Reset lazy-load flag so empirical data is reloaded with new strategy settings
+        self._empirical_loaded = False
+    def release_past_selection(self, *, who: str = "target") -> None:
+        """Restore the default past sampling behaviour for the given strategy."""
+        if not self._prepared:
+            self.prepare_data()
+        for strategy in self._select_strategies(who):
+            if hasattr(strategy, "release_past_selection"):
+                strategy.release_past_selection()
+        # Reset lazy-load flag so empirical data is reloaded with restored strategy settings
+        self._empirical_loaded = False
+    def set_shared_target_dosing(self, enable: bool = True, n_targets: int = 100) -> None:
+        """Enable/disable shared target dosing across all datasets.
+        Parameters
+        ----------
+        enable : bool
+            Whether to enable shared-target dosing.
+        n_targets : int
+            Number of target individuals to sample when enabled.
+        """
+        self.use_shared_target_dosing = enable
+        self.shared_target_n_targets = n_targets
+        for ds in (
+            getattr(self, "train_dataset", None),
+            getattr(self, "val_dataset", None),
+            getattr(self, "test_dataset", None),
+        ):
+            if ds is not None:
+                ds.use_shared_target_dosing = enable
+                ds.shared_target_n_targets = n_targets
+    def unset_shared_target_dosing(self) -> None:
+        """Disable shared target dosing and restore default behaviour."""
+        self.set_shared_target_dosing(False)
+    @staticmethod
+    def _add_batch_dim_to_synthetic_batch(
+        batch: AICMECompartmentsDataBatch,
+    ) -> AICMECompartmentsDataBatch:
+        """Add a leading ``B=1`` axis to tensor fields missing batch dimension."""
+        values: list = []
+        for name, value in zip(batch._fields, batch):
+            if not isinstance(value, torch.Tensor):
+                values.append(value)
+                continue
+            if name == "time_scales":
+                # ``time_scales`` is often already [B, 2] while other fields are
+                # emitted as [I, ...] by dataset-level generation.
+                values.append(value.unsqueeze(0) if value.dim() == 1 else value)
+                continue
+            values.append(value.unsqueeze(0))
+        return AICMECompartmentsDataBatch(*values)
+    def _generate_synthetic_list_with_repeated_target(
+        self,
+        *,
+        shared_context_pack: Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor],
+        target_sim: Tensor,
+        target_times: Tensor,
+        target_dosing_amounts: Tensor,
+        target_dosing_routes: Tensor,
+        base_time_scales: Tensor,
+        num_targets: int,
+    ) -> AICMECompartmentsDataBatch:
+        """Package one list element from shared context and dosing-specific targets.
+        Parameters
+        ----------
+        shared_context_pack:
+            Context tensors generated once and reused across all list elements.
+        target_sim:
+            Target simulation for one dosing condition with shape ``[n_targets, T]``.
+        target_times:
+            Target simulation times with shape ``[n_targets, T]``.
+        target_dosing_amounts:
+            Target dosing amounts with shape ``[n_targets]``.
+        target_dosing_routes:
+            Target dosing route types with shape ``[n_targets]``.
+        base_time_scales:
+            Simulation-level time scales from the sampler.
+        num_targets:
+            Number of target individuals capacity for this synthetic sample.
+        """
+        (
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+        ) = shared_context_pack
+        target_obs_pack = self.train_dataset._safe_generate(
+            self.train_dataset.target_fn,
+            target_sim,
+            target_times,
+            time_scales=base_time_scales,
+        )
+        (
+            target_obs,
+            target_obs_time,
+            target_obs_mask,
+            target_rem_sim,
+            target_rem_sim_time,
+            target_rem_sim_mask,
+            _target_time_scales,
+        ) = target_obs_pack
+        # Keep the time-scale metadata aligned with the shared context payload.
+        ts = base_time_scales
+        return self.train_dataset._build_padded_batch(
+            # context (shared across all list elements)
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+            # target (specific to one repeated-dosing condition)
+            target_obs,
+            target_obs_time,
+            target_obs_mask,
+            target_rem_sim,
+            target_rem_sim_time,
+            target_rem_sim_mask,
+            target_dosing_amounts,
+            target_dosing_routes,
+            # time scales
+            ts=ts,
+            target_capacity=num_targets,
+        )
+    def prepare_full_simulation_list_with_repeated_targets(
+        self,
+        num_targets: int,
+        batch_index: int = 0,
+        num_of_different_dosages: int = 1,
+        device: Optional[torch.device | str] = None,
+    ) -> List["AICMECompartmentsDataBatch"]:
+        """Build one shared context and ``L`` repeated-target dosing batches.
+        This helper is responsible for context creation exactly once, then
+        looping over ``num_of_different_dosages`` target dosing conditions.
+        Packaging is delegated to
+        :meth:`_generate_synthetic_list_with_repeated_target`.
+        Parameters
+        ----------
+        num_targets:
+            Number of target individuals per dosing condition.
+        batch_index:
+            Synthetic sample index used by the simulation backend.
+        num_of_different_dosages:
+            Number of target dosing conditions ``L``.
+        device:
+            Optional device where returned batches should live.
+        """
+        if num_targets < 0:
+            raise ValueError("num_targets must be non-negative")
+        if batch_index < 0:
+            raise ValueError("batch_index must be non-negative")
+        if num_of_different_dosages < 0:
+            raise ValueError("num_of_different_dosages must be non-negative")
+        if not self._prepared:
+            self.prepare_data()
+        (
+            context_sim,
+            context_times,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+            target_simulations,
+            target_times_list,
+            target_dosing_amounts_list,
+            target_dosing_routes_list,
+            _time_points,
+            time_scales,
+        ) = prepare_full_simulation_list_with_repeated_targets_backend(
+            self.meta_config,
+            self.model_config.dosing,
+            n_targets=num_targets,
+            num_of_different_dosages=num_of_different_dosages,
+            idx=batch_index,
+        )
+        # Build context once and reuse verbatim for all list elements.
+        context_obs_pack = self.train_dataset._safe_generate(
+            self.train_dataset.context_fn,
+            context_sim,
+            context_times,
+            time_scales=time_scales,
+        )
+        (
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            context_time_scales,
+        ) = context_obs_pack
+        shared_context_pack = (
+            context_obs,
+            context_obs_time,
+            context_obs_mask,
+            context_rem_sim,
+            context_rem_sim_time,
+            context_rem_sim_mask,
+            dosing_amounts_ctx,
+            dosing_routes_ctx,
+        )
+        base_time_scales = context_time_scales if context_time_scales is not None else time_scales
+        synthetic_batches: List[AICMECompartmentsDataBatch] = []
+        for target_sim, target_times, target_dosing_amounts, target_dosing_routes in zip(
+            target_simulations,
+            target_times_list,
+            target_dosing_amounts_list,
+            target_dosing_routes_list,
+        ):
+            synthetic_batches.append(
+                self._generate_synthetic_list_with_repeated_target(
+                    shared_context_pack=shared_context_pack,
+                    target_sim=target_sim,
+                    target_times=target_times,
+                    target_dosing_amounts=target_dosing_amounts,
+                    target_dosing_routes=target_dosing_routes,
+                    base_time_scales=base_time_scales,
+                    num_targets=num_targets,
+                )
+            )
+        if device is None:
+            return synthetic_batches
+        return list_of_databath_to_device(synthetic_batches, device)
+    def generate_synthetic_with_repeated_target(
+        self,
+        num_targets: int,
+        batch_index: int = 0,
+        different_dosing: bool = False,
+        device: Optional[torch.device | str] = None,
+    ) -> List["AICMECompartmentsDataBatch"]:
+        """Generate one synthetic batch list with configurable target dosing.
+        The generated sample follows the current datamodule ``data_config`` and
+        observation strategies while overriding only the number of target
+        individuals. The returned tensors include an explicit leading batch
+        dimension ``B=1`` to match dataloader outputs.
+        Parameters
+        ----------
+        num_targets:
+            Number of target individuals to include in the generated synthetic
+            sample.
+        batch_index:
+            Dataset index used by the internal synthetic generator.
+        different_dosing:
+            If ``False`` (default), target individuals share one repeated dosing
+            configuration.
+            If ``True``, each target individual receives an independent dosing
+            sample drawn from the same dosing distribution as context.
+        device:
+            Optional device where returned batches should live. When provided,
+            returned batches are moved to ``device``.
+        Returns
+        -------
+        List[AICMECompartmentsDataBatch]
+            A list containing a single synthetic databatch.
+        """
+        if num_targets < 0:
+            raise ValueError("num_targets must be non-negative")
+        if batch_index < 0:
+            raise ValueError("batch_index must be non-negative")
+        if not self._prepared:
+            self.prepare_data()
+        batches = self.train_dataset._generate_item_sample_target_dosing(
+            batch_index,
+            n_targets=num_targets,
+            different_dosing=different_dosing,
+        )
+        batch_list = [self._add_batch_dim_to_synthetic_batch(batch) for batch in batches]
+        if device is None:
+            return batch_list
+        return list_of_databath_to_device(batch_list, device)
+    def generate_synthetic_list_of_repeated_target(
+        self,
+        num_targets: int,
+        batch_index: int = 0,
+        num_of_different_dosages: int = 1,
+        device: Optional[torch.device | str] = None,
+    ) -> List["AICMECompartmentsDataBatch"]:
+        """Generate a list of synthetic batches sharing one context.
+        The returned list has length ``num_of_different_dosages``. Context
+        fields are identical across all elements, while targets are regenerated
+        per element using repeated dosing within each element.
+        """
+        synthetic_batches = self.prepare_full_simulation_list_with_repeated_targets(
+            num_targets=num_targets,
+            batch_index=batch_index,
+            num_of_different_dosages=num_of_different_dosages,
+            device=device,
+        )
+        batch_list = [self._add_batch_dim_to_synthetic_batch(batch) for batch in synthetic_batches]
+        return batch_list

sim_priors_pk/data/extra/compartment_models_vectorized.py ADDED Viewed

	@@ -0,0 +1,182 @@

+import numpy as np
+import torch
+def sample_individual_configs_vectorized(study_config):
+    """
+    Vectorizes the sampling of parameters for a population of individuals.
+    Parameters
+    ----------
+    study_config : StudyConfig
+        Contains the study settings and distribution parameters.
+    Returns
+    -------
+    config_dict : dict
+        Dictionary containing the vectorized parameters and time-magnitudes.
+        Keys:
+          'k_a', 'k_e', 'V': Tensors of shape (N,)
+          'k_1p', 'k_p1': Tensors of shape (N, P)
+          'k_a_tmag', 'k_e_tmag', 'V_tmag': Scalars
+          'k_1p_tmag', 'k_p1_tmag': Tensors of shape (P,)
+          'num_peripherals': int
+    """
+    N = study_config.num_individuals
+    P = study_config.num_peripherals
+    # Sample the central parameters as tensors of shape (N,)
+    k_a = torch.from_numpy(np.random.lognormal(study_config.log_k_a_mean, study_config.log_k_a_std, size=N)).float()
+    k_e = torch.from_numpy(np.random.lognormal(study_config.log_k_e_mean, study_config.log_k_e_std, size=N)).float()
+    V   = torch.from_numpy(np.random.lognormal(study_config.log_V_mean,   study_config.log_V_std,   size=N)).float()
+    # Sample the peripheral parameters as tensors of shape (N, P)
+    k_1p = []
+    k_p1 = []
+    for i in range(P):
+        k_1p_i = torch.from_numpy(np.random.lognormal(study_config.log_k_1p_mean[i],
+                                                       study_config.log_k_1p_std[i], size=N)).float()
+        k_p1_i = torch.from_numpy(np.random.lognormal(study_config.log_k_p1_mean[i],
+                                                       study_config.log_k_p1_std[i], size=N)).float()
+        k_1p.append(k_1p_i)
+        k_p1.append(k_p1_i)
+    # Stack along the peripheral dimension: shape becomes (N, P)
+    k_1p = torch.stack(k_1p, dim=1)
+    k_p1 = torch.stack(k_p1, dim=1)
+    # Pack time-magnitudes (assumed scalars for central parameters and lists for peripherals)
+    k_a_tmag = study_config.k_a_tmag  # scalar
+    k_e_tmag = study_config.k_e_tmag  # scalar
+    V_tmag   = study_config.V_tmag    # scalar
+    # For peripherals, we assume the study_config gives lists/arrays of length P.
+    k_1p_tmag = torch.tensor(study_config.k_1p_tmag).float()  # shape (P,)
+    k_p1_tmag = torch.tensor(study_config.k_p1_tmag).float()    # shape (P,)
+    config_dict = {
+        'k_a': k_a,
+        'k_e': k_e,
+        'V': V,
+        'k_1p': k_1p,
+        'k_p1': k_p1,
+        'k_a_tmag': k_a_tmag,
+        'k_e_tmag': k_e_tmag,
+        'V_tmag': V_tmag,
+        'k_1p_tmag': k_1p_tmag,
+        'k_p1_tmag': k_p1_tmag,
+        'num_peripherals': P,
+    }
+    return config_dict
+import torch
+def compute_rates(config, t):
+    """
+    Computes the dynamic rates for all individuals at a given time t.
+    Parameters
+    ----------
+    config : dict
+        Dictionary returned by sample_individual_configs_vectorized.
+    t : float or torch.Tensor
+        Current time point.
+    Returns
+    -------
+    k_a, k_e, V : torch.Tensor
+        Tensors of shape (N,).
+    k_1p, k_p1 : torch.Tensor
+        Tensors of shape (N, P).
+    """
+    # Ensure t is a tensor
+    if not isinstance(t, torch.Tensor):
+        t = torch.tensor(t, dtype=config['k_a_tmag'].dtype, device=config['k_a_tmag'].device)
+    k_a = config['k_a'] * torch.exp(-config['k_a_tmag'] * t)
+    k_e = config['k_e'] * torch.exp(-config['k_e_tmag'] * t)
+    V   = config['V']   * torch.exp(-config['V_tmag']   * t)
+    # Use broadcasting for peripheral compartments
+    k_1p = config['k_1p'] * torch.exp(-config['k_1p_tmag'] * t)
+    k_p1 = config['k_p1'] * torch.exp(-config['k_p1_tmag'] * t)
+    return k_a, k_e, V, k_1p, k_p1
+def ode_func(t_val, y, config):
+    """
+    ODE function using vectorized rate computations.
+    Parameters
+    ----------
+    t_val : torch.Tensor
+        Current time point.
+    y : torch.Tensor
+        Current state, shape (N, M) where M = 2 + num_peripherals.
+    config : dict
+        Vectorized individual configuration dictionary.
+    Returns
+    -------
+    dy_dt : torch.Tensor
+        Time derivative of y, shape (N, M).
+    """
+    # Get the dynamic rates for all individuals at time t_val.
+    k_a, k_e, _, k_1p, k_p1 = compute_rates(config, t_val)
+    N = y.size(0)
+    P = config['num_peripherals']
+    M = 2 + P
+    # Build the ODE rate matrix A(t) in a vectorized fashion
+    A_all = torch.zeros((N, M, M), dtype=torch.float32)
+    A_all[:, 0, 0] = -k_a          # Loss from gut
+    A_all[:, 1, 0] = k_a           # Transfer gut -> central
+    A_all[:, 1, 1] = -k_e - k_1p.sum(dim=1)  # Loss from central and distribution to peripherals
+    A_all[:, 1, 2:2+P] = k_p1       # Transfer central -> peripherals
+    A_all[:, 2:2+P, 1] = k_1p       # Transfer peripherals -> central
+    # Peripheral compartments clearance:
+    for i in range(P):
+        A_all[:, 2 + i, 2 + i] = -k_p1[:, i]
+    # Compute dy/dt = A_all @ y for each individual.
+    dy_dt = torch.bmm(A_all, y.unsqueeze(-1)).squeeze(-1)
+    return dy_dt
+def sample_study_vectorized(study_config, dosing_config, t, solver_method="rk4"):
+    """
+    Simulates the pharmacokinetic study using vectorized individual configurations.
+    Parameters
+    ----------
+    study_config : StudyConfig
+        Contains global study settings and distribution parameters.
+    dosing_config : DosingConfig
+        Contains dosing information.
+    t : torch.Tensor
+        Time points at which the simulation is evaluated.
+    Returns
+    -------
+    full_simulation : torch.Tensor
+        Concentration profiles (N, len(t)).
+    full_times : torch.Tensor
+        Time points replicated for each individual.
+    """
+    from torchdiffeq import odeint
+    # Get the vectorized configuration dictionary
+    config = sample_individual_configs_vectorized(study_config)
+    N = study_config.num_individuals
+    P = study_config.num_peripherals
+    M = 2 + P
+    # Initial conditions: dose in the gut (first compartment), zeros elsewhere.
+    y0 = torch.zeros((N, M), dtype=torch.float32)
+    y0[:, 0] = dosing_config.dose
+    def wrapped_ode(t_val, y):
+        return ode_func(t_val, y, config)
+    # Solve the ODE system for all individuals in batch
+    y = odeint(wrapped_ode, y0, t, method=solver_method)
+    # Extract central compartment (index 1) for each individual
+    full_simulation = y[:, :, 1].T
+    full_times = t.unsqueeze(0).repeat(N, 1)
+    return full_simulation, full_times

sim_priors_pk/data/extra/kernels.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import torch
+import gpytorch
+def create_kernel(config):
+    kernel_params = config.kernel_params
+    if 'type' not in kernel_params:
+        raise ValueError("Kernel type must be specified in kernel_params")
+    if kernel_params['type'] == 'RBF':
+        kernel = gpytorch.kernels.RBFKernel(ard_num_dims=config.input_dim, requires_grad=False)
+        kernel_params_ = kernel_params.get('params', {})
+        kernel_length_scale = kernel_params_["raw_lengthscale"]
+        kernel_length_scale = torch.tensor([kernel_length_scale] * config.input_dim)
+        kernel.initialize(raw_lengthscale=kernel_length_scale)
+        return kernel
+    raise ValueError(f"Unsupported kernel type: {kernel_params['type']}")
+def create_kernel_mix(kernel_params,input_dim=1):
+    if 'type' not in kernel_params:
+        raise ValueError("Kernel type must be specified in kernel_params")
+    if kernel_params['type'] == 'RBF':
+        kernel = gpytorch.kernels.RBFKernel(ard_num_dims=input_dim, requires_grad=False)
+        kernel_params_ = kernel_params.get('params', {})
+        kernel_length_scale = kernel_params_["raw_lengthscale"]
+        kernel_length_scale = torch.tensor([kernel_length_scale] * input_dim)
+        kernel.initialize(raw_lengthscale=kernel_length_scale)
+        return kernel
+    raise ValueError(f"Unsupported kernel type: {kernel_params['type']}")

sim_priors_pk/hub_runtime/README.md ADDED Viewed

	@@ -0,0 +1,187 @@

+# Hub Runtime Bundle
+This directory contains the parallel Hugging Face export path for
+consumer-facing model bundles.
+The existing training export remains unchanged:
+- native export: `BasicLightningExperiment._push_model_to_hub(...)`
+- runtime export: `push_loaded_model_runtime_bundle(...)`
+The runtime export is intended for users who should be able to load a model
+from the Hugging Face Hub through `transformers` without installing the local
+`sim_priors_pk` package.
+## Important Constraint
+The consumer entrypoint is `transformers`, but `transformers` alone is **not**
+enough today.
+These runtime bundles still execute PyTorch-based custom code and reconstruct
+the internal PK architecture, so the user needs the runtime Python
+dependencies, but not a local checkout of this repository.
+Reliable consumer install:
+```bash
+pip install torch transformers huggingface_hub lightning datasets pandas torchtyping gpytorch pot torchdiffeq torchsde ruamel.yaml pyyaml
+```
+What the consumer does **not** need:
+- `pip install sim_priors_pk`
+- a local clone of this repository
+- access to the training checkpoint directory
+## Consumer Workflow
+Use the runtime repo, not the native training-artifact repo.
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "your-org/your-model-runtime",
+    trust_remote_code=True,
+)
+```
+Then call the stable runtime task API:
+```python
+outputs = model.run_task(
+    task="generate",   # or "predict"
+    studies=studies,   # one StudyJSON or a list[StudyJSON]
+    num_samples=8,
+)
+```
+The return payload is:
+```python
+{
+    "task": "generate",
+    "io_schema_version": "studyjson-v1",
+    "model_info": {...},
+    "results": [
+        {
+            "input_index": 0,
+            "samples": [study_json_0, study_json_1, ...],
+        }
+    ],
+}
+```
+## Generate Example
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "your-org/your-model-runtime",
+    trust_remote_code=True,
+)
+studies = [
+    {
+        "context": [
+            {
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "target": [],
+        "meta_data": {
+            "study_name": "demo",
+            "substance_name": "drug_x",
+        },
+    }
+]
+outputs = model.run_task(
+    task="generate",
+    studies=studies,
+    num_samples=4,
+)
+generated_studies = outputs["results"][0]["samples"]
+```
+## Predict Example
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "your-org/your-model-runtime",
+    trust_remote_code=True,
+)
+predict_studies = [
+    {
+        "context": [
+            {
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "target": [
+            {
+                "name_id": "tgt_0",
+                "observations": [0.25, 0.31],
+                "observation_times": [0.5, 1.0],
+                "remaining": [0.0, 0.0, 0.0],
+                "remaining_times": [2.0, 4.0, 8.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }
+        ],
+        "meta_data": {
+            "study_name": "demo",
+            "substance_name": "drug_x",
+        },
+    }
+]
+outputs = model.run_task(
+    task="predict",
+    studies=predict_studies,
+    num_samples=4,
+)
+prediction_samples = outputs["results"][0]["samples"]
+```
+## Producer Workflow
+To publish a runtime repo from a locally loaded experiment:
+```python
+from sim_priors_pk.hub_runtime import push_loaded_model_runtime_bundle
+runtime_repo_id = push_loaded_model_runtime_bundle(
+    experiment=experiment,
+    model_card_path=["hf_model_cards", "AICME-PK_Readme.md"],
+)
+```
+By default this creates a separate repo:
+```text
+<namespace>/<hf_model_name>-runtime
+```
+That keeps the native training artifact export and the consumer runtime export
+separate.

sim_priors_pk/hub_runtime/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""Public helpers for the parallel Hugging Face runtime bundle path."""
+from sim_priors_pk.hub_runtime.configuration_sim_priors_pk import PKHubConfig
+from sim_priors_pk.hub_runtime.modeling_sim_priors_pk import PKHubModel
+from sim_priors_pk.hub_runtime.runtime_bundle import (
+    RuntimeBundleArtifacts,
+    build_runtime_bundle_dir,
+    default_runtime_repo_id,
+    push_loaded_model_runtime_bundle,
+)
+__all__ = [
+    "PKHubConfig",
+    "PKHubModel",
+    "RuntimeBundleArtifacts",
+    "build_runtime_bundle_dir",
+    "default_runtime_repo_id",
+    "push_loaded_model_runtime_bundle",
+]

sim_priors_pk/hub_runtime/configuration_sim_priors_pk.py ADDED Viewed

	@@ -0,0 +1,42 @@

+"""Hugging Face configuration for self-contained PK runtime bundles."""
+from __future__ import annotations
+from typing import Any, Dict, List, Optional
+from transformers import PretrainedConfig
+from sim_priors_pk.hub_runtime.runtime_contract import STUDY_JSON_IO_VERSION
+class PKHubConfig(PretrainedConfig):
+    """Public Hub config describing a consumer-facing PK runtime bundle."""
+    model_type = "sim_priors_pk"
+    def __init__(
+        self,
+        architecture_name: Optional[str] = None,
+        experiment_type: str = "nodepk",
+        experiment_config: Optional[Dict[str, Any]] = None,
+        builder_config: Optional[Dict[str, Any]] = None,
+        supported_tasks: Optional[List[str]] = None,
+        default_task: Optional[str] = None,
+        io_schema_version: str = STUDY_JSON_IO_VERSION,
+        original_repo_id: Optional[str] = None,
+        runtime_repo_id: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.architecture_name = architecture_name
+        self.experiment_type = experiment_type
+        self.experiment_config = dict(experiment_config or {})
+        self.builder_config = dict(builder_config or {})
+        self.supported_tasks = list(supported_tasks or [])
+        self.default_task = default_task or (self.supported_tasks[0] if self.supported_tasks else None)
+        self.io_schema_version = io_schema_version
+        self.original_repo_id = original_repo_id
+        self.runtime_repo_id = runtime_repo_id
+__all__ = ["PKHubConfig"]

sim_priors_pk/hub_runtime/modeling_sim_priors_pk.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Hugging Face AutoModel wrapper for consumer-facing PK runtime bundles."""
+from __future__ import annotations
+from typing import Any, Dict, Optional, Sequence, Union
+import torch
+from transformers import PreTrainedModel
+from sim_priors_pk.data.data_empirical.json_schema import StudyJSON
+from sim_priors_pk.hub_runtime.configuration_sim_priors_pk import PKHubConfig
+from sim_priors_pk.hub_runtime.runtime_contract import (
+    RuntimeBuilderConfig,
+    build_batch_from_studies,
+    infer_supported_tasks,
+    instantiate_backbone_from_hub_config,
+    normalize_studies_input,
+    split_runtime_samples,
+    validate_studies_for_task,
+)
+from sim_priors_pk.models.amortized_inference.generative_pk import (
+    NewGenerativeMixin,
+    NewPredictiveMixin,
+)
+class PKHubModel(PreTrainedModel):
+    """Thin wrapper exposing a stable StudyJSON runtime API on top of PK models."""
+    config_class = PKHubConfig
+    base_model_prefix = "backbone"
+    def __init__(self, config: PKHubConfig, backbone: Optional[torch.nn.Module] = None) -> None:
+        super().__init__(config)
+        self.backbone = backbone if backbone is not None else instantiate_backbone_from_hub_config(config)
+        self.backbone.eval()
+    def forward(self, *args, **kwargs):
+        """Delegate raw forward calls to the wrapped PK backbone."""
+        return self.backbone(*args, **kwargs)
+    @property
+    def supported_tasks(self) -> Sequence[str]:
+        """Tasks supported by this runtime model."""
+        return tuple(getattr(self.config, "supported_tasks", []) or infer_supported_tasks(self.backbone))
+    @torch.inference_mode()
+    def run_task(
+        self,
+        *,
+        task: str,
+        studies: Union[StudyJSON, Sequence[StudyJSON]],
+        num_samples: int = 1,
+        **kwargs: Any,
+    ) -> Dict[str, Any]:
+        """Run the public StudyJSON inference contract for the requested task."""
+        supported_tasks = list(self.supported_tasks)
+        if task not in supported_tasks:
+            raise ValueError(
+                f"Unsupported task {task!r}. Supported tasks: {supported_tasks or 'none'}."
+            )
+        if int(num_samples) < 1:
+            raise ValueError("num_samples must be >= 1.")
+        canonical_studies = normalize_studies_input(studies)
+        builder_config = RuntimeBuilderConfig.from_dict(self.config.builder_config)
+        validate_studies_for_task(canonical_studies, task=task, builder_config=builder_config)
+        experiment_config_payload = getattr(self.config, "experiment_config", {})
+        meta_dosing_payload = experiment_config_payload.get("dosing", {})
+        batch = build_batch_from_studies(
+            canonical_studies,
+            builder_config=builder_config,
+            meta_dosing=self.backbone.meta_dosing.__class__(**meta_dosing_payload)
+            if meta_dosing_payload
+            else self.backbone.meta_dosing,
+        )
+        batch = batch.to(self.device)
+        if task == "generate":
+            if not isinstance(self.backbone, NewGenerativeMixin):
+                raise ValueError(f"Backbone {type(self.backbone).__name__} does not support generate.")
+            output_studies = self.backbone.sample_new_individuals_to_studyjson(
+                batch,
+                sample_size=int(num_samples),
+                num_steps=kwargs.get("num_steps"),
+            )
+        elif task == "predict":
+            if not isinstance(self.backbone, NewPredictiveMixin):
+                raise ValueError(f"Backbone {type(self.backbone).__name__} does not support predict.")
+            output_studies = self.backbone.sample_individual_prediction_from_batch_list_to_studyjson(
+                [batch],
+                sample_size=int(num_samples),
+            )[0]
+        else:
+            raise ValueError(f"Unsupported task {task!r}.")
+        results = [
+            {
+                "input_index": index,
+                "samples": split_runtime_samples(task, study),
+            }
+            for index, study in enumerate(output_studies)
+        ]
+        return {
+            "task": task,
+            "io_schema_version": self.config.io_schema_version,
+            "model_info": {
+                "architecture_name": self.config.architecture_name,
+                "experiment_type": self.config.experiment_type,
+                "supported_tasks": supported_tasks,
+                "runtime_repo_id": self.config.runtime_repo_id,
+                "original_repo_id": self.config.original_repo_id,
+            },
+            "results": results,
+        }
+__all__ = ["PKHubModel"]

sim_priors_pk/hub_runtime/runtime_bundle.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""Manual export path for consumer-facing Hugging Face runtime bundles."""
+from __future__ import annotations
+import re
+import shutil
+from dataclasses import dataclass
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from typing import Optional, Sequence
+import torch
+from huggingface_hub import HfApi, create_repo
+from sim_priors_pk import config_dir, project_dir
+from sim_priors_pk.hub_runtime.configuration_sim_priors_pk import PKHubConfig
+from sim_priors_pk.hub_runtime.modeling_sim_priors_pk import PKHubModel
+from sim_priors_pk.hub_runtime.runtime_contract import (
+    build_runtime_config_payload,
+    resolve_model_card_text,
+    runtime_readme_text,
+)
+ROOT_CONFIGURATION_FILENAME = "configuration_sim_priors_pk.py"
+ROOT_MODELING_FILENAME = "modeling_sim_priors_pk.py"
+_HF_TOKEN_PATTERN = re.compile(r"hf_[A-Za-z0-9]{20,}")
+_COMET_KEY_ASSIGNMENT_PATTERN = re.compile(r"(COMET_API_KEY\s*=\s*)(['\"]).*?\2")
+_HF_KEY_ASSIGNMENT_PATTERN = re.compile(r"(HF_KEYS\s*=\s*)(['\"]).*?\2")
+@dataclass
+class RuntimeBundleArtifacts:
+    """Return metadata for a staged runtime bundle."""
+    bundle_dir: Path
+    runtime_repo_id: str
+    original_repo_id: Optional[str]
+    readme_path: Path
+def default_runtime_repo_id(experiment, *, suffix: str = "-runtime") -> str:
+    """Resolve the default runtime bundle repo id for a loaded experiment."""
+    if getattr(experiment, "exp_config", None) is None:
+        raise RuntimeError("Experiment config is not loaded.")
+    if getattr(experiment, "hf_token", None) is None:
+        raise RuntimeError(
+            "No Hugging Face token available. Set hugging_face_token in the config or KEYS.txt."
+        )
+    user = HfApi().whoami(token=experiment.hf_token)["name"]
+    return f"{user}/{experiment.exp_config.hf_model_name}{suffix}"
+def _default_original_repo_id(experiment) -> Optional[str]:
+    """Infer the legacy/native Hub repo id if enough metadata is available."""
+    if getattr(experiment, "exp_config", None) is None:
+        return None
+    if getattr(experiment, "hf_token", None) is None:
+        return None
+    user = HfApi().whoami(token=experiment.hf_token)["name"]
+    return f"{user}/{experiment.exp_config.hf_model_name}"
+def _validate_loaded_experiment(experiment) -> None:
+    """Ensure the loaded experiment has the minimum state needed for manual export."""
+    if getattr(experiment, "model", None) is None:
+        raise RuntimeError("Experiment model is not loaded.")
+    if getattr(experiment, "exp_config", None) is None:
+        raise RuntimeError("Experiment config is not loaded.")
+    if getattr(experiment, "experiment_dir", None) is None:
+        raise RuntimeError("Experiment directory is required before pushing.")
+    if getattr(experiment, "hf_token", None) is None:
+        raise RuntimeError(
+            "No Hugging Face token available. Set hugging_face_token in the config or KEYS.txt."
+        )
+def _copy_runtime_support_files(bundle_dir: Path) -> None:
+    """Copy the local package and root remote-code entrypoints into the bundle."""
+    package_src = project_dir / "sim_priors_pk"
+    package_dst = bundle_dir / "sim_priors_pk"
+    shutil.copytree(package_src, package_dst, dirs_exist_ok=True, ignore=shutil.ignore_patterns("__pycache__"))
+    root_config_src = package_src / "hub_runtime" / ROOT_CONFIGURATION_FILENAME
+    root_modeling_src = package_src / "hub_runtime" / ROOT_MODELING_FILENAME
+    shutil.copy2(root_config_src, bundle_dir / ROOT_CONFIGURATION_FILENAME)
+    shutil.copy2(root_modeling_src, bundle_dir / ROOT_MODELING_FILENAME)
+    for extra_name in ("requirements.txt", "LICENSE"):
+        extra_src = project_dir / extra_name
+        if extra_src.is_file():
+            shutil.copy2(extra_src, bundle_dir / extra_name)
+    _scrub_runtime_bundle_secrets(bundle_dir)
+    _validate_no_hf_secrets(bundle_dir)
+def _scrub_runtime_bundle_secrets(bundle_dir: Path) -> None:
+    """Remove token-like secrets from copied source files before Hub upload."""
+    candidate_files = [
+        *bundle_dir.rglob("*.py"),
+        *bundle_dir.rglob("*.md"),
+        *bundle_dir.rglob("*.txt"),
+        *bundle_dir.rglob("*.json"),
+    ]
+    for path in candidate_files:
+        try:
+            text = path.read_text(encoding="utf-8")
+        except UnicodeDecodeError:
+            continue
+        updated = text
+        updated = _HF_TOKEN_PATTERN.sub("hf_REDACTED", updated)
+        updated = _COMET_KEY_ASSIGNMENT_PATTERN.sub(r"\1\2REDACTED\2", updated)
+        updated = _HF_KEY_ASSIGNMENT_PATTERN.sub(r"\1\2REDACTED\2", updated)
+        if path.as_posix().endswith("sim_priors_pk/utils/__init__.py"):
+            updated = (
+                "PASCAL_BASE_DIR = ''\n"
+                "NERSC_BASE_DIR = ''\n"
+                "NERSC_EXPERIMENT_DIR = ''\n"
+                "COMET_API_KEY = 'REDACTED'\n"
+                "HF_KEYS = 'REDACTED'\n"
+                "WORKSPACE = ''\n"
+                "PROJECT = ''\n"
+            )
+        if updated != text:
+            path.write_text(updated, encoding="utf-8")
+def _validate_no_hf_secrets(bundle_dir: Path) -> None:
+    """Fail fast if token-like Hugging Face secrets remain after scrubbing."""
+    offending_files: list[str] = []
+    for path in bundle_dir.rglob("*"):
+        if not path.is_file():
+            continue
+        if path.suffix not in {".py", ".md", ".txt", ".json"}:
+            continue
+        try:
+            text = path.read_text(encoding="utf-8")
+        except UnicodeDecodeError:
+            continue
+        if _HF_TOKEN_PATTERN.search(text):
+            offending_files.append(str(path.relative_to(bundle_dir)))
+    if offending_files:
+        raise RuntimeError(
+            "Refusing to upload runtime bundle because token-like Hugging Face secrets "
+            f"remain after scrubbing: {offending_files}"
+        )
+def build_runtime_bundle_dir(
+    *,
+    experiment,
+    bundle_dir: Path,
+    model_card_path: Optional[Sequence[str]] = None,
+    hf_repo_id: Optional[str] = None,
+    original_repo_id: Optional[str] = None,
+) -> RuntimeBundleArtifacts:
+    """Stage a self-contained runtime bundle in ``bundle_dir`` without uploading it."""
+    _validate_loaded_experiment(experiment)
+    bundle_dir.mkdir(parents=True, exist_ok=True)
+    runtime_repo_id = hf_repo_id or default_runtime_repo_id(experiment)
+    native_repo_id = original_repo_id or _default_original_repo_id(experiment)
+    normalized_model_card_path = tuple(
+        model_card_path
+        if model_card_path is not None
+        else getattr(experiment.exp_config, "hf_model_card_path", ("hf_model_cards", "README.md"))
+    )
+    local_model_card_path = Path(config_dir).joinpath(*normalized_model_card_path)
+    base_model_card = resolve_model_card_text(local_model_card_path)
+    runtime_payload = build_runtime_config_payload(
+        backbone=experiment.model,
+        exp_config=experiment.exp_config,
+        original_repo_id=native_repo_id,
+        runtime_repo_id=runtime_repo_id,
+    )
+    runtime_config = PKHubConfig(
+        **runtime_payload,
+        auto_map={
+            "AutoConfig": f"{ROOT_CONFIGURATION_FILENAME[:-3]}.PKHubConfig",
+            "AutoModel": f"{ROOT_MODELING_FILENAME[:-3]}.PKHubModel",
+        },
+        architectures=["PKHubModel"],
+    )
+    runtime_model = PKHubModel(runtime_config, backbone=experiment.model)
+    state_dict = {name: tensor.detach().cpu() for name, tensor in runtime_model.state_dict().items()}
+    torch.save(state_dict, bundle_dir / "pytorch_model.bin")
+    runtime_config.save_pretrained(str(bundle_dir))
+    _copy_runtime_support_files(bundle_dir)
+    readme_text = runtime_readme_text(
+        base_model_card=base_model_card,
+        runtime_repo_id=runtime_repo_id,
+        original_repo_id=native_repo_id,
+        supported_tasks=runtime_config.supported_tasks,
+        default_task=runtime_config.default_task,
+    )
+    readme_path = bundle_dir / "README.md"
+    readme_path.write_text(readme_text, encoding="utf-8")
+    return RuntimeBundleArtifacts(
+        bundle_dir=bundle_dir,
+        runtime_repo_id=runtime_repo_id,
+        original_repo_id=native_repo_id,
+        readme_path=readme_path,
+    )
+def push_loaded_model_runtime_bundle(
+    experiment,
+    model_card_path: Optional[Sequence[str]] = None,
+    hf_repo_id: Optional[str] = None,
+    alias_name: str = "runtime_bundle_hf",
+    commit_message: str = "manual runtime bundle push",
+    *,
+    original_repo_id: Optional[str] = None,
+    exist_ok: bool = True,
+) -> str:
+    """Build and upload the consumer-facing runtime bundle for a loaded experiment."""
+    _validate_loaded_experiment(experiment)
+    runtime_repo_id = hf_repo_id or default_runtime_repo_id(experiment)
+    create_repo(runtime_repo_id, exist_ok=exist_ok, token=experiment.hf_token)
+    bundle_root = Path(experiment.experiment_dir) / alias_name
+    bundle_root.mkdir(parents=True, exist_ok=True)
+    with TemporaryDirectory(dir=str(bundle_root), prefix="hf_runtime_bundle_") as temp_dir:
+        staged_dir = Path(temp_dir)
+        build_runtime_bundle_dir(
+            experiment=experiment,
+            bundle_dir=staged_dir,
+            model_card_path=model_card_path,
+            hf_repo_id=runtime_repo_id,
+            original_repo_id=original_repo_id,
+        )
+        api = HfApi(token=experiment.hf_token)
+        api.upload_folder(
+            folder_path=str(staged_dir),
+            repo_id=runtime_repo_id,
+            commit_message=commit_message,
+            token=experiment.hf_token,
+        )
+    return runtime_repo_id
+__all__ = [
+    "RuntimeBundleArtifacts",
+    "build_runtime_bundle_dir",
+    "default_runtime_repo_id",
+    "push_loaded_model_runtime_bundle",
+]

sim_priors_pk/hub_runtime/runtime_contract.py ADDED Viewed

	@@ -0,0 +1,662 @@

+"""Shared runtime-contract helpers for consumer-facing Hub bundles.
+This module is imported both by the local exporter and by the copied package
+inside the generated Hugging Face runtime bundle. Keep dependencies limited to
+modules that are already required for model inference.
+"""
+from __future__ import annotations
+from copy import deepcopy
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any, Dict, List, Mapping, Optional, Sequence, Union, get_args, get_origin
+import torch
+from transformers import PretrainedConfig
+from sim_priors_pk.config_classes.data_config import (
+    MetaDosingConfig,
+    MetaStudyConfig,
+    MixDataConfig,
+    ObservationsConfig,
+    SimpleMetaStudyConfig,
+)
+from sim_priors_pk.config_classes.diffusion_pk_config import DiffusionPKExperimentConfig
+from sim_priors_pk.config_classes.flow_pk_config import FlowPKExperimentConfig, VectorFieldPKConfig
+from sim_priors_pk.config_classes.node_pk_config import (
+    EncoderDecoderNetworkConfig,
+    NodePKExperimentConfig,
+)
+from sim_priors_pk.config_classes.source_process_config import SourceProcessConfig
+from sim_priors_pk.config_classes.training_config import TrainingConfig
+from sim_priors_pk.data.data_empirical.builder import EmpiricalBatchConfig, JSON2AICMEBuilder
+from sim_priors_pk.data.data_empirical.json_schema import IndividualJSON, StudyJSON, canonicalize_study
+from sim_priors_pk.data.data_generation.observations_classes import ObservationStrategyFactory
+from sim_priors_pk.models import get_model_class
+from sim_priors_pk.models.amortized_inference.generative_pk import (
+    NewGenerativeMixin,
+    NewPredictiveMixin,
+)
+SUPPORTED_RUNTIME_ARCHITECTURES = {
+    "AICMEPK",
+    "ContextVAEPK",
+    "FlowPK",
+    "PredictionPK",
+}
+STUDY_JSON_IO_VERSION = "studyjson-v1"
+@dataclass
+class RuntimeBuilderConfig:
+    """Fixed builder capacities serialized into the Hub runtime config."""
+    max_context_individuals: int
+    max_target_individuals: int
+    max_context_observations: int
+    max_target_observations: int
+    max_context_remaining: int
+    max_target_remaining: int
+    def to_dict(self) -> Dict[str, int]:
+        """Return a JSON-serializable representation."""
+        return asdict(self)
+    @classmethod
+    def from_dict(cls, payload: Mapping[str, Any]) -> "RuntimeBuilderConfig":
+        """Instantiate the builder capacities from serialized config payload."""
+        return cls(
+            max_context_individuals=int(payload["max_context_individuals"]),
+            max_target_individuals=int(payload["max_target_individuals"]),
+            max_context_observations=int(payload["max_context_observations"]),
+            max_target_observations=int(payload["max_target_observations"]),
+            max_context_remaining=int(payload["max_context_remaining"]),
+            max_target_remaining=int(payload["max_target_remaining"]),
+        )
+    def to_empirical_batch_config(self, *, max_databatch_size: int) -> EmpiricalBatchConfig:
+        """Translate runtime capacities to the builder used by StudyJSON IO."""
+        return EmpiricalBatchConfig(
+            max_databatch_size=int(max_databatch_size),
+            max_individuals=max(self.max_context_individuals, self.max_target_individuals),
+            max_observations=max(self.max_context_observations, self.max_target_observations),
+            max_remaining=max(self.max_context_remaining, self.max_target_remaining),
+            max_context_individuals=self.max_context_individuals,
+            max_target_individuals=self.max_target_individuals,
+            max_context_observations=self.max_context_observations,
+            max_target_observations=self.max_target_observations,
+            max_context_remaining=self.max_context_remaining,
+            max_target_remaining=self.max_target_remaining,
+        )
+def _coerce_annotation(annotation: Any, value: Any) -> Any:
+    """Best-effort coercion of JSON-loaded values into dataclass field types."""
+    if value is None:
+        return None
+    origin = get_origin(annotation)
+    args = get_args(annotation)
+    if origin is Union:
+        non_none = [arg for arg in args if arg is not type(None)]
+        for candidate in non_none:
+            if candidate in (dict, Dict, Any, Mapping):
+                continue
+            try:
+                return _coerce_annotation(candidate, value)
+            except Exception:
+                continue
+        return value
+    if origin in (list, List, Sequence):
+        (inner_type,) = args if args else (Any,)
+        return [_coerce_annotation(inner_type, item) for item in value]
+    if origin in (tuple,):
+        if not args:
+            return tuple(value)
+        if len(args) == 2 and args[1] is Ellipsis:
+            return tuple(_coerce_annotation(args[0], item) for item in value)
+        return tuple(_coerce_annotation(inner, item) for inner, item in zip(args, value))
+    if origin in (dict, Dict, Mapping):
+        return dict(value)
+    if annotation is Any:
+        return value
+    if annotation is MetaStudyConfig and isinstance(value, Mapping) and value.get("simple_mode"):
+        return SimpleMetaStudyConfig(**dict(value))
+    if hasattr(annotation, "__dataclass_fields__") and isinstance(value, Mapping):
+        kwargs = {}
+        for field_name, field_def in annotation.__dataclass_fields__.items():
+            if field_name in value:
+                kwargs[field_name] = _coerce_annotation(field_def.type, value[field_name])
+        return annotation(**kwargs)
+    return value
+def _rebuild_node_config(payload: Mapping[str, Any]) -> NodePKExperimentConfig:
+    """Reconstruct a ``NodePKExperimentConfig`` from serialized dict content."""
+    return NodePKExperimentConfig(
+        experiment_type=str(payload.get("experiment_type", "nodepk")).lower(),
+        name_str=str(payload.get("name_str", "NodePK")),
+        comet_ai_key=payload.get("comet_ai_key"),
+        experiment_name=str(payload.get("experiment_name", "node_pk_compartments")),
+        hugging_face_token=payload.get("hugging_face_token"),
+        upload_to_hf_hub=bool(payload.get("upload_to_hf_hub", False)),
+        hf_model_name=str(payload.get("hf_model_name", "NodePK_runtime")),
+        hf_model_card_path=tuple(payload.get("hf_model_card_path", ("hf_model_cards", "README.md"))),
+        tags=list(payload.get("tags", [])),
+        experiment_indentifier=payload.get("experiment_indentifier"),
+        my_results_path=payload.get("my_results_path"),
+        experiment_dir=payload.get("experiment_dir"),
+        verbose=bool(payload.get("verbose", False)),
+        run_index=int(payload.get("run_index", 0)),
+        debug_test=bool(payload.get("debug_test", False)),
+        network=_coerce_annotation(EncoderDecoderNetworkConfig, payload.get("network", {})),
+        mix_data=_coerce_annotation(MixDataConfig, payload.get("mix_data", {})),
+        context_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("context_observations", {})
+        ),
+        target_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("target_observations", {})
+        ),
+        meta_study=_coerce_annotation(MetaStudyConfig, payload.get("meta_study", {})),
+        dosing=_coerce_annotation(MetaDosingConfig, payload.get("dosing", {})),
+        train=_coerce_annotation(TrainingConfig, payload.get("train", {})),
+    )
+def _rebuild_flow_config(payload: Mapping[str, Any]) -> FlowPKExperimentConfig:
+    """Reconstruct a ``FlowPKExperimentConfig`` from serialized dict content."""
+    return FlowPKExperimentConfig(
+        experiment_type=str(payload.get("experiment_type", "flowpk")).lower(),
+        name_str=str(payload.get("name_str", "FlowPK")),
+        comet_ai_key=payload.get("comet_ai_key"),
+        experiment_name=str(payload.get("experiment_name", "flow_pk_compartments")),
+        hugging_face_token=payload.get("hugging_face_token"),
+        upload_to_hf_hub=bool(payload.get("upload_to_hf_hub", False)),
+        hf_model_name=str(payload.get("hf_model_name", "FlowPK_runtime")),
+        hf_model_card_path=tuple(payload.get("hf_model_card_path", ("hf_model_cards", "README.md"))),
+        tags=list(payload.get("tags", [])),
+        experiment_indentifier=payload.get("experiment_indentifier"),
+        my_results_path=payload.get("my_results_path"),
+        experiment_dir=payload.get("experiment_dir"),
+        verbose=bool(payload.get("verbose", False)),
+        run_index=int(payload.get("run_index", 0)),
+        debug_test=bool(payload.get("debug_test", False)),
+        flow_num_steps=int(payload.get("flow_num_steps", 50)),
+        vector_field=_coerce_annotation(VectorFieldPKConfig, payload.get("vector_field", {})),
+        source_process=_coerce_annotation(SourceProcessConfig, payload.get("source_process", {})),
+        mix_data=_coerce_annotation(MixDataConfig, payload.get("mix_data", {})),
+        context_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("context_observations", {})
+        ),
+        target_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("target_observations", {})
+        ),
+        meta_study=_coerce_annotation(MetaStudyConfig, payload.get("meta_study", {})),
+        dosing=_coerce_annotation(MetaDosingConfig, payload.get("dosing", {})),
+        train=_coerce_annotation(TrainingConfig, payload.get("train", {})),
+    )
+def _rebuild_diffusion_config(payload: Mapping[str, Any]) -> DiffusionPKExperimentConfig:
+    """Reconstruct a ``DiffusionPKExperimentConfig`` from serialized dict content."""
+    return DiffusionPKExperimentConfig(
+        experiment_type=str(payload.get("experiment_type", "diffusionpk")).lower(),
+        name_str=str(payload.get("name_str", "ContinuousDiffusionPK")),
+        diffusion_type=str(payload.get("diffusion_type", "continuous")),
+        comet_ai_key=payload.get("comet_ai_key"),
+        experiment_name=str(payload.get("experiment_name", "diffusion_pk_compartments")),
+        hugging_face_token=payload.get("hugging_face_token"),
+        upload_to_hf_hub=bool(payload.get("upload_to_hf_hub", False)),
+        hf_model_name=str(payload.get("hf_model_name", "DiffusionPK_runtime")),
+        hf_model_card_path=tuple(payload.get("hf_model_card_path", ("hf_model_cards", "README.md"))),
+        tags=list(payload.get("tags", [])),
+        experiment_indentifier=payload.get("experiment_indentifier"),
+        my_results_path=payload.get("my_results_path"),
+        experiment_dir=payload.get("experiment_dir"),
+        verbose=bool(payload.get("verbose", False)),
+        run_index=int(payload.get("run_index", 0)),
+        debug_test=bool(payload.get("debug_test", False)),
+        predict_gaussian_noise=bool(payload.get("predict_gaussian_noise", True)),
+        network=_coerce_annotation(EncoderDecoderNetworkConfig, payload.get("network", {})),
+        source_process=_coerce_annotation(SourceProcessConfig, payload.get("source_process", {})),
+        mix_data=_coerce_annotation(MixDataConfig, payload.get("mix_data", {})),
+        context_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("context_observations", {})
+        ),
+        target_observations=_coerce_annotation(
+            ObservationsConfig, payload.get("target_observations", {})
+        ),
+        meta_study=_coerce_annotation(MetaStudyConfig, payload.get("meta_study", {})),
+        dosing=_coerce_annotation(MetaDosingConfig, payload.get("dosing", {})),
+        train=_coerce_annotation(TrainingConfig, payload.get("train", {})),
+    )
+def rebuild_experiment_config(
+    payload: Mapping[str, Any],
+) -> Union[NodePKExperimentConfig, FlowPKExperimentConfig, DiffusionPKExperimentConfig]:
+    """Rebuild the serialized experiment config stored in the Hub config."""
+    experiment_type = str(payload.get("experiment_type", "nodepk")).lower()
+    if experiment_type == "nodepk":
+        return _rebuild_node_config(payload)
+    if experiment_type == "flowpk":
+        return _rebuild_flow_config(payload)
+    if experiment_type == "diffusionpk":
+        return _rebuild_diffusion_config(payload)
+    raise ValueError(f"Unsupported experiment_type for runtime bundle: {experiment_type!r}.")
+def compute_runtime_builder_config(
+    exp_config: Union[NodePKExperimentConfig, FlowPKExperimentConfig, DiffusionPKExperimentConfig],
+) -> RuntimeBuilderConfig:
+    """Compute fixed empirical StudyJSON capacities from the experiment config."""
+    context_strategy = ObservationStrategyFactory.from_config(
+        exp_config.context_observations,
+        exp_config.meta_study,
+    )
+    target_strategy = ObservationStrategyFactory.from_config(
+        exp_config.target_observations,
+        exp_config.meta_study,
+    )
+    ctx_obs_cap, ctx_rem_cap = context_strategy.get_shapes()
+    tgt_obs_cap, tgt_rem_cap = target_strategy.get_shapes()
+    max_context_individuals = int(exp_config.meta_study.num_individuals_range[-1])
+    max_target_individuals = int(getattr(exp_config.mix_data, "n_of_target_individuals", 1))
+    if max_target_individuals < 0:
+        raise ValueError("n_of_target_individuals must be >= 0 for Hub runtime export.")
+    return RuntimeBuilderConfig(
+        max_context_individuals=max_context_individuals,
+        max_target_individuals=max_target_individuals,
+        max_context_observations=int(ctx_obs_cap),
+        max_target_observations=int(tgt_obs_cap),
+        max_context_remaining=int(ctx_rem_cap),
+        max_target_remaining=int(tgt_rem_cap),
+    )
+def infer_supported_tasks(backbone: torch.nn.Module) -> List[str]:
+    """Infer the public task surface supported by the wrapped model."""
+    tasks: List[str] = []
+    if isinstance(backbone, NewGenerativeMixin):
+        tasks.append("generate")
+    if isinstance(backbone, NewPredictiveMixin):
+        tasks.append("predict")
+    return tasks
+def validate_runtime_architecture(backbone: torch.nn.Module) -> str:
+    """Ensure the loaded architecture is supported by the runtime bundle v1."""
+    architecture_name = backbone.__class__.__name__
+    if architecture_name not in SUPPORTED_RUNTIME_ARCHITECTURES:
+        raise ValueError(
+            "Runtime Hub export only supports "
+            f"{sorted(SUPPORTED_RUNTIME_ARCHITECTURES)}, got {architecture_name!r}."
+        )
+    return architecture_name
+def build_runtime_config_payload(
+    *,
+    backbone: torch.nn.Module,
+    exp_config: Union[NodePKExperimentConfig, FlowPKExperimentConfig, DiffusionPKExperimentConfig],
+    original_repo_id: Optional[str],
+    runtime_repo_id: Optional[str],
+) -> Dict[str, Any]:
+    """Build the serializable fields stored in the Hub config."""
+    architecture_name = validate_runtime_architecture(backbone)
+    supported_tasks = infer_supported_tasks(backbone)
+    if not supported_tasks:
+        raise ValueError(f"Model {architecture_name!r} does not expose runtime tasks.")
+    builder_config = compute_runtime_builder_config(exp_config)
+    return {
+        "architecture_name": architecture_name,
+        "experiment_type": str(getattr(exp_config, "experiment_type", "nodepk")).lower(),
+        "experiment_config": asdict(exp_config),
+        "builder_config": builder_config.to_dict(),
+        "supported_tasks": supported_tasks,
+        "default_task": supported_tasks[0],
+        "io_schema_version": STUDY_JSON_IO_VERSION,
+        "original_repo_id": original_repo_id,
+        "runtime_repo_id": runtime_repo_id,
+    }
+def instantiate_backbone_from_hub_config(config: PretrainedConfig) -> torch.nn.Module:
+    """Rebuild the internal PK model represented by the public Hub wrapper."""
+    experiment_config_payload = getattr(config, "experiment_config", None)
+    if not isinstance(experiment_config_payload, Mapping):
+        raise ValueError("Hub config is missing the serialized experiment_config payload.")
+    exp_config = rebuild_experiment_config(experiment_config_payload)
+    model_cls = get_model_class(exp_config)
+    backbone = model_cls(exp_config)
+    backbone.eval()
+    return backbone
+def normalize_studies_input(
+    studies: Union[StudyJSON, Sequence[StudyJSON]],
+) -> List[StudyJSON]:
+    """Normalize runtime input to a mutable list of canonicalized studies."""
+    if isinstance(studies, Mapping):
+        raw_studies = [dict(studies)]
+    else:
+        raw_studies = [dict(study) for study in studies]
+    return [canonicalize_study(study, drop_tgt_too_few=False) for study in raw_studies]
+def validate_studies_for_task(
+    studies: Sequence[StudyJSON],
+    *,
+    task: str,
+    builder_config: RuntimeBuilderConfig,
+) -> None:
+    """Validate task semantics and reject inputs that exceed runtime capacities."""
+    for study_idx, study in enumerate(studies):
+        context = list(study.get("context", []))
+        target = list(study.get("target", []))
+        if task == "generate":
+            if not context:
+                raise ValueError("`generate` requires at least one context individual per study.")
+            if target:
+                raise ValueError("`generate` expects target to be empty in the input StudyJSON.")
+        elif task == "predict":
+            if not target:
+                raise ValueError("`predict` requires at least one target individual per study.")
+        else:
+            raise ValueError(f"Unsupported task {task!r}.")
+        if len(context) > builder_config.max_context_individuals:
+            raise ValueError(
+                f"Study {study_idx} exceeds context individual capacity "
+                f"({len(context)} > {builder_config.max_context_individuals})."
+            )
+        if len(target) > builder_config.max_target_individuals:
+            raise ValueError(
+                f"Study {study_idx} exceeds target individual capacity "
+                f"({len(target)} > {builder_config.max_target_individuals})."
+            )
+        _validate_individual_block(
+            study_idx=study_idx,
+            block_name="context",
+            individuals=context,
+            max_observations=builder_config.max_context_observations,
+            max_remaining=builder_config.max_context_remaining,
+        )
+        _validate_individual_block(
+            study_idx=study_idx,
+            block_name="target",
+            individuals=target,
+            max_observations=builder_config.max_target_observations,
+            max_remaining=builder_config.max_target_remaining,
+        )
+def _validate_individual_block(
+    *,
+    study_idx: int,
+    block_name: str,
+    individuals: Sequence[IndividualJSON],
+    max_observations: int,
+    max_remaining: int,
+) -> None:
+    """Reject studies that would otherwise be truncated by the empirical builder."""
+    for ind_idx, individual in enumerate(individuals):
+        obs_len = len(individual.get("observations", []))
+        rem_len = len(individual.get("remaining", []))
+        if obs_len > max_observations:
+            raise ValueError(
+                f"Study {study_idx} {block_name}[{ind_idx}] exceeds observation capacity "
+                f"({obs_len} > {max_observations})."
+            )
+        if rem_len > max_remaining:
+            raise ValueError(
+                f"Study {study_idx} {block_name}[{ind_idx}] exceeds remaining capacity "
+                f"({rem_len} > {max_remaining})."
+            )
+def build_batch_from_studies(
+    studies: Sequence[StudyJSON],
+    *,
+    builder_config: RuntimeBuilderConfig,
+    meta_dosing: MetaDosingConfig,
+):
+    """Convert canonical studies into the internal PK databatch representation."""
+    builder = JSON2AICMEBuilder(
+        builder_config.to_empirical_batch_config(max_databatch_size=max(1, len(studies)))
+    )
+    return builder.build_one_aicmebatch(list(studies), meta_dosing)
+def split_runtime_samples(task: str, study: StudyJSON) -> List[StudyJSON]:
+    """Convert model-specific StudyJSON outputs into per-sample StudyJSONs."""
+    if task == "generate":
+        return _split_generate_samples(study)
+    if task == "predict":
+        return _split_predict_samples(study)
+    raise ValueError(f"Unsupported task {task!r}.")
+def _split_generate_samples(study: StudyJSON) -> List[StudyJSON]:
+    """Split generated target individuals into one StudyJSON per sample."""
+    targets = list(study.get("target", []))
+    if not targets:
+        return [deepcopy(study)]
+    split: List[StudyJSON] = []
+    for target in targets:
+        split.append(
+            {
+                "context": deepcopy(study.get("context", [])),
+                "target": [deepcopy(target)],
+                "meta_data": deepcopy(study.get("meta_data", {})),
+            }
+        )
+    return split
+def _split_predict_samples(study: StudyJSON) -> List[StudyJSON]:
+    """Split target prediction samples into one StudyJSON per sample index."""
+    targets = list(study.get("target", []))
+    if not targets:
+        return [deepcopy(study)]
+    sample_count = 0
+    for target in targets:
+        sample_count = max(sample_count, len(target.get("prediction_samples", [])))
+    if sample_count == 0:
+        return [deepcopy(study)]
+    split: List[StudyJSON] = []
+    for sample_idx in range(sample_count):
+        target_block: List[IndividualJSON] = []
+        for target in targets:
+            target_copy: IndividualJSON = deepcopy(target)
+            samples = list(target.get("prediction_samples", []))
+            if samples:
+                if sample_idx >= len(samples):
+                    raise ValueError(
+                        "All target individuals must expose the same number of prediction samples."
+                    )
+                target_copy["prediction_samples"] = [deepcopy(samples[sample_idx])]
+            target_block.append(target_copy)
+        split.append(
+            {
+                "context": deepcopy(study.get("context", [])),
+                "target": target_block,
+                "meta_data": deepcopy(study.get("meta_data", {})),
+            }
+        )
+    return split
+def runtime_readme_text(
+    *,
+    base_model_card: str,
+    runtime_repo_id: str,
+    original_repo_id: Optional[str],
+    supported_tasks: Sequence[str],
+    default_task: str,
+) -> str:
+    """Compose the README uploaded with the consumer-facing runtime bundle."""
+    original_line = (
+        f"- Native training/artifact repo: `{original_repo_id}`"
+        if original_repo_id
+        else "- Native training/artifact repo: not recorded"
+    )
+    tasks_literal = ", ".join(f"`{task}`" for task in supported_tasks)
+    usage = f"""
+## Runtime Bundle
+This repository is the consumer-facing runtime bundle for this PK model.
+- Runtime repo: `{runtime_repo_id}`
+{original_line}
+- Supported tasks: {tasks_literal}
+- Default task: `{default_task}`
+- Load path: `AutoModel.from_pretrained(..., trust_remote_code=True)`
+### Installation
+You do **not** need to install `sim_priors_pk` to use this runtime bundle.
+`transformers` is the public loading entrypoint, but `transformers` alone is
+not sufficient because this is a PyTorch model with custom runtime code. A
+reliable consumer environment is:
+```bash
+pip install torch transformers huggingface_hub lightning datasets pandas torchtyping gpytorch pot torchdiffeq torchsde ruamel.yaml pyyaml
+```
+### Python Usage
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("{runtime_repo_id}", trust_remote_code=True)
+studies = [
+    {{
+        "context": [
+            {{
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }}
+        ],
+        "target": [],
+        "meta_data": {{"study_name": "demo", "substance_name": "drug_x"}},
+    }}
+]
+outputs = model.run_task(
+    task="{default_task}",
+    studies=studies,
+    num_samples=4,
+)
+print(outputs["results"][0]["samples"])
+```
+### Predictive Sampling
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("{runtime_repo_id}", trust_remote_code=True)
+predict_studies = [
+    {{
+        "context": [
+            {{
+                "name_id": "ctx_0",
+                "observations": [0.2, 0.5, 0.3],
+                "observation_times": [0.5, 1.0, 2.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }}
+        ],
+        "target": [
+            {{
+                "name_id": "tgt_0",
+                "observations": [0.25, 0.31],
+                "observation_times": [0.5, 1.0],
+                "remaining": [0.0, 0.0, 0.0],
+                "remaining_times": [2.0, 4.0, 8.0],
+                "dosing": [1.0],
+                "dosing_type": ["oral"],
+                "dosing_times": [0.0],
+                "dosing_name": ["oral"],
+            }}
+        ],
+        "meta_data": {{"study_name": "demo", "substance_name": "drug_x"}},
+    }}
+]
+outputs = model.run_task(
+    task="predict",
+    studies=predict_studies,
+    num_samples=4,
+)
+print(outputs["results"][0]["samples"][0]["target"][0]["prediction_samples"])
+```
+### Notes
+- `trust_remote_code=True` is required because this model uses custom Hugging Face Hub runtime code.
+- The consumer API is `transformers` + `run_task(...)`; the consumer does not need a local clone of this repository.
+- This runtime bundle is intentionally separate from the native training export so you can evaluate both distribution paths in parallel.
+"""
+    return base_model_card.rstrip() + "\n" + usage.strip() + "\n"
+def resolve_model_card_text(model_card_path: Path) -> str:
+    """Read and validate the model card that seeds the runtime README."""
+    if not model_card_path.is_file():
+        raise FileNotFoundError(f"Model card not found at: {model_card_path}")
+    return model_card_path.read_text(encoding="utf-8")

sim_priors_pk/metrics/__init__.py ADDED Viewed

File without changes

sim_priors_pk/metrics/pk_metrics.py ADDED Viewed

	@@ -0,0 +1,490 @@

+import torch
+import numpy as np
+from typing import List,Tuple
+from matplotlib import pyplot as plt
+from torchtyping import TensorType, patch_typeguard
+from sim_priors_pk.data.datasets.aicme_batch import AICMECompartmentsDataBatch
+from scipy import stats
+import torch
+from typing import Tuple
+Tensor = torch.Tensor          # for brevity – keep your own alias if you prefer
+import os
+def ensure_folder_exists(folder_name: str):
+    if not os.path.exists(folder_name):
+        os.makedirs(folder_name)
+        print(f"✅ Created folder: {folder_name}")
+    else:
+        print(f"📁 Folder already exists: {folder_name}")
+def combine_samples(
+    samples_list: list[TensorType["S", "B", "I", "T", 1]]
+) -> TensorType["S", "P", "T"]:
+    """
+    Given:
+      samples_list: list of length P, each tensor of shape [S, B, I, T, 1]
+                    (here B = I = 1)
+    Returns:
+      combined: tensor of shape [S, P, T]
+    """
+    # 1) Extract the [S, T] slice from each sample (drop B=1, I=1, last dim=1)
+    #    - s[:, 0, 0, :, 0] has shape [S, T]
+    squeezed: list[TensorType["S", "T"]] = [
+        s[:, 0, 0, :, 0]
+        for s in samples_list
+    ]
+    # 2) Stack along a new “permutation” axis P → [S, P, T]
+    combined: TensorType["S", "P", "T"] = torch.stack(squeezed, dim=1)
+    return combined
+def extract_context_by_mask(
+    db: AICMECompartmentsDataBatch
+) -> Tuple[
+    List[TensorType["n_i"]],  # context observations per compartment
+    List[TensorType["n_i"]]   # context times per compartment
+]:
+    """
+    For B=1, from a single AICMECompartmentsDataBatch:
+      - db.context_obs:      [1, c_ind, num_obs_c, 1]
+      - db.context_obs_time: [1, c_ind, num_obs_c, 1]
+      - db.context_obs_mask: [1, c_ind, num_obs_c]
+    Returns two lists of length c_ind:
+      obs_list[i].shape == (n_i,)   selects those obs where mask==1
+      time_list[i].shape == (n_i,)  selects the corresponding times
+    """
+    # Unpack and assert B=1
+    B, c_ind, num_obs_c, one = db.context_obs.shape
+    assert B == 1 and one == 1, f"Expected B=1 and last dim=1, got B={B}, last={one}"
+    # Drop the batch and singleton dims:
+    #   [1, c_ind, num_obs_c, 1] → [c_ind, num_obs_c]
+    obs   = db.context_obs.squeeze(0).squeeze(-1)       # TensorType["c_ind", "num_obs_c"]
+    times = db.context_obs_time.squeeze(0).squeeze(-1)  # TensorType["c_ind", "num_obs_c"]
+    mask  = db.context_obs_mask.squeeze(0)              # TensorType["c_ind", "num_obs_c"]
+    obs_list: List[torch.Tensor] = []
+    time_list: List[torch.Tensor] = []
+    for i in range(c_ind):
+        mi      = mask[i].bool()    # [num_obs_c]
+        obs_i   = obs[i][mi]        # [n_i]
+        times_i = times[i][mi]      # [n_i]
+        obs_list.append(obs_i)
+        time_list.append(times_i)
+    return obs_list, time_list
+def compute_pd(
+    y_obs : TensorType["I", "T"],             # observed data
+    y_sim : TensorType["S", "I", "T"],        # S simulated datasets
+    mask  : TensorType["I", "T"],             # True/1 = valid obs
+) -> TensorType["I", "T"]:                    # pd, NaN where mask == 0
+    """
+    NOTICE THAT THERE IS NO BATCH INDEX, this works only on individual substances
+    Prediction discrepancy (pd)   —  Eq. (4)  Comets et al. 2008
+    Parameters
+    ----------
+    y_obs : [I, T]           observed values (padding value doesn't matter,
+                             because `mask` says which entries to trust)
+    y_sim : [S, I, T]        S Monte-Carlo replicates generated from the model
+    mask  : [I, T]           binary mask — True at valid observation points
+    Returns
+    -------
+    pd :   [I, T]            empirical CDF value at (i,j); NaN where mask==0
+    """
+    S, I, T = y_sim.shape
+    assert y_obs.shape == (I, T),  "y_obs must be [I,T]"
+    assert mask.shape  == (I, T),  "mask  must be [I,T]"
+    # Expand y_obs to [S,I,T] so we can broadcast the < comparison
+    y_obs_exp = y_obs.unsqueeze(0).expand(S, -1, -1)      # [S,I,T]
+    # δ_{ijk} = 1 if y_sim < y_obs else 0
+    delta = (y_sim < y_obs_exp).float()                   # [S,I,T]
+    # average over the S simulations   →   empirical CDF
+    pd = delta.mean(dim=0)                                # [I,T]
+    # put NaN where mask == 0 so the caller knows which are pads
+    pd = torch.where(mask.bool(), pd, torch.full_like(pd, float("nan")))
+    return pd
+def sample_covariance_manual_torch(
+    X: TensorType["S", "Tv"]  # simulations for one subject, S×Tᵥ
+):
+    """
+    Pure-Torch analogue of your NumPy helper.
+    Returns unbiased covariance [Tᵥ,Tᵥ] and mean vector [Tᵥ].
+    """
+    S, _ = X.shape
+    mean_vec = X.mean(dim=0)                       # [Tᵥ]
+    Xc = X - mean_vec                              # [S,Tᵥ]
+    cov = Xc.t() @ Xc / (S - 1)                    # [Tᵥ,Tᵥ]
+    return cov, mean_vec
+def whiten_manual_torch_old(
+    X: TensorType["S", "Tv"],        # data to whiten
+    eps: float = 1e-8                # ridge for numerical safety
+):
+    """
+    Manual whitening à la your NumPy code.
+    Returns whitened X and the whitening matrix W (Σ^{-1/2}).
+    """
+    cov, mean_vec = sample_covariance_manual_torch(X)   # Σ, μ
+    eigvals, eigvecs = torch.linalg.eigh(cov + eps * torch.eye(cov.size(0), device=X.device))
+    D_inv_sqrt = torch.diag(torch.rsqrt(eigvals))       # diag(1/√λ)
+    W = eigvecs @ D_inv_sqrt @ eigvecs.t()              # Σ^{-1/2}
+    X_white = (X - mean_vec) @ W                        # apply whitening
+    return X_white, W, mean_vec
+def compute_npde_full_old(
+    y_obs: TensorType["I", "T"],
+    y_sim: TensorType["S", "I", "T"],
+    mask : TensorType["I", "T"],
+    eps  : float = 1e-8
+) -> TensorType["I", "T"]:
+    """
+    Full NPDE with within-subject decorrelation (Σ^{-1/2}) computed
+    **exactly** as in your NumPy snippet.
+    NOTICE THAT THERE IS NO BATCH INDEX, this works only on individual substances
+    Shapes
+    ------
+    y_obs : [I,T]     observations (padding allowed)
+    y_sim : [S,I,T]   S Monte-Carlo replicates
+    mask  : [I,T]     True/1 = valid time-points
+    """
+    S, I, T = y_sim.shape
+    N01 = torch.distributions.Normal(0.0, 1.0)
+    out = torch.full_like(y_obs, float("nan"))          # result placeholder
+    for i in range(I):
+        # ---- select the irregular grid for subject i -------------------
+        valid_idx = mask[i].bool()
+        if not valid_idx.any():
+            continue                                    # nothing to do
+        y_i_obs = y_obs[i, valid_idx]                  # [Tᵥ]
+        y_i_sim = y_sim[:, i, valid_idx]               # [S,Tᵥ]
+        # ---- whitening per your NumPy logic ----------------------------
+        y_i_sim_white, W, mean_vec = whiten_manual_torch(y_i_sim, eps)  # [S,Tᵥ]
+        if W is None:
+            # Whitening degraded → set result to NaN or skip this subject
+            out[i, valid_idx] = float("nan")
+            continue
+        # same transform for the single observation vector
+        y_i_obs_white = (y_i_obs - mean_vec) @ W                       # [Tᵥ]
+        # ---- empirical CDF on whitened scale (Eq. 4) -------------------
+        delta = (y_i_sim_white < y_i_obs_white).float()                # [S,Tᵥ]
+        pde   = delta.mean(dim=0)                                      # [Tᵥ]
+        # ---- edge-case rule (Eq. 6) ------------------------------------
+        one_over_S = 1.0 / S
+        pde = torch.where(pde == 0, torch.full_like(pde, one_over_S), pde)
+        pde = torch.where(pde == 1, torch.full_like(pde, 1 - one_over_S), pde)
+        # ---- NPDE (Eq. 7) ---------------------------------------------
+        npde = N01.icdf(pde)                                           # [Tᵥ]
+        # ---- write back to full-size tensor ----------------------------
+        out[i, valid_idx] = npde
+    return out
+# ---------------------------------------------------------------------
+# 1. Robust whitening
+# ---------------------------------------------------------------------
+def whiten_manual_torch(
+    X: Tensor,                       # [S, Tᵥ]
+    eps: float = 1e-8,
+    max_attempts: int = 5,
+    base_jitter: float = 1e-6
+) -> Tuple[Tensor, torch.Tensor | None, Tensor, bool]:
+    """
+    Returns
+    -------
+    X_white : [S,Tᵥ]           whitened simulations
+    W       : [Tᵥ,Tᵥ] | None   Σ^{-½} (None ⇒ degraded to diag)
+    mean    : [Tᵥ]             sample mean
+    ok      : bool             True if full Σ^{-½} was used
+    """
+    S, T = X.shape
+    X64   = X.double()
+    mean  = X64.mean(dim=0)
+    Xm    = X64 - mean
+    cov   = (Xm.T @ Xm) / (S - 1)
+    I     = torch.eye(T, dtype=X64.dtype, device=X.device)
+    W = None
+    for k in range(max_attempts):
+        jitter = base_jitter * (10.0 ** k)
+        try:
+            eigvals, eigvecs = torch.linalg.eigh(cov + (eps + jitter) * I)
+            if torch.any(eigvals <= 0):
+                raise RuntimeError("non-positive eigenvalues")
+            inv_sqrt = torch.rsqrt(eigvals)
+            W = eigvecs @ torch.diag(inv_sqrt) @ eigvecs.T
+            break
+        except RuntimeError:
+            pass  # try bigger jitter
+    if W is None:                               # final fallback
+        var = cov.diag().clamp_min(eps)
+        W   = torch.diag(torch.rsqrt(var))      # diagonal only
+        ok  = False
+    else:
+        ok  = True
+    X_white = (Xm @ W).float()
+    return X_white, W.float() if ok else None, mean.float(), ok
+# ---------------------------------------------------------------------
+# 2. NPDE with an *output* validity mask
+# ---------------------------------------------------------------------
+def compute_npde_full(
+    y_obs: TensorType["I", "T"],
+    y_sim: TensorType["S", "I", "T"],
+    mask : TensorType["I", "T"],
+    eps  : float = 1e-8,
+) -> Tuple[TensorType["I", "T"], TensorType["I", "T"]]:
+    """
+    Full NPDE with within-subject decorrelation (Σ^{-1/2}) computed
+    **exactly** as in your NumPy snippet.
+    NOTICE THAT THERE IS NO BATCH INDEX, this works only on individual substances
+    Args
+    ------
+    y_obs : [I,T]     observations (padding allowed)
+    y_sim : [S,I,T]   S Monte-Carlo replicates
+    mask  : [I,T]     True/1 = valid time-points
+    Returns
+    -------
+    npde       : [I,T]   – same shape as `y_obs`
+    valid_mask : [I,T]   – True where npde is statistically valid
+    """
+    S, I, T = y_sim.shape
+    N01 = torch.distributions.Normal(0.0, 1.0)
+    npde_out   = torch.full_like(y_obs, float("nan"))
+    valid_out  = mask.clone().bool()            # start with the user mask
+    for i in range(I):
+        # ---- select the irregular grid for subject i -------------------
+        valid_idx = mask[i].bool()
+        if not valid_idx.any():
+            valid_out[i] = False
+            continue
+        y_i_obs = y_obs[i, valid_idx]           # [Tᵥ]
+        y_i_sim = y_sim[:, i, valid_idx]        # [S,Tᵥ]
+        # ---- whitening per your NumPy logic ----------------------------
+        y_i_sim_white, W, mean_vec, ok = whiten_manual_torch(y_i_sim, eps)
+        if not ok:                              # whitening failed → invalidate
+            valid_out[i, valid_idx] = False
+            continue
+        # same transform for the single observation vector
+        y_i_obs_white = (y_i_obs - mean_vec) @ W
+        # ---- empirical CDF on whitened scale (Eq. 4) -------------------
+        delta = (y_i_sim_white < y_i_obs_white).float()
+        pde   = delta.mean(dim=0)
+        # ---- edge-case rule (Eq. 6) ------------------------------------
+        one_over_S = 1.0 / S
+        pde = torch.where(pde == 0, torch.full_like(pde, one_over_S), pde)
+        pde = torch.where(pde == 1, torch.full_like(pde, 1 - one_over_S), pde)
+        # ---- NPDE (Eq. 7) ---------------------------------------------
+        npde = N01.icdf(pde)
+        npde_out[i, valid_idx] = npde
+    return npde_out, valid_out
+def compute_npde_in_batch(
+    y_obs: TensorType["B", "I", "T"],
+    y_sim: TensorType["S", "B", "I", "T"],
+    mask: TensorType["B", "I", "T"],
+    eps: float = 1e-8,
+) -> TensorType["B", "I", "T"]:
+    """Compute NPDE for each element in a batch.
+    Parameters
+    ----------
+    y_obs : [B, I, T]  Observed values per batch item (context observations).
+    y_sim : [S, B, I, T]  Simulated predictions.
+    mask  : [B, I, T]  Validity mask for observations.
+    Returns
+    -------
+    Tensor of shape [B, I, T] with NPDE values.
+    """
+    B = y_obs.size(0)
+    results = []
+    for b in range(B):
+        npde_b = compute_npde_full(y_obs[b], y_sim[:, b], mask[b], eps)
+        results.append(npde_b)
+    return torch.stack(results, dim=0)
+def shapiro_wilk_normality(npde: TensorType["T"]) -> Tuple[float, float]:
+    """Return Shapiro-Wilk normality test statistic and p-value for a 1-D tensor."""
+    npde_np = npde[torch.isfinite(npde)].detach().cpu().numpy()
+    w, p = stats.shapiro(npde_np)
+    return float(w), float(p)
+def qq_plot(npde: TensorType["T"], train:bool =False, epoch:str|int = "na", **kwargs) -> str | None:
+    """
+    Generate and optionally save/show a QQ plot of NPDE values.
+    Args:
+        npde: Tensor containing NPDE values.
+        train (bool, optional): If True (default), saves plot to file.
+        model (optional): Lightning model, used to name the file with `current_epoch`.
+    Returns:
+        File path if saved, None otherwise.
+    """
+    npde_np = npde[torch.isfinite(npde)].detach().cpu().numpy()
+    fig = plt.figure()
+    stats.probplot(npde_np, dist="norm", plot=plt)
+    if train:
+        # Use model.current_epoch if provided
+        path = f"qq_plot_epoch_{epoch}.png"
+        fig.savefig(path, bbox_inches="tight")
+        plt.close(fig)
+        return path
+    else:
+        plt.show()
+        return None
+def vcp_from_sample(model,databatch_list,empirical_databatch,train=False):
+    """
+    in order to have a shape [S,I,T] vs [I,T] the models concatenates all samples for each held out individuals
+    which are of shape [S,B=1,I=1,T,1] (held out sample) -> [S,I,T] (required by vpc)
+    """
+    samples_list = model.sample(databatch_list,use_unique_times=True,num_samples=30)
+    combined_observation = combine_samples([pair[0] for pair in samples_list])
+    combined_times = combine_samples([pair[1] for pair in samples_list])
+    print(combined_observation.shape)
+    simulation_times = combined_times[0,0,:]
+    print(simulation_times.shape)
+    patients, patients_time = extract_context_by_mask(empirical_databatch)
+    img = vpc(simulation_times, combined_observation, patients, patients_time,train=train)
+    return img
+def vpc_from_empirical(databatch_list,databatch_list_context,model,train=False,image_name="vpc.png",samples_number=100,y_scale=None):
+    aicme = databatch_list_context[0]
+    patients = [db_tuple[0].target_obs.cpu().detach().numpy() for db_tuple in databatch_list]
+    patients_time = [db_tuple[0].target_obs_time.cpu().detach().numpy() for db_tuple in databatch_list]
+    max_time_index = aicme.context_obs_mask.sum(axis=2).squeeze().argmax()
+    all_samples_times = aicme.context_obs_time[0,max_time_index,aicme.context_obs_mask[0,max_time_index]]
+    all_samples = []
+    for db_tuple in databatch_list:
+        samples,samples_time = model.sample_new_individual(db_tuple,samples_number)
+        all_samples.append(samples)
+    all_samples = torch.cat(all_samples,dim=2).squeeze()
+    all_samples = all_samples[:,:,aicme.context_obs_mask[0,max_time_index]]
+    vpc(all_samples_times, all_samples, patients, patients_time,train=train,image_name=image_name,y_scale=y_scale)
+def vpc(test_time, MetaStudies, patients, patients_time, train=True, image_name="vpc.png", y_scale=None):
+    """
+    Generate a Visual Predictive Check (VPC) plot with PyTorch tensor inputs.
+    Parameters:
+    - test_time: 1D PyTorch tensor of fixed time points for simulated data (shape [T])
+    - MetaStudies: 3D PyTorch tensor of simulated data (shape [M, P, T])
+    - patients: List of 1D PyTorch tensors, each with observed concentrations
+    - patients_time: List of 1D PyTorch tensors, each with corresponding times
+    - train: If True, save plot; else show it
+    - image_name: File name to save the image if train=True
+    - y_scale: Set to "log" for log-scale y-axis; None for linear
+    """
+    if len(test_time.shape) > 1:
+        test_time = test_time.squeeze()
+    test_time_np = test_time.detach().cpu().numpy()
+    MetaStudies_np = MetaStudies.detach().cpu().numpy()
+    percentiles = [5, 25, 50, 75, 95]
+    sim_percentiles = np.percentile(MetaStudies_np, percentiles, axis=1)  # [5, M, T]
+    sim_percentiles_agg = np.percentile(sim_percentiles, 50, axis=1)      # [5, T]
+    p05, p25, p50, p75, p95 = sim_percentiles_agg
+    plt.figure(figsize=(10, 6))
+    plt.fill_between(test_time_np, p05, p95, color='blue', alpha=0.2, label='5th-95th Percentile')
+    plt.fill_between(test_time_np, p25, p75, color='blue', alpha=0.4, label='25th-75th Percentile')
+    plt.plot(test_time_np, p50, color='blue', label='Median (50th Percentile)')
+    for obs, times in zip(patients, patients_time):
+        plt.scatter(times, obs, color='red', alpha=0.6, s=20)
+    plt.xlabel('Time (hours)')
+    plt.ylabel('Concentration (g/L)')
+    if y_scale == "log":
+        plt.yscale('log')
+    plt.title('Visual Predictive Check (VPC)')
+    plt.legend()
+    plt.grid(True, linestyle='--', alpha=0.7)
+    if train:
+        plt.savefig(image_name)
+        plt.close()
+        return image_name
+    else:
+        plt.show()
+        plt.close()
+def get_unique_target_times(
+    db_list: List[AICMECompartmentsDataBatch]
+) -> TensorType["U", 1]:
+    """
+    Given P databatches, each with
+        .target_obs_time: [B, t_ind, num_obs_t, 1]
+    returns a tensor of shape [U, 1] containing the sorted unique times
+    across *all* batches and *all* target time points.
+    Args:
+        db_list: list of length P of AICMECompartmentsDataBatch
+    Returns:
+        unique_times: Tensor of shape [U, 1], where U is the number of
+                      unique target‐observation times across every batch.
+    """
+    # 1) Flatten each batch's times:
+    #    db.target_obs_time.squeeze(-1).reshape(-1) has shape [B * t_ind * num_obs_t]
+    flat_times = [
+        db.target_obs_time.squeeze(-1).reshape(-1)  # [B * t_ind * num_obs_t]
+        for db in db_list
+    ]
+    # 2) Concatenate all P batches → [(P * B * t_ind * num_obs_t)]
+    all_times = torch.cat(flat_times, dim=0)
+    # 3) Compute sorted unique values → [U]
+    unique = torch.unique(all_times)
+    # 4) Return as column vector → [U, 1]
+    return unique.unsqueeze(-1).unsqueeze(0).unsqueeze(0)  # TensorType[1,1,"U", 1]

sim_priors_pk/metrics/quantiles_coverage.py ADDED Viewed

	@@ -0,0 +1,310 @@

+from typing import Tuple
+import torch
+from torchtyping import TensorType
+def compute_predictive_quantiles(
+    pred_values: TensorType["B", "S", "T", 1],
+    pred_mask: TensorType["B", "S", "T"] | TensorType["B", "T"],
+    alpha: float,
+) -> Tuple[
+    TensorType["B", "T", 1],
+    TensorType["B", "T", 1],
+]:
+    """
+    Compute lower and upper predictive quantiles (α/2, 1−α/2)
+    across stochastic samples, supporting both shared and
+    per-sample (per-individual) masks.
+    Parameters
+    ----------
+    pred_values : [B, S, T, 1]
+        Predicted sample trajectories.
+    pred_mask : [B, T] or [B, S, T]
+        Boolean mask marking valid time points.
+        - If [B, T]: same mask for all samples.
+        - If [B, S, T]: individual-specific masks.
+    alpha : float
+        Significance level (e.g. 0.05 for 90% interval).
+    Returns
+    -------
+    q_low, q_high : [B, T, 1]
+        Predictive lower and upper quantile envelopes.
+    """
+    B, S, T, _ = pred_values.shape
+    device = pred_values.device
+    # --- normalize mask shape ---
+    if pred_mask.ndim == 2:
+        pred_mask = pred_mask.unsqueeze(1).repeat(1, S, 1)  # [B,S,T]
+    q_low_list = []
+    q_high_list = []
+    for b in range(B):
+        q_low_b = torch.zeros(T, device=device)
+        q_high_b = torch.zeros(T, device=device)
+        # for each time index, only include valid samples
+        for t_idx in range(T):
+            valid_s = pred_mask[b, :, t_idx]
+            if valid_s.any():
+                vals = pred_values[b, valid_s, t_idx, 0]
+                q_low_b[t_idx] = vals.quantile(alpha / 2)
+                q_high_b[t_idx] = vals.quantile(1 - alpha / 2)
+            else:
+                # leave zeros (or NaN if preferred)
+                q_low_b[t_idx] = 0.0
+                q_high_b[t_idx] = 0.0
+        q_low_list.append(q_low_b.unsqueeze(-1))
+        q_high_list.append(q_high_b.unsqueeze(-1))
+    q_low = torch.stack(q_low_list, dim=0)  # [B,T,1]
+    q_high = torch.stack(q_high_list, dim=0)  # [B,T,1]
+    return q_low, q_high
+def interpolate_quantiles_to_obs_times(
+    q_low: TensorType["B", "Tpred", 1],
+    q_high: TensorType["B", "Tpred", 1],
+    pred_times: TensorType["B", "Tpred", 1],
+    pred_mask: TensorType["B", "Tpred"],
+    real_times: TensorType["B", "I", "Treal", 1],
+    real_mask: TensorType["B", "I", "Treal"],
+) -> Tuple[
+    TensorType["B", "I", "Treal", 1],
+    TensorType["B", "I", "Treal", 1],
+]:
+    """
+    Interpolate predictive quantile bands (q_low, q_high) to the irregular
+    observation times of real data.
+    Parameters
+    ----------
+    q_low, q_high : TensorType["B", "Tpred", 1]
+        Predictive lower and upper quantile curves at distinct time grid points.
+    pred_times : TensorType["B", "Tpred", 1]
+        Time grid corresponding to the quantile curves.
+    pred_mask : TensorType["B", "Tpred"]
+        Boolean mask marking valid predictive times per batch.
+    real_times : TensorType["B", "I", "Treal", 1]
+        Observation times for each individual and batch.
+    real_mask : TensorType["B", "I", "Treal"]
+        Mask indicating valid observed time points.
+    Returns
+    -------
+    q_low_interp, q_high_interp : Tuple[
+        TensorType["B", "I", "Treal", 1],
+        TensorType["B", "I", "Treal", 1],
+    ]
+        Interpolated quantile band values at each observed time, padded
+        where invalid.
+    Notes
+    -----
+    - Uses linear interpolation between nearest predictive time knots.
+    - Out-of-range times are clamped to the nearest boundary quantile.
+    - Invalid (masked) observations are returned as zeros.
+    """
+    B, I, Treal, _ = real_times.shape
+    device = real_times.device
+    q_low_interp_list, q_high_interp_list = [], []
+    for b in range(B):
+        # Extract valid predictive points for this batch
+        valid_mask_b = pred_mask[b]  # [Tpred]
+        valid_T = valid_mask_b.sum().item()
+        if valid_T < 2:
+            # Degenerate case: not enough points for interpolation
+            q_low_interp_list.append(torch.zeros(I, Treal, 1, device=device))
+            q_high_interp_list.append(torch.zeros(I, Treal, 1, device=device))
+            continue
+        t_pred = pred_times[b, valid_mask_b, 0]  # [T_b]
+        ql = q_low[b, valid_mask_b, 0]  # [T_b]
+        qh = q_high[b, valid_mask_b, 0]  # [T_b]
+        # For each individual, interpolate its observation times
+        q_low_i, q_high_i = [], []
+        for i in range(I):
+            t_obs = real_times[b, i, :, 0]  # [Treal]
+            valid_obs = real_mask[b, i]  # [Treal]
+            # Clamp obs times into predictive range
+            t_clamped = torch.clamp(t_obs, t_pred.min(), t_pred.max())
+            # Use searchsorted to find bracketing indices
+            idx_right = torch.searchsorted(t_pred, t_clamped)
+            idx_left = (idx_right - 1).clamp(min=0)
+            idx_right = idx_right.clamp(max=valid_T - 1)
+            # Gather times and values for interpolation
+            t_L, t_R = t_pred[idx_left], t_pred[idx_right]
+            ql_L, ql_R = ql[idx_left], ql[idx_right]
+            qh_L, qh_R = qh[idx_left], qh[idx_right]
+            denom = (t_R - t_L).clamp(min=1e-8)
+            w_R = (t_clamped - t_L) / denom
+            w_L = 1.0 - w_R
+            ql_interp = w_L * ql_L + w_R * ql_R
+            qh_interp = w_L * qh_L + w_R * qh_R
+            # Zero out invalid times
+            ql_interp = ql_interp.masked_fill(~valid_obs, 0.0)
+            qh_interp = qh_interp.masked_fill(~valid_obs, 0.0)
+            q_low_i.append(ql_interp.unsqueeze(-1))
+            q_high_i.append(qh_interp.unsqueeze(-1))
+        q_low_interp_list.append(torch.stack(q_low_i, dim=0))  # [I, Treal, 1]
+        q_high_interp_list.append(torch.stack(q_high_i, dim=0))  # [I, Treal, 1]
+    q_low_interp = torch.stack(q_low_interp_list, dim=0)  # [B, I, Treal, 1]
+    q_high_interp = torch.stack(q_high_interp_list, dim=0)  # [B, I, Treal, 1]
+    return q_low_interp, q_high_interp
+def compute_time_weighted_coverage(
+    real_values: TensorType["B", "I", "Treal", 1],
+    real_times: TensorType["B", "I", "Treal", 1],
+    real_mask: TensorType["B", "I", "Treal"],
+    q_low_interp: TensorType["B", "I", "Treal", 1],
+    q_high_interp: TensorType["B", "I", "Treal", 1],
+    reduce: bool = True,
+) -> TensorType["B"]:
+    """
+    Compute time-weighted coverage fraction of observations within predictive bands.
+    """
+    # [B, I, Treal, 1]
+    covered = (real_values >= q_low_interp) & (real_values <= q_high_interp)
+    covered = covered.squeeze(-1) & real_mask  # [B, I, Treal]
+    # Compute Δt (difference along time)
+    dt = torch.diff(real_times, dim=2, prepend=real_times[:, :, :1])
+    dt = dt.squeeze(-1) * real_mask  # [B, I, Treal]
+    dt_sum = dt.sum(dim=(1, 2), keepdim=True).clamp(min=1e-8)
+    weights = dt / dt_sum  # normalized time weights
+    coverage = (covered.float() * weights).sum(dim=(1, 2))  # [B]
+    return coverage
+def compute_interval_score(
+    real_values: TensorType["B", "I", "Treal", 1],
+    real_times: TensorType["B", "I", "Treal", 1],
+    real_mask: TensorType["B", "I", "Treal"],
+    q_low_interp: TensorType["B", "I", "Treal", 1],
+    q_high_interp: TensorType["B", "I", "Treal", 1],
+    alpha: float,
+) -> TensorType["B"]:
+    """
+    Compute the time-weighted interval score (Gneiting & Raftery, 2007).
+    """
+    width = (q_high_interp - q_low_interp).abs()
+    below = (q_low_interp - real_values).clamp(min=0)
+    above = (real_values - q_high_interp).clamp(min=0)
+    interval_score = width + (2 / alpha) * (below + above)
+    interval_score = interval_score.squeeze(-1) * real_mask  # [B, I, Treal]
+    # Δt weighting
+    dt = torch.diff(real_times, dim=2, prepend=real_times[:, :, :1]).squeeze(-1)
+    dt = dt * real_mask
+    dt_sum = dt.sum(dim=(1, 2), keepdim=True).clamp(min=1e-8)
+    weights = dt / dt_sum
+    # Weighted mean per batch
+    score_weighted = (interval_score * weights).sum(dim=(1, 2))
+    return score_weighted
+def compute_percentile_coverage(
+    pred_values,
+    pred_times,
+    pred_mask,
+    real_values,
+    real_times,
+    real_mask,
+    alpha: float = 0.05,
+):
+    """
+    Compute predictive interval coverage and interval score between predicted and observed trajectories.
+    This function evaluates how well a stochastic predictive model captures
+    the true (real) observations within its predictive uncertainty bands.
+    It combines three subroutines:
+        1. :func:`compute_predictive_quantiles` — compute lower and upper predictive quantiles.
+        2. :func:`interpolate_quantiles_to_obs_times` — align quantile predictions to observation times.
+        3. :func:`compute_time_weighted_coverage` and :func:`compute_interval_score` —
+           compute Δt-weighted coverage fraction and proper scoring rule.
+    Parameters
+    ----------
+    pred_values : TensorType["B", "S", "T_pred", 1]
+        Stochastic predictions for each batch element `B` and stochastic sample `S`.
+        Typically obtained by sampling the model multiple times.
+    pred_times : TensorType["B", "T_pred", 1]
+        Distinct prediction time grid per batch (shared across stochastic samples).
+    pred_mask : TensorType["B", "T_pred"]
+        Boolean mask indicating valid prediction time steps.
+    real_values : TensorType["B", "I", "T_real", 1]
+        Ground-truth or observed values for each batch and individual.
+    real_times : TensorType["B", "I", "T_real", 1]
+        Observation times corresponding to `real_values`.
+    real_mask : TensorType["B", "I", "T_real"]
+        Boolean mask indicating valid observed time points.
+    alpha : float, optional (default = 0.05)
+        Significance level defining the predictive interval width.
+        For example:
+            * α = 0.05 → 90% central interval (quantiles 0.025 and 0.975)
+            * α = 0.10 → 80% central interval (quantiles 0.05 and 0.95)
+        Smaller α yields wider intervals (more conservative coverage).
+    Returns
+    -------
+    dict[str, TensorType["B"]]
+        Dictionary containing:
+            - ``"coverage"`` : Δt-weighted fraction of observations inside the predictive interval.
+            - ``"interval_score"`` : Proper interval score (Gneiting & Raftery, 2007),
+              penalizing both interval width and miscoverage.
+    Notes
+    -----
+    - High coverage (≈1.0) indicates all real points lie inside the predictive band.
+      In well-calibrated models, expected coverage ≈ 1−α.
+    - Lower interval scores correspond to sharper and better-calibrated predictions.
+    References
+    ----------
+    Gneiting, T. & Raftery, A. E. (2007). *Strictly Proper Scoring Rules, Prediction, and Estimation*.
+    Journal of the American Statistical Association, 102(477), 359-378.
+    """
+    q_low, q_high = compute_predictive_quantiles(pred_values, pred_mask, alpha)
+    q_low_interp, q_high_interp = interpolate_quantiles_to_obs_times(
+        q_low, q_high, pred_times, pred_mask, real_times, real_mask
+    )
+    coverage = compute_time_weighted_coverage(
+        real_values, real_times, real_mask, q_low_interp, q_high_interp
+    )
+    interval_score = compute_interval_score(
+        real_values, real_times, real_mask, q_low_interp, q_high_interp, alpha
+    )
+    return {"coverage": coverage, "interval_score": interval_score}

sim_priors_pk/metrics/sampling_quality.py ADDED Viewed

	@@ -0,0 +1,409 @@

+# Evaluate sampling quality of a model based on Visual Predictive Checks or Normalized Prediction Distribution Errors (NPDEs).
+# Input for both evaluations: a StudyJSON object containing the observed data and a List[StudyJSON] containing replicates of simulated data from the model.
+# This way, both neural networks and NLME models can be evaluated using the same code, as long as they can produce the required StudyJSON objects.
+from typing import List, Optional, Sequence
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from scipy.stats import chi2, norm, shapiro, ttest_1samp
+from sim_priors_pk.data.data_empirical.json_schema import IndividualJSON, StudyJSON
+def json_to_dataframe(study_json: StudyJSON) -> pd.DataFrame:
+    """
+    Convert a StudyJSON object to a pandas DataFrame for easier analysis.
+    Args:
+        study_json (StudyJSON): The StudyJSON object to convert.
+    Returns:
+        pd.DataFrame: A DataFrame with columns ["Type", "ID", "Time", "Value"] from the StudyJSON data.
+    """
+    frames = []
+    for data_type in ["context", "target"]:
+        entries = study_json.get(data_type, [])
+        for j, entry in enumerate(entries):
+            # Prefer name_id, else _id, else a deterministic fallback
+            name_id = entry.get("name_id") or entry.get("_id") or f"{data_type}_{j}"
+            df_entry = pd.DataFrame(
+                {
+                    "Type": data_type,
+                    "ID": str(name_id),  # ensure it's a string
+                    "Time": entry["observation_times"],
+                    "Value": entry["observations"],
+                }
+            )
+            frames.append(df_entry)
+    if frames:
+        return pd.concat(frames, ignore_index=True)
+    else:
+        return pd.DataFrame(columns=["Type", "ID", "Time", "Value"])
+def json_list_to_dataframe(study_list: List[StudyJSON]) -> pd.DataFrame:
+    """
+    Convert a list of StudyJSON objects to a pandas DataFrame for easier analysis.
+    Args:
+        study_list (List[StudyJSON]): The list of StudyJSON objects to convert.
+    Returns:
+        pd.DataFrame: A DataFrame with columns ["Type", "ID", "Time", "Value", "Replicate"] from the StudyJSON data.
+    """
+    frames = []
+    for replicate_idx, study in enumerate(study_list):
+        df = json_to_dataframe(study)
+        df["Replicate"] = replicate_idx
+        frames.append(df)
+    return pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
+def validate_npde_vpc_inputs(
+    data: pd.DataFrame, simulations: List[pd.DataFrame], differentTimesError: bool = True
+) -> None:
+    """
+    Validate the inputs for NPDE / VPC calculation.
+    Args:
+        data (pd.DataFrame): The observed data in DataFrame format with columns ["Type", "ID", "Time", "Value"].
+        simulations (List[pd.DataFrame]): A list of DataFrames with columns ["Type", "ID", "Time", "Value","Replicate"]
+        representing simulated data from the model.
+        differentTimesError (bool): Whether to raise an error if observation times differ between individuals (default: True).
+    Returns:
+        None: If the inputs are valid, otherwise raises a ValueError.
+    """
+    key_cols = ["Type", "ID", "Time"]
+    obs_keys = data[key_cols].drop_duplicates().sort_values(key_cols).reset_index(drop=True)
+    pred_keys = simulations[key_cols].drop_duplicates().sort_values(key_cols).reset_index(drop=True)  # type: ignore
+    if not obs_keys.equals(pred_keys):
+        raise ValueError("Observations and predictions are not structurally identical.")
+    if differentTimesError:
+        if (data.groupby("ID")["Time"].apply(lambda x: tuple(sorted(x))).nunique()) != 1:
+            raise ValueError("Observation times differ between individuals.")
+    return None
+def compute_npde_data(data: StudyJSON, simulations: List[StudyJSON]) -> np.ndarray:
+    """
+    Compute Normalized Prediction Distribution Errors (NPDEs) for a given StudyJSON and a list of simulated StudyJSONs.
+    Args:
+        data (StudyJSON): The observed data in StudyJSON
+        simulations (List[StudyJSON]): A list of StudyJSON objects representing simulated data from the model.
+    Returns:
+        np.ndarray: An array of NPDE values.
+    """
+    # Extract observed values and predicted values from the StudyJSON objects and validate them before calculating NPDEs.
+    observed_values = json_to_dataframe(data)
+    predicted_values = json_list_to_dataframe(simulations)
+    validate_npde_vpc_inputs(observed_values, predicted_values, differentTimesError=False)
+    # Merge observations and predictions into a single DataFrame for NPDE calculation.
+    key_cols = ["Type", "ID", "Time"]
+    pred_wide = predicted_values.pivot(index=key_cols, columns="Replicate", values="Value")
+    obs_indexed = observed_values.set_index(key_cols)
+    combined = pred_wide.join(obs_indexed["Value"].rename("Observed"))
+    # Calculate NPDEs for each replicate and return as a numpy array.
+    replicate_cols = pred_wide.columns
+    pred_vals = combined[replicate_cols].values
+    obs_vals = combined["Observed"].values
+    # Empirical CDF (truncated to avoid 0 and 1) for each observation based on the predicted distribution from the replicates.
+    pde = (pred_vals <= obs_vals[:, None]).sum(axis=1) / (len(replicate_cols) + 1) + 0.5 / (
+        len(replicate_cols) + 1
+    )
+    npde = norm.ppf(pde)
+    return npde
+def npde_plot(npde_values: np.ndarray) -> None:
+    """
+    Create a quantile-quantile-plot of NPDE values.
+    Args:
+        npde_values (np.ndarray): An array of NPDE values to plot.
+    Returns:
+        None
+    """
+    plt.figure(figsize=(6, 6))
+    plt.title("Q-Q Plot of NPDE Values")
+    plt.xlabel("Theoretical Quantiles")
+    plt.ylabel("Empirical Quantiles")
+    norm_qq = np.sort(npde_values)
+    theoretical_qq = norm.ppf((np.arange(len(npde_values)) + 1) / (len(npde_values) + 1))
+    plt.plot(theoretical_qq, norm_qq, marker="o", linestyle="")
+    plt.plot(theoretical_qq, theoretical_qq, color="red", linestyle="--")
+    plt.grid()
+    plt.show()
+def npde_pvalues(npde_values: np.ndarray) -> dict:
+    """
+    Calculate p-values based on the theoretical N(0,1) distribution of NPDE values.
+    Args:
+        npde_values (np.ndarray): An array of NPDE values to summarize.
+    Returns:
+        dict: A dictionary containing p-values for different tests applied to the NPDE values:
+            - "mean": The (one-sample) t-test for zero mean of the NPDE values.
+            - "variance": The (one-sample) chi-squared test for unit variance of the NPDE values.
+            - "normality": The Shapiro-Wilk test for normality of the NPDE values.
+    """
+    # variance test not implemented in scipy, so we calculate the p-value manually based on
+    # the chi-squared distribution of the sample variance under the null hypothesis of unit variance.
+    n = len(npde_values)
+    sample_var = np.var(npde_values, ddof=1)
+    chi2_stat = (n - 1) * sample_var
+    p_lower = chi2.cdf(chi2_stat, df=n - 1)
+    p_upper = 1 - p_lower
+    p_var = 2 * min(p_lower, p_upper)
+    return {
+        "mean": ttest_1samp(npde_values, 0).pvalue,  # type: ignore
+        "variance": p_var,
+        "normality": shapiro(npde_values).pvalue,
+    }
+def compute_vpc_data(
+    data: StudyJSON,
+    simulations: Sequence[StudyJSON],
+    quantiles: List[float] = [0.05, 0.5, 0.95],
+    confidence: float = 0.9,
+    n_bins: Optional[int] = None,
+    binning: str = "equal_count",  # "equal_count" or "equal_width"
+) -> pd.DataFrame:
+    """
+    Compute data for a Visual Predictive Check (VPC) plot for the given StudyJSON and a list of simulated StudyJSONs.
+    Args:
+        data (StudyJSON): The observed data in StudyJSON
+        simulations (List[StudyJSON]): A list of simulated data in StudyJSON format.
+        quantiles (List[float]): The quantiles to display in the VPC plot (default: [0.05, 0.5, 0.95]).
+        confidence (float): The confidence level for the prediction intervals (default: 0.9).
+    Returns:
+        pd.DataFrame: A DataFrame containing the VPC data.
+    """
+    observed_values = json_to_dataframe(data)
+    predicted_values = json_list_to_dataframe(simulations)
+    alpha_low = (1 - confidence) / 2
+    alpha_high = 1 - alpha_low
+    # --------------------------------
+    # Binning (if requested OR if needed)
+    # --------------------------------
+    if n_bins is not None:
+        validate_npde_vpc_inputs(observed_values, predicted_values, differentTimesError=False)
+        match binning:
+            case "equal_count":
+                observed_values["TimeBin"] = pd.qcut(
+                    observed_values["Time"], q=n_bins, duplicates="drop"
+                )
+                # Use same bin edges for predicted
+                bins = observed_values["TimeBin"].cat.categories
+                predicted_values["TimeBin"] = pd.cut(predicted_values["Time"], bins=bins)
+            case "equal_width":
+                tmin = observed_values["Time"].min()
+                tmax = observed_values["Time"].max()
+                bins = np.linspace(tmin, tmax, n_bins + 1)
+                observed_values["TimeBin"] = pd.cut(
+                    observed_values["Time"], bins=bins, include_lowest=True
+                )
+                predicted_values["TimeBin"] = pd.cut(
+                    predicted_values["Time"], bins=bins, include_lowest=True
+                )
+            case _:
+                raise ValueError("binning must be 'equal_width' or 'equal_count'")
+        # Use bin midpoint for plotting
+        bin_midpoints = (
+            observed_values.groupby("TimeBin", observed=False)["Time"].mean().rename("Time")
+        )
+        # Replace Time with bin midpoint
+        observed_values["Time"] = observed_values["TimeBin"].map(bin_midpoints)
+        predicted_values["Time"] = predicted_values["TimeBin"].map(bin_midpoints)
+        # Drop bin column
+        observed_values = observed_values.drop(columns="TimeBin")
+        predicted_values = predicted_values.drop(columns="TimeBin")
+    else:
+        validate_npde_vpc_inputs(observed_values, predicted_values, differentTimesError=True)  # type: ignore
+    # --------------------------------
+    # Quantile calculation
+    # --------------------------------
+    vpc_obs = (
+        observed_values.groupby("Time")["Value"]
+        .quantile(quantiles)  # type: ignore
+        .rename("Obs")
+        .reset_index()
+        .rename(columns={"level_1": "Quantile"})
+    )
+    vpc_pred = (
+        predicted_values.groupby(["Time", "Replicate"])["Value"]
+        .quantile(quantiles)  # type: ignore
+        .rename("SimQuantile")
+        .reset_index()
+        .rename(columns={"level_2": "Quantile"})
+        .groupby(["Time", "Quantile"])["SimQuantile"]
+        .quantile([alpha_low, alpha_high])  # type: ignore
+        .rename("VPC")
+        .reset_index()
+        .rename(columns={"level_2": "PI"})
+        .pivot(index=["Time", "Quantile"], columns="PI", values="VPC")
+        .reset_index()
+        .rename(columns={alpha_low: "LowerPred", alpha_high: "UpperPred"})
+    )
+    vpc_data = vpc_obs.merge(vpc_pred, on=["Time", "Quantile"], how="left")
+    return vpc_data
+def vpc_plot(vpc_data: pd.DataFrame, ax=None, log_y: bool = False) -> None:
+    """
+    Create a Visual Predictive Check (VPC) plot for the given VPC data.
+    Args:
+        vpc_data (pd.DataFrame): The VPC data to plot.
+        ax: Optional matplotlib axis to plot on. If None, a new figure and axis will be created.
+        log_y: Whether to use a logarithmic scale for the y-axis (default: False).
+    Returns:
+        None
+    """
+    quantiles = np.sort(vpc_data["Quantile"].unique())
+    # Enforce exactly 3 quantiles
+    if len(quantiles) != 3:
+        raise ValueError(f"Expected exactly 3 quantiles, got {len(quantiles)}: {quantiles}")
+    # Default axis management
+    if ax is None:
+        fig, ax = plt.subplots(figsize=(10, 6))
+    # Log-scale option
+    if log_y:
+        # Safety check: log scale requires strictly positive values
+        y_cols = ["Obs", "LowerPred", "UpperPred"]
+        if (vpc_data[y_cols] <= 0).any().any():
+            raise ValueError("Log scale requested but non-positive values detected.")
+        ax.set_yscale("log")
+    # Color scheme: lower, median, upper
+    colors = ["tab:blue", "tab:orange", "tab:blue"]
+    # Map sorted quantiles to colors
+    q_to_color = dict(zip(quantiles, colors))
+    # Plot observed quantiles
+    quantiles = vpc_data["Quantile"].unique()
+    for q in quantiles:
+        subset = vpc_data[vpc_data["Quantile"] == q]
+        color = q_to_color[q]
+        is_median = np.isclose(q, 0.5)
+        ax.plot(
+            subset["Time"],
+            subset["Obs"],
+            marker="o",
+            color=color,
+            linewidth=2 if is_median else 1,
+            label=f"Observed {q:.0%}",
+        )
+        ax.fill_between(
+            subset["Time"],
+            subset["LowerPred"],
+            subset["UpperPred"],
+            color=color,
+            alpha=0.25,
+            label=f"Simulated {q:.0%} PI",
+        )
+    # Keep legend outside the plotting area to avoid occluding trajectories.
+    ax.legend(
+        loc="upper center",
+        bbox_to_anchor=(0.5, -0.18),
+        ncol=3,
+        frameon=False,
+    )
+    ax.figure.subplots_adjust(bottom=0.25)
+    return ax
+if __name__ == "__main__":
+    # Example usage
+    observed_data = StudyJSON(
+        context=[
+            IndividualJSON(name_id="1", observation_times=[0, 1, 2], observations=[10, 20, 30]),
+            IndividualJSON(name_id="2", observation_times=[0, 1, 2], observations=[11, 21, 31]),
+        ]
+    )
+    simulated_data = [
+        StudyJSON(
+            context=[
+                IndividualJSON(name_id="1", observation_times=[0, 1, 2], observations=[12, 22, 32]),
+                IndividualJSON(name_id="2", observation_times=[0, 1, 2], observations=[13, 21, 30]),
+            ]
+        ),
+        StudyJSON(
+            context=[
+                IndividualJSON(name_id="1", observation_times=[0, 1, 2], observations=[8, 18, 28]),
+                IndividualJSON(name_id="2", observation_times=[0, 1, 2], observations=[11, 19, 27]),
+            ]
+        ),
+    ]
+    # Convert to dataframes for visualization (optional)
+    observed_values = json_to_dataframe(observed_data)
+    simulated_values = json_list_to_dataframe(simulated_data)
+    validate_npde_inputs(observed_values, simulated_values)
+    npde_results = calculate_npde(observed_data, simulated_data)
+    print("NPDE Results:", npde_results)
+    vpc_data = create_vpc_data(observed_data, simulated_data)
+    vpc_plot(vpc_data)