File size: 6,426 Bytes
5a91d17
 
 
 
 
 
 
 
 
60c4f27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: README
emoji: 🐨
colorFrom: pink
colorTo: purple
sdk: static
pinned: false
---

# EPI-Eval

A curated collection of large epidemiological datasets, normalized to a single
schema so they can be searched, joined, and benchmarked against each other.

## What we track

Time-series surveillance data on infectious disease β€” primarily respiratory
viruses (flu, COVID-19, RSV) and arboviral disease (dengue, Zika,
chikungunya), with smaller coverage of notifiable, mortality, wastewater, and
behavioural / search signals. Sources come from CDC, WHO, ECDC, PAHO, OWID,
and national public-health agencies; we re-publish them as Parquet with a
consistent set of row-level columns (`date`, `location_id`, `location_level`,
optional `condition` / `case_status` / `as_of`) and a metadata header
describing pathogens, geography, cadence, and per-column units.

## Why

Forecasting and modeling work routinely stalls on data plumbing β€” finding the
canonical version of a series, normalizing geography codes, reconciling
reporting cadences, tracking when a source was last revised. The goal of this
org is to do that work once, in the open. 

## Schema

Every dataset card on this org uses the same frontmatter format
([schema v0.1](https://github.com/ChrisHarig/apart-forecasting-tool/blob/main/upload_pipeline/schema/schema_v0.1.md)),
validated against a controlled vocabulary
([`vocabularies.yaml`](https://github.com/ChrisHarig/apart-forecasting-tool/blob/main/upload_pipeline/schema/vocabularies.yaml)).
Curated metadata (pathogens, license, units) lives alongside computed metadata
(time coverage, row count, observed cadence) generated at ingest.

## Contributing a dataset

The ingest pipeline is in
[`apart-forecasting-tool/upload_pipeline`](https://github.com/ChrisHarig/apart-forecasting-tool/tree/main/upload_pipeline).
A new dataset is one `ingest.py` + `card.yaml` under
`upload_pipeline/sources/<source_id>/`; the validator confirms schema fit
before upload. Each new truth dataset auto-creates an empty
`<id>-predictions` companion at upload time.

## Datasets (21)

### Respiratory

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [CDC FluSurv-NET β€” weekly flu hospitalisation rates](https://huggingface.co/datasets/EPI-Eval/delphi-flusurv) | influenza | US | weekly |
| [CDC NHSN Hospital Respiratory Data (HRD)](https://huggingface.co/datasets/EPI-Eval/nhsn-hrd) | influenza, sars-cov-2, rsv | US | weekly |
| [CDC NREVSS β€” weekly RSV test specimens and positives](https://huggingface.co/datasets/EPI-Eval/cdc-nrevss-rsv) | rsv | US | weekly |
| [COVID Tracking Project β€” US states daily (archived)](https://huggingface.co/datasets/EPI-Eval/covid-tracking-project) | sars-cov-2 | US | daily |
| [COVID-19 Forecast Hub β€” hospital admissions target](https://huggingface.co/datasets/EPI-Eval/covid19-forecast-hub) | sars-cov-2 | US | weekly |
| [ECDC ERVISS β€” ILI/ARI primary-care consultation rates](https://huggingface.co/datasets/EPI-Eval/ecdc-erviss) | influenza, sars-cov-2, rsv | multiple (30 countries) | weekly |
| [Flu MetroCast Hub β€” sub-state flu hosp forecast target](https://huggingface.co/datasets/EPI-Eval/flu-metrocast-hub) | influenza | US | weekly |
| [FluSight Forecast Hub β€” flu hospital admission target](https://huggingface.co/datasets/EPI-Eval/flusight-forecast-hub) | influenza | US | weekly |
| [JHU CSSE COVID-19 β€” global daily (archived)](https://huggingface.co/datasets/EPI-Eval/jhu-csse-covid) | sars-cov-2 | multiple | daily |
| [NYT COVID-19 β€” US county daily](https://huggingface.co/datasets/EPI-Eval/nyt-covid) | sars-cov-2 | US | daily |
| [OWID COVID-19 β€” global daily compiled](https://huggingface.co/datasets/EPI-Eval/owid-covid) | sars-cov-2 | multiple | daily |
| [PHAC Respiratory Virus Detection Surveillance β€” Canada weekly](https://huggingface.co/datasets/EPI-Eval/canada-fluwatch) | influenza, influenza-a, influenza-b +7 | CA | weekly |
| [RSV Forecast Hub β€” RSV hospital admissions target](https://huggingface.co/datasets/EPI-Eval/rsv-forecast-hub) | rsv | US | weekly |
| [UKHSA Dashboard β€” England COVID-19 daily metrics](https://huggingface.co/datasets/EPI-Eval/ukhsa-covid-daily) | sars-cov-2 | GB | daily |
| [UKHSA Dashboard β€” England flu / COVID-19 / RSV weekly](https://huggingface.co/datasets/EPI-Eval/ukhsa-respiratory) | influenza, sars-cov-2, rsv | GB | weekly |

### Syndromic / ED

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [CDC NSSP / ESSENCE β€” ED visits for ILI / COVID / RSV](https://huggingface.co/datasets/EPI-Eval/cdc-nssp) | influenza, sars-cov-2, rsv | US | weekly |

### Arboviral

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [OpenDengue β€” national dengue case counts (V1.3)](https://huggingface.co/datasets/EPI-Eval/opendengue) | dengue | multiple | irregular |

### Mobility & contact

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [Google Community Mobility Reports β€” global daily](https://huggingface.co/datasets/EPI-Eval/global-mobility) | β€” | multiple | daily |

### Search & behavioural

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [Wikipedia pageviews β€” disease-article daily views](https://huggingface.co/datasets/EPI-Eval/wikipedia-pageviews) | influenza, sars-cov-2, rsv +6 | multiple | daily |

### Notifiable / other

| Dataset | Pathogens | Geography | Cadence |
| --- | --- | --- | --- |
| [OWID Mpox β€” global daily compiled](https://huggingface.co/datasets/EPI-Eval/owid-mpox) | mpox | multiple | daily |
| [WHO Global TB β€” annual country estimates](https://huggingface.co/datasets/EPI-Eval/who-tb-burden) | tuberculosis | multiple | annual |

## Predictions

Each truth dataset has a companion `EPI-Eval/<id>-predictions` repo that
accumulates community-submitted forecasts. Schema is long-format: one row per
`(target_date, [dim values…], quantile, value)`, with `quantile = NULL`
reserved for the point estimate. Forecasters submit through the
[EPI-Eval dashboard](https://github.com/ChrisHarig/apart-forecasting-tool);
a maintainer reviews each PR before merging, and merged predictions show up
on the corresponding truth dataset's *Show predictions* toggle in the
dashboard, with a per-submitter leaderboard (MAE / WIS / rWIS / coverage).

## Status

Active. Coverage and dataset list grow through PRs to the upload pipeline.