ICU datasets — data • ICU data with R

The Laboratory for Computational Physiology (LCP) at MIT hosts several large-scale databases of hospital intensive care units (ICUs), two of which can be either downloaded in full (MIMIC-III and eICU ) or as demo subsets (MIMIC-III demo and eICU demo), while a third data set is available only in full (HiRID ). While demo data sets are freely available, full download requires credentialed access which can be gained by applying for an account with PhysioNet . Even though registration is required, the described datasets are all publicly available. With AmsterdamUMCdb , a non-PhysioNet hosted data source is available as well. As with the PhysioNet datasets, access is public but has to be granted by the data collectors.

data

Format

The exported data environment contains all datasets that have been made available to ricu. For datasets that are attached during package loading (see attach_src()), shortcuts to the datasets are set up in the package namespace, allowing the object ricu::data::mimic_demo to be accessed as ricu::mimic_demo (or in case the package namespace has been attached, simply as mimic_demo). Datasets that are made available after the package namespace has been sealed will have their proxy object by default located in .GlobalEnv. Datasets are represented by src_env objects, while individual tables are src_tbl and do not represent in-memory data, but rather data stored on disk, subsets of which can be loaded into memory.

Details

Setting up a dataset for use with ricu requires a configuration object. For the included datasets, configuration can be loaded from

system.file("extdata", "config", "data-sources.json", package = "ricu")

by calling load_src_cfg() and for dataset that are external to ricu, additional configuration can be made available by setting the environment variable RICU_CONFIG_PATH (for more information, refer to load_src_cfg()). Using the dataset configuration object, data can be downloaded (download_src()), imported (import_src()) and attached (attach_src()). While downloading and importing are one-time procedures, attaching of the dataset is repeated every time the package is loaded. Briefly, downloading loads the raw dataset from the internet (most likely in .csv format), importing consists of some preprocessing to make the data available more efficiently (by converting it to .fst format) and attaching sets up the data for use by the package. For more information on the individual steps, refer to the respective documentation pages.

A dataset that has been successfully made available can interactively be explored by typing its name into the console and individual tables can be inspected using the $ function. For example for the MIMIC-III demo dataset and the icustays table, this gives

mimic_demo
#> <mimic_demo_env[25]>
#>         admissions            callout         caregivers        chartevents 
#>         [129 x 19]          [77 x 24]        [7,567 x 4]     [758,355 x 15] 
#>          cptevents              d_cpt    d_icd_diagnoses   d_icd_procedures 
#>       [1,579 x 12]          [134 x 9]       [14,567 x 4]        [3,882 x 4] 
#>            d_items         d_labitems     datetimeevents      diagnoses_icd 
#>      [12,487 x 10]          [753 x 6]      [15,551 x 14]        [1,761 x 5] 
#>           drgcodes           icustays     inputevents_cv     inputevents_mv 
#>          [297 x 8]         [136 x 12]      [34,799 x 22]      [13,224 x 31] 
#>          labevents microbiologyevents       outputevents           patients 
#>       [76,074 x 9]       [2,003 x 16]      [11,320 x 13]          [100 x 8] 
#>      prescriptions procedureevents_mv     procedures_icd           services 
#>      [10,398 x 19]         [753 x 25]          [506 x 5]          [163 x 6] 
#>          transfers 
#>         [524 x 13]
mimic_demo$icustays
#> # <mimic_tbl>: [136 x 12]
#> # ID options:  subject_id (patient) < hadm_id (hadm) < icustay_id (icustay)
#> # Defaults:    `intime` (index), `last_careunit` (val)
#> # Time vars:   `intime`, `outtime`
#>     row_id subject_id hadm_id icustay_id dbsource   first_careunit last_careunit
#>      <int>      <int>   <int>      <int> <chr>      <chr>          <chr>
#> 1    12742      10006  142345     206504 carevue    MICU           MICU
#> 2    12747      10011  105331     232110 carevue    MICU           MICU
#> 3    12749      10013  165520     264446 carevue    MICU           MICU
#> 4    12754      10017  199207     204881 carevue    CCU            CCU
#> 5    12755      10019  177759     228977 carevue    MICU           MICU
#> ...
#> 132  42676      44083  198330     286428 metavision CCU            CCU
#> 133  42691      44154  174245     217724 metavision MICU           MICU
#> 134  42709      44212  163189     239396 metavision MICU           MICU
#> 135  42712      44222  192189     238186 metavision CCU            CCU
#> 136  42714      44228  103379     217992 metavision SICU           SICU
#> # i 131 more rows
#> # i 5 more variables: first_wardid <int>, last_wardid <int>, intime <dttm>,
#> #   outtime <dttm>, los <dbl>

Table subsets can be loaded into memory for example using the base::subset() function, which uses non-standard evaluation (NSE) to determine a row-subsetting. This design choice stems form the fact that some tables can have on the order of 10^8 rows, which makes loading full tables into memory an expensive operation. Table subsets loaded into memory are represented as data.table objects. Extending the above example, if only ICU stays corresponding to the patient with subject_id == 10124 are of interest, the respective data can be loaded as

subset(mimic_demo$icustays, subject_id == 10124)
#>    row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit
#> 1:  12863      10124  182664     261764  carevue           MICU          MICU
#> 2:  12864      10124  170883     222779  carevue           MICU          MICU
#> 3:  12865      10124  170883     295043  carevue            CCU           CCU
#> 4:  12866      10124  170883     237528  carevue           MICU          MICU
#>    first_wardid last_wardid              intime             outtime     los
#> 1:           23          23 2192-03-29 10:46:51 2192-04-01 06:36:00  2.8258
#> 2:           50          50 2192-04-16 20:58:32 2192-04-20 08:51:28  3.4951
#> 3:            7           7 2192-04-24 02:29:49 2192-04-26 23:59:45  2.8958
#> 4:           23          23 2192-04-30 14:50:44 2192-05-15 23:34:21 15.3636

Much care has been taken to make ricu extensible to new datasets. For example the publicly available ICU database AmsterdamUMCdb provided by the Amsterdam University Medical Center, currently is not part of the core datasets of ricu, but code for integrating this dataset is available on github.

MIMIC-III

The Medical Information Mart for Intensive Care (MIMIC) database holds detailed clinical data from roughly 60,000 patient stays in Beth Israel Deaconess Medical Center (BIDMC) intensive care units between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). For further information, please refer to the MIMIC-III documentation.

The corresponding demo dataset contains the full data of a randomly selected subset of 100 patients from the patient cohort with conformed in-hospital mortality. The only notable data omission is the noteevents table, which contains unstructured text reports on patients.

eICU

More recently, Philips Healthcare and LCP began assembling the eICU Collaborative Research Database as a multi-center resource for ICU data. Combining data of several critical care units throughout the continental United States from the years 2014 and 2015, this database contains de-identified health data associated with over 200,000 admissions, including vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, and treatment information. For further information, please refer to the eICU documentation .

For the demo subset, data associated with ICU stays for over 2,500 unit stays selected from 20 of the larger hospitals is included. An important caveat that applied to the eICU-based datasets is considerable variability among the large number of hospitals in terms of data availability.

HiRID

Moving to higher time-resolution, HiRID is a freely accessible critical care dataset containing data relating to almost 34,000 patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters, collected during the period from January 2008 to June 2016. Dependent on the type of measurement, time resolution can be on the order of 2 minutes.

AmsterdamUMCdb

With similar time-resolution (for vital-sign measurements) as HiRID, AmsterdamUMCdb contains data from 23,000 admissions of adult patients from 2003-2016 to the department of Intensive Care, of Amsterdam University Medical Center. In total, nearly 10^9^ individual observations consisting of vitals signs, clinical scoring systems, device data and lab results data, as well as nearly 5*10^6^ million medication entries, alongside de-identified demographic information corresponding to the 20,000 individual patients is spread over 7 tables.

MIMIC-IV

With the recent v1.0 release of MIMIC-IV, experimental support has been added in ricu. Building on the success of MIMIC-III, this next iteration contains data on patients admitted to an ICU or the emergency department between 2008 - 2019 at BIDMC. Therefore, relative to MIMIC-III, patients admitted prior to 2008 (whose data is stored in a CareVue-based system) has been removed, while data onward of 2012 has been added. This simplifies data queries considerably, as the CareVue/MetaVision data split in MIMIC-III no longer applies. While addition of ED data is planned, this is not part of the initial v1.0 release and currently is not supported by ricu. For further information, please refer to the MIMIC-III documentation .

References

Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.

Johnson, A., Pollard, T., Badawi, O., & Raffa, J. (2019). eICU Collaborative Research Database Demo (version 2.0). PhysioNet. https://doi.org/10.13026/gxmm-es70.

The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.

Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2020). HiRID, a high time-resolution ICU dataset (version 1.0). PhysioNet. https://doi.org/10.13026/hz5m-md48.

Hyland, S.L., Faltys, M., Hüser, M. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 26, 364–373 (2020). https://doi.org/10.1038/s41591-020-0789-4

Thoral PJ, Peppink JM, Driessen RH, et al (2020) AmsterdamUMCdb: The First Freely Accessible European Intensive Care Database from the ESICM Data Sharing Initiative. https://www.amsterdammedicaldatascience.nl.

Elbers, Dr. P.W.G. (Amsterdam UMC) (2019): AmsterdamUMCdb v1.0.2. DANS. https://doi.org/10.17026/dans-22u-f8vd

Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet. https://doi.org/10.13026/s6n6-xd98.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation (Online). 101 (23), pp. e215–e220.