data_env.Rd
The Laboratory for Computational Physiology (LCP) at MIT hosts several large-scale databases of hospital intensive care units (ICUs), two of which can be either downloaded in full (MIMIC-III and eICU ) or as demo subsets (MIMIC-III demo and eICU demo), while a third data set is available only in full (HiRID ). While demo data sets are freely available, full download requires credentialed access which can be gained by applying for an account with PhysioNet . Even though registration is required, the described datasets are all publicly available. With AmsterdamUMCdb , a non-PhysioNet hosted data source is available as well. As with the PhysioNet datasets, access is public but has to be granted by the data collectors.
data
The exported data
environment contains all datasets that have been made
available to ricu
. For datasets that are attached during package loading
(see attach_src()
), shortcuts to the datasets are set up in the package
namespace, allowing the object ricu::data::mimic_demo
to be accessed as
ricu::mimic_demo
(or in case the package namespace has been attached,
simply as mimic_demo
). Datasets that are made available after the package
namespace has been sealed will have their proxy object by default located
in .GlobalEnv
. Datasets are represented by src_env
objects, while individual tables are src_tbl
and do not
represent in-memory data, but rather data stored on disk, subsets of which
can be loaded into memory.
Setting up a dataset for use with ricu
requires a configuration object.
For the included datasets, configuration can be loaded from
system.file("extdata", "config", "data-sources.json", package = "ricu")
by calling load_src_cfg()
and for dataset that are external to ricu
,
additional configuration can be made available by setting the environment
variable RICU_CONFIG_PATH
(for more information, refer to
load_src_cfg()
). Using the dataset configuration object, data can be
downloaded (download_src()
), imported (import_src()
) and attached
(attach_src()
). While downloading and importing are one-time procedures,
attaching of the dataset is repeated every time the package is loaded.
Briefly, downloading loads the raw dataset from the internet (most likely
in .csv
format), importing consists of some preprocessing to make the
data available more efficiently (by converting it to .fst
format) and attaching sets up the data for use by the package. For more
information on the individual steps, refer to the respective documentation
pages.
A dataset that has been successfully made available can interactively be
explored by typing its name into the console and individual tables can be
inspected using the $
function. For example for the MIMIC-III demo
dataset and the icustays
table, this gives
mimic_demo
#> <mimic_demo_env[25]>
#> admissions callout caregivers chartevents
#> [129 x 19] [77 x 24] [7,567 x 4] [758,355 x 15]
#> cptevents d_cpt d_icd_diagnoses d_icd_procedures
#> [1,579 x 12] [134 x 9] [14,567 x 4] [3,882 x 4]
#> d_items d_labitems datetimeevents diagnoses_icd
#> [12,487 x 10] [753 x 6] [15,551 x 14] [1,761 x 5]
#> drgcodes icustays inputevents_cv inputevents_mv
#> [297 x 8] [136 x 12] [34,799 x 22] [13,224 x 31]
#> labevents microbiologyevents outputevents patients
#> [76,074 x 9] [2,003 x 16] [11,320 x 13] [100 x 8]
#> prescriptions procedureevents_mv procedures_icd services
#> [10,398 x 19] [753 x 25] [506 x 5] [163 x 6]
#> transfers
#> [524 x 13]
mimic_demo$icustays
#> # <mimic_tbl>: [136 x 12]
#> # ID options: subject_id (patient) < hadm_id (hadm) < icustay_id (icustay)
#> # Defaults: `intime` (index), `last_careunit` (val)
#> # Time vars: `intime`, `outtime`
#> row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit
#> <int> <int> <int> <int> <chr> <chr> <chr>
#> 1 12742 10006 142345 206504 carevue MICU MICU
#> 2 12747 10011 105331 232110 carevue MICU MICU
#> 3 12749 10013 165520 264446 carevue MICU MICU
#> 4 12754 10017 199207 204881 carevue CCU CCU
#> 5 12755 10019 177759 228977 carevue MICU MICU
#> ...
#> 132 42676 44083 198330 286428 metavision CCU CCU
#> 133 42691 44154 174245 217724 metavision MICU MICU
#> 134 42709 44212 163189 239396 metavision MICU MICU
#> 135 42712 44222 192189 238186 metavision CCU CCU
#> 136 42714 44228 103379 217992 metavision SICU SICU
#> # i 131 more rows
#> # i 5 more variables: first_wardid <int>, last_wardid <int>, intime <dttm>,
#> # outtime <dttm>, los <dbl>
Table subsets can be loaded into memory for example using the
base::subset()
function, which uses non-standard evaluation (NSE) to
determine a row-subsetting. This design choice stems form the fact that
some tables can have on the order of 10^8 rows, which makes loading full
tables into memory an expensive operation. Table subsets loaded into
memory are represented as data.table
objects.
Extending the above example, if only ICU stays corresponding to the patient
with subject_id == 10124
are of interest, the respective data can be
loaded as
subset(mimic_demo$icustays, subject_id == 10124)
#> row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit
#> 1: 12863 10124 182664 261764 carevue MICU MICU
#> 2: 12864 10124 170883 222779 carevue MICU MICU
#> 3: 12865 10124 170883 295043 carevue CCU CCU
#> 4: 12866 10124 170883 237528 carevue MICU MICU
#> first_wardid last_wardid intime outtime los
#> 1: 23 23 2192-03-29 10:46:51 2192-04-01 06:36:00 2.8258
#> 2: 50 50 2192-04-16 20:58:32 2192-04-20 08:51:28 3.4951
#> 3: 7 7 2192-04-24 02:29:49 2192-04-26 23:59:45 2.8958
#> 4: 23 23 2192-04-30 14:50:44 2192-05-15 23:34:21 15.3636
Much care has been taken to make ricu
extensible to new datasets. For
example the publicly available ICU database AmsterdamUMCdb
provided by the Amsterdam University Medical Center, currently is not part
of the core datasets of ricu
, but code for integrating this dataset is
available on github.
The Medical Information Mart for Intensive Care (MIMIC) database holds detailed clinical data from roughly 60,000 patient stays in Beth Israel Deaconess Medical Center (BIDMC) intensive care units between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). For further information, please refer to the MIMIC-III documentation.
The corresponding
demo dataset
contains the full data of a randomly selected subset of 100 patients from
the patient cohort with conformed in-hospital mortality. The only notable
data omission is the noteevents
table, which contains unstructured text
reports on patients.
More recently, Philips Healthcare and LCP began assembling the eICU Collaborative Research Database as a multi-center resource for ICU data. Combining data of several critical care units throughout the continental United States from the years 2014 and 2015, this database contains de-identified health data associated with over 200,000 admissions, including vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, and treatment information. For further information, please refer to the eICU documentation .
For the demo subset, data associated with ICU stays for over 2,500 unit stays selected from 20 of the larger hospitals is included. An important caveat that applied to the eICU-based datasets is considerable variability among the large number of hospitals in terms of data availability.
Moving to higher time-resolution, HiRID is a freely accessible critical care dataset containing data relating to almost 34,000 patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters, collected during the period from January 2008 to June 2016. Dependent on the type of measurement, time resolution can be on the order of 2 minutes.
With similar time-resolution (for vital-sign measurements) as HiRID, AmsterdamUMCdb contains data from 23,000 admissions of adult patients from 2003-2016 to the department of Intensive Care, of Amsterdam University Medical Center. In total, nearly 10^9^ individual observations consisting of vitals signs, clinical scoring systems, device data and lab results data, as well as nearly 5*10^6^ million medication entries, alongside de-identified demographic information corresponding to the 20,000 individual patients is spread over 7 tables.
With the recent v1.0 release of MIMIC-IV, experimental support has been
added in ricu
. Building on the success of MIMIC-III, this next iteration
contains data on patients admitted to an ICU or the emergency department
between 2008 - 2019 at BIDMC. Therefore, relative to MIMIC-III, patients
admitted prior to 2008 (whose data is stored in a CareVue-based system) has
been removed, while data onward of 2012 has been added. This simplifies
data queries considerably, as the CareVue/MetaVision data split in MIMIC-III
no longer applies. While addition of ED data is planned, this is not part
of the initial v1.0 release and currently is not supported by ricu
. For
further information, please refer to the MIMIC-III documentation .
Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.
Johnson, A., Pollard, T., Badawi, O., & Raffa, J. (2019). eICU Collaborative Research Database Demo (version 2.0). PhysioNet. https://doi.org/10.13026/gxmm-es70.
The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.
Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2020). HiRID, a high time-resolution ICU dataset (version 1.0). PhysioNet. https://doi.org/10.13026/hz5m-md48.
Hyland, S.L., Faltys, M., Hüser, M. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 26, 364–373 (2020). https://doi.org/10.1038/s41591-020-0789-4
Thoral PJ, Peppink JM, Driessen RH, et al (2020) AmsterdamUMCdb: The First Freely Accessible European Intensive Care Database from the ESICM Data Sharing Initiative. https://www.amsterdammedicaldatascience.nl.
Elbers, Dr. P.W.G. (Amsterdam UMC) (2019): AmsterdamUMCdb v1.0.2. DANS. https://doi.org/10.17026/dans-22u-f8vd
Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet. https://doi.org/10.13026/s6n6-xd98.
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation (Online). 101 (23), pp. e215–e220.