Quick start guide • ICU data with R

In order to set up ricu, download of datasets from several platforms is required. Two data sources, mimic_demo and eicu_demo are available directly as R packages, hosted on Github. The respective full-featured versions mimic and eicu, as well as the hirid dataset are available from PhysioNet, while access to the remaining standard dataset aumc is available from yet another website. The following steps guide through package installation, data source set up and conclude with some example data queries.

Package installation

Stable package releases are available from CRAN as

install.packages("ricu")

and the latest development version is available from GitHub as

remotes::install_github("eth-mds/ricu")

Demo datasets

The demo datasets mimic_demo and eicu_demo are listed as Suggests dependencies and therefore their availability is determined by the value passed as dependencies to the above package installation function. The following call explicitly installs the demo data set packages

install.packages(
  c("mimic.demo", "eicu.demo"),
  repos = "https://eth-mds.github.io/physionet-demo"
)

Full datasets

Included with ricu are functions for download and setup of the following datasets: mimic (MIMIC-III), eicu, hirid, aumc and miiv (MIMIC-IV), which can be invoked in several different ways.

To begin with, a directory is needed where the data can permanently be stored. The default location is platform dependent and can be overridden using the environment variable RICU_DATA_PATH. The current value can be retrieved by calling data_dir().
Access to both the PhysioNet datasets (MIMIC-III, eICU and HiRID), as well as to AUMCdb is free but credentialed. In addition to setting up an account with PhysioNet, credentialing is required, which, in the case of HiRID, must also be followed by submitting an access request to the data-owners. Details on the procedure for requesting access to AUMCdb is available from here and consists of filling out a form and completing a training course such as Data or Specimens Only Research (DSOR) which, together with proof of training course completion, can be submitted by email.
If raw data in .csv form has already been downloaded, this can be decompressed and copied to an appropriate sub-folder (mimic, eicu, hirid or aumc) to the directory identified by data_dir().
In order to have ricu download the required data, login credentials can be supplied as environment variables RICU_PHYSIONET_USER/RICU_PHYSIONET_PASS and RICU_AUMC_TOKEN (the string the follows token= in the download URL received from the AUMCdb data owners) or entered into the terminal manually in interactive sessions.
Enabling efficient random row/column access, ricu converts .csv files into a binary format using the fst package.
Data conversion to .fst format (and potentially data download) is automatically triggered upon first access of a table. In interactive sessions, the user is asked for permission to setup the given data source and in non-interactive sessions, access to missing data throws an error.
Instead of relying on first data access to trigger setup, up-front data conversion, possible preceded by data download, can be invoked by calling setup_src_data().

Concept loading

Many commonly used clinical data concepts are available for all data sources, where the required data exists. An overview of available concepts is available by calling explain_dictionary() and concepts can be loaded using load_concepts():

src  <- "mimic_demo"
demo <- c(src, "eicu_demo")

head(explain_dictionary(src = demo))
#>       name     category            description
#> 1      abx  medications            antibiotics
#> 2 adh_rate  medications       vasopressin rate
#> 3      adm demographics patient admission type
#> 4      age demographics            patient age
#> 5      alb    chemistry                albumin
#> 6      alp    chemistry   alkaline phosphatase
load_concepts("alb", src, verbose = FALSE)
#> # A `ts_tbl`: 297 ✖ 3
#> # Id var:     `icustay_id`
#> # Units:      `alb` [g/dL]
#> # Index var:  `charttime` (1 hours)
#>     icustay_id charttime   alb
#>          <int> <drtn>    <dbl>
#>   1     201006   0 hours   2.4
#>   2     203766 -18 hours   2
#>   3     203766   4 hours   1.7
#>   4     204132   7 hours   3.6
#>   5     204201   9 hours   2.3
#>   …
#> 293     298685 130 hours   1.9
#> 294     298685 154 hours   2
#> 295     298685 203 hours   2
#> 296     298685 272 hours   2.2
#> 297     298685 299 hours   2.5
#> # … with 287 more rows

Concepts representing time-dependent measurements are loaded as ts_tbl objects, whereas static information is retrieved as id_tbl object. Both classes inherit from data.table (and therefore also from data.frame) and can be coerced to any of the base classes using as.data.table() and as.data.frame(), respectively. Using data.table ‘by-reference’ operations, this is available as zero-copy operation by passing by_ref = TRUE¹.

(dat <- load_concepts("height", src, verbose = FALSE))
#> # An `id_tbl`: 63 ✖ 2
#> # Id var:      `icustay_id`
#> # Units:       `height` [cm]
#>    icustay_id height
#>         <int>  <dbl>
#>  1     201006   157.
#>  2     201204   163.
#>  3     203766   165.
#>  4     204132   165.
#>  5     204201   157.
#>  …
#> 59     293429   155.
#> 60     295043   165.
#> 61     295741   175.
#> 62     296804   173.
#> 63     298685   175.
#> # … with 53 more rows
head(tmp <- as.data.frame(dat, by_ref = TRUE))
#>   icustay_id height
#> 1     201006 157.48
#> 2     201204 162.56
#> 3     203766 165.10
#> 4     204132 165.10
#> 5     204201 157.48
#> 6     210989 175.26
identical(dat, tmp)
#> [1] TRUE

Many functions exported by ricu use id_tbl and ts_tbl objects in order to enable more concise semantics. Merging an id_tbl with a ts_tbl, for example, will automatically use the columns identified by id_vars() of both tables, as by.x/by.y arguments, while for two ts_tbl object, respective columns reported by id_vars() and index_var() will be used to merge on.

When loading form multiple data sources simultaneously, load_concepts() will add a source column (which will be among the id_vars() of the resulting object), thereby allowing to identify stay IDs corresponding to the individual data sources.

load_concepts("weight", demo, verbose = FALSE)
#> # An `id_tbl`: 2,434 ✖ 3
#> # Id vars:     `source`, `icustay_id`
#> # Units:       `weight` [kg]
#>       source     icustay_id weight
#>       <chr>           <int>  <dbl>
#>     1 eicu_demo      141765   46.5
#>     2 eicu_demo      143870   77.5
#>     3 eicu_demo      144815   60.3
#>     4 eicu_demo      145427   91.7
#>     5 eicu_demo      147307   72.5
#>     …
#> 2,430 mimic_demo     295043   96.6
#> 2,431 mimic_demo     295741   81.6
#> 2,432 mimic_demo     296804   71
#> 2,433 mimic_demo     297782   78.8
#> 2,434 mimic_demo     298685   52
#> # … with 2,424 more rows

Extending the concept dictionary

In addition to the ~100 concepts that are available by default, adding user-defined concepts is possible either as R objects or more robustly, as JSON configuration files.

Data concepts consist of zero, one, or several data items per data source, encoding how to retrieve the corresponding data. The constructors concept() and item() can be used to instantiate concepts as R objects.

ldh <- concept("ldh",
  item("mimic_demo", "labevents", "itemid", 50954),
  description = "Lactate dehydrogenase",
  unit = "IU/L"
)
load_concepts(ldh, verbose = FALSE)
#> # A `ts_tbl`: 365 ✖ 3
#> # Id var:     `icustay_id`
#> # Units:      `ldh` [IU/L]
#> # Index var:  `charttime` (1 hours)
#>     icustay_id charttime   ldh
#>          <int> <drtn>    <dbl>
#>   1     201006 -45 hours   249
#>   2     201006  48 hours   399
#>   3     203766   4 hours   227
#>   4     204132   7 hours   489
#>   5     204132  36 hours   574
#>   …
#> 361     298685 203 hours   222
#> 362     298685 226 hours   230
#> 363     298685 260 hours   218
#> 364     298685 272 hours   221
#> 365     298685 299 hours   253
#> # … with 355 more rows

Configuration files are looked for in both the package installation directory and in user-specified locations, either using the environment variable RICU_CONFIG_PATH or by passing paths as function arguments (load_dictionary() for example accepts a cfg_dirs argument).
Mechanisms for both extending and replacing existing concept dictionaries are supported by ricu. The file name of the default concept dictionary is called concept-dict.json and any file with the same name in user-specified locations will be used as extensions. In order to forgo the internal dictionary, a different file name can be chosen, which then has to be passed as function argument (load_dictionary() for example has a name argument which defaults to concept-dict)

A JSON-based concept akin to the one above can be specified as

{
    "ldh": {
      "unit": "IU/L",
      "description": "Lactate dehydrogenase",
      "sources": {
        "mimic_demo": [
          {
            "ids": 50954,
            "table": "labevents",
            "sub_var": "itemid"
          }
        ]
      }
    }
}

and this can (given that it is saved as concept-dict.json in a directory pointed to by RICU_CONFIG_PATH) then be loaded using load_concepts() as

load_concepts("ldh", "mimic_demo")

For further details on constructing concepts, refer to documentation at ?concept and ?item.

While data.table by-reference operations can be very useful due to their inherent efficiency benefits, much care is required if enabled, as they break with the usual base R by-value (copy-on-modify) semantics.↩︎