Load configuration for a data source — load_src

For a data source to be accessible by ricu, a configuration object inheriting from the S3 class src_cfg is required. Such objects can be generated from JSON based configuration files, using load_src_cfg(). Information encoded by this configuration object includes available ID systems (mainly for use in change_id(), default column names per table for columns with special meaning (such as index column, value columns, unit columns, etc.), as well as a specification used for initial setup of the dataset which includes file names and column names alongside their data types.

load_src_cfg(src = NULL, name = "data-sources", cfg_dirs = NULL)

Arguments

src: (Optional) name(s) of data sources used for subsetting
name: String valued name of a config file which will be looked up in the default config directors
cfg_dirs: Additional directory/ies to look for configuration files

Value

A list of data source configurations as src_cfg objects.

Details

Configuration files are looked for as files name with added suffix .json starting with the directory (or directories) supplied as cfg_dirs argument, followed by the directory specified by the environment variable RICU_CONFIG_PATH, and finally in extdata/config of the package install directory. If files with matching names are found in multiple places they are concatenated such that in cases of name clashes. the earlier hits take precedent over the later ones. The following JSON code blocks show excerpts of the config file available at

system.file("extdata", "config", "data-sources.json", package = "ricu")

A data source configuration entry in a config file starts with a name, followed by optional entries class_prefix and further (variable) key-value pairs, such as an URL. For more information on class_prefix, please refer to the end of this section. Further entries include id_cfg and tables which are explained in more detail below. As outline, this gives for the data source mimic_demo, the following JSON object:

{
  "name": "mimic_demo",
  "class_prefix": ["mimic_demo", "mimic"],
  "url": "https://physionet.org/files/mimiciii-demo/1.4",
  "id_cfg": {
    ...
  },
  "tables": {
    ...
  }
}

The id_cfg entry is used to specify the available ID systems for a data source and how they relate to each other. An ID system within the context of ricu is a patient identifier of which typically several are present in a data set. In MIMIC-III, for example, three ID systems are available: patient IDs (subject_id), hospital admission IDs (hadm_id) and ICU stay IDs (icustay_id). Furthermore there is a one-to-many relationship between subject_id and hadm_id, as well as between hadm_id and icustay_id. Required for defining an ID system are a name, a position entry which orders the ID systems by their cardinality, a table entry, alongside column specifications id, start and end, which define how the IDs themselves, combined with start and end times can be loaded from a table. This gives the following specification for the ICU stay ID system in MIMIC-III:

{
  "icustay": {
    "id": "icustay_id",
    "position": 3,
    "start": "intime",
    "end": "outtime",
    "table": "icustays"
  }
}

Tables are defined by a name and entries files, defaults, and cols, as well as optional entries num_rows and partitioning. As files entry, a character vector of file names is expected. For all of MIMIC-III a single .csv file corresponds to a table, but for example for HiRID, some tables are distributed in partitions. The defaults entry consists of key-value pairs, identifying columns in a table with special meaning, such as the default index column or the set of all columns that represent timestamps. This gives as an example for a table entry for the chartevents table in MIMIC-III a JSON object like:

{
  "chartevents": {
    "files": "CHARTEVENTS.csv.gz",
    "defaults": {
      "index_var": "charttime",
      "val_var": "valuenum",
      "unit_var": "valueuom",
      "time_vars": ["charttime", "storetime"]
    },
    "num_rows": 330712483,
    "cols": {
      ...
    },
    "partitioning": {
      "col": "itemid",
      "breaks": [127, 210, 425, 549, 643, 741, 1483, 3458, 3695, 8440,
                 8553, 220274, 223921, 224085, 224859, 227629]
    }
  }
}

The optional num_rows entry is used when importing data (see import_src()) as a sanity check, which is not performed if this entry is missing from the data source configuration. The remaining table entry, partitioning, is optional in the sense that if it is missing, the table is not partitioned and if it is present, the table will be partitioned accordingly when being imported (see import_src()). In order to specify a partitioning, two entries are required, col and breaks, where the former denotes a column and the latter a numeric vector which is used to construct intervals according to which col is binned. As such, currently col is required to be of numeric type. A partitioning entry as in the example above will assign rows corresponding to idemid 1 through 126 to partition 1, 127 through 209 to partition 2 and so on up to partition 17.

Column specifications consist of a name and a spec entry alongside a name which determines the column name that will be used by ricu. The spec entry is expected to be the name of a column specification function of the readr package (see readr::cols()) and all further entries in a cols object are used as arguments to the readr column specification. For the admissions table of MIMIC-III the columns hadm_id and admittime are represented by:

{
  ...,
  "hadm_id": {
    "name": "HADM_ID",
    "spec": "col_integer"
  },
  "admittime": {
    "name": "ADMITTIME",
    "spec": "col_datetime",
    "format": "%Y-%m-%d %H:%M:%S"
  },
  ...
}

Internally, a src_cfg object consist of further S3 classes, which are instantiated when loading a JSON source configuration file. Functions for creating and manipulating src_cfg and related objects are marked internal but a brief overview is given here nevertheless:

src_cfg: wraps objects id_cfg, col_cfg and optionally tbl_cfg
id_cfg: contains information in ID systems and is created from id_cfg entries in config files
col_cfg: contains column default settings represented by defaults entries in table configuration blocks
tbl_cfg: used when importing data and therefore encompasses information in files, num_rows and cols entries of table configuration blocks

A src_cfg can be instantiated without corresponding tbl_cfg but consequently cannot be used for data import (see import_src()). In that sense, table config entries files and cols are optional as well with the restriction that the data source has to be already available in .fst format

An example for such a slimmed down config file is available at

system.file("extdata", "config", "demo-sources.json", package = "ricu")

The class_prefix entry in a data source configuration is used create sub- classes to src_cfg, id_cfg, col_cfg and tbl_cfg classes and passed on to constructors of src_env (new_src_env()) and src_tbl new_src_tbl() objects. As an example, for the above class_prefix value of c("mimic_demo", "mimic"), the corresponding src_cfg will be assigned classes c("mimic_demo_cfg", "mimic_cfg", "src_cfg") and consequently the src_tbl objects will inherit from "mimic_demo_tbl", "mimic_tbl" and "src_tbl". This can be used to adapt the behavior of involved S3 generic function to specifics of the different data sources. An example for this is how load_difftime() uses theses sub-classes to smoothen out different time-stamp representations. Furthermore, such a design was chosen with extensibility in mind. Currently, download_src() is designed around data sources hosted on PhysioNet, but in order to include a dataset external to PhysioNet, the download_src() generic can simply be extended for the new class.

Examples

cfg <- load_src_cfg("mimic_demo")
str(cfg, max.level = 1L)
#> List of 1
#>  $ mimic_demo:List of 6
#>   ..- attr(*, "class")= chr [1:3] "mimic_demo_cfg" "mimic_cfg" "src_cfg"
cfg <- cfg[["mimic_demo"]]
str(cfg, max.level = 1L)
#> List of 6
#>  $ name   : chr "mimic_demo"
#>  $ prefix : chr [1:2] "mimic_demo" "mimic"
#>  $ id_cfg : mmc_dm_d [1:3] `subject_id`, `hadm_id`, `icustay_id`
#>  $ col_cfg: mmc_dm_c [1:25] [0, 0, 5, 0, 1], [0, 1, 6, 0, 1], [1, 0, 0, 0, 1], [0,...
#>  $ tbl_cfg: mmc_dm_t [1:25] [?? ✖ 19; 1], [?? ✖ 24; 1], [?? ✖ 4; 1], [?? ✖ 15; 2],...
#>  $ extra  :List of 1
#>  - attr(*, "class")= chr [1:3] "mimic_demo_cfg" "mimic_cfg" "src_cfg"

cols <- as_col_cfg(cfg)
index_var(head(cols))
#> $admissions
#> NULL
#> 
#> $callout
#> [1] "outcometime"
#> 
#> $caregivers
#> NULL
#> 
#> $chartevents
#> [1] "charttime"
#> 
#> $cptevents
#> [1] "chartdate"
#> 
#> $d_cpt
#> NULL
#> 
time_vars(head(cols))
#> $admissions
#> [1] "admittime" "dischtime" "deathtime" "edregtime" "edouttime"
#> 
#> $callout
#> [1] "createtime"             "updatetime"             "acknowledgetime"       
#> [4] "outcometime"            "firstreservationtime"   "currentreservationtime"
#> 
#> $caregivers
#> NULL
#> 
#> $chartevents
#> [1] "charttime" "storetime"
#> 
#> $cptevents
#> [1] "chartdate"
#> 
#> $d_cpt
#> NULL
#> 

as_id_cfg(cfg)
#> <mimic_demo_ids[3]>
#>      patient         hadm      icustay 
#> `subject_id`    `hadm_id` `icustay_id`