load_src_cfg.Rd
For a data source to be accessible by ricu
, a configuration object
inheriting from the S3 class src_cfg
is required. Such objects can be
generated from JSON based configuration files, using load_src_cfg()
.
Information encoded by this configuration object includes available ID
systems (mainly for use in change_id()
, default column names per table
for columns with special meaning (such as index column, value columns, unit
columns, etc.), as well as a specification used for initial setup of the
dataset which includes file names and column names alongside their data
types.
load_src_cfg(src = NULL, name = "data-sources", cfg_dirs = NULL)
(Optional) name(s) of data sources used for subsetting
String valued name of a config file which will be looked up in the default config directors
Additional directory/ies to look for configuration files
A list of data source configurations as src_cfg
objects.
Configuration files are looked for as files name
with added suffix
.json
starting with the directory (or directories) supplied as cfg_dirs
argument, followed by the directory specified by the environment variable
RICU_CONFIG_PATH
, and finally in extdata/config
of the package install
directory. If files with matching names are found in multiple places they
are concatenated such that in cases of name clashes. the earlier hits take
precedent over the later ones. The following JSON code blocks show excerpts
of the config file available at
system.file("extdata", "config", "data-sources.json", package = "ricu")
A data source configuration entry in a config file starts with a name,
followed by optional entries class_prefix
and further (variable)
key-value pairs, such as an URL. For more information on class_prefix
,
please refer to the end of this section. Further entries include id_cfg
and tables
which are explained in more detail below. As outline, this
gives for the data source mimic_demo
, the following JSON object:
{"name": "mimic_demo",
"class_prefix": ["mimic_demo", "mimic"],
"url": "https://physionet.org/files/mimiciii-demo/1.4",
"id_cfg": {
...
},"tables": {
...
} }
The id_cfg
entry is used to specify the available ID systems for a data
source and how they relate to each other. An ID system within the context
of ricu
is a patient identifier of which typically several are present in
a data set. In MIMIC-III, for example, three ID systems are available:
patient IDs (subject_id
), hospital admission IDs (hadm_id
) and ICU stay
IDs (icustay_id
). Furthermore there is a one-to-many relationship between
subject_id
and hadm_id
, as well as between hadm_id
and icustay_id
.
Required for defining an ID system are a name, a position
entry which
orders the ID systems by their cardinality, a table
entry, alongside
column specifications id
, start
and end
, which define how the IDs
themselves, combined with start and end times can be loaded from a table.
This gives the following specification for the ICU stay ID system in
MIMIC-III:
{"icustay": {
"id": "icustay_id",
"position": 3,
"start": "intime",
"end": "outtime",
"table": "icustays"
} }
Tables are defined by a name and entries files
, defaults
, and cols
,
as well as optional entries num_rows
and partitioning
. As files
entry,
a character vector of file names is expected. For all of MIMIC-III a single
.csv
file corresponds to a table, but for example for HiRID, some tables
are distributed in partitions. The defaults
entry consists of key-value
pairs, identifying columns in a table with special meaning, such as the
default index column or the set of all columns that represent timestamps.
This gives as an example for a table entry for the chartevents
table in
MIMIC-III a JSON object like:
{"chartevents": {
"files": "CHARTEVENTS.csv.gz",
"defaults": {
"index_var": "charttime",
"val_var": "valuenum",
"unit_var": "valueuom",
"time_vars": ["charttime", "storetime"]
},"num_rows": 330712483,
"cols": {
...
},"partitioning": {
"col": "itemid",
"breaks": [127, 210, 425, 549, 643, 741, 1483, 3458, 3695, 8440,
8553, 220274, 223921, 224085, 224859, 227629]
}
} }
The optional num_rows
entry is used when importing data (see
import_src()
) as a sanity check, which is not performed if this entry is
missing from the data source configuration. The remaining table entry,
partitioning
, is optional in the sense that if it is missing, the table
is not partitioned and if it is present, the table will be partitioned
accordingly when being imported (see import_src()
). In order to specify a
partitioning, two entries are required, col
and breaks
, where the former
denotes a column and the latter a numeric vector which is used to construct
intervals according to which col
is binned. As such, currently col
is
required to be of numeric type. A partitioning
entry as in the example
above will assign rows corresponding to idemid
1 through 126 to partition
1, 127 through 209 to partition 2 and so on up to partition 17.
Column specifications consist of a name
and a spec
entry alongside a
name which determines the column name that will be used by ricu
. The
spec
entry is expected to be the name of a column specification function
of the readr
package (see readr::cols()
) and all further entries in a
cols
object are used as arguments to the readr
column specification.
For the admissions
table of MIMIC-III the columns hadm_id
and
admittime
are represented by:
{
...,"hadm_id": {
"name": "HADM_ID",
"spec": "col_integer"
},"admittime": {
"name": "ADMITTIME",
"spec": "col_datetime",
"format": "%Y-%m-%d %H:%M:%S"
},
... }
Internally, a src_cfg
object consist of further S3 classes, which are
instantiated when loading a JSON source configuration file. Functions for
creating and manipulating src_cfg
and related objects are marked
internal
but a brief overview is given here nevertheless:
src_cfg
: wraps objects id_cfg
, col_cfg
and optionally tbl_cfg
id_cfg
: contains information in ID systems and is created from id_cfg
entries in config files
col_cfg
: contains column default settings represented by defaults
entries in table configuration blocks
tbl_cfg
: used when importing data and therefore encompasses information
in files
, num_rows
and cols
entries of table configuration blocks
A src_cfg
can be instantiated without corresponding tbl_cfg
but
consequently cannot be used for data import (see import_src()
). In that
sense, table config entries files
and cols
are optional as well with
the restriction that the data source has to be already available in .fst
format
An example for such a slimmed down config file is available at
system.file("extdata", "config", "demo-sources.json", package = "ricu")
The class_prefix
entry in a data source configuration is used create sub-
classes to src_cfg
, id_cfg
, col_cfg
and tbl_cfg
classes and passed
on to constructors of src_env
(new_src_env()
) and src_tbl
new_src_tbl()
objects. As an example, for the above class_prefix
value
of c("mimic_demo", "mimic")
, the corresponding src_cfg
will be assigned
classes c("mimic_demo_cfg", "mimic_cfg", "src_cfg")
and consequently the
src_tbl
objects will inherit from "mimic_demo_tbl"
, "mimic_tbl"
and
"src_tbl"
. This can be used to adapt the behavior of involved S3 generic
function to specifics of the different data sources. An example for this is
how load_difftime()
uses theses sub-classes to smoothen out different
time-stamp representations. Furthermore, such a design was chosen with
extensibility in mind. Currently, download_src()
is designed around data
sources hosted on PhysioNet, but in order to include a dataset external to
PhysioNet, the download_src()
generic can simply be extended for the new
class.
cfg <- load_src_cfg("mimic_demo")
str(cfg, max.level = 1L)
#> List of 1
#> $ mimic_demo:List of 6
#> ..- attr(*, "class")= chr [1:3] "mimic_demo_cfg" "mimic_cfg" "src_cfg"
cfg <- cfg[["mimic_demo"]]
str(cfg, max.level = 1L)
#> List of 6
#> $ name : chr "mimic_demo"
#> $ prefix : chr [1:2] "mimic_demo" "mimic"
#> $ id_cfg : mmc_dm_d [1:3] `subject_id`, `hadm_id`, `icustay_id`
#> $ col_cfg: mmc_dm_c [1:25] [0, 0, 5, 0, 1], [0, 1, 6, 0, 1], [1, 0, 0, 0, 1], [0,...
#> $ tbl_cfg: mmc_dm_t [1:25] [?? ✖ 19; 1], [?? ✖ 24; 1], [?? ✖ 4; 1], [?? ✖ 15; 2],...
#> $ extra :List of 1
#> - attr(*, "class")= chr [1:3] "mimic_demo_cfg" "mimic_cfg" "src_cfg"
cols <- as_col_cfg(cfg)
index_var(head(cols))
#> $admissions
#> NULL
#>
#> $callout
#> [1] "outcometime"
#>
#> $caregivers
#> NULL
#>
#> $chartevents
#> [1] "charttime"
#>
#> $cptevents
#> [1] "chartdate"
#>
#> $d_cpt
#> NULL
#>
time_vars(head(cols))
#> $admissions
#> [1] "admittime" "dischtime" "deathtime" "edregtime" "edouttime"
#>
#> $callout
#> [1] "createtime" "updatetime" "acknowledgetime"
#> [4] "outcometime" "firstreservationtime" "currentreservationtime"
#>
#> $caregivers
#> NULL
#>
#> $chartevents
#> [1] "charttime" "storetime"
#>
#> $cptevents
#> [1] "chartdate"
#>
#> $d_cpt
#> NULL
#>
as_id_cfg(cfg)
#> <mimic_demo_ids[3]>
#> patient hadm icustay
#> `subject_id` `hadm_id` `icustay_id`