Load concept data — load_concepts • ICU data with R

Concept objects are used in ricu as a way to specify how a clinical concept, such as heart rate can be loaded from a data source. Building on this abstraction, load_concepts() powers concise loading of data with data source specific preprocessing hidden away from the user, thereby providing a data source agnostic interface to data loading. At default value of the argument merge_data, a tabular data structure (either a ts_tbl or an id_tbl, depending on what kind of concepts are requested), inheriting from data.table, is returned, representing the data in wide format (i.e. returning concepts as columns).

load_concepts(x, ...)

# S3 method for character
load_concepts(
  x,
  src = NULL,
  concepts = NULL,
  ...,
  dict_name = "concept-dict",
  dict_dirs = NULL
)

# S3 method for integer
load_concepts(
  x,
  src = NULL,
  concepts = NULL,
  ...,
  dict_name = "concept-dict",
  dict_dirs = NULL
)

# S3 method for numeric
load_concepts(x, ...)

# S3 method for concept
load_concepts(
  x,
  src = NULL,
  aggregate = NULL,
  merge_data = TRUE,
  verbose = TRUE,
  ...
)

# S3 method for cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)

# S3 method for num_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)

# S3 method for unt_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)

# S3 method for fct_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)

# S3 method for lgl_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)

# S3 method for rec_cncpt
load_concepts(
  x,
  aggregate = NULL,
  patient_ids = NULL,
  id_type = "icustay",
  interval = hours(1L),
  ...,
  progress = NULL
)

# S3 method for item
load_concepts(
  x,
  patient_ids = NULL,
  id_type = "icustay",
  interval = hours(1L),
  progress = NULL,
  ...
)

# S3 method for itm
load_concepts(
  x,
  patient_ids = NULL,
  id_type = "icustay",
  interval = hours(1L),
  ...
)

Arguments

x: Object specifying the data to be loaded
...: Passed to downstream methods
src: A character vector, used to subset the concepts; NULL means no subsetting
concepts: The concepts to be used or NULL in which case load_dictionary() is called
dict_name, dict_dirs: In case not concepts are passed as concepts, these are forwarded to load_dictionary() as name and file arguments
aggregate: Controls how data within concepts is aggregated
merge_data: Logical flag, specifying whether to merge concepts into wide format or return a list, each entry corresponding to a concept
verbose: Logical flag for muting informational output
progress: Either NULL, or a progress bar object as created by progress::progress_bar
patient_ids: Optional vector of patient ids to subset the fetched data with
id_type: String specifying the patient id type to return
interval: The time interval used to discretize time stamps with, specified as base::difftime() object

Value

An id_tbl/ts_tbl or a list thereof, depending on loaded concepts and the value passed as merge_data.

Details

In order to allow for a large degree of flexibility (and extensibility), which is much needed owing to considerable heterogeneity presented by different data sources, several nested S3 classes are involved in representing a concept and load_concepts() follows this hierarchy of classes recursively when resolving a concept. An outline of this hierarchy can be described as

concept: contains many cncpt objects (of potentially differing sub-types), each comprising of some meta-data and an item object
item: contains many itm objects (of potentially differing sub-types), each encoding how to retrieve a data item.

The design choice for wrapping a vector of cncpt objects with a container class concept is motivated by the requirement of having several different sub-types of cncpt objects (all inheriting from the parent type cncpt), while retaining control over how this homogeneous w.r.t. parent type, but heterogeneous w.r.t. sub-type vector of objects behaves in terms of S3 generic functions.

Concept

Top-level entry points are either a character vector of concept names or an integer vector of concept IDs (matched against omopid fields), which are used to subset a concept object or an entire concept dictionary, or a concept object. When passing a character/integer vector as first argument, the most important further arguments at that level control from where the dictionary is taken (dict_name or dict_dirs). At concept level, the most important additional arguments control the result structure: data merging can be disabled using merge_data and data aggregation is governed by the aggregate argument.

Data aggregation is important for merging several concepts into a wide-format table, as this requires data to be unique per observation (i.e. by either id or combination of id and index). Several value types are acceptable as aggregate argument, the most important being FALSE, which disables aggregation, NULL, which auto-determines a suitable aggregation function or a string which is ultimately passed to dt_gforce() where it identifies a function such as sum(), mean(), min() or max(). More information on aggregation is available as aggregate(). If the object passed as aggregate is scalar, it is applied to all requested concepts in the same way. In order to customize aggregation per concept, a named object (with names corresponding to concepts) of the same length as the number of requested concepts may be passed.

Under the hood, a concept object comprises of several cncpt objects with varying sub-types (for example num_cncpt, representing continuous numeric data or fct_cncpt representing categorical data). This implementation detail is of no further importance for understanding concept loading and for more information, please refer to the concept documentation. The only argument that is introduced at cncpt level is progress, which controls progress reporting. If called directly, the default value of NULL yields messages, sent to the terminal. Internally, if called from load_concepts() at concept level (with verbose set to TRUE), a progress::progress_bar is set up in a way that allows nested messages to be captured and not interrupt progress reporting (see msg_progress()).

Item

A single cncpt object contains an item object, which in turn is composed of several itm objects with varying sub-types, the relationship item to itm being that of concept to cncpt and the rationale for this implementation choice is the same as previously: a container class used representing a vector of objects of varying sub-types, all inheriting form a common super-type. For more information on the item class, please refer to the relevant documentation. Arguments introduced at item level include patient_ids, id_type and interval. Acceptable values for interval are scalar-valued base::difftime() objects (see also helper functions such as hours()) and this argument essentially controls the time-resolution of the returned time-series. Of course, the limiting factor raw time resolution which is on the order of hours for data sets like MIMIC-III or eICU but can be much higher for a data set like HiRID. The argument id_type is used to specify what kind of id system should be used to identify different time series in the returned data. A data set like MIMIC-III, for example, makes possible the resolution of data to 3 nested ID systems:

patient (subject_id): identifies a person
hadm (hadm_id): identifies a hospital admission (several of which are possible for a given person)
icustay (icustay_id): identifies an admission to an ICU and again has a one-to-many relationship to hadm.

Acceptable argument values are strings that match ID systems as specified by the data source configuration. Finally, patient_ids is used to define a patient cohort for which data can be requested. Values may either be a vector of IDs (which are assumed to be of the same type as specified by the id_type argument) or a tabular object inheriting from data.frame, which must contain a column named after the data set-specific ID system identifier (for MIMIC-III and an id_type argument of hadm, for example, that would be hadm_id).

Extensions

The presented hierarchy of S3 classes is designed with extensibility in mind: while the current range of functionality covers settings encountered when dealing with the included concepts and datasets, further data sets and/or clinical concepts might necessitate different behavior for data loading. For this reason, various parts in the cascade of calls to load_concepts() can be adapted for new requirements by defining new sub- classes to cncpt or itm and providing methods for the generic function load_concepts()specific to these new classes. At cncpt level, method dispatch defaults to load_concepts.cncpt() if no method specific to the new class is provided, while at itm level, no default function is available.

Roughly speaking, the semantics for the two functions are as follows:

cncpt: Called with arguments x (the current cncpt object), aggregate (controlling how aggregation per time-point and ID is handled), ... (further arguments passed to downstream methods) and progress (controlling progress reporting), this function should be able to load and aggregate data for the given concept. Usually this involves extracting the item object and calling load_concepts() again, dispatching on the item class with arguments x (the given item), arguments passed as ..., as well as progress.
itm: Called with arguments x (the current object inheriting from itm, patient_ids (NULL or a patient ID selection), id_type (a string specifying what ID system to retrieve), and interval (the time series interval), this function actually carries out the loading of individual data items, using the specified ID system, rounding times to the correct interval and subsetting on patient IDs. As return value, on object of class as specified by the target entry is expected and all data_vars() should be named consistently, as data corresponding to multiple itm objects concatenated in row-wise fashion as in base::rbind().

Examples

if (require(mimic.demo)) {
dat <- load_concepts("glu", "mimic_demo")

gluc <- concept("gluc",
  item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L)))
)

identical(load_concepts(gluc), dat)

class(dat)
class(load_concepts(c("sex", "age"), "mimic_demo"))
}
#> ── Loading 1 concept ───────────────────────────────────────────────────────────
#> • glu
#>   ◯ removed 1 (0.05%) of rows due to `NA` values
#>   ◯ removed 1 (0.05%) of rows due to out of range entries
#> ────────────────────────────────────────────────────────────────────────────────
#> ── Loading 1 concept ───────────────────────────────────────────────────────────
#> • gluc
#>   ◯ removed 1 (0.05%) of rows due to `NA` values
#> ────────────────────────────────────────────────────────────────────────────────
#> ── Loading 2 concepts ──────────────────────────────────────────────────────────
#> • sex
#> • age
#> ────────────────────────────────────────────────────────────────────────────────
#> [1] "id_tbl"     "data.table" "data.frame"