load_concepts.Rd
Concept objects are used in ricu
as a way to specify how a clinical
concept, such as heart rate can be loaded from a data source. Building on
this abstraction, load_concepts()
powers concise loading of data with
data source specific preprocessing hidden away from the user, thereby
providing a data source agnostic interface to data loading. At default
value of the argument merge_data
, a tabular data structure (either a
ts_tbl
or an id_tbl
, depending on what kind of
concepts are requested), inheriting from
data.table
, is returned, representing the data
in wide format (i.e. returning concepts as columns).
load_concepts(x, ...)
# S3 method for character
load_concepts(
x,
src = NULL,
concepts = NULL,
...,
dict_name = "concept-dict",
dict_dirs = NULL
)
# S3 method for integer
load_concepts(
x,
src = NULL,
concepts = NULL,
...,
dict_name = "concept-dict",
dict_dirs = NULL
)
# S3 method for numeric
load_concepts(x, ...)
# S3 method for concept
load_concepts(
x,
src = NULL,
aggregate = NULL,
merge_data = TRUE,
verbose = TRUE,
...
)
# S3 method for cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)
# S3 method for num_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)
# S3 method for unt_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)
# S3 method for fct_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)
# S3 method for lgl_cncpt
load_concepts(x, aggregate = NULL, ..., progress = NULL)
# S3 method for rec_cncpt
load_concepts(
x,
aggregate = NULL,
patient_ids = NULL,
id_type = "icustay",
interval = hours(1L),
...,
progress = NULL
)
# S3 method for item
load_concepts(
x,
patient_ids = NULL,
id_type = "icustay",
interval = hours(1L),
progress = NULL,
...
)
# S3 method for itm
load_concepts(
x,
patient_ids = NULL,
id_type = "icustay",
interval = hours(1L),
...
)
Object specifying the data to be loaded
Passed to downstream methods
A character vector, used to subset the concepts
; NULL
means no subsetting
The concepts to be used or NULL
in which case
load_dictionary()
is called
In case not concepts are passed as concepts
,
these are forwarded to load_dictionary()
as name
and file
arguments
Controls how data within concepts is aggregated
Logical flag, specifying whether to merge concepts into wide format or return a list, each entry corresponding to a concept
Logical flag for muting informational output
Either NULL
, or a progress bar object as created by
progress::progress_bar
Optional vector of patient ids to subset the fetched data with
String specifying the patient id type to return
The time interval used to discretize time stamps with,
specified as base::difftime()
object
An id_tbl
/ts_tbl
or a list thereof, depending on loaded
concepts and the value passed as merge_data
.
In order to allow for a large degree of flexibility (and extensibility),
which is much needed owing to considerable heterogeneity presented by
different data sources, several nested S3 classes are involved in
representing a concept and load_concepts()
follows this hierarchy of
classes recursively when
resolving a concept. An outline of this hierarchy can be described as
concept
: contains many cncpt
objects (of potentially differing
sub-types), each comprising of some meta-data and an item
object
item
: contains many itm
objects (of potentially differing
sub-types), each encoding how to retrieve a data item.
The design choice for wrapping a vector of cncpt
objects with a container
class concept
is motivated by the requirement of having several different
sub-types of cncpt
objects (all inheriting from the parent type cncpt
),
while retaining control over how this homogeneous w.r.t. parent type, but
heterogeneous w.r.t. sub-type vector of objects behaves in terms of S3
generic functions.
Top-level entry points are either a character vector of concept names or an
integer vector of concept IDs (matched against omopid
fields), which are
used to subset a concept
object or an entire concept dictionary, or a concept
object. When passing a
character/integer vector as first argument, the most important further
arguments at that level control from where the dictionary is taken
(dict_name
or dict_dirs
). At concept
level, the most important
additional arguments control the result structure: data merging can be
disabled using merge_data
and data aggregation is governed by the
aggregate
argument.
Data aggregation is important for merging several concepts into a
wide-format table, as this requires data to be unique per observation (i.e.
by either id or combination of id and index). Several value types are
acceptable as aggregate
argument, the most important being FALSE
, which
disables aggregation, NULL, which auto-determines a suitable aggregation
function or a string which is ultimately passed to dt_gforce()
where it
identifies a function such as sum()
, mean()
, min()
or max()
. More
information on aggregation is available as aggregate().
If the object passed as aggregate
is scalar, it is applied to all
requested concepts in the same way. In order to customize aggregation per
concept, a named object (with names corresponding to concepts) of the same
length as the number of requested concepts may be passed.
Under the hood, a concept
object comprises of several cncpt
objects
with varying sub-types (for example num_cncpt
, representing continuous
numeric data or fct_cncpt
representing categorical data). This
implementation detail is of no further importance for understanding concept
loading and for more information, please refer to the
concept
documentation. The only argument that is introduced
at cncpt
level is progress
, which controls progress reporting. If
called directly, the default value of NULL
yields messages, sent to the
terminal. Internally, if called from load_concepts()
at concept
level
(with verbose
set to TRUE
), a progress::progress_bar is set up in a
way that allows nested messages to be captured and not interrupt progress
reporting (see msg_progress()
).
A single cncpt
object contains an item
object, which in turn is
composed of several itm
objects with varying sub-types, the relationship
item
to itm
being that of concept
to cncpt
and the rationale for
this implementation choice is the same as previously: a container class
used representing a vector of objects of varying sub-types, all inheriting
form a common super-type. For more information on the item
class, please
refer to the relevant documentation. Arguments introduced at item
level include patient_ids
, id_type
and interval
. Acceptable values for
interval
are scalar-valued base::difftime()
objects (see also helper
functions such as hours()
) and this argument essentially controls the
time-resolution of the returned time-series. Of course, the limiting factor
raw time resolution which is on the order of hours for data sets like
MIMIC-III or
eICU but can be much higher for a
data set like HiRID. The argument
id_type
is used to specify what kind of id system should be used to
identify different time series in the returned data. A data set like
MIMIC-III, for example, makes possible the resolution of data to 3 nested
ID systems:
patient
(subject_id
): identifies a person
hadm
(hadm_id
): identifies a hospital admission (several of which are
possible for a given person)
icustay
(icustay_id
): identifies an admission to an ICU and again has
a one-to-many relationship to hadm
.
Acceptable argument values are strings that match ID systems as specified
by the data source configuration. Finally, patient_ids
is used to define a patient cohort for which data can be requested. Values
may either be a vector of IDs (which are assumed to be of the same type as
specified by the id_type
argument) or a tabular object inheriting from
data.frame
, which must contain a column named after the data set-specific
ID system identifier (for MIMIC-III and an id_type
argument of hadm
,
for example, that would be hadm_id
).
The presented hierarchy of S3 classes is designed with extensibility in
mind: while the current range of functionality covers settings encountered
when dealing with the included concepts and datasets, further data sets
and/or clinical concepts might necessitate different behavior for data
loading. For this reason, various parts in the cascade of calls to
load_concepts()
can be adapted for new requirements by defining new sub-
classes to cncpt
or itm
and providing methods for the generic function
load_concepts()
specific to these new classes. At cncpt
level, method
dispatch defaults to load_concepts.cncpt()
if no method specific to the
new class is provided, while at itm
level, no default function is
available.
Roughly speaking, the semantics for the two functions are as follows:
cncpt
: Called with arguments x
(the current cncpt
object),
aggregate
(controlling how aggregation per time-point and ID is
handled), ...
(further arguments passed to downstream methods) and
progress
(controlling progress reporting), this function should be able
to load and aggregate data for the given concept. Usually this involves
extracting the item
object and calling load_concepts()
again,
dispatching on the item
class with arguments x
(the given item
),
arguments passed as ...
, as well as progress
.
itm
: Called with arguments x
(the current object inheriting from
itm
, patient_ids
(NULL
or a patient ID selection), id_type
(a
string specifying what ID system to retrieve), and interval
(the time
series interval), this function actually carries out the loading of
individual data items, using the specified ID system, rounding times to
the correct interval and subsetting on patient IDs. As return value, on
object of class as specified by the target
entry is expected and all
data_vars()
should be named consistently, as data corresponding to
multiple itm
objects concatenated in row-wise fashion as in
base::rbind()
.
if (require(mimic.demo)) {
dat <- load_concepts("glu", "mimic_demo")
gluc <- concept("gluc",
item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L)))
)
identical(load_concepts(gluc), dat)
class(dat)
class(load_concepts(c("sex", "age"), "mimic_demo"))
}
#> ── Loading 1 concept ───────────────────────────────────────────────────────────
#> • glu
#> ◯ removed 1 (0.05%) of rows due to `NA` values
#> ◯ removed 1 (0.05%) of rows due to out of range entries
#> ────────────────────────────────────────────────────────────────────────────────
#> ── Loading 1 concept ───────────────────────────────────────────────────────────
#> • gluc
#> ◯ removed 1 (0.05%) of rows due to `NA` values
#> ────────────────────────────────────────────────────────────────────────────────
#> ── Loading 2 concepts ──────────────────────────────────────────────────────────
#> • sex
#> • age
#> ────────────────────────────────────────────────────────────────────────────────
#> [1] "id_tbl" "data.table" "data.frame"