import.Rd
Making a dataset available to ricu
consists of 3 steps: downloading
(download_src()
), importing (import_src()
) and attaching
(attach_src()
). While downloading and importing are one-time procedures,
attaching of the dataset is repeated every time the package is loaded.
Briefly, downloading loads the raw dataset from the internet (most likely
in .csv
format), importing consists of some preprocessing to make the
data available more efficiently and attaching sets up the data for use by
the package.
import_src(x, ...)
# S3 method for src_cfg
import_src(
x,
data_dir = src_data_dir(x),
tables = NULL,
force = FALSE,
verbose = TRUE,
...
)
# S3 method for aumc_cfg
import_src(x, ...)
# S3 method for character
import_src(
x,
data_dir = src_data_dir(x),
tables = NULL,
force = FALSE,
verbose = TRUE,
cleanup = FALSE,
...
)
import_tbl(x, ...)
# S3 method for tbl_cfg
import_tbl(
x,
data_dir = src_data_dir(x),
progress = NULL,
cleanup = FALSE,
...
)
Object specifying the source configuration
Passed to downstream methods (finally to readr::read_csv/readr::read_csv_chunked)/generic consistency
Destination directory where the downloaded data is written to.
Character vector specifying the tables to download. If
NULL
, all available tables are downloaded.
Logical flag; if TRUE
, existing data will be re-downloaded
Logical flag indicating whether to print progress information
Logical flag indicating whether to remove raw csv files after conversion to fst
Either NULL
or a progress bar as created by
progress::progress_bar()
Called for side effects and returns NULL
invisibly.
In order to speed up data access operations, ricu
does not directly use
the PhysioNet provided CSV files, but converts all data to fst::fst()
format, which allows for random row and column access. Large tables are
split into chunks in order to keep memory requirements reasonably low.
The one-time step per dataset of data import is fairly resource intensive: depending on CPU and available storage system, it will take on the order of an hour to run to completion and depending on the dataset, somewhere between 50 GB and 75 GB of temporary disk space are required as tables are uncompressed, in case of partitioned data, rows are reordered and the data again is saved to a storage efficient format.
The S3 generic function import_src()
performs import of an entire data
source, internally calling the S3 generic function import_tbl()
in order
to perform import of individual tables. Method dispatch is intended to
occur on objects inheriting from src_cfg
and tbl_cfg
respectively. Such
objects can be generated from JSON based configuration files which contain
information such as table names, column types or row numbers, in order to
provide safety in parsing of .csv
files. For more information on data
source configuration, refer to load_src_cfg()
.
Current import capabilities include re-saving a .csv
file to .fst
at
once (used for smaller sized tables), reading a large .csv
file using the
readr::read_csv_chunked()
API, while partitioning chunks and reassembling
sub-partitions (used for splitting a large file into partitions), as well
as re-partitioning an already partitioned table according to a new
partitioning scheme. Care has been taken to keep the maximal memory
requirements for this reasonably low, such that data import is feasible on
laptop class hardware.
if (FALSE) {
dir <- tempdir()
list.files(dir)
download_src("mimic_demo", dir)
list.files(dir)
import_src("mimic_demo", dir)
list.files(dir)
unlink(dir, recursive = TRUE)
}