Data import utilities — import_src • ICU data with R

Making a dataset available to ricu consists of 3 steps: downloading (download_src()), importing (import_src()) and attaching (attach_src()). While downloading and importing are one-time procedures, attaching of the dataset is repeated every time the package is loaded. Briefly, downloading loads the raw dataset from the internet (most likely in .csv format), importing consists of some preprocessing to make the data available more efficiently and attaching sets up the data for use by the package.

import_src(x, ...)

# S3 method for src_cfg
import_src(
  x,
  data_dir = src_data_dir(x),
  tables = NULL,
  force = FALSE,
  verbose = TRUE,
  ...
)

# S3 method for aumc_cfg
import_src(x, ...)

# S3 method for character
import_src(
  x,
  data_dir = src_data_dir(x),
  tables = NULL,
  force = FALSE,
  verbose = TRUE,
  cleanup = FALSE,
  ...
)

import_tbl(x, ...)

# S3 method for tbl_cfg
import_tbl(
  x,
  data_dir = src_data_dir(x),
  progress = NULL,
  cleanup = FALSE,
  ...
)

Arguments

x: Object specifying the source configuration
...: Passed to downstream methods (finally to readr::read_csv/readr::read_csv_chunked)/generic consistency
data_dir: Destination directory where the downloaded data is written to.
tables: Character vector specifying the tables to download. If NULL, all available tables are downloaded.
force: Logical flag; if TRUE, existing data will be re-downloaded
verbose: Logical flag indicating whether to print progress information
cleanup: Logical flag indicating whether to remove raw csv files after conversion to fst
progress: Either NULL or a progress bar as created by progress::progress_bar()

Value

Called for side effects and returns NULL invisibly.

Details

In order to speed up data access operations, ricu does not directly use the PhysioNet provided CSV files, but converts all data to fst::fst() format, which allows for random row and column access. Large tables are split into chunks in order to keep memory requirements reasonably low.

The one-time step per dataset of data import is fairly resource intensive: depending on CPU and available storage system, it will take on the order of an hour to run to completion and depending on the dataset, somewhere between 50 GB and 75 GB of temporary disk space are required as tables are uncompressed, in case of partitioned data, rows are reordered and the data again is saved to a storage efficient format.

The S3 generic function import_src() performs import of an entire data source, internally calling the S3 generic function import_tbl() in order to perform import of individual tables. Method dispatch is intended to occur on objects inheriting from src_cfg and tbl_cfg respectively. Such objects can be generated from JSON based configuration files which contain information such as table names, column types or row numbers, in order to provide safety in parsing of .csv files. For more information on data source configuration, refer to load_src_cfg().

Current import capabilities include re-saving a .csv file to .fst at once (used for smaller sized tables), reading a large .csv file using the readr::read_csv_chunked() API, while partitioning chunks and reassembling sub-partitions (used for splitting a large file into partitions), as well as re-partitioning an already partitioned table according to a new partitioning scheme. Care has been taken to keep the maximal memory requirements for this reasonably low, such that data import is feasible on laptop class hardware.

Examples

if (FALSE) {

dir <- tempdir()
list.files(dir)

download_src("mimic_demo", dir)
list.files(dir)

import_src("mimic_demo", dir)
list.files(dir)

unlink(dir, recursive = TRUE)

}