library(EDCimport)
library(dplyr)
db = edc_example(N=200) %>%
edc_unify_subjid() %>%
edc_clean_names()
db
#> ── EDCimport database ──────────────────────────────────────────────────────────
load_database(db)
Introduction
You imported your database, but now you need to check it for errors and inconsistencies.
There are a lot of ways to do so, so EDCimport provides functions for a few concepts.
As in previous vignettes, we will be using edc_example()
, but in the real world you should use EDC reading functions. See vignette("reading")
to see how.
Data warning
Data errors
The primary and most valuable data-checking tool in EDCimport is edc_data_warn()
.
Simply use dplyr::filter()
to identify problematic or inconsistent rows and pipe them into the function.
For example, let’s say that in our study:
Patients should be older than 25
Adverse event grades should be between 1 and 5.
Patients in the treatment arm should not have
data1$date1
before 2010-04-10
Here’s how you check these conditions:
enrol %>%
filter(age<25) %>%
edc_data_warn("Patients should be >25yo", issue_n=1)
#> Warning: Issue #01: Patients should be >25yo (2 patients: #18 and #59)
ae %>%
filter(aegr<1 | aegr>5) %>%
edc_data_warn("Incorrect adverse event grade", issue_n=2)
data1 %>%
edc_left_join(enrol) %>%
filter(arm=="Trt") %>%
filter(date1<"2010-04-10") %>%
edc_data_warn("Treated patients should have been seen later", issue_n=3)
#> Warning: Issue #03: Treated patients should have been seen later (9 patients: #1, #45,
#> #64, #69, #82, …)
You can implement these checks according to your Data Validation Plan, ensuring they run after every export. Any failed check will produce a warning in your R console. Once the database is corrected, the warnings will no longer appear (e.g. issue_n=2
).
After running all your checks, you can use edc_data_warnings()
to get a summary of all detected issues.
edc_data_warnings()
#> # A tibble: 2 × 4
#> issue_n message subjid fun
#> <chr> <chr> <list> <chr>
#> 1 01 Patients should be >25yo <chr [2]> cli_warn
#> 2 03 Treated patients should have been seen later <chr [9]> cli_warn
Fatal error
If one check is so mandatory that you cannot work in a database where it fails, use edc_data_stop()
instead.
For example, you can use it to check that some variable construction didn’t go wrong:
Duplicate-free dataset assertion
If you work with multiple datasets, your code probably include a lot of joins. As you may have painfully discovered, joining data carries a high risk of altering the data layout and resulting in multiple rows per patient.
This is why you should always include assert_no_duplicate()
in your pipeline if you expect only one row per patient.
enrol %>%
assert_no_duplicate() %>%
count(arm)
#> # A tibble: 2 × 2
#> arm n
#> <chr> <int>
#> 1 Ctl 99
#> 2 Trt 101
enrol %>%
edc_left_join(data1) %>% #oopsie
assert_no_duplicate() %>%
count(arm)
#> Error in `assert_no_duplicate()`:
#> ! Duplicate on column "subjid" for values 1, 2, 3, 4, 5, 6, 7, 8, 9, and
#> 10.
Tip
This is the Fail Fast principle: you’d better have an error in your R script than in your analysis report.
Last-news table
If your analysis involves a survival endpoint, you likely have a follow-up dataset that includes the vital status as of the last visit date.
However, in real-world scenarios, this dataset might not be accurately filled, and some patient can have other data after the date of last visit.
The lastnews_table()
function calculates the actual date of the last recorded information for each patient (SUBJID
), based on all Date/Datetime columns across all datasets.
Currently, edc_example()
does not include an explicit table for this scenario, so let’s consider the following example:
data3$date10
is your last visit date in your followup dataset. It is the origin you shouldprefer
if there is a tie and the reference to identify any inconsistency.data1
is a dataset of protocolary dates, such as planned medical visits. You should ignore all columns in this dataset, as they do not pertain to the patient’s last knowndate8
anddate9
are dates when treatments were administered. They imply that the patient was alive at that time. If a date is afterdate10
, it means that the survival time is underestimated.
Here is how to parameterize lastnews_table()
to fit this scenario:
lastnews_table(prefer="date10", except="data1", show_delta=TRUE) %>%
mutate(delta=round(delta)) %>%
arrange(desc(delta))
#> # A tibble: 200 × 8
#> subjid last_date origin_data origin_col origin_label preferred_last_date
#> <dbl> <date> <chr> <chr> <chr> <date>
#> 1 9 2010-08-01 data3 date9 Date at visit 9 2010-07-14
#> 2 41 2010-07-28 data3 date9 Date at visit 9 2010-07-09
#> 3 71 2010-07-16 data3 date8 Date at visit 8 2010-06-28
#> 4 186 2010-07-12 data3 date9 Date at visit 9 2010-06-23
#> 5 14 2010-07-30 data3 date9 Date at visit 9 2010-07-13
#> 6 61 2010-07-30 data3 date9 Date at visit 9 2010-07-12
#> 7 68 2010-08-06 data3 date8 Date at visit 8 2010-07-20
#> 8 120 2010-07-27 data3 date9 Date at visit 9 2010-07-09
#> 9 58 2010-07-31 data3 date9 Date at visit 9 2010-07-15
#> 10 93 2010-07-30 data3 date9 Date at visit 9 2010-07-13
#> # ℹ 190 more rows
#> # ℹ 2 more variables: preferred_origin <chr>, delta <drtn>
Warning
This table is useful for Overall Survival and data checking. However, you should be very careful when using it for Event-Free Survival: a patient can be alive at that point without having experienced an event.