Overview

diyar is a data analytical package for linking records with shared characteristics. Linked records represent an entity, which depending on the context can range from unique patients or occurrences as defined by a case definition. Each entity is assigned to unique group with identifiers. These identifiers are stored as an S4 class with useful information about each group in their slots.

The package is capable of assessing and comparing an entity’s characteristics in several ways, making it useful in ordinarily complex analyses such as record linkage and contact or network analyses, or the application of case definitions.

The main functions are links() and link_records(), episodes() and partitions(). These functions are very flexible in terms of what and how they compare the record’s characteristics, as well as what are considered matches. Although each is best suited to certain use cases, their functionalities can sometimes overlap.

  • links() and link_record() - compare records without a temporal aspect to them. For example, record linkage
  • episodes() - compare records with a temporal aspect to them, while factoring the duration between individual records. For example, contact and network analysis.
  • partitions() - compare records with a temporal aspect to them, without factoring the duration between individual records. For example, application of case definitions.

Record linkage

Case definitions

episodes

Primarily used for contact and network analysis. episodes() is designed to compare dated records (events) and assign them to unique groups (episodes) based on the duration between events. Three type of episodes are possible; fixed, rolling and recursive episodes (Figure 1). A fixed episode is a set of events within a defined period before or after an index event. A rolling episode is a repeating series of fixed episodes linked together as recurrences. A recursive episode is a rolling episode where every event serves as an index event.

dfr_2 <- c(1:5, 10:15, 20:25)
dfr_2 <- data.frame(date = as.Date("2020-01-01") + dfr_2)
dfr_2$ep1 <- episodes(dfr_2$date, case_length = 5, episode_type = "fixed")
dfr_2$ep2 <- episodes(dfr_2$date, case_length = 5, episode_type = "rolling")
dfr_2$ep3 <- episodes(dfr_2$date, case_length = 5, episode_type = "recursive")
dfr_2
#>          date      ep1     ep2     ep3
#> 1  2020-01-02 E.01 (C) E.1 (C) E.1 (C)
#> 2  2020-01-03 E.01 (D) E.1 (D) E.1 (D)
#> 3  2020-01-04 E.01 (D) E.1 (D) E.1 (D)
#> 4  2020-01-05 E.01 (D) E.1 (D) E.1 (D)
#> 5  2020-01-06 E.01 (D) E.1 (D) E.1 (D)
#> 6  2020-01-11 E.06 (C) E.1 (R) E.1 (R)
#> 7  2020-01-12 E.06 (D) E.1 (R) E.1 (R)
#> 8  2020-01-13 E.06 (D) E.1 (D) E.1 (D)
#> 9  2020-01-14 E.06 (D) E.1 (D) E.1 (D)
#> 10 2020-01-15 E.06 (D) E.1 (D) E.1 (D)
#> 11 2020-01-16 E.06 (D) E.1 (D) E.1 (D)
#> 12 2020-01-21 E.12 (C) E.1 (R) E.1 (R)
#> 13 2020-01-22 E.12 (D) E.1 (R) E.1 (R)
#> 14 2020-01-23 E.12 (D) E.1 (D) E.1 (D)
#> 15 2020-01-24 E.12 (D) E.1 (D) E.1 (D)
#> 16 2020-01-25 E.12 (D) E.1 (D) E.1 (D)
#> 17 2020-01-26 E.12 (D) E.1 (D) E.1 (D)

Figure 1: Episodes.

There are several options to determine which records are used as the index event, how many index events are used, how many durations from the index events are assessed, the nature of recurrences where applicable and additional matching criteria through sub_criteria objects as described above. See help(episodes) for more details about these. Also useful is episodes_wf_splits() - a wrapper function of episodes() which is better optimised for analyses with duplicate records.

partitions

partitions() assigns events to groups (panes) if they are within a defined interval (Figure 2). Unlike episodes(), the duration between events is not a factor. Here, events from the same pane simply occurred within the same interval. See the example below.

event_dt <- seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-11"), by = 1)
dfr_3 <- data.frame(date = event_dt)

# Group events into 2 equal parts per `strata`.
dfr_3$pn2 <- partitions(event_dt, length.out = 2, separate = TRUE)
# Group events into 3-day sequences per `strata`.
dfr_3$pn3 <- partitions(event_dt, by = 3, separate = TRUE)
# Group events into a specified period of time in each `strata`.
dfr_3$pn4 <- partitions(event_dt, window = number_line(event_dt[4], event_dt[7]))
# Group events from separate periods into one pane.
dfr_3$pn5 <- partitions(event_dt, length.out = 2, separate = FALSE)

Figure 2: Panes

Other useful functions

There are other supporting functions which are useful for common analytical tasks. Some of these are shown below.

number_line

Produces a number_line object. This is an S4 class representing two points of an interval. It’s covered in more detail in the number_line and overlaps vignette. It’s useful when tracking episodes from periods in time as opposed to points in time. See an example of this below.

data(hospital_admissions)
dfr_4 <- hospital_admissions[c("admin_dt", "discharge_dt")]
dfr_4$admin_period <- number_line(dfr_4$admin_dt, dfr_4$discharge_dt)

# Group overlapping hospital stays
dfr_4$nl1 <- index_window(dfr_4$admin_period)
dfr_4$ep4 <- episodes(date = dfr_4$admin_period, case_length = dfr_4$nl1)

# Group overlapping hospital stays and those within 21 days of the end point of an index hospital stay 
dfr_4$nl2 <- expand_number_line(index_window(dfr_4$admin_period), 20, "right")
dfr_4$ep5 <- episodes(date = dfr_4$admin_period, case_length =  dfr_4$nl2)

They can also be used as an attribute in a sub_criteria object.

s_cri_c <- sub_criteria(dfr_4$admin_period, match_funcs = overlaps)
dfr_4$pd5 <- links("place_holder", sub_criteria = list("cr1" = s_cri_c), recursive = TRUE)
# Results
dfr_4[c("admin_period", "nl1", "nl2", "ep4", "ep5", "pd5")]
#>                admin_period      nl1       nl2     ep4     ep5           pd5
#> 1  2019-01-01 == 2019-01-01   0 == 0   0 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 2  2019-01-01 -> 2019-01-10  -9 -> 0  -9 -> 20 E.2 (C) E.2 (C) P.1 (CRI 001)
#> 3  2019-01-10 -> 2019-01-13  -3 -> 0  -3 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 4  2019-01-05 -> 2019-01-06  -1 -> 0  -1 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 5  2019-01-05 -> 2019-01-15 -10 -> 0 -10 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 6  2019-01-07 -> 2019-01-15  -8 -> 0  -8 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 7  2019-01-04 -> 2019-01-13  -9 -> 0  -9 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 8  2019-01-20 -> 2019-01-30 -10 -> 0 -10 -> 20 E.8 (C) E.2 (D) P.9 (CRI 001)
#> 9  2019-01-26 -> 2019-01-31  -5 -> 0  -5 -> 20 E.8 (D) E.2 (D) P.9 (CRI 001)
#> 10 2019-01-01 -> 2019-01-10  -9 -> 0  -9 -> 20 E.2 (D) E.2 (D) P.1 (CRI 001)
#> 11 2019-01-20 -> 2019-01-30 -10 -> 0 -10 -> 20 E.8 (D) E.2 (D) P.9 (CRI 001)

make_pairs

make_pairs() creates record pairs.

prs_1 <- make_pairs(LETTERS[1:4], repeats_allowed = FALSE, permutations_allowed = FALSE)
prs_2 <- make_pairs(LETTERS[1:4], repeats_allowed = TRUE, permutations_allowed = TRUE)
prs_3 <- make_pairs(1:5000, repeats_allowed = TRUE, permutations_allowed = TRUE)

str(prs_1)
#> List of 4
#>  $ x_pos: int [1:6] 1 1 1 2 2 3
#>  $ y_pos: int [1:6] 2 3 4 3 4 4
#>  $ x_val: chr [1:6] "A" "A" "A" "B" ...
#>  $ y_val: chr [1:6] "B" "C" "D" "C" ...
str(prs_2)
#> List of 4
#>  $ x_pos: int [1:16] 1 1 1 1 2 2 2 2 3 3 ...
#>  $ y_pos: int [1:16] 1 2 3 4 1 2 3 4 1 2 ...
#>  $ x_val: chr [1:16] "A" "A" "A" "A" ...
#>  $ y_val: chr [1:16] "A" "B" "C" "D" ...
str(prs_3)
#> List of 4
#>  $ x_pos: int [1:25000000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ y_pos: int [1:25000000] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ x_val: int [1:25000000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ y_val: int [1:25000000] 1 2 3 4 5 6 7 8 9 10 ...

combi

combi() creates numeric codes for unique combination of vectors. This aims to be a faster alternative to paste0 when it’s more relevant to know a unique combination instead of the exact combination.

cmbi_dfr <- dfr_stages[c(1, 2)]
cmbi_dfr$combi_nm <- paste(cmbi_dfr[[1]], cmbi_dfr[[2]])
cmbi_dfr$combi_cd <- combi(as.list(cmbi_dfr))
cmbi_dfr
#>   age hair_colour      combi_nm combi_cd
#> 1  30       Brown      30 Brown        1
#> 2  30        Teal       30 Teal        2
#> 3  30        <NA>         30 NA        3
#> 4  30       Green      30 Green        4
#> 5  30       Green      30 Green        4
#> 6  30  Dark brown 30 Dark brown        5
#> 7  30       Brown      30 Brown        1

Processing time

episodes() and links() are iterative functions. Although both continue to be optimised, each iteration costs additional processing time. The main way to mitigate this is to reduce the number of iterations required to complete an analysis. To help with this, both functions have arguments which ensure that only the minimum number of records required to complete the process is used.

The flexibility of episodes() and links() results in situations where the use of different arguments will lead to the same outcome. However, one combination will usually require fewer iterations to complete the same process. Therefore, a good grasp of the role of each argument is useful in knowing the most efficient combination of arguments to use.

In some analyses, episodes() and links() can be used interchangeably however, one is usually less efficient in each situation. For example, identifiers dfr_4$ep4 and dfr_4$pd5 are essentially the same outcome; 2 groups with the same set of records in each. However, we can see below that episodes() created dfr_4$ep4 in 2 iterations compared to the 7 iterations it took links() to create dfr_4$ep4.

summary(dfr_4$ep4)
#> Iterations:                     2
#> Total records:                  11
#>  by record type:
#>      Case:                2
#>      Duplicate_C:         9
#> Total episodes:                 2
#>  by episode type:
#>      Fixed:               2
#>  by episode dataset:
#>      N/A
#>  by episodes duration:
#>      N/A
#>  by records per episode:
#>      3 records:           1
#>      8 records:           1
#>  by recurrence:
#>      N/A

summary(dfr_4$pd5)
#> Iterations:                     7
#> Total records:                  11
#>  by matching criteria:
#>      Criteria 1:          11
#> Total record groups:                 2
#>  by group dataset:
#>      N/A
#>  by records per group:
#>      3 records:           1
#>      8 records:           1

It’s also worth noting that using a sub_criteria can cost a lot in processing time. Therefore, it should not be used if it can be avoided. For example, dfr_stages$p4 was created using sub_cri_b which took 6 iterations. However, the same outcome can be achieved in 1 iteration with a criteria. See the difference below.

dfr_stages$p4b <- links(criteria = combi(last_word_wf(dfr_stages$hair_colour),
                                         last_word_wf(dfr_stages$branch_office)))

summary(dfr_stages$p4b)
#> Iterations:                     1
#> Total records:                  7
#>  by matching criteria:
#>      Criteria 1:          3
#>      No hits:             4
#> Total record groups:                 5
#>  by group dataset:
#>      N/A
#>  by records per group:
#>      1 record:            4
#>      3 records:           1

all(dfr_stages$p4 == dfr_stages$p4b)
#> [1] TRUE

The time differences from these different iterations are negligible when dealing with relatively small datasets but become more apparent as the size increases.