Multistage record linkage

Assign records to unique groups based on an ordered set of match criteria.

links(
  criteria,
  sub_criteria = NULL,
  sn = NULL,
  strata = NULL,
  data_source = NULL,
  data_links = "ANY",
  display = "none",
  group_stats = FALSE,
  expand = TRUE,
  shrink = FALSE,
  recursive = "none",
  check_duplicates = FALSE,
  tie_sort = NULL,
  batched = "yes",
  repeats_allowed = FALSE,
  permutations_allowed = FALSE,
  ignore_same_source = FALSE
)

Arguments

criteria: [list|atomic]. Ordered list of attributes to be compared. Each element of the list is a stage in the linkage process. See Details.
sub_criteria: [list|sub_criteria]. Nested match criteria. This must be paired to a stage of the linkage process (criteria). See sub_criteria
sn: [integer]. Unique record ID.
strata: [atomic]. Subsets of the dataset. Record-groups are created separately for each strata. See Details.
data_source: [character]. Source ID for each record. If provided, a list of all sources in each record-group is returned. See pid_dataset slot.
data_links: [list|character]. data_source required in each pid. A record-group without records from these data_sources will be unlinked. See Details.
display: [character]. Display progress update and/or generate a linkage report for the analysis. Options are; "none" (default), "progress", "stats", "none_with_report", "progress_with_report" or "stats_with_report".
group_stats: [character]. A selection of group specific information to be return for each record-group. Most are added to slots of the pid object. Options are NULL or any combination of "XX", "XX" and "XX".
expand: [logical]. If TRUE, a record-group gains new records if a match is found at the next stage of the linkage process. Not interchangeable with shrink.
shrink: [logical]. If TRUE, a record-group loses existing records if no match is found at the next stage of the linkage process. Not interchangeable with expand.
recursive: [logical]. If TRUE, within each iteration of the process, a match can spawn new matches. Ignored when batched is "no".
check_duplicates: [logical]. If TRUE, within each iteration of the process, duplicates values of an attributes are not checked. The outcome of the logical test on the first instance of the value will be recycled for the duplicate values. Ignored when batched is "no".
tie_sort: [atomic]. Preferential order for breaking match ties within an iteration of record linkage.
batched: [character] Determines if record-pairs are created and compared in batches. Options are "yes", "no" or "semi".
repeats_allowed: [logical] If TRUE, pairs made up of repeat records are not created and compared. Only used when batched is "no".
permutations_allowed: [logical] If TRUE, permutations of record-pairs are created and compared. Only used when batched is "no".
ignore_same_source: [logical] If TRUE, only records-pairs from a different data_source are created and compared.

Value

pid; list

Details

The priority of matches decreases with each subsequent stage of the linkage process. Therefore, the attributes in criteria should be in an order of decreasing relevance.

Records with missing data (NA) for each criteria are skipped at the respective stage, while records with missing data strata are skipped from every stage.

If a record is skipped from a stage, another attempt will be made to match the record at the next stage. If a record is still unmatched by the last stage, it is assigned a unique group ID.

A sub_criteria adds nested match criteria to each stage of the linkage process. If used, only records with a matching criteria and sub_criteria are linked.

In links, each sub_criteria must be linked to a criteria. This is done by adding each sub_criteria to a named element of a list - "cr" concatenated with the corresponding stage's number. For example, 3 sub_criteria linked to criteria 1, 5 and 13 will be;

$$list(cr1 = sub_criteria(...), cr5 = sub_criteria(...), cr13 = sub_criteria(...))$$

Any unlinked sub_criteria will be ignored.

Every element in data_links must be named "l" (links) or "g" (groups). Unnamed elements of data_links will be assumed to be "l".

If named "l", groups without records from every listed data_source will be unlinked.
If named "g", groups without records from any listed data_source will be unlinked.

See vignette("links") for more information.

Examples

data(patient_records)
dfr <- patient_records
# An exact match on surname followed by an exact match on forename
stages <- as.list(dfr[c("surname", "forename")])
p1 <- links(criteria = stages)

# An exact match on forename followed by an exact match on surname
p2 <- links(criteria = rev(stages))

# Nested matches
# Same sex OR birth year
m.cri.1 <- sub_criteria(
  format(dfr$dateofbirth, "%Y"), dfr$sex,
  operator = "or")

# Same middle name AND a 10 year age difference
age_diff <- function(x, y){
  diff <- abs(as.numeric(x) - as.numeric(y))
  wgt <-  diff %in% 0:10 & !is.na(diff)
  wgt
}
m.cri.2 <- sub_criteria(
  format(dfr$dateofbirth, "%Y"), dfr$middlename,
  operator = "and",
  match_funcs = c(age_diff, exact_match))

# Nested match criteria 'm.cri.1' OR 'm.cri.2'
n.cri <- sub_criteria(
  m.cri.1, m.cri.2,
  operator = "or")

# Record linkage with additional match criteria
p3 <- links(
  criteria = stages,
  sub_criteria = list(cr1 = m.cri.1,
                      cr2 = m.cri.2))

# Record linkage with additonal nested match criteria
p4 <- links(
  criteria = stages,
  sub_criteria = list(cr1 = n.cri,
                      cr2 = n.cri))

dfr$p1 <- p1; dfr$p2 <- p2
dfr$p3 <- p3; dfr$p4 <- p4

head(dfr)
#>      forename middlename   surname dateofbirth specimendate  sex identity
#> 4545     <NA>       <NA>      <NA>  1971-11-28   2022-01-31 <NA>     2456
#> 9436     <NA>     GBENDA   IBEKAKU  1970-02-13   2021-04-29 <NA>     5157
#> 2303     <NA>    IKYAMBE EJOFODOMI  2002-03-18   2020-09-02    M     1256
#> 7813  HERVERT     AHAIWE     DOWGO        <NA>   2021-01-23    F     4258
#> 6448  OLUWOGA       <NA>      <NA>  1974-04-17   2021-02-27 <NA>     3492
#> 7336     <NA>    IKYAMBE      <NA>  1976-08-23   2022-03-07    U     4005
#>                    p1              p2            p3            p4
#> 4545 P.0001 (No hits) P.001 (No hits) P.1 (No hits) P.1 (No hits)
#> 9436 P.0002 (CRI 001) P.144 (CRI 002) P.2 (No hits) P.2 (No hits)
#> 2303 P.0003 (CRI 001) P.003 (CRI 002) P.3 (CRI 001) P.3 (CRI 001)
#> 7813 P.0004 (CRI 001) P.004 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 6448 P.4143 (CRI 003) P.005 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 7336 P.0006 (No hits) P.006 (No hits) P.6 (No hits) P.6 (No hits)

Arguments

Value

Details

See also

Examples