Assign records to unique groups based on an ordered set of match criteria.
links(
criteria,
sub_criteria = NULL,
sn = NULL,
strata = NULL,
data_source = NULL,
data_links = "ANY",
display = "none",
group_stats = FALSE,
expand = TRUE,
shrink = FALSE,
recursive = "none",
check_duplicates = FALSE,
tie_sort = NULL,
batched = "yes",
repeats_allowed = FALSE,
permutations_allowed = FALSE,
ignore_same_source = FALSE
)
[list|atomic]
. Ordered list of attributes to be compared. Each element of the list is a stage in the linkage process. See Details
.
[list|sub_criteria]
. Nested match criteria. This must be paired to a stage of the linkage process (criteria
). See sub_criteria
[integer]
. Unique record ID.
[atomic]
. Subsets of the dataset. Record-groups are created separately for each strata
. See Details
.
[character]
. Source ID for each record. If provided, a list of all sources in each record-group is returned. See pid_dataset slot
.
[list|character]
. data_source
required in each pid
. A record-group without records from these data_sources
will be unlinked
. See Details
.
[character]
. Display progress update and/or generate a linkage report for the analysis. Options are; "none"
(default), "progress"
, "stats"
, "none_with_report"
, "progress_with_report"
or "stats_with_report"
.
[character]
. A selection of group specific information to be return for each record-group. Most are added to slots of the pid
object.
Options are NULL
or any combination of "XX"
, "XX"
and "XX"
.
[logical]
. If TRUE
, a record-group gains new records if a match is found at the next stage of the linkage process. Not interchangeable with shrink
.
[logical]
. If TRUE
, a record-group loses existing records if no match is found at the next stage of the linkage process. Not interchangeable with expand
.
[logical]
. If TRUE
, within each iteration of the process, a match can spawn new matches. Ignored when batched
is "no"
.
[logical]
. If TRUE
, within each iteration of the process, duplicates values of an attributes are not checked. The outcome of the logical test on the first instance of the value will be recycled for the duplicate values. Ignored when batched
is "no"
.
[atomic]
. Preferential order for breaking match ties within an iteration of record linkage.
[character]
Determines if record-pairs are created and compared in batches. Options are "yes"
, "no"
or "semi"
.
[logical]
If TRUE
, pairs made up of repeat records are not created and compared. Only used when batched
is "no"
.
[logical]
If TRUE
, permutations of record-pairs are created and compared. Only used when batched
is "no"
.
[logical]
If TRUE
, only records-pairs from a different data_source
are created and compared.
pid
; list
The priority of matches decreases with each subsequent stage of the linkage process.
Therefore, the attributes in criteria
should be in an order of decreasing relevance.
Records with missing data (NA
) for each criteria
are
skipped at the respective stage, while records with
missing data strata
are skipped from every stage.
If a record is skipped from a stage, another attempt will be made to match the record at the next stage. If a record is still unmatched by the last stage, it is assigned a unique group ID.
A sub_criteria
adds nested match criteria
to each stage of the linkage process. If used, only
records with a matching criteria
and sub_criteria
are linked.
In links
, each sub_criteria
must
be linked to a criteria
. This is done by adding each sub_criteria
to a named element of a list - "cr" concatenated with
the corresponding stage's number.
For example, 3 sub_criteria
linked to
criteria
1, 5 and 13 will be;
$$list(cr1 = sub_criteria(...), cr5 = sub_criteria(...), cr13 = sub_criteria(...))$$
Any unlinked sub_criteria
will be ignored.
Every element in data_links
must be named "l"
(links) or "g"
(groups).
Unnamed elements of data_links
will be assumed to be "l"
.
If named "l"
, groups without records from every listed data_source
will be unlinked.
If named "g"
, groups without records from any listed data_source
will be unlinked.
See vignette("links")
for more information.
data(patient_records)
dfr <- patient_records
# An exact match on surname followed by an exact match on forename
stages <- as.list(dfr[c("surname", "forename")])
p1 <- links(criteria = stages)
# An exact match on forename followed by an exact match on surname
p2 <- links(criteria = rev(stages))
# Nested matches
# Same sex OR birth year
m.cri.1 <- sub_criteria(
format(dfr$dateofbirth, "%Y"), dfr$sex,
operator = "or")
# Same middle name AND a 10 year age difference
age_diff <- function(x, y){
diff <- abs(as.numeric(x) - as.numeric(y))
wgt <- diff %in% 0:10 & !is.na(diff)
wgt
}
m.cri.2 <- sub_criteria(
format(dfr$dateofbirth, "%Y"), dfr$middlename,
operator = "and",
match_funcs = c(age_diff, exact_match))
# Nested match criteria 'm.cri.1' OR 'm.cri.2'
n.cri <- sub_criteria(
m.cri.1, m.cri.2,
operator = "or")
# Record linkage with additional match criteria
p3 <- links(
criteria = stages,
sub_criteria = list(cr1 = m.cri.1,
cr2 = m.cri.2))
# Record linkage with additonal nested match criteria
p4 <- links(
criteria = stages,
sub_criteria = list(cr1 = n.cri,
cr2 = n.cri))
dfr$p1 <- p1; dfr$p2 <- p2
dfr$p3 <- p3; dfr$p4 <- p4
head(dfr)
#> forename middlename surname dateofbirth specimendate sex identity
#> 4545 <NA> <NA> <NA> 1971-11-28 2022-01-31 <NA> 2456
#> 9436 <NA> GBENDA IBEKAKU 1970-02-13 2021-04-29 <NA> 5157
#> 2303 <NA> IKYAMBE EJOFODOMI 2002-03-18 2020-09-02 M 1256
#> 7813 HERVERT AHAIWE DOWGO <NA> 2021-01-23 F 4258
#> 6448 OLUWOGA <NA> <NA> 1974-04-17 2021-02-27 <NA> 3492
#> 7336 <NA> IKYAMBE <NA> 1976-08-23 2022-03-07 U 4005
#> p1 p2 p3 p4
#> 4545 P.0001 (No hits) P.001 (No hits) P.1 (No hits) P.1 (No hits)
#> 9436 P.0002 (CRI 001) P.144 (CRI 002) P.2 (No hits) P.2 (No hits)
#> 2303 P.0003 (CRI 001) P.003 (CRI 002) P.3 (CRI 001) P.3 (CRI 001)
#> 7813 P.0004 (CRI 001) P.004 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 6448 P.4143 (CRI 003) P.005 (CRI 001) P.5 (No hits) P.5 (No hits)
#> 7336 P.0006 (No hits) P.006 (No hits) P.6 (No hits) P.6 (No hits)