A specific use case of links for probabilistic record linkage.

links_wf_probabilistic(
  attribute,
  blocking_attribute = NULL,
  cmp_func = diyar::exact_match,
  cmp_threshold = 0.95,
  probabilistic = TRUE,
  m_probability = 0.95,
  score_threshold = 1,
  id_1 = NULL,
  id_2 = NULL,
  ...
)

prob_score_range(attribute, m_probability = 0.95)

Arguments

attribute

[list]. Attributes to compare.

blocking_attribute

[atomic]. Subsets of the dataset.

cmp_func

[list|function]. String comparators for each attribute. See Details.

cmp_threshold

[list|numeric|number_line]. Weight-thresholds for each cmp_func. See Details.

probabilistic

[logical]. If TRUE, scores are assigned base on Fellegi-Sunter model for probabilistic record linkage. See Details.

m_probability

[list|numeric]. The probability that a match from the string comparator is actually from the same entity.

score_threshold

[numeric|number_line]. Score-threshold for linked records. See Details.

id_1

[list|numeric]. One half of a specific pair of records to check for match weights and score-thresholds.

id_2

[list|numeric]. One half of a specific pair of records to check for match weights and score-thresholds.

...

Arguments passed to links

Value

pid; list

Details

links_wf_probabilistic is a wrapper function of links for probabilistic record linkage. Its implementation is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.

In summary, record pairs are created and categorised as matches and non-matches (cmp_func). Two probabilities (m and u) are then estimated for each record pair to score matches and non-matches. The m-probability is the probability that matched records are actually from the same entity i.e a true match, while u-probability is the probability that matched records are not from the same entity i.e. a false match. m-probabilities must be supplied but u-probabilities are calculated for each value of an attribute. This is calculated as the frequency of each value in the dataset. Record pairs whose total score are above a certain threshold (score_threshold) are assumed to belong to the same entity.

Agreement (match) and disagreement (non-match) scores are calculated as described by Asher et al. (2020).

For each record pair, an agreement for attribute \(i\) is calculated as;

$$\log_{2}(m_{i}/u_{i})$$

For each record pair, a disagreement score for attribute \(i\) is calculated as;

$$\log_{2}((1-m_{i})/(1-u_{i}))$$

where \(m_{i}\) and \(u_{i}\) are the m and u-probabilities for each value of attribute \(i\).

Missing data (NA) are categorised as non-matches and assigned a u-probability of 0.

By default, matches and non-matches for each attribute are determined as an exact_match with a binary outcome. String comparators can also be used with thresholds (cmp_threshold) for each similarity score. If probabilistic is FALSE, the sum of all similarity scores is used as the score_threshold instead of deriving one from the m and u-probabilities.

links_wf_probabilistic requires a score_threshold in advance of the linkage process. This differs from the typical approach where a score_threshold is selected after the linkage process, following a review of all calculated scores. To help with this, prob_score_range will return the range of scores attainable for a given set of attributes. Additionally, id_1 and id_2 can be used to link specific records pairs, aiding the review of potential scores.

A blocking_attribute can be used to reduce processing time by restricting comparisons to subsets of the dataset.

References

Fellegi, I. P., & Sunter, A. B. (1969). A Theory for Record Linkage. Journal of the Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049

Asher, J., Resnick, D., Brite, J., Brackbill, R., & Cone, J. (2020). An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries. International journal of environmental research and public health, 17(18), 6937. https://doi.org/10.3390/ijerph17186937.

See also

Examples

# Using exact matches dfr <- missing_staff_id[c("staff_id", "initials", "hair_colour", "branch_office")] score_range <- prob_score_range(attribute = as.list(dfr)) prob_pids1 <- links_wf_probabilistic(attribute = as.list(dfr), score_threshold = score_range$minimum_score) prob_pids1
#> $pid #> [1] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)" #> [5] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)" #> #> $pid_weights #> sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office #> [1,] 1 1 0 1 1 1 #> [2,] 2 1 0 0 0 0 #> [3,] 3 1 0 0 0 0 #> [4,] 4 1 0 0 0 0 #> [5,] 5 1 0 0 0 0 #> [6,] 6 1 0 1 0 0 #> [7,] 7 1 0 1 1 1 #> cmp.weight cmp.threshold prb.staff_id prb.initials prb.hair_colour #> [1,] 3 NA -4.321928 1.148392 1.733354 #> [2,] 0 NA -4.321928 -4.099536 -4.099536 #> [3,] 0 NA -4.321928 -3.836501 -4.321928 #> [4,] 0 NA -4.321928 -3.836501 -3.836501 #> [5,] 0 NA -4.321928 -4.321928 -3.836501 #> [6,] 1 NA -3.836501 1.148392 -4.099536 #> [7,] 3 NA -3.836501 1.148392 1.733354 #> prb.branch_office prb.weight prb.threshold #> [1,] 1.733354 0.2931724 1 #> [2,] -3.836501 -16.3575007 1 #> [3,] -4.321928 -16.8022856 1 #> [4,] -4.321928 -16.3168587 1 #> [5,] -3.836501 -16.3168587 1 #> [6,] -4.099536 -10.8871808 1 #> [7,] 1.733354 0.7785993 1 #>
# Using other logical tests e.g. string comparators # For example, matching last word in `hair_colour` and `branch_office` last_word_wf <- function(x) tolower(gsub("^.* ", "", x)) last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y) prob_pids2 <- links_wf_probabilistic(attribute = as.list(dfr), cmp_func = c(diyar::exact_match, diyar::exact_match, last_word_cmp, last_word_cmp), score_threshold = score_range$mid_scorce) prob_pids2
#> $pid #> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)" #> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)" #> #> $pid_weights #> sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office #> [1,] 1 1 0 1 1 1 #> [2,] 2 2 NA NA NA NA #> [3,] 3 3 NA NA NA NA #> [4,] 4 4 NA NA NA NA #> [5,] 5 5 NA NA NA NA #> [6,] 6 1 0 1 1 1 #> [7,] 7 1 0 1 1 1 #> cmp.weight cmp.threshold prb.staff_id prb.initials prb.hair_colour #> [1,] 3 NA -4.321928 1.148392 1.733354 #> [2,] NA NA NA NA NA #> [3,] NA NA NA NA NA #> [4,] NA NA NA NA NA #> [5,] NA NA NA NA NA #> [6,] 3 NA -3.836501 1.148392 2.733354 #> [7,] 3 NA -3.836501 1.148392 1.733354 #> prb.branch_office prb.weight prb.threshold #> [1,] 1.733354 0.2931724 1 #> [2,] NA NA NA #> [3,] NA NA NA #> [4,] NA NA NA #> [5,] NA NA NA #> [6,] 2.733354 2.7785993 1 #> [7,] 1.733354 0.7785993 1 #>
# Results for specific record pairs prob_pids3 <- links_wf_probabilistic(attribute = as.list(dfr), cmp_func = c(diyar::exact_match, diyar::exact_match, last_word_cmp, last_word_cmp), score_threshold = score_range$mid_scorce, id_1 = c(1, 1, 1), id_2 = c(6, 7, 4)) prob_pids3
#> $pid #> NULL #> #> $pid_weights #> sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office #> [1,] 1 6 0 1 1 1 #> [2,] 1 7 0 1 1 1 #> [3,] 1 4 0 0 0 0 #> cmp.weight cmp.threshold prb.staff_id prb.initials prb.hair_colour #> [1,] 3 NA -4.321928 1.148392 1.733354 #> [2,] 3 NA -4.321928 1.148392 1.733354 #> [3,] 0 NA -4.321928 -3.514573 -3.836501 #> prb.branch_office prb.weight prb.threshold #> [1,] 1.733354 0.2931724 1 #> [2,] 1.733354 0.2931724 1 #> [3,] -3.836501 -15.5095038 0 #>