Record linkage

Deterministic and probabilistic record linkage with partial or evaluated matches.

link_records(
  attribute,
  blocking_attribute = NULL,
  cmp_func = diyar::exact_match,
  attr_threshold = 1,
  probabilistic = TRUE,
  m_probability = 0.95,
  u_probability = NULL,
  score_threshold = 1,
  repeats_allowed = FALSE,
  permutations_allowed = FALSE,
  data_source = NULL,
  ignore_same_source = TRUE,
  display = "none"
)

links_wf_probabilistic(
  attribute,
  blocking_attribute = NULL,
  cmp_func = diyar::exact_match,
  attr_threshold = 1,
  probabilistic = TRUE,
  m_probability = 0.95,
  u_probability = NULL,
  score_threshold = 1,
  id_1 = NULL,
  id_2 = NULL,
  ...
)

prob_score_range(attribute, m_probability = 0.95, u_probability = NULL)

Arguments

attribute	`[atomic\|list\|data.frame\|matrix\|d_attribute]`. Attributes to compare.
blocking_attribute	`[atomic]`. Subsets of the dataset.
cmp_func	`[list\|function]`. String comparators for each `attribute`. See `Details`.
attr_threshold	`[list\|numeric\|number_line]`. Weight-thresholds for each `cmp_func`. See `Details`.
probabilistic	`[logical]`. If `TRUE`, scores are assigned base on Fellegi-Sunter model for probabilistic record linkage. See `Details`.
m_probability	`[list\|numeric]`. The probability that a matching records are the same entity.
u_probability	`[list\|numeric]`. The probability that a matching records are not the same entity.
score_threshold	`[numeric\|number_line]`. Score-threshold for linked records. See `Details`.
repeats_allowed	`[logical]` If `TRUE`, repetition are included.
permutations_allowed	`[logical]` If `TRUE`, permutations are included.
data_source	`[character]`. Data source identifier. Adds the list of data sources in each record-group to the `pid`. Useful when the data is from multiple sources.
ignore_same_source	`[logical]` If `TRUE`, only records from different `data_source` are compared.
display	`[character]`. Display or produce a status update. Options are; `"none"` (default), `"progress"`, `"stats"`, `"none_with_report"`, `"progress_with_report"` or `"stats_with_report"`.
id_1	`[list\|numeric]`. Record id or index of one half of a record pair.
id_2	`[list\|numeric]`. Record id or index of one half of a record pair.
...	Arguments passed to `links`

Value

pid; list

Details

link_records() and links_wf_probabilistic() are functions to implement deterministic, fuzzy or probabilistic record linkage. link_records() compares every record-pair in one instance, while links_wf_probabilistic() is a wrapper function of links and so compares batches of record-pairs in iterations.

link_records() is more thorough in the sense that it compares every combination of record-pairs. This makes it faster but is memory intensive, particularly if there's no blocking_attribute. In contrast, links_wf_probabilistic() is less memory intensive but takes longer since it does it's checks in batches.

The implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.

In summary, record-pairs are created and categorised as matches and non-matches (attr_threshold) with user-defined functions (cmp_func). Two probabilities (m and u) are then estimated for each record-pair to score the matches and non-matches. The m-probability is the probability that matched records are actually from the same entity i.e. a true match, while u-probability is the probability that matched records are not from the same entity i.e. a false match. By default, u-probabilities are calculated as the frequency of each value of an attribute however, they can also be supplied along with m-probabilities. Record-pairs whose total score are above a certain threshold (score_threshold) are assumed to belong to the same entity.

Agreement (match) and disagreement (non-match) scores are calculated as described by Asher et al. (2020).

For each record pair, an agreement for attribute $i$ is calculated as;

$$\log_{2}(m_{i}/u_{i})$$

For each record pair, a disagreement score for attribute $i$ is calculated as;

$$\log_{2}((1-m_{i})/(1-u_{i}))$$

where $m_{i}$ and $u_{i}$ are the m and u-probabilities for each value of attribute $i$.

Note that each probability is calculated as a combined probability for the record pair. For example, if the values of the record-pair have u-probabilities of 0.1 and 0.2 respectively, then the u-probability for the pair will be 0.02.

Missing data (NA) are considered non-matches and assigned a u-probability of 0.

By default, matches and non-matches for each attribute are determined as an exact_match with a binary outcome. Alternatively, user-defined functions (cmp_func) are used to create similarity scores. Pairs with similarity scores within (attr_threshold) are then considered matches for the corresponding attribute.

If probabilistic is FALSE, the sum of all similarity scores is used as the score_threshold instead of deriving one from the m and u-probabilities.

A blocking_attribute can be used to reduce the processing time by restricting comparisons to subsets of the dataset.

In link_records(), score_threshold is a convenience argument because every combination of record-pairs are returned therefore, a new score_threshold can be selected after reviewing the final scores. However, in links_wf_probabilistic(), the score_threshold is more important because a final selection is made at each iteration.

As a result, links_wf_probabilistic() requires an acceptable score_threshold in advance. To help with this, prob_score_range() can be used to return the range of scores attainable for a given set of attribute, m and u-probabilities. Additionally, id_1 and id_2 can be used to link specific records pairs, aiding the review of potential scores.

References

Fellegi, I. P., & Sunter, A. B. (1969). A Theory for Record Linkage. Journal of the Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049

Asher, J., Resnick, D., Brite, J., Brackbill, R., & Cone, J. (2020). An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries. International journal of environmental research and public health, 17(18), 6937. https://doi.org/10.3390/ijerph17186937.

Examples

# Deterministic linkage
dfr <- missing_staff_id[c(2, 4, 5, 6)]

link_records(dfr, attr_threshold = 1, probabilistic = FALSE, score_threshold = 2)
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>    sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1     1    2            0            0               0                 0
#> 2     1    3            0            0               0                 0
#> 3     1    4            0            0               0                 0
#> 4     1    5            0            0               0                 0
#> 5     1    6            0            1               0                 0
#> 6     1    7            0            1               1                 1
#> 7     2    3            0            0               0                 0
#> 8     2    4            0            0               0                 0
#> 9     2    5            0            0               0                 1
#> 10    2    6            0            0               0                 0
#> 11    2    7            0            0               0                 0
#> 12    3    4            0            1               0                 0
#> 13    3    5            0            0               0                 0
#> 14    3    6            0            0               0                 0
#> 15    3    7            0            0               0                 0
#> 16    4    5            0            0               1                 0
#> 17    4    6            0            0               0                 0
#> 18    4    7            0            0               0                 0
#> 19    5    6            0            0               0                 0
#> 20    5    7            0            0               0                 0
#> 21    6    7            1            1               0                 0
#>    cmp.weight record.match
#> 1           0        FALSE
#> 2           0        FALSE
#> 3           0        FALSE
#> 4           0        FALSE
#> 5           1        FALSE
#> 6           3         TRUE
#> 7           0        FALSE
#> 8           0        FALSE
#> 9           1        FALSE
#> 10          0        FALSE
#> 11          0        FALSE
#> 12          1        FALSE
#> 13          0        FALSE
#> 14          0        FALSE
#> 15          0        FALSE
#> 16          1        FALSE
#> 17          0        FALSE
#> 18          0        FALSE
#> 19          0        FALSE
#> 20          0        FALSE
#> 21          2         TRUE
#> 
links_wf_probabilistic(dfr, attr_threshold = 1, probabilistic = FALSE,
                       score_threshold = 2, recursive = TRUE)
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>   sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1    1    1            0            1               1                 1
#> 2    2    2           NA           NA              NA                NA
#> 3    3    3           NA           NA              NA                NA
#> 4    4    4           NA           NA              NA                NA
#> 5    5    5           NA           NA              NA                NA
#> 6    6    7            1            1               0                 0
#> 7    7    1            0            1               1                 1
#>   cmp.weight record.match
#> 1          3         TRUE
#> 2         NA           NA
#> 3         NA           NA
#> 4         NA           NA
#> 5         NA           NA
#> 6          2         TRUE
#> 7          3         TRUE
#> 

# Probabilistic linkage
prob_score_range(dfr)
#> $minimum_score
#> [1] -13.31096
#> 
#> $mid_scorce
#> [1] 1.692975
#> 
#> $maximum_score
#> [1] 16.69691
#> 
link_records(dfr, attr_threshold = 1, probabilistic = TRUE, score_threshold = -16)
#> $pid
#> [1] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> [5] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>    sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1     1    2            0            0               0                 0
#> 2     1    3            0            0               0                 0
#> 3     1    4            0            0               0                 0
#> 4     1    5            0            0               0                 0
#> 5     1    6            0            1               0                 0
#> 6     1    7            0            1               1                 1
#> 7     2    3            0            0               0                 0
#> 8     2    4            0            0               0                 0
#> 9     2    5            0            0               0                 1
#> 10    2    6            0            0               0                 0
#> 11    2    7            0            0               0                 0
#> 12    3    4            0            1               0                 0
#> 13    3    5            0            0               0                 0
#> 14    3    6            0            0               0                 0
#> 15    3    7            0            0               0                 0
#> 16    4    5            0            0               1                 0
#> 17    4    6            0            0               0                 0
#> 18    4    7            0            0               0                 0
#> 19    5    6            0            0               0                 0
#> 20    5    7            0            0               0                 0
#> 21    6    7            1            1               0                 0
#>    cmp.weight prb.staff_id prb.initials prb.hair_colour prb.branch_office
#> 1           0    -3.358454    -3.267306       -3.298333         -3.235597
#> 2           0    -3.358454    -3.170009       -2.873027         -2.873027
#> 3           0    -3.358454    -3.170009       -3.235597         -2.873027
#> 4           0    -3.358454    -2.551099       -3.235597         -3.235597
#> 5           1    -3.358454     1.074391       -3.298333         -3.298333
#> 6           3    -3.358454     1.074391        1.659354          1.659354
#> 7           0    -3.358454    -3.298333       -3.136062         -2.873027
#> 8           0    -3.358454    -3.298333       -3.298333         -2.873027
#> 9           1    -3.358454    -3.136062       -3.298333          1.659354
#> 10          0    -3.358454    -3.267306       -3.328707         -3.298333
#> 11          0    -3.358454    -3.267306       -3.298333         -3.235597
#> 12          1    -3.358454     1.659354       -3.358454         -3.358454
#> 13          0    -3.358454    -2.873027       -3.358454         -3.358454
#> 14          0    -3.358454    -3.170009       -3.358454         -3.358454
#> 15          0    -3.358454    -3.170009       -3.358454         -3.358454
#> 16          1    -3.358454    -2.873027        1.659354         -3.358454
#> 17          0    -3.358454    -3.170009       -3.298333         -3.358454
#> 18          0    -3.358454    -3.170009       -3.235597         -3.358454
#> 19          0    -3.358454    -3.358454       -3.298333         -3.298333
#> 20          0    -3.358454    -3.358454       -3.235597         -3.235597
#> 21          2     1.659354     1.074391       -3.298333         -3.298333
#>    prb.weight record.match
#> 1  -13.159690         TRUE
#> 2  -12.274517         TRUE
#> 3  -12.637087         TRUE
#> 4  -12.380747         TRUE
#> 5   -8.880729         TRUE
#> 6    1.034645         TRUE
#> 7  -12.665876         TRUE
#> 8  -12.828147         TRUE
#> 9   -8.133495         TRUE
#> 10 -13.252800         TRUE
#> 11 -13.159690         TRUE
#> 12  -8.416008         TRUE
#> 13 -12.948389         TRUE
#> 14 -13.245371         TRUE
#> 15 -13.245371         TRUE
#> 16  -7.930581         TRUE
#> 17 -13.185250         TRUE
#> 18 -13.122514         TRUE
#> 19 -13.313574         TRUE
#> 20 -13.188102         TRUE
#> 21  -3.862921         TRUE
#> 
links_wf_probabilistic(dfr, attr_threshold = 1, probabilistic = TRUE,
                       score_threshold = -16, recursive = TRUE)
#> $pid
#> [1] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> [5] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>   sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1    1    1            0            1               1                 1
#> 2    2    1            0            0               0                 0
#> 3    3    4            0            1               0                 0
#> 4    4    1            0            0               0                 0
#> 5    5    1            0            0               0                 0
#> 6    6    1            0            1               0                 0
#> 7    7    1            0            1               1                 1
#>   cmp.weight prb.staff_id prb.initials prb.hair_colour prb.branch_office
#> 1          3    -4.321928     1.148392        1.733354          1.733354
#> 2          0    -4.321928    -3.267306       -3.298333         -3.235597
#> 3          1    -4.321928     1.733354       -4.321928         -4.321928
#> 4          0    -4.321928    -3.170009       -3.235597         -4.321928
#> 5          0    -4.321928    -4.321928       -3.235597         -3.235597
#> 6          1    -3.836501     1.148392       -3.298333         -3.298333
#> 7          3    -3.836501     1.148392        1.733354          1.733354
#>    prb.weight record.match
#> 1   0.2931724         TRUE
#> 2 -14.1231644         TRUE
#> 3 -11.2324299         TRUE
#> 4 -15.0494623         TRUE
#> 5 -15.1150506         TRUE
#> 6  -9.2847754         TRUE
#> 7   0.7785993         TRUE
#> 

# Using string comparators
# For example, matching last word in `hair_colour` and `branch_office`
last_word_wf <- function(x) tolower(gsub("^.* ", "", x))
last_word_cmp <- function(x, y) last_word_wf(x) == last_word_wf(y)

link_records(dfr, attr_threshold = 1,
             cmp_func = c(diyar::exact_match,
                          diyar::exact_match,
                          last_word_cmp,
                          last_word_cmp),
             score_threshold = -4)
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>    sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1     1    2            0            0               0                 0
#> 2     1    3            0            0               0                 0
#> 3     1    4            0            0               0                 0
#> 4     1    5            0            0               0                 0
#> 5     1    6            0            1               1                 1
#> 6     1    7            0            1               1                 1
#> 7     2    3            0            0               0                 0
#> 8     2    4            0            0               0                 0
#> 9     2    5            0            0               0                 1
#> 10    2    6            0            0               0                 0
#> 11    2    7            0            0               0                 0
#> 12    3    4            0            1               0                 0
#> 13    3    5            0            0               0                 0
#> 14    3    6            0            0               0                 0
#> 15    3    7            0            0               0                 0
#> 16    4    5            0            0               1                 0
#> 17    4    6            0            0               0                 0
#> 18    4    7            0            0               0                 0
#> 19    5    6            0            0               0                 0
#> 20    5    7            0            0               0                 0
#> 21    6    7            1            1               1                 1
#>    cmp.weight prb.staff_id prb.initials prb.hair_colour prb.branch_office
#> 1           0    -3.358454    -3.267306       -3.298333         -3.235597
#> 2           0    -3.358454    -3.170009       -2.873027         -2.873027
#> 3           0    -3.358454    -3.170009       -3.235597         -2.873027
#> 4           0    -3.358454    -2.551099       -3.235597         -3.235597
#> 5           3    -3.358454     1.074391        4.466709          4.466709
#> 6           3    -3.358454     1.074391        1.659354          1.659354
#> 7           0    -3.358454    -3.298333       -3.136062         -2.873027
#> 8           0    -3.358454    -3.298333       -3.298333         -2.873027
#> 9           1    -3.358454    -3.136062       -3.298333          1.659354
#> 10          0    -3.358454    -3.267306       -3.328707         -3.298333
#> 11          0    -3.358454    -3.267306       -3.298333         -3.235597
#> 12          1    -3.358454     1.659354       -3.358454         -3.358454
#> 13          0    -3.358454    -2.873027       -3.358454         -3.358454
#> 14          0    -3.358454    -3.170009       -3.358454         -3.358454
#> 15          0    -3.358454    -3.170009       -3.358454         -3.358454
#> 16          1    -3.358454    -2.873027        1.659354         -3.358454
#> 17          0    -3.358454    -3.170009       -3.298333         -3.358454
#> 18          0    -3.358454    -3.170009       -3.235597         -3.358454
#> 19          0    -3.358454    -3.358454       -3.298333         -3.298333
#> 20          0    -3.358454    -3.358454       -3.235597         -3.235597
#> 21          4     1.659354     1.074391        4.466709          4.466709
#>    prb.weight record.match
#> 1  -13.159690        FALSE
#> 2  -12.274517        FALSE
#> 3  -12.637087        FALSE
#> 4  -12.380747        FALSE
#> 5    6.649355         TRUE
#> 6    1.034645         TRUE
#> 7  -12.665876        FALSE
#> 8  -12.828147        FALSE
#> 9   -8.133495        FALSE
#> 10 -13.252800        FALSE
#> 11 -13.159690        FALSE
#> 12  -8.416008        FALSE
#> 13 -12.948389        FALSE
#> 14 -13.245371        FALSE
#> 15 -13.245371        FALSE
#> 16  -7.930581        FALSE
#> 17 -13.185250        FALSE
#> 18 -13.122514        FALSE
#> 19 -13.313574        FALSE
#> 20 -13.188102        FALSE
#> 21  11.667162         TRUE
#> 
links_wf_probabilistic(dfr, attr_threshold = 1,
                    cmp_func = c(diyar::exact_match,
                                 diyar::exact_match,
                                 last_word_cmp,
                                 last_word_cmp),
                    score_threshold = -4,
                    recursive = TRUE)
#> $pid
#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
#> 
#> $pid_weights
#>   sn_x sn_y cmp.staff_id cmp.initials cmp.hair_colour cmp.branch_office
#> 1    1    1            0            1               1                 1
#> 2    2    2           NA           NA              NA                NA
#> 3    3    3           NA           NA              NA                NA
#> 4    4    4           NA           NA              NA                NA
#> 5    5    5           NA           NA              NA                NA
#> 6    6    1            0            1               1                 1
#> 7    7    1            0            1               1                 1
#>   cmp.weight prb.staff_id prb.initials prb.hair_colour prb.branch_office
#> 1          3    -4.321928     1.148392        1.733354          1.733354
#> 2         NA           NA           NA              NA                NA
#> 3         NA           NA           NA              NA                NA
#> 4         NA           NA           NA              NA                NA
#> 5         NA           NA           NA              NA                NA
#> 6          3    -3.836501     1.148392        4.466709          4.466709
#> 7          3    -3.836501     1.148392        1.733354          1.733354
#>   prb.weight record.match
#> 1  0.2931724         TRUE
#> 2         NA           NA
#> 3         NA           NA
#> 4         NA           NA
#> 5         NA           NA
#> 6  6.2453079         TRUE
#> 7  0.7785993         TRUE
#>

Arguments

Value

Details

References

See also

Examples