Predefined logical tests in diyar — predefined

A collection of predefined logical tests used with sub_criteria objects

exact_match(x, y)

range_match(x, y, range = 10)

prob_link(
  x,
  y,
  cmp_func,
  attr_threshold,
  score_threshold,
  probabilistic,
  return_weights = FALSE
)

true(x, y)

false(x, y)

Arguments

x: Attribute(s) to be compared against.
y: Attribute(s) to be compared by.
range: Difference between y and x.
cmp_func: Logical tests such as string comparators. See links_wf_probabilistic.
attr_threshold: Matching set of weight thresholds for each result of cmp_func. See links_wf_probabilistic.
score_threshold: Score threshold determining matched or linked records. See links_wf_probabilistic.
probabilistic: If TRUE, matches determined through a score derived base on Fellegi-Sunter model for probabilistic linkage. See links_wf_probabilistic.
return_weights: If TRUE, returns the match-weights and score-thresholds for record pairs.

Details

exact_match() - test that x == y

range_match() - test that x $\le$ y $\le$ (x + range)

prob_link() - Test that a record-pair relate to the same entity based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.

In summary, record-pairs are created and categorised as matches and non-matches (attr_threshold) with user-defined functions (cmp_func). If probabilistic is TRUE, two probabilities (m and u) are used to calculate weights for matches and non-matches. The m-probability is the probability that matched records are actually from the same entity i.e. a true match, while u-probability is the probability that matched records are not from the same entity i.e. a false match. Record-pairs whose total score are above a certain threshold (score_threshold) are assumed to belong to the same entity.

Agreement (match) and disagreement (non-match) scores are calculated as described by Asher et al. (2020).

For each record pair, an agreement for attribute $i$ is calculated as;

$$\log_{2}(m_{i}/u_{i})$$

For each record pair, a disagreement score for attribute $i$ is calculated as;

$$\log_{2}((1-m_{i})/(1-u_{i}))$$

where $m_{i}$ and $u_{i}$ are the m and u-probabilities for each value of attribute $i$.

Note that each probability is calculated as a combined probability for the record pair. For example, if the values of the record-pair have u-probabilities of 0.1 and 0.2 respectively, then the u-probability for the pair will be 0.02.

Missing data (NA) are considered non-matches and assigned a u-probability of 0.

Examples

`exact_match`
#> function(x, y) {
#>   x == y & !is.na(x) & !is.na(y)
#> }
#> <bytecode: 0x00000000199e06c0>
#> <environment: namespace:diyar>
exact_match(x = 1, y = 1)
#> [1] TRUE
exact_match(x = 1, y = 2)
#> [1] FALSE

`range_match`
#> function(x, y, range = 10){
#>   x <- as.numeric(x); y <- as.numeric(y)
#>   (x <= y) & (y <= x + range) & !is.na(x) & !is.na(y)
#> }
#> <bytecode: 0x000000004a69cc20>
#> <environment: namespace:diyar>
range_match(x = 10, y = 16, range = 6)
#> [1] TRUE
range_match(x = 16, y = 10, range = 6)
#> [1] FALSE