A collection of predefined logical tests used with sub_criteria
objects
exact_match(x, y)
range_match(x, y, range = 10)
prob_link(
x,
y,
cmp_func,
attr_threshold,
score_threshold,
probabilistic,
return_weights = FALSE
)
true(x, y)
false(x, y)
Attribute(s) to be compared against.
Attribute(s) to be compared by.
Difference between y
and x
.
Logical tests such as string comparators. See links_wf_probabilistic
.
Matching set of weight thresholds for each result of cmp_func
. See links_wf_probabilistic
.
Score threshold determining matched or linked records. See links_wf_probabilistic
.
If TRUE
, matches determined through a score derived base on Fellegi-Sunter model for probabilistic linkage. See links_wf_probabilistic
.
If TRUE
, returns the match-weights and score-thresholds for record pairs.
exact_match()
- test that x == y
range_match()
- test that x
\(\le\) y
\(\le\) (x + range)
prob_link()
- Test that a record-pair relate to the same entity based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.
In summary, record-pairs are created and categorised as matches and non-matches (attr_threshold
) with user-defined functions (cmp_func
).
If probabilistic
is TRUE
, two probabilities (m
and u
) are used to calculate weights for matches and non-matches.
The m
-probability is the probability that matched records are actually from the same entity i.e. a true match,
while u
-probability is the probability that matched records are not from the same entity i.e. a false match.
Record-pairs whose total score are above a certain threshold (score_threshold
) are assumed to belong to the same entity.
Agreement (match) and disagreement (non-match) scores are calculated as described by Asher et al. (2020).
For each record pair, an agreement for attribute \(i\) is calculated as;
$$\log_{2}(m_{i}/u_{i})$$
For each record pair, a disagreement score for attribute \(i\) is calculated as;
$$\log_{2}((1-m_{i})/(1-u_{i}))$$
where \(m_{i}\) and \(u_{i}\) are the m
and u
-probabilities for each value of attribute \(i\).
Note that each probability is calculated as a combined probability for the record pair.
For example, if the values of the record-pair have u
-probabilities of 0.1
and 0.2
respectively,
then the u
-probability for the pair will be 0.02
.
Missing data (NA
) are considered non-matches and assigned a u
-probability of 0
.
`exact_match`
#> function(x, y) {
#> x == y & !is.na(x) & !is.na(y)
#> }
#> <bytecode: 0x00000000199e06c0>
#> <environment: namespace:diyar>
exact_match(x = 1, y = 1)
#> [1] TRUE
exact_match(x = 1, y = 2)
#> [1] FALSE
`range_match`
#> function(x, y, range = 10){
#> x <- as.numeric(x); y <- as.numeric(y)
#> (x <= y) & (y <= x + range) & !is.na(x) & !is.na(y)
#> }
#> <bytecode: 0x000000004a69cc20>
#> <environment: namespace:diyar>
range_match(x = 10, y = 16, range = 6)
#> [1] TRUE
range_match(x = 16, y = 10, range = 6)
#> [1] FALSE