n_gram_merge {refinr} | R Documentation |
This function takes a character vector and makes edits and merges values
that are approximately equivalent yet not identical. It uses a two step
process, the first is clustering values based on their ngram fingerprint (described here
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth).
The second step is merging values based on approximate string matching of
the ngram fingerprints, using the [sd_lower_tri()] C function from the
package stringdist
.
n_gram_merge(vect, numgram = 2, ignore_strings = NULL, bus_suffix = TRUE, edit_threshold = 1, weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5), ...)
vect |
Character vector, items to be potentially clustered and merged. |
numgram |
Numeric value, indicating the number of characters that will occupy each ngram token. Default value is 2. |
ignore_strings |
Character vector, these strings will be ignored during
the merging of values within |
bus_suffix |
Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE. |
edit_threshold |
Numeric value, indicating the threshold at which a
merge is performed, based on the sum of the edit values derived from
param |
weight |
Numeric vector, indicating the weights to assign to
the four edit operations (see details below), for the purpose of
approximate string matching. Default values are
c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along
to the |
... |
additional args to be passed along to the |
The values of arg weight
are edit distance values that
get passed to the stringdist
edit distance function. The
param takes four arguments, each one is a specific type of edit, with
default penalty value.
d: deletion, default value is 0.33
i: insertion, default value is 0.33
s: substitution, default value is 1
t: transposition, default value is 0.5
Character vector with similar values merged.
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC") n_gram_merge(vect = x) # The performance of the approximate string matching can be ajusted using # parameters 'weight' or 'edit_threshold' n_gram_merge(vect = x, weight = c(d = 0.4, i = 1, s = 1, t = 1)) # Use parameter 'ignore_strings' to ignore specific strings during merging # of values. x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield") n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))