Title: | Superlatively Fast Fuzzy Joins |
---|---|
Description: | Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time. |
Authors: | Beniamino Green [aut, cre, cph], Etienne Bacher [ctb] , The authors of the dependency Rust crates [ctb, cph] (see inst/AUTHORS file for details) |
Maintainer: | Beniamino Green <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.0 |
Built: | 2024-11-22 05:54:39 UTC |
Source: | https://github.com/beniaminogreen/zoomerjoin |
A set of donor names from the Database on Ideology, Money in Politics, and Elections (DIME). This dataset was used as a benchmark in the 2021 APSR paper Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field by Aaron R. Kaufman and Aja Klevs, the dataset in this package is a subset of the data from the replication archive of that paper. The full dataset can be found in the paper's replication materials here: doi:10.7910/DVN/4031UL.
dime_data
dime_data
dime_data
A data frame with 10,000 rows and 2 columns:
Numeric ID / Row Number
Donor Name
... #' @source https://www.who.int/teams/global-tuberculosis-programme/data
Adam Bonica
A Rust implementation of the Naive Bayes / Fellegi-Sunter model of record linkage as detailed in the article "Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records" by Enamorado, Fifield and Imai (2019). Takes an integer matrix describing the similarities between each possible pair of observations, and a vector of initial guesses of the probability each pair is a match (these can either be set from domain knowledge, or one can hand-label a subset of the data and leave the rest as p=.5). Iteratively refines these guesses using the Expectation Maximization algorithm until an optima is reached. for more details, see doi:10.1017/S0003055418000783.
em_link(X, g, tol = 10^-6, max_iter = 10^3)
em_link(X, g, tol = 10^-6, max_iter = 10^3)
X |
an integer matrix of similarities. Must go from 0 (the most disagreement) to the maximum without any "gaps" or unused levels. As an example, a column with values 0,1,2,3 is a valid column, but 0,1,2,4 is not as three is omitted |
g |
a vector of initial guesses that are iteratively improved using the EM algorithm (my personal approach is to guess at logistic regression coefficients and use them to create the intitial probability guesses). This is chosen to avoid the model getting stuck in a local optimum, and to avoid the problem of label-switching, where the labels for matches and non-matches are reversed. |
tol |
tolerance in the sense of the infinity norm. i.e. how close the parameters have to be between iterations before the EM algorithm terminates. |
max_iter |
iterations after which the algorithm will error out if it has not converged. |
a vector of probabilities representing the posterior probability each record pair is a match.
inv_logit <- function(x) { exp(x) / (1 + exp(x)) } n <- 10^6 d <- 1:n %% 5 == 0 X <- cbind( as.integer(ifelse(d, runif(n) < .8, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .9, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .7, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .6, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .5, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)), as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)), as.integer(ifelse(d, runif(n) < .8, runif(n) < .01)) ) # inital guess at class assignments based on # a hypothetical logistic # regression. Should be based on domain knowledge, or a handful of hand-coded # observations. x_sum <- rowSums(X) g <- inv_logit((x_sum - mean(x_sum)) / sd(x_sum)) out <- em_link(X, g, tol = .0001, max_iter = 100)
inv_logit <- function(x) { exp(x) / (1 + exp(x)) } n <- 10^6 d <- 1:n %% 5 == 0 X <- cbind( as.integer(ifelse(d, runif(n) < .8, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .9, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .7, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .6, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .5, runif(n) < .2)), as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)), as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)), as.integer(ifelse(d, runif(n) < .8, runif(n) < .01)) ) # inital guess at class assignments based on # a hypothetical logistic # regression. Should be based on domain knowledge, or a handful of hand-coded # observations. x_sum <- rowSums(X) g <- inv_logit((x_sum - mean(x_sum)) / sd(x_sum)) out <- em_link(X, g, tol = .0001, max_iter = 100)
Fuzzy joins for Euclidean distance using Locality Sensitive Hashing
euclidean_anti_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_inner_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_left_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_right_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_full_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE )
euclidean_anti_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_inner_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_left_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_right_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE ) euclidean_full_join( a, b, by = NULL, threshold = 1, n_bands = 30, band_width = 5, r = 0.5, progress = FALSE )
a , b
|
The two dataframes to join. |
by |
A named vector indicating which columns to join on. Format should
be the same as dplyr: |
threshold |
The distance threshold below which units should be considered a match. Note that contrary to Jaccard joins, this value is about the distance and not the similarity. Therefore, a lower value means a higher similarity. |
n_bands |
The number of bands used in the minihash algorithm (default is
40). Use this in conjunction with the |
band_width |
The length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the |
r |
Hyperparameter used to govern the sensitivity of the locality
sensitive hash. Corresponds to the width of the hash bucket in the LSH
algorithm. Increasing values of |
progress |
Set to |
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
Datar, Mayur, Nicole Immorlica, Pitor Indyk, and Vahab Mirrokni. "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions" SCG '04: Proceedings of the twentieth annual symposium on Computational geometry (2004): 253-262
n <- 10 # Build two matrices that have close values X_1 <- matrix(c(seq(0, 1, 1 / (n - 1)), seq(0, 1, 1 / (n - 1))), nrow = n) X_2 <- X_1 + .0000001 X_1 <- as.data.frame(X_1) X_2 <- as.data.frame(X_2) X_1$id_1 <- 1:n X_2$id_2 <- 1:n # only keep observations that have a match euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005) # keep all observations from X_1, regardless of whether they have a match euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005)
n <- 10 # Build two matrices that have close values X_1 <- matrix(c(seq(0, 1, 1 / (n - 1)), seq(0, 1, 1 / (n - 1))), nrow = n) X_2 <- X_1 + .0000001 X_1 <- as.data.frame(X_1) X_2 <- as.data.frame(X_2) X_1$id_1 <- 1:n X_2$id_2 <- 1:n # only keep observations that have a match euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005) # keep all observations from X_1, regardless of whether they have a match euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005)
Plot S-Curve for a LSH with given hyperparameters
euclidean_curve(n_bands, band_width, r, up_to = 100)
euclidean_curve(n_bands, band_width, r, up_to = 100)
n_bands |
The number of LSH bands calculated |
band_width |
The number of hashes in each band |
r |
the "r" hyperparameter used to govern the sensitivity of the hash. |
up_to |
the right extent of the x axis. |
A plot showing the probability a pair is proposed as a match, given the Jaccard similarity of the two items.
Find Probability of Match Based on Similarity
euclidean_probability(distance, n_bands, band_width, r)
euclidean_probability(distance, n_bands, band_width, r)
distance |
the euclidian distance between the two vectors you want to compare. |
n_bands |
The number of LSH bands used in hashing. |
band_width |
The number of hashes in each band. |
r |
the "r" hyperparameter used to govern the sensitivity of the hash. |
a decimal number giving the proability that the two items will be returned as a candidate pair from the minihash algorithm.
Calculate Hamming distance of two character vectors
hamming_distance(a, b)
hamming_distance(a, b)
a |
the first character vector |
b |
the first character vector |
a vector of hamming similarities of the strings
hamming_distance( c("ACGTCGATGACGTGATGCGTAGCGTA", "ACGTCGATGTGCTCTCGTCGATCTAC"), c("ACGTCGACGACGTGATGCGCAGCGTA", "ACGTCGATGGGGTCTCGTCGATCTAC") )
hamming_distance( c("ACGTCGATGACGTGATGCGTAGCGTA", "ACGTCGATGTGCTCTCGTCGATCTAC"), c("ACGTCGACGACGTGATGCGCAGCGTA", "ACGTCGATGGGGTCTCGTCGATCTAC") )
Find similar rows between two tables using the hamming distance. The hamming distance is equal to the number characters two strings differ by, or is equal to infinity if two strings are of different lengths
hamming_inner_join( a, b, by = NULL, n_bands = 100, band_width = 8, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_anti_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_left_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_right_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_full_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL )
hamming_inner_join( a, b, by = NULL, n_bands = 100, band_width = 8, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_anti_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_left_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_right_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL ) hamming_full_join( a, b, by = NULL, n_bands = 100, band_width = 100, threshold = 2, progress = FALSE, clean = FALSE, similarity_column = NULL )
a , b
|
The two dataframes to join. |
by |
A named vector indicating which columns to join on. Format should
be the same as dplyr: |
n_bands |
The number of bands used in the locality sensitive hashing
algorithm (default is 100). Use this in conjunction with the
|
band_width |
The length of each band used in the minihashing algorithm
(default is 8). Use this in conjunction with the |
threshold |
The Hamming distance threshold below which two strings should be considered a match. A distance of zero corresponds to complete equality between strings, while a distance of 'x' between two strings means that 'x' substitutions must be made to transform one string into the other. |
progress |
Set to |
clean |
Should the strings that you fuzzy join on be cleaned (coerced to
lower-case, stripped of punctuation and spaces)? Default is |
similarity_column |
An optional character vector. If provided, the data frame will contain a column with this name giving the Hamming distance between the two fields. Extra column will not be present if anti-joining. |
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
# load baby names data # install.packages("babynames") library(babynames) baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_mispelled <- data.frame( name_mispelled = gsub("[aeiouy]", "x", baby_names$name) ) # Run the join and only keep rows that have a match: hamming_inner_join( baby_names, baby_names_mispelled, by = c("name" = "name_mispelled"), threshold = 3, n_bands = 150, band_width = 10, clean = FALSE # default ) # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: hamming_left_join( baby_names, baby_names_mispelled, by = c("name" = "name_mispelled"), threshold = 3, n_bands = 150, band_width = 10, )
# load baby names data # install.packages("babynames") library(babynames) baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_mispelled <- data.frame( name_mispelled = gsub("[aeiouy]", "x", baby_names$name) ) # Run the join and only keep rows that have a match: hamming_inner_join( baby_names, baby_names_mispelled, by = c("name" = "name_mispelled"), threshold = 3, n_bands = 150, band_width = 10, clean = FALSE # default ) # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: hamming_left_join( baby_names, baby_names_mispelled, by = c("name" = "name_mispelled"), threshold = 3, n_bands = 150, band_width = 10, )
Find Probability of Match Based on Similarity
hamming_probability(distance, input_length, n_bands, band_width)
hamming_probability(distance, input_length, n_bands, band_width)
distance |
The hamming distance of the two strings you want to compare |
input_length |
the length (number of characters) of the input strings you want to calculate. |
n_bands |
The number of LSH bands used in hashing. |
band_width |
The number of hashes in each band. |
A decimal number giving the probability that the two items will be returned as a candidate pair from the lsh algotithm.
Plot S-Curve for a LSH with given hyperparameters
jaccard_curve(n_bands, band_width)
jaccard_curve(n_bands, band_width)
n_bands |
The number of LSH bands calculated |
band_width |
The number of hashes in each band |
A plot showing the probability a pair is proposed as a match, given the Jaccard similarity of the two items.
# Plot the probability two pairs will be matched as a function of their # jaccard similarity, given the hyperparameters n_bands and band_width. jaccard_curve(40, 6)
# Plot the probability two pairs will be matched as a function of their # jaccard similarity, given the hyperparameters n_bands and band_width. jaccard_curve(40, 6)
Runs a grid search to find the hyperparameters that will achieve an (s1,s2,p1,p2)-sensitive locality sensitive hash. A locality sensitive hash can be called (s1,s2,p1,p2)-sensitive if to strings with a similarity less than s1 have a less than p1 chance of being compared, while two strings with similarity s2 have a greater than p2 chance of being compared. As an example, a (.1,.7,.001,.999)-sensitive LSH means that strings with similarity less than .1 will have a .1% chance of being compared, while strings with .7 similarity have a 99.9% chance of being compared.
jaccard_hyper_grid_search(s1 = 0.1, s2 = 0.7, p1 = 0.001, p2 = 0.999)
jaccard_hyper_grid_search(s1 = 0.1, s2 = 0.7, p1 = 0.001, p2 = 0.999)
s1 |
the s1 parameter (the first similaity). |
s2 |
the s2 parameter (the second similarity, must be greater than s1). |
p1 |
the p1 parameter (the first probability). |
p2 |
the p2 parameter (the second probability, must be greater than p1). |
a named vector with the hyperparameters that will meet the LSH criteria, while reducing runitme.
# Help me find the parameters that will minimize runtime while ensuring that # two strings with similarity .1 will be compared less than .1% of the time, # strings with .8 similaity will have a 99.95% chance of being compared: jaccard_hyper_grid_search(.1, .9, .001, .995)
# Help me find the parameters that will minimize runtime while ensuring that # two strings with similarity .1 will be compared less than .1% of the time, # strings with .8 similaity will have a 99.95% chance of being compared: jaccard_hyper_grid_search(.1, .9, .001, .995)
Fuzzy joins for Jaccard distance using MinHash
jaccard_inner_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_anti_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_left_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_right_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_full_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL )
jaccard_inner_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_anti_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_left_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_right_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL ) jaccard_full_join( a, b, by = NULL, block_by = NULL, n_gram_width = 2, n_bands = 50, band_width = 8, threshold = 0.7, progress = FALSE, clean = FALSE, similarity_column = NULL )
a , b
|
The two dataframes to join. |
by |
A named vector indicating which columns to join on. Format should
be the same as dplyr: |
block_by |
A named vector indicating which column to block on, such that
rows that disagree on this field cannot be considered a match. Format
should be the same as dplyr: |
n_gram_width |
The length of the n_grams used in calculating the Jaccard
similarity. For best performance, I set this large enough that the chance
any string has a specific n_gram is low (i.e. |
n_bands |
The number of bands used in the minihash algorithm (default is
40). Use this in conjunction with the |
band_width |
The length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the |
threshold |
The Jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is equal to 1 - the Jaccard distance between the two strings, so 1 implies the strings are identical, while a similarity of zero implies the strings are completely dissimilar. |
progress |
Set to |
clean |
Should the strings that you fuzzy join on be cleaned (coerced to
lower-case, stripped of punctuation and spaces)? Default is |
similarity_column |
An optional character vector. If provided, the data frame will contain a column with this name giving the Jaccard similarity between the two fields. Extra column will not be present if anti-joining. |
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
# load baby names data # install.packages("babynames") library(babynames) baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_sans_vowels <- data.frame( name_wo_vowels = gsub("[aeiouy]", "", baby_names$name) ) # Check the probability two pairs of strings with similarity .8 will be # matched with a band width of 8 and 30 bands using the `jaccard_probability()` # function: jaccard_probability(.8, 30, 8) # Run the join and only keep rows that have a match: jaccard_inner_join( baby_names, baby_names_sans_vowels, by = c("name" = "name_wo_vowels"), threshold = .8, n_bands = 20, band_width = 6, n_gram_width = 1, clean = FALSE # default ) # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: jaccard_left_join( baby_names, baby_names_sans_vowels, by = c("name" = "name_wo_vowels"), threshold = .8, n_bands = 20, band_width = 6, n_gram_width = 1 )
# load baby names data # install.packages("babynames") library(babynames) baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_sans_vowels <- data.frame( name_wo_vowels = gsub("[aeiouy]", "", baby_names$name) ) # Check the probability two pairs of strings with similarity .8 will be # matched with a band width of 8 and 30 bands using the `jaccard_probability()` # function: jaccard_probability(.8, 30, 8) # Run the join and only keep rows that have a match: jaccard_inner_join( baby_names, baby_names_sans_vowels, by = c("name" = "name_wo_vowels"), threshold = .8, n_bands = 20, band_width = 6, n_gram_width = 1, clean = FALSE # default ) # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: jaccard_left_join( baby_names, baby_names_sans_vowels, by = c("name" = "name_wo_vowels"), threshold = .8, n_bands = 20, band_width = 6, n_gram_width = 1 )
This is a port of the
lsh_probability
function from the
textreuse
package, with arguments changed to reflect the hyperparameters in this
package. It gives the probability that two strings of jaccard similarity
similarity
will be matched, given the chosen bandwidth and number of
bands.
jaccard_probability(similarity, n_bands, band_width)
jaccard_probability(similarity, n_bands, band_width)
similarity |
the similarity of the two strings you want to compare |
n_bands |
The number of LSH bands used in hashing. |
band_width |
The number of hashes in each band. |
a decimal number giving the probability that the two items will be returned as a candidate pair from the minhash algorithm.
# Find the probability two pairs will be matched given they have a # jaccard_similarity of .8, band width of 5, and 50 bands: jaccard_probability(.8, n_bands = 50, band_width = 5)
# Find the probability two pairs will be matched given they have a # jaccard_similarity of .8, band width of 5, and 50 bands: jaccard_probability(.8, n_bands = 50, band_width = 5)
Calculate Jaccard Similarity of two character vectors
jaccard_similarity(a, b, ngram_width = 2)
jaccard_similarity(a, b, ngram_width = 2)
a |
the first character vector |
b |
the first character vector |
ngram_width |
the length of the shingles / ngrams used in the similarity calculation |
a vector of jaccard similarities of the strings
jaccard_similarity( c("the quick brown fox", "jumped over the lazy dog"), c("the quck bron fx", "jumped over hte lazy dog") )
jaccard_similarity( c("the quick brown fox", "jumped over the lazy dog"), c("the quck bron fx", "jumped over hte lazy dog") )
Performs fuzzy string grouping in which similar strings are assigned to the
same group. Uses the cluster_fast_greedy()
community detection algorithm
from the igraph
package to create the groups. Must have igraph installed
in order to use this function.
jaccard_string_group( string, n_gram_width = 2, n_bands = 45, band_width = 8, threshold = 0.7, progress = FALSE )
jaccard_string_group( string, n_gram_width = 2, n_bands = 45, band_width = 8, threshold = 0.7, progress = FALSE )
string |
a character you wish to perform entity resolution on. |
n_gram_width |
the length of the n_grams used in calculating the
jaccard similarity. For best performance, I set this large enough that the
chance any string has a specific n_gram is low (i.e. |
n_bands |
the number of bands used in the minihash algorithm (default
is 40). Use this in conjunction with the |
band_width |
the length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the |
threshold |
the jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is euqal to 1
|
progress |
set to true to report progress of the algorithm |
a string vector storing the group of each element in the original input strings. The input vector is grouped so that similar strings belong to the same group, which is given a standardized name.
string <- c( "beniamino", "jack", "benjamin", "beniamin", "jacky", "giacomo", "gaicomo" ) jaccard_string_group(string, threshold = .2, n_bands = 90, n_gram_width = 1)
string <- c( "beniamino", "jack", "benjamin", "beniamin", "jacky", "giacomo", "gaicomo" ) jaccard_string_group(string, threshold = .2, n_bands = 90, n_gram_width = 1)