Package: zoomerjoin 0.1.5

zoomerjoin: Superlatively Fast Fuzzy Joins

Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time.

Authors:Beniamino Green [aut, cre, cph], Etienne Bacher [ctb], The authors of the dependency Rust crates [ctb, cph]

zoomerjoin_0.1.5.tar.gz
zoomerjoin_0.1.5.zip(r-4.5)zoomerjoin_0.1.5.zip(r-4.4)zoomerjoin_0.1.5.zip(r-4.3)
zoomerjoin_0.1.5.tgz(r-4.4-x86_64)zoomerjoin_0.1.5.tgz(r-4.4-arm64)zoomerjoin_0.1.5.tgz(r-4.3-arm64)
zoomerjoin_0.1.5.tar.gz(r-4.5-noble)zoomerjoin_0.1.5.tar.gz(r-4.4-noble)
zoomerjoin.pdf |zoomerjoin.html
zoomerjoin/json (API)
NEWS

# Install 'zoomerjoin' in R:
install.packages('zoomerjoin', repos = c('https://beniaminogreen.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/beniaminogreen/zoomerjoin/issues

Datasets:

On CRAN:

blazinglyfastfuzzyjoinjoinrustzoomer

24 exports 96 stars 4.31 score 23 dependencies 11 scripts 211 downloads

Last updated 3 months agofrom:4828a1a5ae. Checks:OK: 8. Indexed: yes.

TargetResultDate
Doc / VignettesOKAug 31 2024
R-4.5-win-x86_64OKAug 31 2024
R-4.5-linux-x86_64OKAug 31 2024
R-4.4-win-x86_64OKAug 31 2024
R-4.4-mac-x86_64OKAug 31 2024
R-4.4-mac-aarch64OKAug 31 2024
R-4.3-win-x86_64OKAug 31 2024
R-4.3-mac-aarch64OKAug 31 2024

Exports:em_linkeuclidean_anti_joineuclidean_full_joineuclidean_inner_joineuclidean_left_joineuclidean_probabilityeuclidean_right_joinhamming_anti_joinhamming_distancehamming_full_joinhamming_inner_joinhamming_left_joinhamming_probabilityhamming_right_joinjaccard_anti_joinjaccard_curvejaccard_full_joinjaccard_hyper_grid_searchjaccard_inner_joinjaccard_left_joinjaccard_probabilityjaccard_right_joinjaccard_similarityjaccard_string_group

Dependencies:clicollapsecpp11dplyrfansigenericsgluelifecyclemagrittrpillarpkgconfigpurrrR6Rcpprlangstringistringrtibbletidyrtidyselectutf8vctrswithr

A Zoomerjoin Guided Tour

Rendered fromguided_tour.Rmdusingknitr::rmarkdownon Aug 31 2024.

Last update: 2024-02-14
Started: 2023-03-09

Benchmarks

Rendered frombenchmarks.Rmdusingknitr::rmarkdownon Aug 31 2024.

Last update: 2024-02-14
Started: 2023-03-09

Matching Vectors Based on Euclidean Distance

Rendered frommatching_vectors.Rmdusingknitr::rmarkdownon Aug 31 2024.

Last update: 2024-02-14
Started: 2023-08-06

Readme and manuals

Help Manual

Help pageTopics
Donors from DIME Databasedime_data
Fit a Probabilistic Matching Model using Naive Bayes + E.M.em_link
Fuzzy joins for Euclidean distance using Locality Sensitive Hashingeuclidean_anti_join euclidean_full_join euclidean_inner_join euclidean_left_join euclidean_right_join
Plot S-Curve for a LSH with given hyperparameterseuclidean_curve
Find Probability of Match Based on Similarityeuclidean_probability
Calculate Hamming distance of two character vectorshamming_distance
Fuzzy joins for Hamming distance using Locality Sensitive Hashinghamming_anti_join hamming_full_join hamming_inner_join hamming_left_join hamming_right_join
Find Probability of Match Based on Similarityhamming_probability
Plot S-Curve for a LSH with given hyperparametersjaccard_curve
Help Choose the Appropriate LSH Hyperparametersjaccard_hyper_grid_search
Fuzzy joins for Jaccard distance using MinHashjaccard_anti_join jaccard_full_join jaccard_inner_join jaccard_left_join jaccard_right_join
Find Probability of Match Based on Similarityjaccard_probability
Calculate Jaccard Similarity of two character vectorsjaccard_similarity
Fuzzy String Grouping Using Minhashingjaccard_string_group