Package: zoomerjoin 0.2.0

zoomerjoin: Superlatively Fast Fuzzy Joins

Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time.

Authors:Beniamino Green [aut, cre, cph], Etienne Bacher [ctb], The authors of the dependency Rust crates [ctb, cph]

zoomerjoin_0.2.0.tar.gz
zoomerjoin_0.2.0.zip(r-4.5)zoomerjoin_0.2.0.zip(r-4.4)zoomerjoin_0.2.0.zip(r-4.3)
zoomerjoin_0.2.0.tgz(r-4.4-x86_64)zoomerjoin_0.2.0.tgz(r-4.4-arm64)zoomerjoin_0.2.0.tgz(r-4.3-arm64)
zoomerjoin_0.2.0.tar.gz(r-4.5-noble)zoomerjoin_0.2.0.tar.gz(r-4.4-noble)
zoomerjoin.pdf |zoomerjoin.html
zoomerjoin/json (API)
NEWS

# Install 'zoomerjoin' in R:
install.packages('zoomerjoin', repos = c('https://beniaminogreen.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/beniaminogreen/zoomerjoin/issues

Datasets:

On CRAN:

blazinglyfastfuzzyjoinjoinrustzoomer

7.61 score 103 stars 11 scripts 195 downloads 24 exports 23 dependencies

Last updated 2 months agofrom:466287c16e. Checks:OK: 8. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 22 2024
R-4.5-win-x86_64OKNov 22 2024
R-4.5-linux-x86_64OKNov 22 2024
R-4.4-win-x86_64OKNov 22 2024
R-4.4-mac-x86_64OKNov 22 2024
R-4.4-mac-aarch64OKNov 22 2024
R-4.3-win-x86_64OKNov 22 2024
R-4.3-mac-aarch64OKNov 22 2024

Exports:em_linkeuclidean_anti_joineuclidean_full_joineuclidean_inner_joineuclidean_left_joineuclidean_probabilityeuclidean_right_joinhamming_anti_joinhamming_distancehamming_full_joinhamming_inner_joinhamming_left_joinhamming_probabilityhamming_right_joinjaccard_anti_joinjaccard_curvejaccard_full_joinjaccard_hyper_grid_searchjaccard_inner_joinjaccard_left_joinjaccard_probabilityjaccard_right_joinjaccard_similarityjaccard_string_group

Dependencies:clicollapsecpp11dplyrfansigenericsgluelifecyclemagrittrpillarpkgconfigpurrrR6Rcpprlangstringistringrtibbletidyrtidyselectutf8vctrswithr

A Zoomerjoin Guided Tour

Rendered fromguided_tour.Rmdusingknitr::rmarkdownon Nov 22 2024.

Last update: 2024-02-14
Started: 2023-03-09

Benchmarks

Rendered frombenchmarks.Rmdusingknitr::rmarkdownon Nov 22 2024.

Last update: 2024-09-23
Started: 2023-03-09

Matching Vectors Based on Euclidean Distance

Rendered frommatching_vectors.Rmdusingknitr::rmarkdownon Nov 22 2024.

Last update: 2024-02-14
Started: 2023-08-06

Readme and manuals

Help Manual

Help pageTopics
Donors from DIME Databasedime_data
Fit a Probabilistic Matching Model using Naive Bayes + E.M.em_link
Fuzzy joins for Euclidean distance using Locality Sensitive Hashingeuclidean_anti_join euclidean_full_join euclidean_inner_join euclidean_left_join euclidean_right_join
Plot S-Curve for a LSH with given hyperparameterseuclidean_curve
Find Probability of Match Based on Similarityeuclidean_probability
Calculate Hamming distance of two character vectorshamming_distance
Fuzzy joins for Hamming distance using Locality Sensitive Hashinghamming_anti_join hamming_full_join hamming_inner_join hamming_left_join hamming_right_join
Find Probability of Match Based on Similarityhamming_probability
Plot S-Curve for a LSH with given hyperparametersjaccard_curve
Help Choose the Appropriate LSH Hyperparametersjaccard_hyper_grid_search
Fuzzy joins for Jaccard distance using MinHashjaccard_anti_join jaccard_full_join jaccard_inner_join jaccard_left_join jaccard_right_join
Find Probability of Match Based on Similarityjaccard_probability
Calculate Jaccard Similarity of two character vectorsjaccard_similarity
Fuzzy String Grouping Using Minhashingjaccard_string_group