SURVEY OF SIMILARITY JOIN ALGORITHMS BASED ON MAPREDUCE
DOI:
https://doi.org/10.20319/mijst.2016.s21.214234Keywords:
Hadoop, MapReduce, Similarity JoinAbstract
Similarity Join is a data processing and analysis operation that retrieves all data pairs whose their distance is less than a pre-defined threshold. The similarity join algorithms are used in different real world applications such as finding similarity in documents, images, and strings. In this survey we will explain some of the similarity join algorithms which are based on MapReduce approach. These algorithms are: Set-Similarity Join, SSJ-2R, MRSimJoin, Pair-wise similarity, multi-sig-er method, Trie-join, and PreJoin algorithm. We then make a comparison between these algorithms according to some criteria and discuss the results.
References
Baraglia, R., Morales, G. D. F., & Lucchese, C. (2010, December). Document similarity self-join with MapReduce. In 2010 IEEE International Conference on Data Mining (pp. 731-736). IEEE.
Gouda, K, & Rashad M (2012, May). Prejoin: An efficient trie-based string similarity join algorithm. In Informatics and Systems (INFOS), 2012 8th International Conference on (pp. DE-37). IEEE.
Kolb, L., Thor, A., & Rahm, E. (2013, June). Don't match twice: redundancy-free similarity computation with MapReduce. In Proceedings of the Second Workshop on Data Analytics in the Cloud (pp. 1-5). ACM.
Pang, J., Gu, Y., Xu, J., Bao, Y., & Yu, G. (2014, June). Efficient Graph Similarity Join with Scalable Prefix-Filtering Using MapReduce. In International Conference on Web-Age Information Management (pp. 415-418). Springer International Publishing.
Silva, Y. N., & Reed, J. M. (2012, May). Exploiting MapReduce-based similarity joins. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 693-696). ACM.
Silva, Y. N., Reed, J. M., & Tsosie, L. M. (2012, August). MapReduce-based similarity join for metric spaces. In Proceedings of the 1st International Workshop on Cloud Intelligence (p. 3). ACM. Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 495-506). ACM.
Wang, J., Feng, J., & Li, G. (2010). Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 3(1-2), 1219-1230.
Yan, C., Song, Y., Wang, J., & Guo, W. (2015, May). Eliminating the Redundancy in MapReduce-based Entity Resolution. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on (pp. 1233-1236). IEEE.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 Authors
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright of Published Articles
Author(s) retain the article copyright and publishing rights without any restrictions.
All published work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.