Apache Spark Implementation on Algorithms Boyer-Moore Horspool for Case Studies Internal Transcribed Spacer and Restriction Enzyme

Fidela Zhafirah; Topik Hidayat; Lala Septem Riza

doi:10.17509/jcs.v5i1.70790

Apache Spark Implementation on Algorithms Boyer-Moore Horspool for Case Studies Internal Transcribed Spacer and Restriction Enzyme

Fidela Zhafirah, Topik Hidayat, Lala Septem Riza

Abstract

The huge increase in the amount of data is a problem today. The increase in large amounts of data makes storage very large and processing data becomes very long. Meanwhile, the speed of the process is very necessary to streamline time. This research is dedicated to solving storage and process problems as a big data processing solution by creating a string matching computational model using the Boyer-Moore Horspool algorithm using the Big Data platform, Apache Spark where the Hadoop Distributed File System as data storage on the cluster. In this study, a comparison of string matching process time between stand-alone, the use of Apache Spark single nodes, the use of Apache Spark 3 nodes, 5 nodes, 11 nodes and 16 nodes using Hadoop Distributed File System storage on clusters on Google Cloud Platform. The case study used is bioinformatics by solving two problems in the field of biology, namely the search for motives related to determining the group of flowering plants with other plant groups and the search for motives as detection of begomovirous symptoms as the cause of curly leaf disease. In the results of the study, insignificant time was obtained because the data used could still be processed by classical programs so that the execution time was not much different. The accuracy of the program run on Apache Spark is 83.5%.

Keywords

Apache Spark; Big data; Bioinformatics

Full Text:

PDF

References

Bayat, A. (2002). Science, medicine, and the future: Bioinformatics. BMJ: British Medical Journal, 324(7344), 1018.

Derrien, T., Estellé, J., Marco Sola, S., Knowles, D. G., Raineri, E., Guigó, R., and Ribeca, P. (2012). Fast computation and applications of genome mappability. PloS one, 7(1), e30377.

Devakunchari, R. (2014). Analysis on big data over the years. International Journal of Scientific and Research Publications, 4(1), 1-7.

Hidayat, T., Priyandoko, D., Wardiny, P. Y., and Islami, D. K. (2016). Molecular phylogenetic screening of withania somnifera relative from indonesia based on internal transcribed spacer region. HAYATI Journal of Biosciences, 23(2), 92-95.

Hoong, C. C., and Ameedeen, M. A. (2017). Boyer-moore horspool algorithm used in content management system of data fast searching. Advanced Science Letters, 23(11), 11387-11390.

Kadkhodaei, H., Moghadam, A. M. E., and Dehghan, M. (2021). Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm. Expert Systems with Applications, 183, 115369.

Muthunagai, S. U., and Anitha, R. (2022). TDOPS: Time series based deduplication and optimal data placement strategy for IIoT in cloud environment. Journal of Intelligent and Fuzzy Systems, 43(1), 1583-1597.

Ridha, H. M., Gomes, C., Hizam, H., Ahmadipour, M., Heidari, A. A., and Chen, H. (2021). Multi-objective optimization and multi-criteria decision-making methods for optimal design of standalone photovoltaic system: A comprehensive review. Renewable and Sustainable Energy Reviews, 135, 110202.

Salehan, M., and Negahban, A. (2013). Social networking on smartphones: When mobile phones become addictive. Computers in human behavior, 29(6), 2632-2639.

Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., ... and Robinson, G. E. (2015). Big data: astronomical or genomical?. PLoS biology, 13(7), e1002195.

Tang, S., He, B., Yu, C., Li, Y., and Li, K. (2020). A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications. IEEE Transactions on Knowledge and Data Engineering, 34(1), 71-91.

Villar, S., Hogg, D. W., Storey-Fisher, K., Yao, W., and Blum-Smith, B. (2021). Scalars are universal: Equivariant machine learning, structured like classical physics. Advances in Neural Information Processing Systems, 34, 28848-28863.

Wilisiani, F., Somowiyarjo, S., and Hartono, S. (2014). Identifikasi molekuler virus penyebab penyakit daun keriting isolat bantul pada melon. Jurnal Perlindungan Tanaman Indonesia, 18(1), 47-54.

Wu, X., Zhu, X., Wu, G. Q., and Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107.

Yahya, A. A. (2021). Android-Based Horspool Algorithm for Proverb Search. Instal: Jurnal Komputer, 13(1), 1-9.

DOI: https://doi.org/10.17509/jcs.v5i1.70790

Refbacks

There are currently no refbacks.

Username
Password
Remember me