Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin, Han; Su, Zhichao; Meng, Xiandong; Jin, Xu; Wang, Zhong; Han, Wenting; An, Hong; Chi, Mengxian; Wu, Zheng

doi:10.1007/s10766-017-0524-z

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Published: 07 October 2017

Volume 46, pages 762–775, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Han Lin ORCID: orcid.org/0000-0001-7666-0150¹,
Zhichao Su¹,
Xiandong Meng²,
Xu Jin¹,
Zhong Wang²,
Wenting Han¹,
Hong An¹,
Mengxian Chi¹ &
…
Zheng Wu¹

418 Accesses
3 Citations
Explore all metrics

Abstract

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193$\times $ speedup for the computing-intensive step and 9.6$\times $ speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cloud Computing for De Novo Metagenomic Sequence Assembly

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Article Open access 04 June 2015

Terabase-scale metagenome coassembly with MetaHipMer

Article Open access 01 July 2020

References

Anderson, M., Smith, S., Sundaram, N., Capotă, M., Zhao, Z., Dulloor, S., Satish, N., Willke, T.L.: Bridging the gap between hpc and big data frameworks. Proc. VLDB Endow. 10(8), 901–912 (2017)
Article Google Scholar
Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Fox, G.C., Qiu, J., Kamburugamuve, S., Jha, S., Luckow, A.: Hpc-abds high performance computing enhanced apache big data stack. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 1057–1066. IEEE (2015)
Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., Chhugani, J., et al.: Matrix factorization at scale: a comparison of scientific data analytics in spark and c+ mpi using three case studies (2016). arXiv preprint arXiv:1607.01335
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
Article MATH Google Scholar
Guo, X., Yu, N., Ding, X., Wang, J., Pan, Y.: Dime: a novel framework for de novo metagenomic sequence assembly. J. Comput. Biol. 22(2), 159–177 (2015)
Article Google Scholar
Heger, D.: Hadoop performance tuning-a pragmatic & iterative approach. CMG J. 4, 97–113 (2013)
Google Scholar
Hess, M., Sczyrba, A., Egan, R., Kim, T.W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al.: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331(6016), 463–467 (2011)
Article Google Scholar
Joshi, S.B.: Apache hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 241–242. ACM (2012)
Kiveris, R., Lattanzi, S., Mirrokni, V., Rastogi, V., Vassilvitskii, S.: Connected components in mapreduce and beyond. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–13. ACM (2014)
Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)
Lu, X., Liang, F., Wang, B., Zha, L., Xu, Z.: Datampi: extending mpi to hadoop-like big data computing. In: 2014 IEEE 28th International Symposium on Parallel and Distributed Processing, pp. 829–838. IEEE (2014)
Metzker, M.L.: Sequencing technologies—the next generation. Nat. Rev. Genet. 11(1), 31–46 (2010)
Article Google Scholar
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)
Nvidia, C.: Compute Unified Device Architecture Programming Guide (2007). http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Qiu, J., Jha, S., Luckow, A., Fox, G.C.: Towards hpc-abds: an initial high-performance big data stack. Build. Robust Big Data Ecosyst. ISO/IEC JTC 1, 18–21 (2014)
Google Scholar
Rasheed, Z., Rangwala, H.: A map-reduce framework for clustering metagenomes. In: Parallel and Distributed Processing Symposium Workshops and Ph.D. Forum (IPDPSW), 2013 IEEE 27th International, pp. 549–558. IEEE (2013)
Reyes-Ortiz, J.L., Oneto, L., Anguita, D.: Big data analytics in the cloud: spark on hadoop vs mpi/openmp on beowulf. Proc. Comput. Sci. 53, 121–130 (2015)
Article Google Scholar
Schmidt, B., Hildebrandt, A.: Next-generation sequencing: big data meets high performance computing. Drug Discovery Today 4(4), 712–717 (2017)
Shi, L., Wang, Z., Yu, W., Meng, X.: Performance evaluation and tuning of biopig for genomic analysis. In: Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems, p. 9. ACM (2015)
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM (JACM) 22(2), 215–225 (1975)
Article MathSciNet MATH Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Website: Apache hadoop. https://hadoop.apache.org
Website: Apache pig. http://pig.apache.org
Website: Apache tez. https://tez.aprche.org
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Google Scholar

Download references

Acknowledgements

The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403). Xiandong Meng and Zhong Wang’s work was supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230026, Anhui, China
Han Lin, Zhichao Su, Xu Jin, Wenting Han, Hong An, Mengxian Chi & Zheng Wu
DOE Joint Genome Institute and Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Xiandong Meng & Zhong Wang

Authors

Han Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Su
View author publications
You can also search for this author in PubMed Google Scholar
Xiandong Meng
View author publications
You can also search for this author in PubMed Google Scholar
Xu Jin
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenting Han
View author publications
You can also search for this author in PubMed Google Scholar
Hong An
View author publications
You can also search for this author in PubMed Google Scholar
Mengxian Chi
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Han Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, H., Su, Z., Meng, X. et al. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. Int J Parallel Prog 46, 762–775 (2018). https://doi.org/10.1007/s10766-017-0524-z

Download citation

Received: 02 September 2017
Accepted: 18 September 2017
Published: 07 October 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10766-017-0524-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cloud Computing for De Novo Metagenomic Sequence Assembly

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Terabase-scale metagenome coassembly with MetaHipMer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cloud Computing for De Novo Metagenomic Sequence Assembly

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Terabase-scale metagenome coassembly with MetaHipMer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation