iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S10766-017-0524-Z
Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive | International Journal of Parallel Programming Skip to main content
Log in

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193\(\times \) speedup for the computing-intensive step and 9.6\(\times \) speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Anderson, M., Smith, S., Sundaram, N., Capotă, M., Zhao, Z., Dulloor, S., Satish, N., Willke, T.L.: Bridging the gap between hpc and big data frameworks. Proc. VLDB Endow. 10(8), 901–912 (2017)

    Article  Google Scholar 

  2. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. Fox, G.C., Qiu, J., Kamburugamuve, S., Jha, S., Luckow, A.: Hpc-abds high performance computing enhanced apache big data stack. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 1057–1066. IEEE (2015)

  5. Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., Chhugani, J., et al.: Matrix factorization at scale: a comparison of scientific data analytics in spark and c+ mpi using three case studies (2016). arXiv preprint arXiv:1607.01335

  6. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)

    Article  MATH  Google Scholar 

  7. Guo, X., Yu, N., Ding, X., Wang, J., Pan, Y.: Dime: a novel framework for de novo metagenomic sequence assembly. J. Comput. Biol. 22(2), 159–177 (2015)

    Article  Google Scholar 

  8. Heger, D.: Hadoop performance tuning-a pragmatic & iterative approach. CMG J. 4, 97–113 (2013)

    Google Scholar 

  9. Hess, M., Sczyrba, A., Egan, R., Kim, T.W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al.: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331(6016), 463–467 (2011)

    Article  Google Scholar 

  10. Joshi, S.B.: Apache hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 241–242. ACM (2012)

  11. Kiveris, R., Lattanzi, S., Mirrokni, V., Rastogi, V., Vassilvitskii, S.: Connected components in mapreduce and beyond. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–13. ACM (2014)

  12. Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)

  13. Lu, X., Liang, F., Wang, B., Zha, L., Xu, Z.: Datampi: extending mpi to hadoop-like big data computing. In: 2014 IEEE 28th International Symposium on Parallel and Distributed Processing, pp. 829–838. IEEE (2014)

  14. Metzker, M.L.: Sequencing technologies—the next generation. Nat. Rev. Genet. 11(1), 31–46 (2010)

    Article  Google Scholar 

  15. Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)

  16. Nvidia, C.: Compute Unified Device Architecture Programming Guide (2007). http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA_CUDA_Programming_Guide_1.0.pdf

  17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

  18. Qiu, J., Jha, S., Luckow, A., Fox, G.C.: Towards hpc-abds: an initial high-performance big data stack. Build. Robust Big Data Ecosyst. ISO/IEC JTC 1, 18–21 (2014)

    Google Scholar 

  19. Rasheed, Z., Rangwala, H.: A map-reduce framework for clustering metagenomes. In: Parallel and Distributed Processing Symposium Workshops and Ph.D. Forum (IPDPSW), 2013 IEEE 27th International, pp. 549–558. IEEE (2013)

  20. Reyes-Ortiz, J.L., Oneto, L., Anguita, D.: Big data analytics in the cloud: spark on hadoop vs mpi/openmp on beowulf. Proc. Comput. Sci. 53, 121–130 (2015)

    Article  Google Scholar 

  21. Schmidt, B., Hildebrandt, A.: Next-generation sequencing: big data meets high performance computing. Drug Discovery Today 4(4), 712–717 (2017)

  22. Shi, L., Wang, Z., Yu, W., Meng, X.: Performance evaluation and tuning of biopig for genomic analysis. In: Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems, p. 9. ACM (2015)

  23. Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM (JACM) 22(2), 215–225 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  24. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)

  25. Website: Apache hadoop. https://hadoop.apache.org

  26. Website: Apache pig. http://pig.apache.org

  27. Website: Apache tez. https://tez.aprche.org

  28. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)

    Google Scholar 

Download references

Acknowledgements

The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403). Xiandong Meng and Zhong Wang’s work was supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Han Lin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, H., Su, Z., Meng, X. et al. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive. Int J Parallel Prog 46, 762–775 (2018). https://doi.org/10.1007/s10766-017-0524-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0524-z

Keywords

Navigation