Abstract
Provenance data can support the reproducibility of experiments providing the history of the data in a scientific workflow. Bioinformatics generates an increasing amount of data, which are often analyzed employing workflows. This paper proposes a way to manage automatic executions of Bioinformatics workflows, storing their provenance and raw data in the MongoDB NoSQL database system. It uses a program that manages three different data models, a referenced, an embedded, and a hybrid data model for purposes of comparison. The results showed general advantages and disadvantages for each data model and some particularities of Bioinformatics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 6th International Conference on Pervasive Computing and Applications (ICPCA), pp. 363–366. IEEE (2011)
Erturk, E., Jyoti, K.: Perspectives on a big data application: What database engineers and it students need to know. Eng. Technol. Appl. Sci. Res. 5(5), 850–853 (2015)
Li, T., Liu, L., Zhang, X., Xu, K., Yang, C.: Provenancelens: service provenance management in cloud. In: 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (2014)
Moniruzzaman, A., Hossain, S.A.: NoSQL database: new era of databases for big data analytics-classification, characteristics and comparison, arXiv preprint arXiv:1307.0191 (2013)
Reis, D.G., Gasparoni, F.S., Holanda, M., Victorino, M., Ladeira, M., Ribeiro, E.O.: An evaluation of data model for NoSQL document-based databases. In: World Conference on Information Systems and Technologies, pp. 616–625. Springer (2018)
Bellazzi, R.: Big data and biomedical informatics: a challenging opportunity. Yearb. Med. Inf. 9(1), 8 (2014)
Gessert, F., Ritter, N.: Scalable Data Management: NoSQL Datastores in Research and Practice (2016)
The MongoDB 4.0 Manual. https://docs.mongodb.com/manual. Accessed 23 June 2018
Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: International Conference on Database Theory, pp. 316–330. Springer (2001)
Guimaraes, V., Hondo, F., Almeida, R., Vera, H., Holanda, M., Araujo, A., Walter, M.E., Lifschitz, S.: A study of genomic data provenance in NoSQL document-oriented database systems. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2015, pp. 1525–1531. IEEE (2015)
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L.: Gerenciando experimentos científicos em larga escala, SEMISH – Seminário Integrado de Software e Hardware (2008)
De Paula, R., Holanda, M., Gomes, L.S., Lifschitz, S., Walter, M.E.M.: Provenance in bioinformatics workflows. BMC Bioinf. 14(11), S6 (2013)
Abdrabo, M., Elmogy, M., Eltaweel, G., Barakat, S.: Enhancing big data value using knowledge discovery techniques. IJ Inf. Technol. Comput. Sci. 8, 1–12 (2016)
Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)
Mattoso, M., Dias, J., Costa, F., de Oliveira, D., Ogasawara, E.: Experiences in using provenance to optimize the parallel execution of scientific workflows steered by users. In: Workshop of Provenance Analytics, vol. 1 (2014)
Kanwal, S., Khan, F.Z., Lonie, A., Sinnott, R.O.: Investigating reproducibility and tracking provenance-a genomic workflow case study. BMC Bioinf. 18(1), 337 (2017)
Costa, F., Silva, V., De Oliveira, D., Ocaña, K., Ogasawara, E., Dias, J., Mattoso, M.: Capturing and querying workflow runtime provenance with PROV: a practical approach. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 282–289. ACM (2013)
Hondo, F., Wercelens, P., da Silva, W., Lima, I., Santana, I., de Araujo, G., Araujo, A., Walter, M.E., Holanda, M., Lifschitz, S.: Uso de bancos de dados nosql para gerenciamento de dados em workflow de bioinformática. In: Proceedings of 32nd Brazilian Symposium on Databases, pp. 310–317 (2017)
Hondo, F., Wercelens, P., da Silva, W., Castro, K., Santana, I., Walter, M.E., Araujo, A., Holanda, M., Lifschitz, S.: Data provenance management for bioinformatics workflows using NoSQL database systems in a cloud computing environment. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1929–1934. IEEE (2017)
Kim, D., Langmead, B., Salzberg, S.L.: HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12(4), 357 (2015)
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)
Anders, S., Pyl, P.T., Huber, W.: HTSeq-a python framework to work with high-throughput sequencing data. Bioinformatics 31(2), 166–169 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Santana, I., da Silva, W.M.C., Holanda, M. (2019). A NoSQL Solution for Bioinformatics Data Provenance Storage. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) New Knowledge in Information Systems and Technologies. WorldCIST'19 2019. Advances in Intelligent Systems and Computing, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-16181-1_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-16181-1_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16180-4
Online ISBN: 978-3-030-16181-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)