Abstract
Subgroup discovery is a well-known technique for the extraction of patterns, with respect to a variable of interest in the data. However, the explosion in data gathering has hampered the performance of traditional algorithms to discover interesting relationships between different objects in a set with respect to a specific property which is of interest to the user. In this regard, our goal is to propose a set of efficient techniques to mine subgroups on Big Data by means of Apache Spark. On this matter, AprioriK-SD-OE and PFP-SD-OE are proposed as fast exhaustive search algorithms to discover subgroups on Big Data using Apache Spark. The experimental study includes more than 70 datasets considering search spaces bigger than \(10^{15}\) subgroups. The scalability of our proposals are analyzed by considering datasets with 200 million of instances demonstrating the usefulness of using Spark to tackle Big Data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)
Herrera, F., Carmona, C.J., González, P., Jesus, M.J.: An overview on subgroup discovery: foundations and applications. Knowl. Inf. Syst. 29(3), 495–525 (2010)
Ventura, S., Luna, J.M.: Pattern Mining with Evolutionary Algorithms. Springer, Berlin (2016)
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMOD Rec. 29(2), 1–12 (2000)
Luna, J.M., Romero, J.R., Romero, C., Ventura, S.: On the use of genetic programming for mining comprehensible rules in subgroup discovery. IEEE Trans. Cybernet. 44(12), 2329–2341 (2014)
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)
Grosskreutz, H., Rüping, S., Wrobel, S.: Proceedings, Part I European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008. Tight Optimistic Estimates for Fast Subgroup Discovery (Berlin, Heidelberg, 2008) pp. 440–456 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Ser. HotCloud’10, Berkeley (2010)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant, pp. 249–271. American Association for Artificial Intelligence, Menlo Park (1996)
Kavšek, B., Lavrač, N., Jovanoski, V.: 5th International Symposium on Intelligent Data Analysis, IDA: ch, pp. 230–241. APRIORI-SD, Adapting Association Rule Learning to Subgroup Discovery (2003)
Atzmueller, M., Puppe, F.: Sd-map-a fast algorithm for exhaustive subgroup discovery. In: 17th European Conference on Machine Learning and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2006). Lecture Notes on Computer Science, vol. 4213, pp. 6–17. Springer (2006)
Klösgen, W.: Advances in knowledge discovery and data mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Explora: A Multipattern and Multistrategy Discovery Assistant. American Association for Artificial Intelligence, Menlo Park (1996)
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, Switzerland (2015)
Lemmerich, F., Atzmueller, M., Puppe, F.: Fast exhaustive subgroup discovery with numerical target concepts. Data Min. Knowl. Discov. 30(3), 711–762 (2015)
Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target concepts. In: Foundations of Intelligent Systems, pp. 35–44. Springer, Berlin (2009)
Grosskreutz, H., Rüping, S., Wrobel, S.: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, ch. Tight Optimistic Estimates for Fast Subgroup Discovery, pp. 440–456
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)
Padillo, F., Luna, J.M., Cano, A., Ventura, S.: A data structure to speed-up machine learning algorithms on massive datasets. In: Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems, ser. HAIS 2016, Seville, Spain, pp. 365–376 (2016)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008, 51(1), 107–113 (2008)
Lam, C.: Hadoop in Action, 1st edn. Manning Publications Co., Greenwich (2010)
Luna, J.M.: Pattern mining: current status and emerging topics. Prog. Artif. Intel. 5(3), 1–6 (2016)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, ser. RecSys ’08. New York, NY, USA: ACM, pp. 107–114 (2008)
Acknowledgements
This work was supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds, Project TIN-2014-55252-P.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Padillo, F., Luna, J.M. & Ventura, S. Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark. Prog Artif Intell 6, 145–158 (2017). https://doi.org/10.1007/s13748-017-0112-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-017-0112-x