Abstract
Modern applications are digesting and generating data with rich features that are stored in high dimensional array or tensor. The computation applied to tensor, such as Canonical Polyadic decomposition (CP decomposition) plays an important role in understanding the internal relationships within the data. Using CP decomposition to analyze large tensor with billions of sizes requires tremendous computation power. In the meanwhile, the emerging Sunway many-core processor has demonstrated its computation advantage in powering the first hundred petaFLOPS supercomputer in the world. In this paper, we propose swTensor that adapts the CP decomposition to Sunway processor by leveraging the MapReduce framework for automatic parallelization and the unique architecture of Sunway for high performance. Specifically, we divide the major computation of CP decomposition into four sub-procedures and implement each using MapReduce framework with customized design key-value pair. Also, we tile the data during the computation so that it fits into the limited local device memory on Sunway for better performance. Moreover, we propose a performance auto-tuning mechanism to search for the optimal parameter settings in swTensor. The experimental results demonstrate swTensor achieves better performance than the state-of-the-art BigTensor and CSTF with the average speedup of 1.36 \(\times \) and 1.24 \(\times \), respectively. Besides, swTensor exhibits better scalability when scaling across multiple Sunway processors.
Similar content being viewed by others
References
Aarts, E., Korst, J., Michiels, W.: Simulated annealing. Local Search Combin. Optim. 36, partii(3), 187–210 (2007)
Acar, E., Dunlavy, D.M., Kolda, T.G., Mørup, M.: Scalable tensor factorizations for incomplete data. Chemometr. Intell. Lab. Syst. 106(1), 41–56 (2011)
Aja-Fernández, S., de Luis Garcia, R., Tao, D., Li, X.: Tensors in Image Processing and Computer Vision. Springer, Berlin (2009)
Bertsimas, D., Tsitsiklis, J.: Simulated annealing. Stat. Sci. 8(1), 10–15 (1993)
Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 109–117. SIAM (2014)
Blanco, Z., Liu, B., Dehnavi, M.M.: Cstf: Large-scale sparse tensor factorizations on distributed platforms. In: Proceedings of the 47th International Conference on Parallel Processing, p. 21. ACM (2018)
Chang, K.W., Yih, S.W.t., Yang, B., Meek, C.: Typed tensor decomposition of knowledge bases for relation extraction (2014)
Chen, B., Fu, H., Wei, Y., He, C., Zhang, W., Li, Y., Wan, W., Zhang, W., Gan, L., Zhang, W., et al.: Simulating the Wenchuan earthquake with accurate surface topography on Sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 40. IEEE Press (2018)
Choi, J., Liu, X., Smith, S., Simon, T.: Blocking optimization techniques for sparse tensor computation. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 568–577. IEEE (2018)
Choi, J.H., Vishwanathan, S.V.N.: Dfacto: distributed factorization of tensors. Adv. Neural Inf. Process. Syst. 2, 1296–1304 (2014)
Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Duan, X., Gao, P., Zhang, T., Zhang, M., Liu, W., Zhang, W., Xue, W., Fu, H., Gan, L., Chen, D., et al.: Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 12. IEEE Press (2018)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)
Han, Q., Yang, H., Luan, Z., Qian, D.: Accelerating tile low-rank gemm on Sunway architecture: Poster. In: Proceedings of the 16th ACM International Conference on Computing Frontiers, pp. 295–297. ACM (2019)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (tiis) 5(4), 19 (2016)
He, X., Zhang, H., Kan, M.Y., Chua, T.S.: Fast matrix factorization for online recommendation with implicit feedback. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. ACM (2016)
Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6(1–4), 164–189 (1927)
Hu, Y., Yang, H., Luan, Z., Qian, D.: Massively scaling seismic processing on sunway taihulight supercomputer. arXiv preprint arXiv:1907.11678 (2019)
Jeon, I., Papalexakis, E.E., Faloutsos, C., Sael, L., Kang, U.: Mining billion-scale tensors: algorithms and discoveries. Int. J. Very Large Data Bases 25(4), 519–544 (2016)
Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: Billion-scale tensor decompositions. In: Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 1047–1058. IEEE (2015)
Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–324. ACM (2012)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Lei, C., Yang, Y.H.: Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans. Image Process. 15(9), 2473–2480 (2006)
Li, J., Ma, Y., Yan, C., Vuduc, R.: Optimizing sparse tensor times matrix on multi-core and many-core architectures. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, pp. 26–33. IEEE Press (2016)
Li, L., Yu, T., Zhao, W., Fu, H., Wang, C., Tan, L., Yang, G., Thomson, J.: Large-scale hierarchical k-means for heterogeneous many-core supercomputers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 160–170. IEEE (2018a)
Li, M., Liu, Y., Yang, H., Luan, Z., Qian, D.: Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 594–601. IEEE (2018b)
Lim, L.H., Comon, P.: Multiarray signal processing: Tensor decomposition meets compressed sensing. arXiv preprint arXiv:1002.4935 (2010)
Liu, B., Wen, C., Sarwate, A.D., Dehnavi, M.M.: A unified optimization approach for sparse tensor operations on gpus. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 47–57. IEEE (2017)
Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp. 363–373. ACM (2018)
Liu, C., Yang, H., Sun, R., Luan, Z., Qian, D.: swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404 (2019)
Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on gpus. J. Parallel Distrib. Comput. 129, 99–109 (2019)
Nickel, M., Tresp, V., Kriegel, H.P.: Factorizing yago: scalable machine learning for linked data. In: Proceedings of the 21st international conference on World Wide Web, pp. 271–280. ACM (2012)
Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. arXiv preprint arXiv:1904.03329 (2019)
Oh, S., Park, N., Jang, J.G., Sael, L., Kang, U.: High-performance tucker factorization on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. (2019)
Park, N., Jeon, B., Lee, J., Kang, U.: Bigtensor: Mining billion-scale tensor made easy. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2457–2460. ACM (2016)
Phipps, E.T., Kolda, T.G.: Software for sparse tensor decomposition on emerging computing architectures. SIAM J. Sci. Comput. 41(3), C269–C290 (2019)
Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd international conference on Machine learning, pp. 792–799. ACM (2005)
Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2017)
Smith, S., Choi, J.W., Li, J., Vuduc, R., Park, J., Liu, X., Karypis, G.: Frostt: The formidable repository of open sparse tensors and tools (2017)
Smith, S., Park, J., Karypis, G.: Sparse tensor factorization on many-core processors with high-bandwidth memory. In: Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pp. 1058–1067. IEEE (2017)
Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis, and machine vision. Cengage Learning (2014)
Tew, P.A.: An investigation of sparse tensor formats for tensor libraries. Ph.D. thesis, Massachusetts Institute of Technology (2016)
Tsourakakis, C.E.: Data mining with mapreduce: Graph and tensor algorithms with applications. Diss. Master’s thesis, Carnegie Mellon University (2010)
Tucker, L.R.: Implications of factor analysis of three-way matrices for measurement of change. Probl. Meas. Change 15, 122–137 (1963)
Wang, X., Xue, W., Liu, W., Wu, L.: swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353. ACM (2018)
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley (2009)
Xu, Z., Lin, J., Matsuoka, S.: Benchmarking sw26010 many-core processor. In: Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pp. 743–752. IEEE (2017)
Yelp dataset challenge. (2019) https://www.yelp.com/dataset/challenge
Zhong, X., Li, M., Yang, H., Liu, Y., Qian, D.: swmr: A framework for accelerating mapreduce applications on sunway taihulight. IEEE Trans. Emerg. Topics Comput. (2018)
Acknowledgements
The authors would like to thank all anonymous reviewers for their insightful comments and suggestions. This work is supported by National Key R&D Program of China (Grant No. 2016YFB1000503 and 2016YFA0602200), National Natural Science Foundation of China (Grant No. 61502019 and 61732002), State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2018ZX-19), Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao). Hailong Yang is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
About this article
Cite this article
Zhong, X., Yang, H., Luan, Z. et al. swTensor: accelerating tensor decomposition on Sunway architecture. CCF Trans. HPC 1, 161–176 (2019). https://doi.org/10.1007/s42514-019-00017-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-019-00017-5