iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S42514-019-00017-5
swTensor: accelerating tensor decomposition on Sunway architecture | CCF Transactions on High Performance Computing Skip to main content
Log in

swTensor: accelerating tensor decomposition on Sunway architecture

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Modern applications are digesting and generating data with rich features that are stored in high dimensional array or tensor. The computation applied to tensor, such as Canonical Polyadic decomposition (CP decomposition) plays an important role in understanding the internal relationships within the data. Using CP decomposition to analyze large tensor with billions of sizes requires tremendous computation power. In the meanwhile, the emerging Sunway many-core processor has demonstrated its computation advantage in powering the first hundred petaFLOPS supercomputer in the world. In this paper, we propose swTensor that adapts the CP decomposition to Sunway processor by leveraging the MapReduce framework for automatic parallelization and the unique architecture of Sunway for high performance. Specifically, we divide the major computation of CP decomposition into four sub-procedures and implement each using MapReduce framework with customized design key-value pair. Also, we tile the data during the computation so that it fits into the limited local device memory on Sunway for better performance. Moreover, we propose a performance auto-tuning mechanism to search for the optimal parameter settings in swTensor. The experimental results demonstrate swTensor achieves better performance than the state-of-the-art BigTensor and CSTF with the average speedup of 1.36 \(\times \) and 1.24 \(\times \), respectively. Besides, swTensor exhibits better scalability when scaling across multiple Sunway processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Aarts, E., Korst, J., Michiels, W.: Simulated annealing. Local Search Combin. Optim. 36, partii(3), 187–210 (2007)

    MATH  Google Scholar 

  • Acar, E., Dunlavy, D.M., Kolda, T.G., Mørup, M.: Scalable tensor factorizations for incomplete data. Chemometr. Intell. Lab. Syst. 106(1), 41–56 (2011)

    Article  Google Scholar 

  • Aja-Fernández, S., de Luis Garcia, R., Tao, D., Li, X.: Tensors in Image Processing and Computer Vision. Springer, Berlin (2009)

    Book  Google Scholar 

  • Bertsimas, D., Tsitsiklis, J.: Simulated annealing. Stat. Sci. 8(1), 10–15 (1993)

    Article  Google Scholar 

  • Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 109–117. SIAM (2014)

  • Blanco, Z., Liu, B., Dehnavi, M.M.: Cstf: Large-scale sparse tensor factorizations on distributed platforms. In: Proceedings of the 47th International Conference on Parallel Processing, p. 21. ACM (2018)

  • Chang, K.W., Yih, S.W.t., Yang, B., Meek, C.: Typed tensor decomposition of knowledge bases for relation extraction (2014)

  • Chen, B., Fu, H., Wei, Y., He, C., Zhang, W., Li, Y., Wan, W., Zhang, W., Gan, L., Zhang, W., et al.: Simulating the Wenchuan earthquake with accurate surface topography on Sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 40. IEEE Press (2018)

  • Choi, J., Liu, X., Smith, S., Simon, T.: Blocking optimization techniques for sparse tensor computation. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 568–577. IEEE (2018)

  • Choi, J.H., Vishwanathan, S.V.N.: Dfacto: distributed factorization of tensors. Adv. Neural Inf. Process. Syst. 2, 1296–1304 (2014)

    Google Scholar 

  • Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015)

    Article  Google Scholar 

  • Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  • Duan, X., Gao, P., Zhang, T., Zhang, M., Liu, W., Zhang, W., Xue, W., Fu, H., Gan, L., Chen, D., et al.: Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 12. IEEE Press (2018)

  • Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

    MATH  Google Scholar 

  • Han, Q., Yang, H., Luan, Z., Qian, D.: Accelerating tile low-rank gemm on Sunway architecture: Poster. In: Proceedings of the 16th ACM International Conference on Computing Frontiers, pp. 295–297. ACM (2019)

  • Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (tiis) 5(4), 19 (2016)

    Google Scholar 

  • He, X., Zhang, H., Kan, M.Y., Chua, T.S.: Fast matrix factorization for online recommendation with implicit feedback. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. ACM (2016)

  • Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6(1–4), 164–189 (1927)

    Article  Google Scholar 

  • Hu, Y., Yang, H., Luan, Z., Qian, D.: Massively scaling seismic processing on sunway taihulight supercomputer. arXiv preprint arXiv:1907.11678 (2019)

  • Jeon, I., Papalexakis, E.E., Faloutsos, C., Sael, L., Kang, U.: Mining billion-scale tensors: algorithms and discoveries. Int. J. Very Large Data Bases 25(4), 519–544 (2016)

    Article  Google Scholar 

  • Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: Billion-scale tensor decompositions. In: Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 1047–1058. IEEE (2015)

  • Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–324. ACM (2012)

  • Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    Article  MathSciNet  Google Scholar 

  • Lei, C., Yang, Y.H.: Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans. Image Process. 15(9), 2473–2480 (2006)

    Article  Google Scholar 

  • Li, J., Ma, Y., Yan, C., Vuduc, R.: Optimizing sparse tensor times matrix on multi-core and many-core architectures. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, pp. 26–33. IEEE Press (2016)

  • Li, L., Yu, T., Zhao, W., Fu, H., Wang, C., Tan, L., Yang, G., Thomson, J.: Large-scale hierarchical k-means for heterogeneous many-core supercomputers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 160–170. IEEE (2018a)

  • Li, M., Liu, Y., Yang, H., Luan, Z., Qian, D.: Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 594–601. IEEE (2018b)

  • Lim, L.H., Comon, P.: Multiarray signal processing: Tensor decomposition meets compressed sensing. arXiv preprint arXiv:1002.4935 (2010)

  • Liu, B., Wen, C., Sarwate, A.D., Dehnavi, M.M.: A unified optimization approach for sparse tensor operations on gpus. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 47–57. IEEE (2017)

  • Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp. 363–373. ACM (2018)

  • Liu, C., Yang, H., Sun, R., Luan, Z., Qian, D.: swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404 (2019)

  • Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on gpus. J. Parallel Distrib. Comput. 129, 99–109 (2019)

    Article  Google Scholar 

  • Nickel, M., Tresp, V., Kriegel, H.P.: Factorizing yago: scalable machine learning for linked data. In: Proceedings of the 21st international conference on World Wide Web, pp. 271–280. ACM (2012)

  • Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. arXiv preprint arXiv:1904.03329 (2019)

  • Oh, S., Park, N., Jang, J.G., Sael, L., Kang, U.: High-performance tucker factorization on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. (2019)

  • Park, N., Jeon, B., Lee, J., Kang, U.: Bigtensor: Mining billion-scale tensor made easy. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2457–2460. ACM (2016)

  • Phipps, E.T., Kolda, T.G.: Software for sparse tensor decomposition on emerging computing architectures. SIAM J. Sci. Comput. 41(3), C269–C290 (2019)

    Article  MathSciNet  Google Scholar 

  • Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd international conference on Machine learning, pp. 792–799. ACM (2005)

  • Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2017)

    Article  MathSciNet  Google Scholar 

  • Smith, S., Choi, J.W., Li, J., Vuduc, R., Park, J., Liu, X., Karypis, G.: Frostt: The formidable repository of open sparse tensors and tools (2017)

  • Smith, S., Park, J., Karypis, G.: Sparse tensor factorization on many-core processors with high-bandwidth memory. In: Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pp. 1058–1067. IEEE (2017)

  • Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis, and machine vision. Cengage Learning (2014)

  • Tew, P.A.: An investigation of sparse tensor formats for tensor libraries. Ph.D. thesis, Massachusetts Institute of Technology (2016)

  • Tsourakakis, C.E.: Data mining with mapreduce: Graph and tensor algorithms with applications. Diss. Master’s thesis, Carnegie Mellon University (2010)

  • Tucker, L.R.: Implications of factor analysis of three-way matrices for measurement of change. Probl. Meas. Change 15, 122–137 (1963)

    Google Scholar 

  • Wang, X., Xue, W., Liu, W., Wu, L.: swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353. ACM (2018)

    Article  Google Scholar 

  • Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley (2009)

  • Xu, Z., Lin, J., Matsuoka, S.: Benchmarking sw26010 many-core processor. In: Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pp. 743–752. IEEE (2017)

  • Yelp dataset challenge. (2019) https://www.yelp.com/dataset/challenge

  • Zhong, X., Li, M., Yang, H., Liu, Y., Qian, D.: swmr: A framework for accelerating mapreduce applications on sunway taihulight. IEEE Trans. Emerg. Topics Comput. (2018)

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers for their insightful comments and suggestions. This work is supported by National Key R&D Program of China (Grant No. 2016YFB1000503 and 2016YFA0602200), National Natural Science Foundation of China (Grant No. 61502019 and 61732002), State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2018ZX-19), Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, X., Yang, H., Luan, Z. et al. swTensor: accelerating tensor decomposition on Sunway architecture. CCF Trans. HPC 1, 161–176 (2019). https://doi.org/10.1007/s42514-019-00017-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-019-00017-5

Keywords

Navigation