Abstract
The K-Means algorithm has effectively promoted the development of intelligent systems and data-driven decision-making through data clustering and analysis. A reasonable convergence judgment directly determines when the model training can be terminated, which heavily affects the model quality. There are many researches for training acceleration and quality improvement, but few focus on the judgment. Currently, the convergence criteria still adopt a centralized judgment strategy for a single loss value. The criterion is simply copied between different optimized K-Means variants, typically like the fast Mini-Batch version and the traditional Full-Batch version. Our analysis reveals that such a design cannot guarantee that different variants converge to the same point, that is, it can result in abnormal situations such as false-positive and over-training. To perform a fair comparison and guarantee the model accuracy, we proposed a new dynamic convergence criterion VF (Vote for Freezing) and optimized version VF+. VF adopts a distributed judgment strategy where each sample can decide whether to participate in training based on the criterion (i.e., freezing itself) or not. Meanwhile, combined with the priority of samples, VF adaptively adjusts the sample freezing threshold which achieves asymptotic withdrawal of samples and accelerates model convergence. VF+ further introduced parameter freezing thresholds and freezing periods to eliminate redundant distance calculations, hence it improves the training efficiency. Experiments on multiple datasets validate the effectiveness of our convergence criterion in terms of training quality and efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Blackard, J.: Covertype. UCI Machine Learning Repository (1998). https://doi.org/10.24432/C50K5N
Chen, C., et al.: Communication-efficient federated learning with adaptive parameter freezing. In: 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pp. 1–11. IEEE (2021)
Ghazal, T.M.: Performances of k-means clustering algorithm with different distance metrics. Intell. Autom. Soft Comput. 30(2), 735–742 (2021)
Han, R., et al.: SlimML: removing non-critical input data in large-scale iterative machine learning. IEEE Trans. Knowl. Data Eng. 33(5), 2223–2236 (2019)
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010)
Mao, Y., Gan, D., Mwakapesa, D.S., Nanehkaran, Y.A., Tao, T., Huang, X.: A mapreduce-based k-means clustering algorithm. J. Supercomput. 1–22 (2022)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Olukanmi, P., Nelwamondo, F., Marwala, T., Twala, B.: Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities. Neural Comput. Appl. 34(8), 5939–5958 (2022)
Pérez, J., Martínez, A., Almanza, N., Mexicano, A., Pazos, R.: Improvement to the k-means algorithm by using its geometric and cluster neighborhood properties. In: Proceedings of ICITSEM, pp. 21–26 (2014)
Song, Z., et al.: ADGNN: towards scalable GNN training with aggregation-difference aware sampling. Proc. ACM Manag. Data 1(4), 1–26 (2023)
Song, Z., Gu, Y., Qi, J., Wang, Z., Yu, G.: EC-graph: a distributed graph neural network system with error-compensated compression. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 648–660. IEEE (2022)
Wang, Z., Gu, Y., Bao, Y., Yu, G., Yu, J.X., Wei, Z.: HGraph: I/O-efficient distributed and iterative graph computing by hybrid pushing/pulling. IEEE Trans. Knowl. Data Eng. 33(5), 1973–1987 (2019)
Wang, Z., et al.: FSP: towards flexible synchronous parallel frameworks for distributed machine learning. IEEE Trans. Parallel Distrib. Syst. 34(2), 687–703 (2022)
Wang, Z., et al.: Lightweight streaming graph partitioning by fully utilizing knowledge from local view. In: 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), pp. 614–625. IEEE (2023)
Whiteson, D.: HIGGS. UCI Machine Learning Repository (2014). https://doi.org/10.24432/C5V312
Whiteson, D.: HEPMASS. UCI Machine Learning Repository (2016). https://doi.org/10.24432/C5PP5W
Yu, C., Fei, L., Chen, F., Chen, L., Wang, J.: Heterogeneous graphs embedding learning with metapath instance contexts. In: Yuan, L., Yang, S., Li, R., Kanoulas, E., Zhao, X. (eds.) WISA 2023. LNCS, vol. 14094, pp. 149–161. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-6222-8_13
Acknowledgement
This work was supported by the Key R&D Program of Shandong Province, China (No. 2023CXPT020), and the National Natural Science Foundation of China (No. U23A20320 and No. U22A2068).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yu, H. et al. (2024). A Dynamic Convergence Criterion for Fast K-means Computations. In: Jin, C., Yang, S., Shang, X., Wang, H., Zhang, Y. (eds) Web Information Systems and Applications. WISA 2024. Lecture Notes in Computer Science, vol 14883. Springer, Singapore. https://doi.org/10.1007/978-981-97-7707-5_17
Download citation
DOI: https://doi.org/10.1007/978-981-97-7707-5_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-7706-8
Online ISBN: 978-981-97-7707-5
eBook Packages: Computer ScienceComputer Science (R0)