Abstract
For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Hu W, Wu O, Chen Z, Fu Z. Maybank, Steve Nat. Recognition of Pornographic Web Pages by Classifying Texts and Images. IEEE Trans Pattern Anal Mach Intell. 2007;29(6):1019–34.
Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.
Jin-Shu S, Bo-Feng Z, Xin X. Advances in machine learning based text categorization. J Softw. 2006;17(9):1848–59.
HP Zhang, HK Yu, DY Xiong, Q Liu. HHMM-based Chinese lexical analyzer ICTCLAS. Second SIGHAN workshop affiliated with 41th ACL; Sapporo Japan, July; 2003. pp 184–7.
Chen YW, Wang HZ, Li HB, Zhong BN, Gou J, Chen DS. A topic extraction method for Chinese web text based on BaiduBaike and text classification. J Chin Comput Syst. 2012;33(12):2605–10.
T Hofmann, Probabilistic latent semantic indexing. Proceedings of the twenty-second annual. International SIGIR conference on research and development in information retrieval (SIGIR-99); 1999.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
Zhuang FZ, Luo P, Shen ZY, He Q, Xiong Y, Shi ZZ, Xiong H. Mining distinction and commonality across multiple domains using generative model for text classification. IEEE Trans Knowl Data Eng. 2012;24(11):2025–39.
Gong Z, Zhang D, Hu M. An Improved SVM algorithm for Chinese text classification. Comput Simul. 2009;7:040.
J He, AH Tan, CL Tan. A comparative study on Chinese text categorization methods. In PRICAI workshop on text and web mining, vol. 35; 2000.
X. Wan. Co-training for cross-lingual sentiment classification. In 4th international.
Joint Conference on Natural Language Processing. Association for Computational Linguistics; 2009. P. 235–43.
R Pandarachalil, S Sendhilkumar, GS Mahalakshmi. Twitter sentiment analysis for large-scale data: an unsupervised approach. Cogn Comput. 2014(4).
Das D, Bandyopadhyay S. Sentence-level emotion and valence tagging. Cogn Comput. 2012;4:420–35.
Yazdani M, Popescu-Belisa A. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artif Intell. 2013;194:176–202.
C Huang, H Zhao. Which is essential for Chinese word segmentation: character versus word. In Proceedings of the 20th Pacific Asia conference on language, information and computation (PACLIC20); 2006. p. 1–12.
Huang C, Zhao H. Chinese word segmentation: a decade review. J Chin Inf Process. 2007;21(3):8–18.
Xia YQ, Wong KF, Zhang P. Toward anomalous and dynamic nature of the Chinese network chat language. J Chin Inf Process. 2007;21(3):83–91.
Jian YY, Li P, Wang Q. An improved labeled latent Dirichlet Allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed. 2013;49(4):425–32.
Li WB, Sun L, Zhang DK. Text classification based on labeled-LDA model. Chin J Comput. 2008;31(4):621–7.
Song SL, Wang SL, Chen P. Chinese text semantic representation for text classification. J Xidian Univ. 2013;40(2):89–97.
TS Teng. study on Chinese short-text classification. Master degree thesis of Tsinghua University; 2009.
Acknowledgments
This study was supported by the Grant of the National Science Foundation of China (No. 61175121); the Grant of the National Science Foundation of Fujian Province (No. 2013J06014); the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQNYX108); the Fundamental Research Funds for the Central Universities (No. JB-ZR1217).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Chen, Yw., Zhou, Q., Luo, W. et al. Classification of Chinese Texts Based on Recognition of Semantic Topics. Cogn Comput 8, 114–124 (2016). https://doi.org/10.1007/s12559-015-9346-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-015-9346-8