Semi-supervised learning in large scale text categorization

Xu, Zewen; Li, Jianqiang; Liu, Bo; Bi, Jing; Li, Rong; Mao, Rui

doi:10.1007/s12204-017-1835-3

Semi-supervised learning in large scale text categorization

Published: 30 May 2017

Volume 22, pages 291–302, (2017)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Zewen Xu (许泽文)^1,2,
Jianqiang Li (李建强)^1,2,3,4,
Bo Liu (刘博)¹,
Jing Bi (毕敏)¹,
Rong Li (李蓉)¹ &
…
Rui Mao (毛睿)^3,4

242 Accesses
3 Citations
Explore all metrics

Abstract

The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents, we obtain the traditional supervised classifier for text categorization (TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text (FACT) based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data, and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine (SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of semi-supervised learning for text classification

Article 31 January 2023

Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

References

LI J Q, ZHAO Y, LIU B. Exploiting semantic resources for large scale text categorization [J]. Journal of Intelligent Information Systems, 2012, 39(3): 763–788.
Article Google Scholar
MIYATO T, DAI A M, GOODFELLOW I. Virtual adversarial training for semi-supervised text classification [EB/OL]. (2016-07-22). https://arxiv.org/abs/1605.07725v1.
YIN C Y, XIANG J, ZHANG H, et al. A new SVM method for short text classification based on semisupervised learning [C]//2015 4th International Conference on Advanced Information Technology and Sensor Application. Dubai, UAE: IEEE, 2015: 100–103.
Google Scholar
JOHNSON R, ZHANG T. Semi-supervised convolutional neural networks for text categorization via region embedding [J]. Advances in Neural Information Processing Systems, 2015, 28: 919–927.
Google Scholar
JOHNSON R, ZHANG T. Supervised and semisupervised text categorization using LSTM for region embeddings [C]//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: JMLR W&CP, 2016: 1–9.
Google Scholar
SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1–47.
Article MathSciNet Google Scholar
JOACHIMS T. Transductive inference for text classification using support vector machines [C]//Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: [s.n.], 1999: 200–209.
Google Scholar
SIOLAS G, D’ALCHé-BUC F. Support vector machines based on a semantic kernel for text categorization [C]//Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neuralnetworks. Washington, USA: IEEE, 2000: 205–209.
Google Scholar
BASILI R, CAMMISA M, MOSCHITTI A. Effective use of Wordnet semantics via kernel-based learning [C]// Proceedings of the 9th Conference on Computational Natural Language Learning. Ann Arbor, USA: Association for Computational Linguistics, 2005: 1–8.
Chapter Google Scholar
GABRILOVICH E, MARKOVITCH S. Feature generation for text categorization using world knowledge [C]//International Joint Conference on Artificial Intelligence. [s.l.]: Morgan Kaufmann Publishers Inc, 2005: 1048–1053.
Google Scholar
WANG P, DOMENICONI C. Building semantic kernels for text classification using wikipedia [C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008: 713–721.
Google Scholar
CHAPELLE O, SCHöLKOPF B, ZIEN A. Semisupervised learning [M]. London, England: MIT Press, 2006.
Book Google Scholar
SINDHWANI V, KEERTHI S S. Large scale semisupervised linear SVMs [C]//International ACM SIGIR Conference on Research and Development in Information Retrieval. Washington, USA: ACM, 2006: 477–484.
Google Scholar
SINDHWANI V, KEERTHI S S. Newton methods for fast solution of semi-supervised linear SVMs [EB/OL]. (2016-07-22). http: //citeseerx.ist.psu.edu/ viewdoc/download.
LI C H, YANG J C, PARK S C. Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet [J]. Expert Systems with Applications, 2012, 39: 765–772.
Article Google Scholar
FOX-ROBERTS P, ROSTEN E. Unbiased generative semi-supervised learning [J]. Journal of Machine Learning Research, 2014, 15: 367–443.
MathSciNet MATH Google Scholar
SHANG F H, JIAO L C, LIU Y Y, et al. Semisupervised learning with nuclear norm regularization [J]. Pattern Recognization, 2013, 46(8): 2323–2336.
Article MATH Google Scholar
WANG J, JEBARA T, CHANG S F. Semi-supervised learning using greedy max-cut [J]. Journal of Machine Learning Research, 2013, 14: 729–758.
MathSciNet MATH Google Scholar
CHENG S, SHI Y H, QIN Q D. Particle swarm optimization based semi-supervised learning on chinese text categorization [C]//Proceedings of the 2012 IEEE Congress on Evolutionary Computation. Brisbane, Australia: IEEE, 2012: 1–8.
Google Scholar
LENG Y, XU X Y, QI G H. Combining active learning and semi-supervised learning to construct SVM classifier [J]. Knowledge-Based Systems, 2013, 44(1): 121–131.
Article Google Scholar
LI J Q, LIU C C, LIU B, et al. Diversity-aware retrieval of medical records [J]. Compuer in Industries, 2015, 69(1): 81–91.
Article Google Scholar
YANG J M, LIU Y N, ZHU X D, et al. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization [J]. Information Processing and Management, 2012, 48(4): 741–754.
Article Google Scholar
BREVE F, ZHAO L, QUILES M, et al. Particle competition and cooperation in networks for semisupervised learning [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 24(9): 1686–1698.
Article Google Scholar
LI J Q, WANG F. Semi-supervised learning via mean field methods [J]. Neurocomputing, 2016, 177: 385–393.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Software Engineering, Beijing University of Technology, Beijing, 100124, China
Zewen Xu (许泽文), Jianqiang Li (李建强), Bo Liu (刘博), Jing Bi (毕敏) & Rong Li (李蓉)
Beijing Engineering Research Center for IoT Software and Systems, Beijing University of Technology, Beijing, 100124, China
Zewen Xu (许泽文) & Jianqiang Li (李建强)
Guangdong Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen 518060, Guangdong, China
Jianqiang Li (李建强) & Rui Mao (毛睿)
Shenzhen Key Laboratory of Service Computing and Applications, Shenzhen University, Shenzhen 518060, Guangdong, China
Jianqiang Li (李建强) & Rui Mao (毛睿)

Authors

Zewen Xu (许泽文)
View author publications
You can also search for this author in PubMed Google Scholar
Jianqiang Li (李建强)
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu (刘博)
View author publications
You can also search for this author in PubMed Google Scholar
Jing Bi (毕敏)
View author publications
You can also search for this author in PubMed Google Scholar
Rong Li (李蓉)
View author publications
You can also search for this author in PubMed Google Scholar
Rui Mao (毛睿)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianqiang Li (李建强).

Additional information

Foundation item: the National Key Technology Research and Development Program of China (No. 2015BAH13F01), and the Beijing Natural Science Foundation (No. 4152007)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Z., Li, J., Liu, B. et al. Semi-supervised learning in large scale text categorization. J. Shanghai Jiaotong Univ. (Sci.) 22, 291–302 (2017). https://doi.org/10.1007/s12204-017-1835-3

Download citation

Received: 25 July 2016
Published: 30 May 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s12204-017-1835-3

Keywords

CLC number

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised learning in large scale text categorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A review of semi-supervised learning for text classification

Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Subscribe and save

Buy Now

Navigation

Semi-supervised learning in large scale text categorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A review of semi-supervised learning for text classification

Improving Semi-supervised Text Classification by Using Wikipedia Knowledge

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Subscribe and save

Buy Now

Search

Navigation