Abstract
Great changes in Natural Language Processing (NLP) research appear with the rapid inflation of corpora scale. NLP based on massive scale natural annotations has become a new research hotspot. We summarized the state of art in NLP based on massive scale natural annotated resource, and proposed a new concept of “Natural Chunk”. In the paper, we analyzed its concept and properties, and conducted experiments on natural chunk recognition, which exhibit the feasibility of natural chunk recognition based on natural annotations. Chinese natural chunk research, as a new research direction in language boundary recognition, has positive influences in Chinese computing and promising future.
Supported by NFSC(61170162), State Language Commission (YB125-42), National Science-technology Support Plan Projects (2012BAH16F00) and the Fundamental Research Funds for the Central Universities(13YCX192).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Liu, C.: Structure and Boundary - A Cognitive Study on Linguistic Expressions. Shanghai Foreign Language Education Press (December 2008)
Feng, S.: The multidimensional properties of “word” in Chinese. Contemporary Linguistics 3(3), 161–174 (2001)
Sun, M.: Natural Language Processing Based on Naturally Annotated Web Resources. Journal of Chinese Information Processing 25(6), 26–32 (2011)
Rao, G., Xun, E.: Word Boundary and Chinese Word Segmentaion. Journal of Beijing University (Natural Science Edition) 49(1) (2013)
Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics 35(4), 505–512 (2009)
Yang, Y., Lu, Q., Zhao, T.: Chinese Term Extraction Based on Delimiters. In: Conference: Language Resources and Evaluation – LREC (2008)
Li, X., Zong, C.: A Hierarchical Parsing Approach with Punctuation Processing for Long Chinese Sentences. Journal of Chinese Information Processing 20(4), 8–15 (2006)
Chuang, T.C., Yeh, K.C.: Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria. Computational Linguistics and Chinese Language Processing 10(1), 95–122 (2005)
Qian, Y.-L., Xun, E.-D.: Prediction of Speech Pauses Based on Punctuation Information and Statistical Language Model. PR&AI 21(4), 541–545 (2008)
Xun, E.-D., Qian, Y.-L., Guo, Q., Song, R.: Using Binary Tree as Pruning Strategy to identify Rhythm Phrase Breaks. Journal of Chinese Information Processing 20(3), 23–28 (2006)
Spitkovsky, V.I., Jurafsky, D.: Profiting from mark-up: Hypertext annotations for guided parsing. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1278–1287 (2010)
Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: Punctuation: Making a Point in Unsupervised Dependency Parsing. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 19–28 (2011)
Sun, W., Xu, J.: Enhancing Chinese Word Segmentation Using Unlabeled Data. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 970–979 (2011)
Zhao, H., Kit, C.: An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework. In: International Joint Conference on Natural Language Processing – IJCNLP 2008 (2008)
Wang, H., Zhu, J., Tang, S., Fan, X.: A New Unsupervised Approach to Word Segmentation. ACL 37(3), 421–454 (2011)
Huan, C.-R., Šimon, P., Hsieh, S.-K., Prévot, L.: Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. In: Proceedings of the ACL 2007 Demo and Poster Sessions, pp. 69–72 (2007)
Li, S., Huang, C.-R.: Chinese Word Segmentation Based on Word Boundary Decision. Journal of Chinese Information Processing 24(1), 3–7 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, Ze., Xun, Ed., Rao, Gq., Yu, D. (2013). Chinese Natural Chunk Research Based on Natural Annotations in Massive Scale Corpora. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)