Sardar Parhat, Gao Ting, Mijit Ablimit, A. Hamdulla
{"title":"基于语素序列和卷积神经网络的哈萨克语文本分类","authors":"Sardar Parhat, Gao Ting, Mijit Ablimit, A. Hamdulla","doi":"10.1109/APSIPAASC47483.2019.9023280","DOIUrl":null,"url":null,"abstract":"Word embedding techniques can map language units into a sequential vector space based on context. And it is a natural way to extract and predict out-of-vocabulary (OOV) from context information, word-vector based morphological analysis has provided a convenient way for low resource languages processing tasks. In this paper, we discuss Kazakh text classification experiment based on the m2asr morphological analyzer for small agglutinative languages. Morpheme segmentation and stem extraction from noisy data based on stem-vector similarity representation are experimented on Kazakh language. After preparing both word and morpheme-based training text corpora, we apply convolutional neural networks (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that morpheme-based approach outperforms word-based approach.","PeriodicalId":145222,"journal":{"name":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A morpheme sequence and convolutional neural network based Kazakh text classification\",\"authors\":\"Sardar Parhat, Gao Ting, Mijit Ablimit, A. Hamdulla\",\"doi\":\"10.1109/APSIPAASC47483.2019.9023280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embedding techniques can map language units into a sequential vector space based on context. And it is a natural way to extract and predict out-of-vocabulary (OOV) from context information, word-vector based morphological analysis has provided a convenient way for low resource languages processing tasks. In this paper, we discuss Kazakh text classification experiment based on the m2asr morphological analyzer for small agglutinative languages. Morpheme segmentation and stem extraction from noisy data based on stem-vector similarity representation are experimented on Kazakh language. After preparing both word and morpheme-based training text corpora, we apply convolutional neural networks (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that morpheme-based approach outperforms word-based approach.\",\"PeriodicalId\":145222,\"journal\":{\"name\":\"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSIPAASC47483.2019.9023280\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPAASC47483.2019.9023280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A morpheme sequence and convolutional neural network based Kazakh text classification
Word embedding techniques can map language units into a sequential vector space based on context. And it is a natural way to extract and predict out-of-vocabulary (OOV) from context information, word-vector based morphological analysis has provided a convenient way for low resource languages processing tasks. In this paper, we discuss Kazakh text classification experiment based on the m2asr morphological analyzer for small agglutinative languages. Morpheme segmentation and stem extraction from noisy data based on stem-vector similarity representation are experimented on Kazakh language. After preparing both word and morpheme-based training text corpora, we apply convolutional neural networks (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that morpheme-based approach outperforms word-based approach.