Abdullah Al Nahas, Aysenur Kulunk, Burak Gözütok, S. Kalkan, Hakki Yagiz Erdinc
{"title":"How to Segment Turkish Words for Neural Text Classification?","authors":"Abdullah Al Nahas, Aysenur Kulunk, Burak Gözütok, S. Kalkan, Hakki Yagiz Erdinc","doi":"10.1109/INISTA49547.2020.9194661","DOIUrl":null,"url":null,"abstract":"Neural text classifiers of agglutinative languages often suffer from large vocabulary sizes of training data and high out of vocabulary rates during the test time. The natural language processing community has developed and used numerous word segmentation procedures to alleviate these problems. However, their effect on the performance of neural classifiers of Turkish documents requires further investigation. In this empirical study, we carry out an extensive series of experiments to investigate the effect of the choice of word segmentation procedure on the performance of three different neural text classifiers on Turkish documents across multiple domains. Our experiments show that the choice of word segmentation procedure is another hyperparameter that needs tuning. This choice may depend on the domain and the neural architecture.","PeriodicalId":124632,"journal":{"name":"2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INISTA49547.2020.9194661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Neural text classifiers of agglutinative languages often suffer from large vocabulary sizes of training data and high out of vocabulary rates during the test time. The natural language processing community has developed and used numerous word segmentation procedures to alleviate these problems. However, their effect on the performance of neural classifiers of Turkish documents requires further investigation. In this empirical study, we carry out an extensive series of experiments to investigate the effect of the choice of word segmentation procedure on the performance of three different neural text classifiers on Turkish documents across multiple domains. Our experiments show that the choice of word segmentation procedure is another hyperparameter that needs tuning. This choice may depend on the domain and the neural architecture.