{"title":"土耳其语文本分类的预训练神经模型","authors":"Halil Ibrahim Okur, A. Sertbas","doi":"10.1109/UBMK52708.2021.9558878","DOIUrl":null,"url":null,"abstract":"In the text classification process, which is a sub-task of NLP, the preprocessing and indexing of the text has a direct determining effect on the performance for NLP models. When the studies on pre-trained models are examined, it is seen that the changes made on the models developed for world languages or training the same model with a Turkish text dataset. Word-embedding is considered to be the most critical point of the text processing problem. The two most popular word embedding methods today are Word2Vec and Glove, which embed words into a corpus using multidimensional vectors. BERT, Electra and Fastext models, which have a contextual word representation method and a deep neural network architecture, have been frequently used in the creation of pre-trained models recently. In this study, the use and performance results of pre-trained models on TTC-3600 and TRT-Haber text sets prepared for Turkish text classification NLP task are shown. By using pre-trained models obtained with large corpus, a certain time and hardware cost, the text classification process is performed with less effort and high performance.","PeriodicalId":106516,"journal":{"name":"2021 6th International Conference on Computer Science and Engineering (UBMK)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pretrained Neural Models for Turkish Text Classification\",\"authors\":\"Halil Ibrahim Okur, A. Sertbas\",\"doi\":\"10.1109/UBMK52708.2021.9558878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the text classification process, which is a sub-task of NLP, the preprocessing and indexing of the text has a direct determining effect on the performance for NLP models. When the studies on pre-trained models are examined, it is seen that the changes made on the models developed for world languages or training the same model with a Turkish text dataset. Word-embedding is considered to be the most critical point of the text processing problem. The two most popular word embedding methods today are Word2Vec and Glove, which embed words into a corpus using multidimensional vectors. BERT, Electra and Fastext models, which have a contextual word representation method and a deep neural network architecture, have been frequently used in the creation of pre-trained models recently. In this study, the use and performance results of pre-trained models on TTC-3600 and TRT-Haber text sets prepared for Turkish text classification NLP task are shown. By using pre-trained models obtained with large corpus, a certain time and hardware cost, the text classification process is performed with less effort and high performance.\",\"PeriodicalId\":106516,\"journal\":{\"name\":\"2021 6th International Conference on Computer Science and Engineering (UBMK)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 6th International Conference on Computer Science and Engineering (UBMK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/UBMK52708.2021.9558878\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 6th International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK52708.2021.9558878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Pretrained Neural Models for Turkish Text Classification
In the text classification process, which is a sub-task of NLP, the preprocessing and indexing of the text has a direct determining effect on the performance for NLP models. When the studies on pre-trained models are examined, it is seen that the changes made on the models developed for world languages or training the same model with a Turkish text dataset. Word-embedding is considered to be the most critical point of the text processing problem. The two most popular word embedding methods today are Word2Vec and Glove, which embed words into a corpus using multidimensional vectors. BERT, Electra and Fastext models, which have a contextual word representation method and a deep neural network architecture, have been frequently used in the creation of pre-trained models recently. In this study, the use and performance results of pre-trained models on TTC-3600 and TRT-Haber text sets prepared for Turkish text classification NLP task are shown. By using pre-trained models obtained with large corpus, a certain time and hardware cost, the text classification process is performed with less effort and high performance.