{"title":"基于标记的文本分类数据增强","authors":"Patawee Prakrankamanant, E. Chuangsuwanich","doi":"10.1109/jcsse54890.2022.9836268","DOIUrl":null,"url":null,"abstract":"Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.","PeriodicalId":284735,"journal":{"name":"2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Tokenization-based data augmentation for text classification\",\"authors\":\"Patawee Prakrankamanant, E. Chuangsuwanich\",\"doi\":\"10.1109/jcsse54890.2022.9836268\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.\",\"PeriodicalId\":284735,\"journal\":{\"name\":\"2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/jcsse54890.2022.9836268\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 19th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/jcsse54890.2022.9836268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Tokenization-based data augmentation for text classification
Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.