{"title":"我们需要更多的文本分类训练样本吗?","authors":"Wanwan Zheng, Mingzhe Jin","doi":"10.1145/3299819.3299836","DOIUrl":null,"url":null,"abstract":"In recent years, with the rise of exceptional cloud computing technologies, machine learning approach in solving complex problems has been greatly accelerated. In the field of text classification, machine learning is a technology of providing computers the ability to learn and predict tasks without being explicitly labeled, and it is said that enough data are needed in order to let a machine to learn. However, more data tend to cause overfitting in machine learning algorithms, and there is no object criteria in deciding how many samples are required to achieve a desired level of performance. This article addresses this problem by using feature selection method. In our experiments, feature selection is proved to be able to decrease 66.67% at the largest of the required size of training dataset. Meanwhile, the kappa coefficient as a performance measure of classifiers could increase 11 points at the maximum. Furthermore, feature selection as a technology to remove irrelevant features was found be able to prevent overfitting to a great extent.","PeriodicalId":119217,"journal":{"name":"Artificial Intelligence and Cloud Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Do We Need More Training Samples For Text Classification?\",\"authors\":\"Wanwan Zheng, Mingzhe Jin\",\"doi\":\"10.1145/3299819.3299836\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, with the rise of exceptional cloud computing technologies, machine learning approach in solving complex problems has been greatly accelerated. In the field of text classification, machine learning is a technology of providing computers the ability to learn and predict tasks without being explicitly labeled, and it is said that enough data are needed in order to let a machine to learn. However, more data tend to cause overfitting in machine learning algorithms, and there is no object criteria in deciding how many samples are required to achieve a desired level of performance. This article addresses this problem by using feature selection method. In our experiments, feature selection is proved to be able to decrease 66.67% at the largest of the required size of training dataset. Meanwhile, the kappa coefficient as a performance measure of classifiers could increase 11 points at the maximum. Furthermore, feature selection as a technology to remove irrelevant features was found be able to prevent overfitting to a great extent.\",\"PeriodicalId\":119217,\"journal\":{\"name\":\"Artificial Intelligence and Cloud Computing Conference\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence and Cloud Computing Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3299819.3299836\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3299819.3299836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Do We Need More Training Samples For Text Classification?
In recent years, with the rise of exceptional cloud computing technologies, machine learning approach in solving complex problems has been greatly accelerated. In the field of text classification, machine learning is a technology of providing computers the ability to learn and predict tasks without being explicitly labeled, and it is said that enough data are needed in order to let a machine to learn. However, more data tend to cause overfitting in machine learning algorithms, and there is no object criteria in deciding how many samples are required to achieve a desired level of performance. This article addresses this problem by using feature selection method. In our experiments, feature selection is proved to be able to decrease 66.67% at the largest of the required size of training dataset. Meanwhile, the kappa coefficient as a performance measure of classifiers could increase 11 points at the maximum. Furthermore, feature selection as a technology to remove irrelevant features was found be able to prevent overfitting to a great extent.