{"title":"Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools","authors":"Nicolas Turenne, Bokai Xu, Xinyue Li, Xindi Xu, Hongyu Liu, Xiaolin Zhu","doi":"10.1145/3446132.3446192","DOIUrl":null,"url":null,"abstract":"To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.