{"title":"Graph-Based Analysis of Similarities between Word Frequency Distributions of Various Corpora for Complex Word Identification","authors":"Yo Ehara","doi":"10.1109/ICMLA.2019.00317","DOIUrl":null,"url":null,"abstract":"Complex word identification (CWI) is a fundamental task in educational NLP and applied linguistics which involves the identification of complex words in a text for various applications, including text simplification. Recent studies have independently reported that when word-frequency features from some uncommon corpora are used in combination with those from a general corpus, they improve the CWI accuracy; this suggests that they can be used as adjustments for a general corpus. However, although previous studies have analyzed similarity values between each pair of corpora, the significance of the similarity in the entire set of corpora is unclear. This complicates the analysis of the combination of general and uncommon corpora aimed at improving CWI accuracy; thus, the search for effective types of corpora would have to be exhaustive. To contribute to a better understanding and a non-exhaustive search, this paper proposes a novel graph-based analysis method. We first calculate various similarities among the word frequency distributions of various corpora in an unsupervised manner. Subsequently, we regard each similarity as a weighted graph and analyze the importance of a pair of corpora, or an edge, within the entire graph structure. Through our experiments, it was found that our analysis method can successfully explain why the previously reported combinations of corpora were effective; Furthermore, it can find effective corpus combinations.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Complex word identification (CWI) is a fundamental task in educational NLP and applied linguistics which involves the identification of complex words in a text for various applications, including text simplification. Recent studies have independently reported that when word-frequency features from some uncommon corpora are used in combination with those from a general corpus, they improve the CWI accuracy; this suggests that they can be used as adjustments for a general corpus. However, although previous studies have analyzed similarity values between each pair of corpora, the significance of the similarity in the entire set of corpora is unclear. This complicates the analysis of the combination of general and uncommon corpora aimed at improving CWI accuracy; thus, the search for effective types of corpora would have to be exhaustive. To contribute to a better understanding and a non-exhaustive search, this paper proposes a novel graph-based analysis method. We first calculate various similarities among the word frequency distributions of various corpora in an unsupervised manner. Subsequently, we regard each similarity as a weighted graph and analyze the importance of a pair of corpora, or an edge, within the entire graph structure. Through our experiments, it was found that our analysis method can successfully explain why the previously reported combinations of corpora were effective; Furthermore, it can find effective corpus combinations.