FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA

IF 1.2 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Malaysian Journal of Computer Science Pub Date : 2021-10-31 DOI:10.22452/mjcs.vol34no4.3

Jen-Yuan Yeh, Cheng-Jung Tsai, Tien-Yu Hsu, J. Lin, Pei-Cheng Cheng

{"title":"FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA","authors":"Jen-Yuan Yeh, Cheng-Jung Tsai, Tien-Yu Hsu, J. Lin, Pei-Cheng Cheng","doi":"10.22452/mjcs.vol34no4.3","DOIUrl":null,"url":null,"abstract":"Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (χ2-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naïve Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over-sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CL-SciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.","PeriodicalId":49894,"journal":{"name":"Malaysian Journal of Computer Science","volume":"1 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.22452/mjcs.vol34no4.3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (χ2-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naïve Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over-sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CL-SciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.

查看原文本刊更多论文

特征选择与分类相结合的非平衡数据引文文本跨度识别方法

最近的科学论文摘要研究探索了一种新的参考文献结构化摘要形式，将所有引用和引用的句子按方面分组。这涉及三项主要任务：（1）识别引文的引文跨度（即引用句子），（2）对其话语方面进行分类，以及（3）根据引文跨度和引文生成结构化摘要。本文重点研究了第一个任务，并将该任务作为二元分类来区分相关的引用和参考句对与无关的引用和引用句对。我们提出了一种新的方法，将特征选择和分类技术相结合，以提高分类性能。所提出的方法研究了六种特征选择方法（χ2-统计量、信息增益、增益比、Relief-F、显著性属性评估和对称不确定性）和五种分类算法（k近邻、决策树、支持向量机、朴素贝叶斯和随机森林）的组合。此外，为了解决训练过程中的不平衡数据，我们应用SMOTE（合成少数群体过采样技术）来引入对少数群体的合成偏见。使用CL SciSumm语料库进行实验，以比较特征选择应用于分类的效果。结果揭示了特征选择在显著提高F1评分指标性能方面的优势，并表明我们的方法在CL SciSumm评估中与最先进的方法相比具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Malaysian Journal of Computer Science COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

2.20

自引率

33.30%

发文量

审稿时长

7.5 months

期刊介绍： The Malaysian Journal of Computer Science (ISSN 0127-9084) is published four times a year in January, April, July and October by the Faculty of Computer Science and Information Technology, University of Malaya, since 1985. Over the years, the journal has gained popularity and the number of paper submissions has increased steadily. The rigorous reviews from the referees have helped in ensuring that the high standard of the journal is maintained. The objectives are to promote exchange of information and knowledge in research work, new inventions/developments of Computer Science and on the use of Information Technology towards the structuring of an information-rich society and to assist the academic staff from local and foreign universities, business and industrial sectors, government departments and academic institutions on publishing research results and studies in Computer Science and Information Technology through a scholarly publication. The journal is being indexed and abstracted by Clarivate Analytics'' Web of Science and Elsevier''s Scopus