Jen-Yuan Yeh, Cheng-Jung Tsai, Tien-Yu Hsu, J. Lin, Pei-Cheng Cheng
{"title":"FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA","authors":"Jen-Yuan Yeh, Cheng-Jung Tsai, Tien-Yu Hsu, J. Lin, Pei-Cheng Cheng","doi":"10.22452/mjcs.vol34no4.3","DOIUrl":null,"url":null,"abstract":"Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (χ2-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naïve Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over-sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CL-SciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.","PeriodicalId":49894,"journal":{"name":"Malaysian Journal of Computer Science","volume":"1 1","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.22452/mjcs.vol34no4.3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1
Abstract
Recent studies in scientific paper summarization have explored a new form of structured summary for a reference paper by grouping all cited and citing sentences together by facet. This involves three main tasks: (1) identifying cited text spans for citances (i.e., citing sentences), (2) classifying their discourse facets, and (3) generating a structured summary from the cited text spans and citances. This paper focuses on the first task, and approaches the task as binary classification to distinguish relevant pairs of citances and reference sentences from irrelevant pairs. We propose a new method that integrates feature selection and classification techniques to enhance classification performance. The proposed method investigates combinations of six feature selection methods (χ2-Statistics, Information Gain, Gain Ratio, Relief-F, Significance Attribute Evaluation, and Symmetrical Uncertainty), and five classification algorithms (k-Nearest Neighbors, Decision Tree, Support Vector Machine, Naïve Bayes, and Random Forest). Additionally, to address imbalanced data during training, we apply SMOTE (Synthetic Minority Over-sampling Technique) to introduce synthetic biases towards the minority. Experiments are conducted using the CL-SciSumm corpora to compare the effect of feature selection applied to classification. The results reveal the benefits of feature selection in significantly boosting performance of F1 score metric, and show that our method is competitive to the state-of-the-art methods in the CL-SciSumm evaluations.
期刊介绍:
The Malaysian Journal of Computer Science (ISSN 0127-9084) is published four times a year in January, April, July and October by the Faculty of Computer Science and Information Technology, University of Malaya, since 1985. Over the years, the journal has gained popularity and the number of paper submissions has increased steadily. The rigorous reviews from the referees have helped in ensuring that the high standard of the journal is maintained. The objectives are to promote exchange of information and knowledge in research work, new inventions/developments of Computer Science and on the use of Information Technology towards the structuring of an information-rich society and to assist the academic staff from local and foreign universities, business and industrial sectors, government departments and academic institutions on publishing research results and studies in Computer Science and Information Technology through a scholarly publication. The journal is being indexed and abstracted by Clarivate Analytics'' Web of Science and Elsevier''s Scopus