Savaş Yıldırım, Dhanya Jothimani, Can Kavaklioglu, A. Bener
{"title":"Building Domain-Specific Lexicons: An Application to Financial News","authors":"Savaş Yıldırım, Dhanya Jothimani, Can Kavaklioglu, A. Bener","doi":"10.1109/Deep-ML.2019.00013","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) has gained attention in the recent years. Previous research (such as WordNet and Cyc) has focused on developing an all purpose (generalised) polarised lexicons. However, these lexicons do not provide much information in different domains such as Finance and Medical Sciences. Using these lexicons for text classification could affect the prediction accuracy. Therefore, there is a need for building domain-and context-specific lexicons. To achieve this, in this work, a label based propagation based word embedding algorithm has been proposed to obtain positive and negative lexicons. The proposed algorithm works on the principle of graph theory and word embedding. The proposed algorithm is tested on Dow Jones news wires text feed to classify the Financial news as hot and non-hot. Three classifiers, namely, Logistic Regression, Random Forest and XGBoost, employing polarised lexicons, seed words and random words were used. The performance of classifiers in all cases was evaluated using accuracy. Lexicons generated using the proposed approach were effective in classifying the Financial news articles as hot and non-hot compared to classifiers using seed words and random words. Proposed label propagation with word embedding algorithm generates context-specific lexicons, which aids in helps in better representation of text in natural processing tasks and avoids the problem of dimensionality.","PeriodicalId":228378,"journal":{"name":"2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Deep-ML.2019.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Natural Language Processing (NLP) has gained attention in the recent years. Previous research (such as WordNet and Cyc) has focused on developing an all purpose (generalised) polarised lexicons. However, these lexicons do not provide much information in different domains such as Finance and Medical Sciences. Using these lexicons for text classification could affect the prediction accuracy. Therefore, there is a need for building domain-and context-specific lexicons. To achieve this, in this work, a label based propagation based word embedding algorithm has been proposed to obtain positive and negative lexicons. The proposed algorithm works on the principle of graph theory and word embedding. The proposed algorithm is tested on Dow Jones news wires text feed to classify the Financial news as hot and non-hot. Three classifiers, namely, Logistic Regression, Random Forest and XGBoost, employing polarised lexicons, seed words and random words were used. The performance of classifiers in all cases was evaluated using accuracy. Lexicons generated using the proposed approach were effective in classifying the Financial news articles as hot and non-hot compared to classifiers using seed words and random words. Proposed label propagation with word embedding algorithm generates context-specific lexicons, which aids in helps in better representation of text in natural processing tasks and avoids the problem of dimensionality.