Building Domain-Specific Lexicons: An Application to Financial News

2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML) Pub Date : 2019-08-01 DOI:10.1109/Deep-ML.2019.00013

Savaş Yıldırım, Dhanya Jothimani, Can Kavaklioglu, A. Bener

{"title":"Building Domain-Specific Lexicons: An Application to Financial News","authors":"Savaş Yıldırım, Dhanya Jothimani, Can Kavaklioglu, A. Bener","doi":"10.1109/Deep-ML.2019.00013","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) has gained attention in the recent years. Previous research (such as WordNet and Cyc) has focused on developing an all purpose (generalised) polarised lexicons. However, these lexicons do not provide much information in different domains such as Finance and Medical Sciences. Using these lexicons for text classification could affect the prediction accuracy. Therefore, there is a need for building domain-and context-specific lexicons. To achieve this, in this work, a label based propagation based word embedding algorithm has been proposed to obtain positive and negative lexicons. The proposed algorithm works on the principle of graph theory and word embedding. The proposed algorithm is tested on Dow Jones news wires text feed to classify the Financial news as hot and non-hot. Three classifiers, namely, Logistic Regression, Random Forest and XGBoost, employing polarised lexicons, seed words and random words were used. The performance of classifiers in all cases was evaluated using accuracy. Lexicons generated using the proposed approach were effective in classifying the Financial news articles as hot and non-hot compared to classifiers using seed words and random words. Proposed label propagation with word embedding algorithm generates context-specific lexicons, which aids in helps in better representation of text in natural processing tasks and avoids the problem of dimensionality.","PeriodicalId":228378,"journal":{"name":"2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Deep-ML.2019.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Natural Language Processing (NLP) has gained attention in the recent years. Previous research (such as WordNet and Cyc) has focused on developing an all purpose (generalised) polarised lexicons. However, these lexicons do not provide much information in different domains such as Finance and Medical Sciences. Using these lexicons for text classification could affect the prediction accuracy. Therefore, there is a need for building domain-and context-specific lexicons. To achieve this, in this work, a label based propagation based word embedding algorithm has been proposed to obtain positive and negative lexicons. The proposed algorithm works on the principle of graph theory and word embedding. The proposed algorithm is tested on Dow Jones news wires text feed to classify the Financial news as hot and non-hot. Three classifiers, namely, Logistic Regression, Random Forest and XGBoost, employing polarised lexicons, seed words and random words were used. The performance of classifiers in all cases was evaluated using accuracy. Lexicons generated using the proposed approach were effective in classifying the Financial news articles as hot and non-hot compared to classifiers using seed words and random words. Proposed label propagation with word embedding algorithm generates context-specific lexicons, which aids in helps in better representation of text in natural processing tasks and avoids the problem of dimensionality.

查看原文本刊更多论文

构建特定领域词汇:在金融新闻中的应用

自然语言处理(NLP)近年来受到了广泛的关注。以前的研究(如WordNet和Cyc)集中于开发一种通用的(广义的)极化词汇。然而，这些词汇并不能提供很多不同领域的信息，比如金融和医学。使用这些词汇进行文本分类会影响预测的准确性。因此，有必要构建特定于领域和上下文的词典。为此，本文提出了一种基于标签传播的词嵌入算法来获取正负词汇。该算法基于图论和词嵌入原理。在道琼斯新闻专线文本feed上对该算法进行了测试，将财经新闻分为热点新闻和非热点新闻。使用了逻辑回归、随机森林和XGBoost三种分类器，分别采用极化词汇、种子词和随机词。分类器在所有情况下的性能都是用准确率来评估的。与使用种子词和随机词的分类器相比，使用该方法生成的词典可以有效地将财经新闻文章分类为热点和非热点。提出了一种基于词嵌入的标签传播算法，该算法生成上下文相关的词汇，有助于在自然处理任务中更好地表示文本，避免了维数问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML)

自引率

0.00%

发文量