Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

Silvia García-Méndez, Francisco de Arriba-Pérez, Ana Barros-Vila, Francisco J. González-Castaño, Enrique Costa-Montenegro
{"title":"Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation","authors":"Silvia García-Méndez, Francisco de Arriba-Pérez, Ana Barros-Vila, Francisco J. González-Castaño, Enrique Costa-Montenegro","doi":"arxiv-2404.01338","DOIUrl":null,"url":null,"abstract":"Financial news items are unstructured sources of information that can be\nmined to extract knowledge for market screening applications. Manual extraction\nof relevant information from the continuous stream of finance-related news is\ncumbersome and beyond the skills of many investors, who, at most, can follow a\nfew sources and authors. Accordingly, we focus on the analysis of financial\nnews to identify relevant text and, within that text, forecasts and\npredictions. We propose a novel Natural Language Processing (NLP) system to\nassist investors in the detection of relevant financial events in unstructured\ntextual sources by considering both relevance and temporality at the discursive\nlevel. Firstly, we segment the text to group together closely related text.\nSecondly, we apply co-reference resolution to discover internal dependencies\nwithin segments. Finally, we perform relevant topic modelling with Latent\nDirichlet Allocation (LDA) to separate relevant from less relevant text and\nthen analyse the relevant text using a Machine Learning-oriented temporal\napproach to identify predictions and speculative statements. We created an\nexperimental data set composed of 2,158 financial news items that were manually\nlabelled by NLP researchers to evaluate our solution. The ROUGE-L values for\nthe identification of relevant text and predictions/forecasts were 0.662 and\n0.982, respectively. To our knowledge, this is the first work to jointly\nconsider relevance and temporality at the discursive level. It contributes to\nthe transfer of human associative discourse capabilities to expert systems\nthrough the combination of multi-paragraph topic segmentation and co-reference\nresolution to separate author expression patterns, topic modelling with LDA to\ndetect relevant text, and discursive temporality analysis to identify forecasts\nand predictions within this text.","PeriodicalId":501139,"journal":{"name":"arXiv - QuantFin - Statistical Finance","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuantFin - Statistical Finance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.01338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (NLP) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (LDA) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. We created an experimental data set composed of 2,158 financial news items that were manually labelled by NLP researchers to evaluate our solution. The ROUGE-L values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with LDA to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text.
通过潜在德里希勒分配(Latent Dirichlet Allocation)进行主题建模,自动检测财经新闻中的相关信息、预测和预报
金融新闻是非结构化的信息来源,可用于提取市场筛选应用的知识。从源源不断的金融相关新闻中手动提取相关信息非常繁琐,而且超出了许多投资者的技能范围,他们最多只能关注几个来源和作者。因此,我们专注于分析财经新闻,以识别相关文本以及文本中的预测和预言。我们提出了一种新颖的自然语言处理(NLP)系统,通过在话语层面考虑相关性和时间性,帮助投资者检测非结构化文本来源中的相关金融事件。首先,我们对文本进行分段,将密切相关的文本集中在一起;其次,我们应用共参照解析来发现分段中的内部依赖关系;最后,我们进行相关主题建模。最后,我们使用 LatentDirichlet Allocation(LDA)进行相关主题建模,将相关文本与不太相关的文本区分开来,然后使用面向机器学习的时间方法分析相关文本,以识别预测和推测性语句。我们创建了一个由 2,158 条财经新闻组成的实验数据集,这些新闻是由 NLP 研究人员手动标记的,用于评估我们的解决方案。识别相关文本和预测/预报的 ROUGE-L 值分别为 0.662 和 0.982。据我们所知,这是第一项在话语层面联合考虑相关性和时间性的工作。通过结合多段落主题分割和共指解析来分离作者的表达模式,用 LDA 建立主题模型来检测相关文本,并通过话语时间性分析来识别文本中的预测和预报,这有助于将人类的联想话语能力转移到专家系统中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信