Silvia García-Méndez, Francisco de Arriba-Pérez, Ana Barros-Vila, Francisco J. González-Castaño, Enrique Costa-Montenegro
{"title":"Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation","authors":"Silvia García-Méndez, Francisco de Arriba-Pérez, Ana Barros-Vila, Francisco J. González-Castaño, Enrique Costa-Montenegro","doi":"arxiv-2404.01338","DOIUrl":null,"url":null,"abstract":"Financial news items are unstructured sources of information that can be\nmined to extract knowledge for market screening applications. Manual extraction\nof relevant information from the continuous stream of finance-related news is\ncumbersome and beyond the skills of many investors, who, at most, can follow a\nfew sources and authors. Accordingly, we focus on the analysis of financial\nnews to identify relevant text and, within that text, forecasts and\npredictions. We propose a novel Natural Language Processing (NLP) system to\nassist investors in the detection of relevant financial events in unstructured\ntextual sources by considering both relevance and temporality at the discursive\nlevel. Firstly, we segment the text to group together closely related text.\nSecondly, we apply co-reference resolution to discover internal dependencies\nwithin segments. Finally, we perform relevant topic modelling with Latent\nDirichlet Allocation (LDA) to separate relevant from less relevant text and\nthen analyse the relevant text using a Machine Learning-oriented temporal\napproach to identify predictions and speculative statements. We created an\nexperimental data set composed of 2,158 financial news items that were manually\nlabelled by NLP researchers to evaluate our solution. The ROUGE-L values for\nthe identification of relevant text and predictions/forecasts were 0.662 and\n0.982, respectively. To our knowledge, this is the first work to jointly\nconsider relevance and temporality at the discursive level. It contributes to\nthe transfer of human associative discourse capabilities to expert systems\nthrough the combination of multi-paragraph topic segmentation and co-reference\nresolution to separate author expression patterns, topic modelling with LDA to\ndetect relevant text, and discursive temporality analysis to identify forecasts\nand predictions within this text.","PeriodicalId":501139,"journal":{"name":"arXiv - QuantFin - Statistical Finance","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuantFin - Statistical Finance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.01338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Financial news items are unstructured sources of information that can be
mined to extract knowledge for market screening applications. Manual extraction
of relevant information from the continuous stream of finance-related news is
cumbersome and beyond the skills of many investors, who, at most, can follow a
few sources and authors. Accordingly, we focus on the analysis of financial
news to identify relevant text and, within that text, forecasts and
predictions. We propose a novel Natural Language Processing (NLP) system to
assist investors in the detection of relevant financial events in unstructured
textual sources by considering both relevance and temporality at the discursive
level. Firstly, we segment the text to group together closely related text.
Secondly, we apply co-reference resolution to discover internal dependencies
within segments. Finally, we perform relevant topic modelling with Latent
Dirichlet Allocation (LDA) to separate relevant from less relevant text and
then analyse the relevant text using a Machine Learning-oriented temporal
approach to identify predictions and speculative statements. We created an
experimental data set composed of 2,158 financial news items that were manually
labelled by NLP researchers to evaluate our solution. The ROUGE-L values for
the identification of relevant text and predictions/forecasts were 0.662 and
0.982, respectively. To our knowledge, this is the first work to jointly
consider relevance and temporality at the discursive level. It contributes to
the transfer of human associative discourse capabilities to expert systems
through the combination of multi-paragraph topic segmentation and co-reference
resolution to separate author expression patterns, topic modelling with LDA to
detect relevant text, and discursive temporality analysis to identify forecasts
and predictions within this text.