Predicting high increases in stock prices using text mining and data resampling techniques

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chih-Fong Tsai , Ming-Chang Wang , Wei-Chao Lin , Xin-Yu Zheng
{"title":"Predicting high increases in stock prices using text mining and data resampling techniques","authors":"Chih-Fong Tsai ,&nbsp;Ming-Chang Wang ,&nbsp;Wei-Chao Lin ,&nbsp;Xin-Yu Zheng","doi":"10.1016/j.asoc.2025.113228","DOIUrl":null,"url":null,"abstract":"<div><div>Text mining techniques have been demonstrated their effectiveness in developing stock prediction models, in which most of them focus on predicting whether the future stock price will rise or fall as a binary classification problem. However, in practice, existing prediction models cannot fulfill a well-defined investment portfolio composed of high-, medium-, and low-risk target stocks for different levels of return on investment (ROI). In order to achieve this practical demand, the prediction models should be able to predict different stock rise ratios for the investment portfolio. To construct this kind of prediction models, the class imbalance problem occurs in the training datasets that the number of data examples in the high-rise class is much less than the ones in the nonhigh-rise class. Therefore, the aim of this paper is to examine the performances of text mining-based stock prediction models by different machine learning and deep learning techniques in predicting different high-stock-price ratios, including 3 %, 5 %, 7 %, and 9 %. In addition, different data resampling techniques are employed to rebalance the class imbalanced training datasets to construct the prediction models for performance comparison. The experimental results indicate that one-class classifiers, such as one-class support vector machine and isolation forest, perform very well over the class imbalanced datasets in terms of AUC rates and the type I error of misclassifying high-rise cases into the nonhigh-rise class. Furthermore, after rebalancing the training datasets using over- and hybrid sampling algorithms, most classifiers show certain performance improvement, where hybrid sampling is the better choice than oversampling.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"176 ","pages":"Article 113228"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625005393","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Text mining techniques have been demonstrated their effectiveness in developing stock prediction models, in which most of them focus on predicting whether the future stock price will rise or fall as a binary classification problem. However, in practice, existing prediction models cannot fulfill a well-defined investment portfolio composed of high-, medium-, and low-risk target stocks for different levels of return on investment (ROI). In order to achieve this practical demand, the prediction models should be able to predict different stock rise ratios for the investment portfolio. To construct this kind of prediction models, the class imbalance problem occurs in the training datasets that the number of data examples in the high-rise class is much less than the ones in the nonhigh-rise class. Therefore, the aim of this paper is to examine the performances of text mining-based stock prediction models by different machine learning and deep learning techniques in predicting different high-stock-price ratios, including 3 %, 5 %, 7 %, and 9 %. In addition, different data resampling techniques are employed to rebalance the class imbalanced training datasets to construct the prediction models for performance comparison. The experimental results indicate that one-class classifiers, such as one-class support vector machine and isolation forest, perform very well over the class imbalanced datasets in terms of AUC rates and the type I error of misclassifying high-rise cases into the nonhigh-rise class. Furthermore, after rebalancing the training datasets using over- and hybrid sampling algorithms, most classifiers show certain performance improvement, where hybrid sampling is the better choice than oversampling.
使用文本挖掘和数据重采样技术预测股票价格的高涨幅
文本挖掘技术在股票预测模型的开发中已经被证明是有效的,其中大多数都是将预测未来股票价格是上涨还是下跌作为一个二元分类问题。然而,在实际应用中,现有的预测模型并不能满足由不同投资回报率水平的高、中、低风险目标股票组成的明确的投资组合。为了实现这一实际需求,预测模型应该能够预测投资组合的不同股票上涨比例。为了构建这类预测模型,在训练数据集中会出现类不平衡问题,即高层类的数据样本数量远远少于非高层类的数据样本数量。因此,本文的目的是通过不同的机器学习和深度学习技术来检验基于文本挖掘的股票预测模型在预测不同高股价比率(包括3 %,5 %,7 %和9 %)方面的性能。此外,采用不同的数据重采样技术对类不平衡训练数据集进行再平衡,构建预测模型进行性能比较。实验结果表明,单类分类器,如单类支持向量机和隔离森林,在AUC率和将高层案例误分类为非高层的I型错误方面,比类不平衡数据集表现得很好。此外,在使用过采样和混合采样算法重新平衡训练数据集后,大多数分类器显示出一定的性能改进,其中混合采样比过采样更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Soft Computing
Applied Soft Computing 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
6.90%
发文量
874
审稿时长
10.9 months
期刊介绍: Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信