Big Data ResearchPub Date : 2024-08-02DOI: 10.1016/j.bdr.2024.100484
S. Sridhar , S. Anusuya
{"title":"A dual algorithmic approach to deal with multiclass imbalanced classification problems","authors":"S. Sridhar , S. Anusuya","doi":"10.1016/j.bdr.2024.100484","DOIUrl":"10.1016/j.bdr.2024.100484","url":null,"abstract":"<div><p>Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100484"},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141985718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-07-29DOI: 10.1016/j.bdr.2024.100482
Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu
{"title":"Multi-step trend aware graph neural network for traffic flow forecasting","authors":"Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu","doi":"10.1016/j.bdr.2024.100482","DOIUrl":"10.1016/j.bdr.2024.100482","url":null,"abstract":"<div><p>Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: <strong>M</strong>ulti-<strong>S</strong>tep <strong>T</strong>rend <strong>A</strong>ware <strong>G</strong>raph <strong>N</strong>eural <strong>N</strong>etwork (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: <span><span>https://github.com/Vitalitypi/MSTAGNN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100482"},"PeriodicalIF":3.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-25DOI: 10.1016/j.bdr.2024.100481
Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith
{"title":"Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic","authors":"Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith","doi":"10.1016/j.bdr.2024.100481","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100481","url":null,"abstract":"<div><p>The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100481"},"PeriodicalIF":3.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000558/pdfft?md5=a8e2330701051448866927c6cb877d10&pid=1-s2.0-S2214579624000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-19DOI: 10.1016/j.bdr.2024.100473
Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang
{"title":"Two-dimensional data partitioning for non-negative matrix tri-factorization","authors":"Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang","doi":"10.1016/j.bdr.2024.100473","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100473","url":null,"abstract":"<div><p>As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100473"},"PeriodicalIF":3.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100480
Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang
{"title":"Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques","authors":"Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang","doi":"10.1016/j.bdr.2024.100480","DOIUrl":"10.1016/j.bdr.2024.100480","url":null,"abstract":"<div><p>Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100480"},"PeriodicalIF":3.3,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100475
Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan
{"title":"Intelligent geological interpretation of AMT data based on machine learning","authors":"Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan","doi":"10.1016/j.bdr.2024.100475","DOIUrl":"10.1016/j.bdr.2024.100475","url":null,"abstract":"<div><p>AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100475"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141408443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100474
Marco Ortu, Maurizio Romano, Andrea Carta
{"title":"Semi-supervised topic representation through sentiment analysis and semantic networks","authors":"Marco Ortu, Maurizio Romano, Andrea Carta","doi":"10.1016/j.bdr.2024.100474","DOIUrl":"10.1016/j.bdr.2024.100474","url":null,"abstract":"<div><p>This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100474"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000509/pdfft?md5=46a689f4478007ad8db7233af95c8c2e&pid=1-s2.0-S2214579624000509-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100477
Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi
{"title":"Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing","authors":"Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi","doi":"10.1016/j.bdr.2024.100477","DOIUrl":"10.1016/j.bdr.2024.100477","url":null,"abstract":"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100477"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141415449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100476
{"title":"Research on the legal system of economic-ecological synergistic compensation in carbon neutral marine cities with a background in big data","authors":"","doi":"10.1016/j.bdr.2024.100476","DOIUrl":"10.1016/j.bdr.2024.100476","url":null,"abstract":"<div><p>With the increasingly severe global carbon emissions problem and the serious threat ecosystems face, carbon neutrality has gradually attracted widespread attention. This study provides an in-depth analysis of practical cases of international carbon neutrality initiatives and relevant experiences of marine cities, focusing on the construction and implementation of a legal system for economic, ecologically coordinated compensation. To evaluate the actual effectiveness of the legal system in marine cities, this study used a multiple linear regression model, considering factors such as the strictness of the legal system, enforcement efforts, and the level of participation of local enterprises and residents. The research results indicate that carbon emissions have significantly decreased in cities where legal systems are effectively enforced, from an average of 1.5 million tons per year to 1 million tons. At the same time, the economic growth rate of these cities has also significantly improved, increasing by about 2.5 percentage points from the original annual average of 4 % to 6.5 %. The study also found that the biodiversity index of these cities increased by 15 %, far higher than the average increase of 5 % in other cities, indicating the positive role of legal systems in protecting biodiversity. The public's participation rate in environmental protection activities has also increased from 25 % to 45 %, and the growth rate of green investment has reached an average of 8 % per year, far exceeding the 3 % growth rate of other cities. In terms of the ecosystem, data shows that the distribution of the ecosystem is stable, with an average ecological index of 508, which is in a relatively ideal state. The annual average growth rate of ecosystem restoration is about 3.5 %, further proving the effectiveness of ecological protection measures. Comprehensive empirical analysis shows that implementing the new legal system effectively reduces carbon emissions, enhances biodiversity, and promotes sustainable economic development. The economic growth rate increased from an average of 4.2 % to 5.1 % per year after implementing the new legal system, fully demonstrating the important role of the economic, ecologically coordinated compensation legal system in promoting carbon neutrality goals in marine cities.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100476"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Diversified Top-k Pattern Matching in Big Graphs","authors":"Aissam Aouar , Saïd Yahiaoui , Lamia Sadeg , Kadda Beghdad Bey","doi":"10.1016/j.bdr.2024.100464","DOIUrl":"10.1016/j.bdr.2024.100464","url":null,"abstract":"<div><p>Typically, graph pattern matching is expressed in terms of subgraph isomorphism. Graph simulation and its variants were introduced to reduce the time complexity and obtain more meaningful results in big graphs. Among these models, the matching subgraphs obtained by tight simulation are more compact and topologically closer to the pattern graph than results produced by other approaches. However, the number of resulting subgraphs can be huge, overlapping each other and unequally relaxed from the pattern graph. Hence, we introduce a ranking and diversification method for tight simulation results, which allows the user to obtain the most diversified and relevant matching subgraphs. This approach exploits the weights on edges of the big graph to express the interest of the matching subgraph by tight simulation. Furthermore, we provide distributed scalable algorithms to evaluate the proposed methods based on distributed programming paradigms. The experiments on real data graphs succeed in demonstrating the effectiveness of the proposed models and the efficiency of the associated algorithms. The result diversification reached 123% within a time frame that does not exceed 40%, on average, of the duration required for tight simulation graph pattern matching.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100464"},"PeriodicalIF":3.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141043195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}