Big Data ResearchPub Date : 2024-11-07DOI: 10.1016/j.bdr.2024.100494
Zhihui Lai , Xiaomei Fang , Heng Kong
{"title":"Deep semantics-preserving cross-modal hashing","authors":"Zhihui Lai , Xiaomei Fang , Heng Kong","doi":"10.1016/j.bdr.2024.100494","DOIUrl":"10.1016/j.bdr.2024.100494","url":null,"abstract":"<div><div>Cross-modal hashing has been paid widespread attention in recent years due to its outstanding performance in cross-modal data retrieval. Cross-modal hashing can be decomposed into two steps, i.e., the feature learning and the binarization. However, most existing cross-modal hash methods do not take the supervisory information of the data into consideration during binary quantization, and thus often fail to adequately preserve semantic information. To solve these problems, this paper proposes a novel deep cross-modal hashing method called deep semantics-preserving cross-modal hashing (DSCMH), which makes full use of intra and inter-modal semantic information to improve the model's performance. Moreover, by designing a label network for semantic alignment during the binarization process, DSCMH's performance can be further improved. In order to verify the performance of the proposed method, extensive experiments were conducted on four big datasets. The results show that the proposed method is better than most of the existing cross-modal hashing methods. In addition, the ablation experiment shows that the proposed new regularized terms all have positive effects on the model's performances in cross-modal retrieval. The code of this paper can be downloaded from <span><span>http://www.scholat.com/laizhihui</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100494"},"PeriodicalIF":3.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-09-27DOI: 10.1016/j.bdr.2024.100493
Yinuo Qian, Fuzhong Nian
{"title":"Research on the characteristics of information propagation dynamic on the weighted multiplex Weibo networks","authors":"Yinuo Qian, Fuzhong Nian","doi":"10.1016/j.bdr.2024.100493","DOIUrl":"10.1016/j.bdr.2024.100493","url":null,"abstract":"<div><div>In order to simulate the forwarding situation of different categories of Weibo and discover interesting propagation phenomena in different layers of Weibo networks, this paper proposes the retweeting weighted multiplex networks and propagation model coupled with multi-class Weibo. Firstly, the weighted multiplex social network is constructed through the processing of Weibo network data. Secondly, a new information propagation model is established by using the weight and interlayer information of the Weibo multiplex network combined with the coupling factors in the propagation. Finally, the information propagation simulated by the propagation model is compared with the real data, so as to summarize different information propagation phenomena in multiplex social multiplex network. At the same time, by comparing the structure of the forwarding weighted multiplex network constructed by the short time data and the long time data, we find the self-similarity of the forwarding weighted multiplex network, which proves the generalization of the experiment. Through the above research, the mystery of the Weibo social network has been deeply explored, and a new perspective has been opened up for the exploration of social media information propagation.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100493"},"PeriodicalIF":3.5,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142417738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-08-08DOI: 10.1016/j.bdr.2024.100483
Bilal Tahir , Muhammad Amir Mehmood
{"title":"Leveraging social computing for epidemic surveillance: A case study","authors":"Bilal Tahir , Muhammad Amir Mehmood","doi":"10.1016/j.bdr.2024.100483","DOIUrl":"10.1016/j.bdr.2024.100483","url":null,"abstract":"<div><p>Social media platforms have become a popular source of information for real-time monitoring of events and user behavior. In particular, Twitter provides invaluable information related to diseases and public health to build real-time disease surveillance systems. Effective use of such social media platforms for public health surveillance requires data-driven AI models which are hindered by the difficult, expensive, and time-consuming task of collecting high-quality and large-scale datasets. In this paper, we build and analyze the Epidemic TweetBank (EpiBank) dataset containing 271 million English tweets related to six epidemic-prone diseases COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. For this purpose, we develop a tool of ESS-T (Epidemic Surveillance Study via Twitter) which collects tweets according to provided input parameters and keywords. Also, our tool assigns location to tweets with 95% accuracy value and performs analysis of collected tweets focusing on temporal distribution, spatial patterns, users, entities, sentiment, and misinformation. Leveraging ESS-T, we build two geo-tagged datasets of EpiBank-global and EpiBank-Pak containing 86 million tweets from 190 countries and 2.6 million tweets from Pakistan, respectively. Our spatial analysis of EpiBank-global for COVID19, Malaria, and Dengue indicates that our framework correctly identifies high-risk epidemic-prone countries according to World Health Organization (WHO) statistics.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100483"},"PeriodicalIF":3.5,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141978839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-08-02DOI: 10.1016/j.bdr.2024.100485
Daniela N. Rim , DongNyeong Heo , Chungjun Lee , Sukhyun Nam , Jae-Hyoung Yoo , James Won-Ki Hong , Heeyoul Choi
{"title":"Anomaly detection based on system text logs of virtual network functions","authors":"Daniela N. Rim , DongNyeong Heo , Chungjun Lee , Sukhyun Nam , Jae-Hyoung Yoo , James Won-Ki Hong , Heeyoul Choi","doi":"10.1016/j.bdr.2024.100485","DOIUrl":"10.1016/j.bdr.2024.100485","url":null,"abstract":"<div><p>In virtual network environments building secure and effective systems is crucial for its correct functioning, and so the anomaly detection task is at its core. To uncover and predict abnormalities in the behavior of a virtual machine, it is desirable to extract relevant information from system text logs. The main issue is that text is unstructured and symbolic data, and also very expensive to process. However, recent advances in deep learning have shown remarkable capabilities of handling such data. In this work, we propose using a simple LSTM recurrent network on top of a pre-trained Sentence-BERT, which encodes the system logs into fixed-length vectors. We trained the model in an unsupervised fashion to learn the likelihood of the represented sequences of logs. This way, the model can trigger a warning with an accuracy of 81% when a virtual machine generates an abnormal sequence. Our model approach is not only easy to train and computationally cheap, it also generalizes to the content of any input.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100485"},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-08-02DOI: 10.1016/j.bdr.2024.100484
S. Sridhar , S. Anusuya
{"title":"A dual algorithmic approach to deal with multiclass imbalanced classification problems","authors":"S. Sridhar , S. Anusuya","doi":"10.1016/j.bdr.2024.100484","DOIUrl":"10.1016/j.bdr.2024.100484","url":null,"abstract":"<div><p>Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100484"},"PeriodicalIF":3.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141985718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-07-29DOI: 10.1016/j.bdr.2024.100482
Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu
{"title":"Multi-step trend aware graph neural network for traffic flow forecasting","authors":"Lipeng Zhao , Bing Guo , Cheng Dai , Yan Shen , Fei Chen , Mingjie Zhao , Yuchuan Hu","doi":"10.1016/j.bdr.2024.100482","DOIUrl":"10.1016/j.bdr.2024.100482","url":null,"abstract":"<div><p>Traffic flow prediction plays an important role in smart cities. Although many neural network models already existed that can predict traffic flow, in the face of complex spatio-temporal data, these models still have some shortcomings. Firstly, they although take into account local spatio-temporal relations, ignore global information, leading to inability to capture global trend. Secondly, most models although construct spatio-temporal graphs for convolution, ignore the dynamic characteristics of spatio-temporal graphs, leading to the inability to capture local fluctuation. Finally, the current popular models need to take a lot of training time to obtain better prediction results, resulting in higher computing cost. To this end, we propose a new model: <strong>M</strong>ulti-<strong>S</strong>tep <strong>T</strong>rend <strong>A</strong>ware <strong>G</strong>raph <strong>N</strong>eural <strong>N</strong>etwork (MSTAGNN), which considers the influence of global spatio-temporal information and captures the dynamic characteristics of spatio-temporal graph. It can not only accurately capture local fluctuation, but also extract global trend and dramatically reduce computing cost. The experimental results showed that our proposed model achieved optimal results compared to baseline. Among them, mean absolute error (MAE) was reduced by 6.25% and the total training time was reduced by 79% on the PEMSD8 dataset. The source codes are available at: <span><span>https://github.com/Vitalitypi/MSTAGNN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100482"},"PeriodicalIF":3.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-25DOI: 10.1016/j.bdr.2024.100481
Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith
{"title":"Unmasking hate in the pandemic: A cross-platform study of the COVID-19 infodemic","authors":"Fatima Zahrah , Jason R.C. Nurse , Michael Goldsmith","doi":"10.1016/j.bdr.2024.100481","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100481","url":null,"abstract":"<div><p>The past few decades have established how digital technologies and platforms have provided an effective medium for spreading hateful content, which has been linked to several catastrophic consequences. Recent academic studies have also highlighted how online hate is a phenomenon that strategically makes use of multiple online platforms. In this article, we seek to advance the current research landscape by harnessing a cross-platform approach to computationally analyse content relating to the 2020 COVID-19 pandemic. More specifically, we analyse content on hate-specific environments from Twitter, Reddit, 4chan and Stormfront. Our findings show how content and posting activity can change across platforms, and how the psychological components of online content can differ depending on the platform being used. Through this, we provide unique insight into the cross-platform behaviours of online hate. We further define several avenues for future research within this field so as to gain a more comprehensive understanding of the global hate ecosystem.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100481"},"PeriodicalIF":3.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000558/pdfft?md5=a8e2330701051448866927c6cb877d10&pid=1-s2.0-S2214579624000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-19DOI: 10.1016/j.bdr.2024.100473
Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang
{"title":"Two-dimensional data partitioning for non-negative matrix tri-factorization","authors":"Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang","doi":"10.1016/j.bdr.2024.100473","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100473","url":null,"abstract":"<div><p>As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100473"},"PeriodicalIF":3.5,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100480
Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang
{"title":"Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques","authors":"Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang","doi":"10.1016/j.bdr.2024.100480","DOIUrl":"10.1016/j.bdr.2024.100480","url":null,"abstract":"<div><p>Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100480"},"PeriodicalIF":3.3,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100475
Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan
{"title":"Intelligent geological interpretation of AMT data based on machine learning","authors":"Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan","doi":"10.1016/j.bdr.2024.100475","DOIUrl":"10.1016/j.bdr.2024.100475","url":null,"abstract":"<div><p>AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100475"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141408443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}