Big Data ResearchPub Date : 2023-08-28DOI: 10.1016/j.bdr.2023.100394
Linqin Cai, Lingjun Wang, Rongdi Yuan, Tingjie Lai
{"title":"Meta-Learning Based Dynamic Adaptive Relation Learning for Few-Shot Knowledge Graph Completion","authors":"Linqin Cai, Lingjun Wang, Rongdi Yuan, Tingjie Lai","doi":"10.1016/j.bdr.2023.100394","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100394","url":null,"abstract":"<div><p>As artificial intelligence<span> gradually steps into cognitive intelligence stage, knowledge graphs (KGs) play an increasingly important role in many natural language processing<span><span> tasks. Due to the prevalence of long-tail relations in KGs, few-shot knowledge graph completion (KGC) for link prediction of long-tail relations has gradually become a hot research topic. Current few-shot KGC methods mainly focus on the static representation of surrounding entities to explore the potential semantic features<span> of entities, while ignoring the dynamic properties among entities and the special influence of the long-tail relation on link prediction. In this paper, a new meta-learning based dynamic adaptive relation learning model (DARL) is proposed for few-shot KGC. For obtaining better semantic information of the meta knowledge, the proposed DARL model applies a dynamic neighbor encoder to incorporate neighbor relations into entity embedding. In addition, DARL builds </span></span>attention mechanism based fusion strategy for different attributes of the same relation to further enhance the relation-meta learning ability. We evaluate our DARL model on two public benchmark datasets NELL-One and WIKI-One for link prediction. Extensive experimental results indicate that our DARL outperforms the state-of-the-art models with an average relative improvement about 23.37%, 32.46% in MRR and Hits@1 on NELL-One, respectively.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 ","pages":"Article 100394"},"PeriodicalIF":3.3,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49711677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-28DOI: 10.1016/j.bdr.2023.100382
Mintae Kim, Wooju Kim
{"title":"Task-Oriented Collaborative Graph Embedding Using Explicit High-Order Proximity for Recommendation","authors":"Mintae Kim, Wooju Kim","doi":"10.1016/j.bdr.2023.100382","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100382","url":null,"abstract":"<div><p><span><span>A recommender or recommendation system is a subclass<span> of information filtering systems that seeks to predict the “rating” or “preference” that a user would assign to an item. Although many collaborative filtering (CF) approaches based on neural matrix factorization (NMF) have been successful, significant scope for improvement in recommendation systems exists. The primary challenge in </span></span>recommender systems<span> is to extract high-quality user–item interaction information from sparse data. However, most studies have focused on additional review text or metadata instead of fully used high-order relationships between users and items. In this paper, we propose a novel model—Cross Neighborhood Attention Network (CNAN)—that solves this problem by designing high-order neighborhood selection and neighborhood attention networks to learn user–item interaction efficiently. Our CNAN performs rating prediction using an architecture considering only user–item interaction data. Furthermore, the proposed model uses only user–item interaction (from the user–item ratings matrix) information without additional information such as review text or metadata. We evaluated the effectiveness of the proposed model by performing experiments on five datasets with review text and three datasets with metadata. Consequently, the CNAN model demonstrated a performance improvement of up to 7.59% over the model using review text and up to 1.99% over the model using metadata. Experimental results show that CNAN achieves better recommendation performance through higher-order neighborhood </span></span>information integration with neighborhood selection and attention. The results show that our model delivers higher prediction performance via efficient structural improvement without using additional information.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 ","pages":"Article 100382"},"PeriodicalIF":3.3,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49711464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-28DOI: 10.1016/j.bdr.2023.100395
Ling Ding , Peng Du , Haiwei Hou , Jian Zhang , Di Jin , Shifei Ding
{"title":"Botnet DGA Domain Name Classification Using Transformer Network with Hybrid Embedding","authors":"Ling Ding , Peng Du , Haiwei Hou , Jian Zhang , Di Jin , Shifei Ding","doi":"10.1016/j.bdr.2023.100395","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100395","url":null,"abstract":"<div><p><span>One of the severest threats to cyber security is botnet, which typically uses domain names generated by Domain Generation Algorithms (DGAs) to communicate with their Command and Control (C&C) infrastructure. </span>DGA detection<span> and classification play an important role of assisting cyber security researchers to detect botnet C&C servers. However, many of the existing DGA detection models only focus on single scale word embedding<span> method, and very few models are specially designed to extract more effective features for DGA detection from multiple scales word embedding. To alleviate above questions, first we propose a hybrid word embedding method, which combines character level embedding and bigram level embedding to make full use of the domain names information, and then, we design a deep neural network with hybrid embedding method to distinguish DGA domains from known legitimate domains. Finally, we evaluate our hybrid embedding method and the proposed model on ONIST dataset and compare our methods with several state-of-the-art DGA classification methods.</span></span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 ","pages":"Article 100395"},"PeriodicalIF":3.3,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49711678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-28DOI: 10.1016/j.bdr.2023.100396
Callum Roberts, Adrian Gepp, James Todd
{"title":"A Big Data Framework to Address Building Sum Insured Misestimation","authors":"Callum Roberts, Adrian Gepp, James Todd","doi":"10.1016/j.bdr.2023.100396","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100396","url":null,"abstract":"<div><p><span>In the insurance industry, the accumulation of complex problems and volume of data creates a large scope for actuaries to apply big data techniques to investigate and provide unique solutions for millions of policyholders. With much of the actuarial focus on traditional problems like price optimisation or improving claims management, there is an opportunity to tackle other known product inefficiencies with a data-driven approach. The purpose of this paper is to build a framework that exploits </span>big data technologies<span> to measure and explain Australian policyholder Sum Insured Misestimation (SIM). Big data clustering and dimension reduction techniques are leveraged to measure SIM for a national home insurance portfolio. We then design predictive and prescriptive models to explore the relationship between socioeconomic and demographic factors with SIM. Real-world results from a national home insurance portfolio provide actionable business insight on SIM and facilitate solutions for stakeholders, being government and insurers.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 ","pages":"Article 100396"},"PeriodicalIF":3.3,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49733789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-28DOI: 10.1016/j.bdr.2023.100398
Amr M. Abdeltif , Khalid M. Hosny , Mohamed M. Darwish , Ahmad Salah , Kenli Li
{"title":"Parallel Framework for Memory-Efficient Computation of Image Descriptors for Megapixel Images","authors":"Amr M. Abdeltif , Khalid M. Hosny , Mohamed M. Darwish , Ahmad Salah , Kenli Li","doi":"10.1016/j.bdr.2023.100398","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100398","url":null,"abstract":"<div><p><span>Image moments are image descriptors widely utilized in several image processing, pattern recognition, computer vision, and multimedia security applications. In the era of big data, the computation of image moments yields a huge memory demand, especially for large moment order and/or high-resolution images (i.e., megapixel images). The state-of-the-art moment computation methods successfully accelerate the image moment computation for digital images of a resolution smaller than 1K × 1K pixels. For digital images of higher resolutions, image moment computation is problematic. Researchers utilized GPU-based </span>parallel processing<span> to overcome this problem. In practice, the parallel computation of image moments using GPUs encounters the non-extended memory problem, which is the main challenge. This paper proposed a recurrent-based method for computing the Polar Complex Exponent Transform (PCET) moments of fractional orders. The proposed method utilized the symmetry of the image kernel to reduce kernel computation. In the proposed method, once a kernel value is computed in one quaternion, the other three corresponding values in the remaining three quaternions can be trivially computed. Moreover, the proposed method utilized recurrence equations to compute kernels. Thus, the required memory to store the pre-computed memory is saved. Finally, we implemented the proposed method on the GPU parallel architecture. The proposed method overcomes the memory limit due to saving the kernel's memory. The experiments show that the proposed parallel-friendly and memory-efficient method is superior to the state-of-the-art moment computation methods in memory consumption and runtimes. The proposed method computes the PCET moment of order 50 for an image of size 2K × 2K pixels in 3.5 seconds while the state-of-the-art method of comparison needs 7.0 seconds to process the same image, the memory requirements for the proposed method and the method of comparison for the were 67.0 MB and 3.4 GB, respectively. The method of comparison could not compute the image moment for any image with a resolution higher than 2K × 2K pixels. In contrast, the proposed method managed to compute the image moment up to 16K × 16K pixels image.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 ","pages":"Article 100398"},"PeriodicalIF":3.3,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49711262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-22DOI: 10.1016/j.bdr.2023.100407
Felipe Tomazelli Lima, Vinicius M.A. Souza
{"title":"A Large Comparison of Normalization Methods on Time Series","authors":"Felipe Tomazelli Lima, Vinicius M.A. Souza","doi":"10.1016/j.bdr.2023.100407","DOIUrl":"10.1016/j.bdr.2023.100407","url":null,"abstract":"<div><p>Normalization is a mandatory preprocessing step<span><span><span> in time series problems to guarantee similarity comparisons invariant to unexpected distortions in amplitude and offset. Such distortions are usual for most time series data<span>. A typical example is gait recognition by motion collected on subjects with varying body height and width. To rescale the data for the same range of values, the vast majority of researchers consider z-normalization as the default method for any domain application, data, or task. This choice is made without a searching process as occurs to set the parameters of an algorithm or without any experimental evidence in the literature considering a variety of scenarios to support this decision. To address this gap, we evaluate the impact of different normalization methods on time series data. Our analysis is based on an extensive experimental comparison on classification problems involving 10 normalization methods, 3 state-of-the-art classifiers, and 38 benchmark datasets. We consider the </span></span>classification task<span> due to the simplicity of the experimental settings and well-defined metrics. However, our findings can be extrapolated for other time series mining tasks, such as forecasting or clustering. Based on our results, we suggest to evaluate the maximum absolute scale as an alternative to z-normalization. Besides being time efficient, this alternative shows promising results for similarity-based methods using Euclidean distance. For </span></span>deep learning, mean normalization could be considered.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"34 ","pages":"Article 100407"},"PeriodicalIF":3.3,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43624406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-08-01DOI: 10.1016/j.bdr.2023.100398
Amr M. Abdeltif, K. Hosny, M. M. Darwish, Ahmad Salah, KenLi Li
{"title":"Parallel Framework for Memory-Efficient Computation of Image Descriptors for Megapixel Images","authors":"Amr M. Abdeltif, K. Hosny, M. M. Darwish, Ahmad Salah, KenLi Li","doi":"10.1016/j.bdr.2023.100398","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100398","url":null,"abstract":"","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"33 1","pages":"100398"},"PeriodicalIF":3.3,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54134995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100379
Liqing Qiu, Jingcheng Zhou, Caixia Jing, Yuying Liu
{"title":"Heterogeneous Graph Convolutional Network Based on Correlation Matrix","authors":"Liqing Qiu, Jingcheng Zhou, Caixia Jing, Yuying Liu","doi":"10.1016/j.bdr.2023.100379","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100379","url":null,"abstract":"<div><p><span>Heterogeneous graph embedding maps a high-dimension graph that has different sorts of nodes and edges to a low-dimensional space, making it perform well in downstream tasks. The existing models mainly use two approaches to explore and embed heterogeneous graph information. One is to use meta-path to mining heterogeneous information; the other is to use special modules designed by researchers to explore heterogeneous information. These models show excellent performance in heterogeneous graph embedding tasks. However, none of the models considers using the number of meta-path instances between nodes to improve the performance of heterogeneous graph embedding. The paper proposes a </span><em><strong>H</strong>eterogeneous <strong>G</strong>raph <strong>C</strong>onvolutional <strong>N</strong>etwork based on <strong>C</strong>orrelation <strong>M</strong>atrix</em><span> (CMHGCN) to fully use of the number of meta-path instances between nodes to discover interactive information between nodes in heterogeneous graphs. CMHGCN contains two core components: the node-level correlation component and the semantic-level correlation component. The node-level correlation component is able to use the number of meta-path instances between nodes to calculate the correlation between nodes guided by different meta-paths. The semantic-level correlation component can reasonably integrate such information from different meta-paths. On heterogeneous graphs with a large number of meta-path instances, CMHGCN outperforms baselines in node classification and clustering, according to experiments carried out on three benchmark heterogeneous datasets.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"32 ","pages":"Article 100379"},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100380
Jinghui Peng, Xinyu Hu, Wenbo Huang, Jian Yang
{"title":"What Is a Multi-Modal Knowledge Graph: A Survey","authors":"Jinghui Peng, Xinyu Hu, Wenbo Huang, Jian Yang","doi":"10.1016/j.bdr.2023.100380","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100380","url":null,"abstract":"<div><p>With the explosive growth of multi-modal information on the Internet, the multi-modal knowledge graph (MMKG) has become an important research topic in knowledge graphs to meet the needs of data management and application. Most research on MMKG has taken image-text data as the research object and used the multi-modal deep learning approach to process multi-modal data. In comparison, the structure of the MMKG is no uniform statement. This paper focuses on MMKG, introduces the related theories of multi-modal knowledge, and analyzes several common ideas about its construction. The survey also explains the structural evolution, proposes mirror node alignment to represent cross-modal knowledge for MMKG, lists some tasks' difficulties, and ultimately gives a sample MMKG for the news scene.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"32 ","pages":"Article 100380"},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2023-05-28DOI: 10.1016/j.bdr.2023.100384
Junru Wang , Shixin Zhang , Anbang Dai
{"title":"Spatio-Temporal Characteristics of Influenza Burden and Its Influence Factors in Japan in the Past Three Decades: An Influenza Disease Burden Data-Based Modeling Study","authors":"Junru Wang , Shixin Zhang , Anbang Dai","doi":"10.1016/j.bdr.2023.100384","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100384","url":null,"abstract":"<div><p><strong>Introduction:</strong> Influenza has still posed a great threat to humans. The knowledge of the systematic disease burden of influenza in Japan was limited. The study was aimed to investigate Spatio-temporal characteristics of the influenza burden and its influence factors in the past three decades.</p><p><strong>Methods:</strong> Data on annual death, years lived with disability (YLDs), years of life lost (YLLs) and disability adjusted life year (DALYs) of influenza from 1990 to 2019 in Japan were available from the Global Health Data Exchange (GHDx), and data on annual social household available from e-Stat in Japan. A joinpoint regression model was used to assess the trends of influenza from 1990 to 2019, a discrete Poisson model to analyze the spatial and temporal cluster of influenza, and a generalized linear model to assess the association of death and DALY of influenza with social household factors.</p><p><strong>Results:</strong> From 1990 to 2019, the mortality rate increased from 9.95 per 100000 to 19.49 per 100000 in Japan, with AAPC of 2.2% (95% CI: 1.5, 3.0, P<0.05). The DALYs rate increased from 153.86 per 100000 to 209.22 per 100000, with AAPC of 1.0% (95% CI: 0.1, 1.9, P<0.05). The mortality rate ranged from 1.98 per 100000 (Chiba) to 16.9 per 100000 (Kochi) in 1990, and from 5.10 per 100000 (Chiba) to 35.74 per 100000 (Akita) in 2019. The population aged 60+ had the highest mortality rates from 53.79 per 100000 in 1990 to 55.74 per 100000 in 2019 (AAPC: 0.0%, 95% CI: -0.5, 0.6, P=0.944) and DALYs rates from 713.43 per 100000 to 565.22 per 100000 (AAPC: -0.9%, 95% CI: -1.5, -0.3, P<0.05). YLLs and DALYs rates among the population aged 1-4 were also high from 1990 to 2019, ranked after that among populations aged 60+. The mortality rate had two stages of spatio-temporal aggregation across Japan, northern Japan with the period of 2005-2019 (RR = 1.36, P < 0.001) and southern Japan with the same period in the northern area (RR = 1.36, P < 0.001). The generalized linear model (GLM) indicated that year was positively correlated with the mortality rate of influenza (<em>β</em> = 0.18, p<0.01); while the ratio of households ordered via the internet and population were negatively correlated with the mortality rate of influenza (<em>β</em> = -4.41, p<0.05 and <em>β</em> =-0.17, p<0.01, respectively).</p><p><strong>Conclusions:</strong><span> The disease burden of influenza in Japan increased in the past three decades, especially among the population aged 60+ years, followed by the population aged 1-4 years. It had two stages of spatio-temporal aggregation across Japan. Lifestyle of households ordered via the internet contributed to the low mortality rate of influenza.</span></p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"32 ","pages":"Article 100384"},"PeriodicalIF":3.3,"publicationDate":"2023-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49714138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}