Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu
{"title":"Fusion learning of preference and bias from ratings and reviews for item recommendation","authors":"Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu","doi":"10.1016/j.datak.2024.102283","DOIUrl":"10.1016/j.datak.2024.102283","url":null,"abstract":"<div><p>Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102283"},"PeriodicalIF":2.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multivariate hierarchical DBSCAN model for enhanced maritime data analytics","authors":"Nitin Newaliya, Yudhvir Singh","doi":"10.1016/j.datak.2024.102282","DOIUrl":"10.1016/j.datak.2024.102282","url":null,"abstract":"<div><p>Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102282"},"PeriodicalIF":2.5,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139667962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations","authors":"Seungkyu Park , Joong yoon Lee , Jooyeoun Lee","doi":"10.1016/j.datak.2023.102264","DOIUrl":"10.1016/j.datak.2023.102264","url":null,"abstract":"<div><p>With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102264"},"PeriodicalIF":2.5,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001246/pdfft?md5=e0d3a91ff85a9662d7d0a2bed8c5acfd&pid=1-s2.0-S0169023X23001246-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":"10.1016/j.datak.2024.102281","url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto
{"title":"Issues in inter-organizational data sharing: Findings from practice and research challenges","authors":"Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto","doi":"10.1016/j.datak.2024.102280","DOIUrl":"10.1016/j.datak.2024.102280","url":null,"abstract":"<div><p>Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102280"},"PeriodicalIF":2.5,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000041/pdfft?md5=8cca34784bb0ed03de222b7dc6fbfc47&pid=1-s2.0-S0169023X24000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee
{"title":"A deep learning model for predicting the number of stores and average sales in commercial district","authors":"Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee","doi":"10.1016/j.datak.2024.102277","DOIUrl":"10.1016/j.datak.2024.102277","url":null,"abstract":"<div><p>This paper presents a plan for preparing for changes in the business environment by analyzing and predicting business district data in Seoul. The COVID-19 pandemic and economic crisis caused by inflation have led to an increase in store closures and a decrease in sales, which has had a significant impact on commercial districts. The number of stores and sales are critical factors that directly affect the business environment and can help prepare for changes. This study conducted correlation analysis to extract factors related to the commercial district’s environment in Seoul and estimated the number of stores and sales based on these factors. Using the Kendaltau correlation coefficient, the study found that existing population and working population were the most influential factors. Linear regression, tensor decomposition, Factorization Machine, and deep neural network models were used to estimate the number of stores and sales, with the deep neural network model showing the best performance in RMSE and evaluation indicators. This study also predicted the number of stores and sales of the service industry in a specific area using the population prediction results of the neural prophet model. The study’s findings can help identify commercial district information and predict the number of stores and sales based on location, industry, and influencing factors, contributing to the revitalization of commercial districts.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102277"},"PeriodicalIF":2.5,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000016/pdfft?md5=399d90f81e8f5fbe38aeaa5e86a26560&pid=1-s2.0-S0169023X24000016-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A transformer-based neural network framework for full names prediction with abbreviations and contexts","authors":"Ziming Ye , Shuangyin Li","doi":"10.1016/j.datak.2023.102275","DOIUrl":"10.1016/j.datak.2023.102275","url":null,"abstract":"<div><p>With the rapid spread of information, abbreviations are used more and more common because they are convenient. However, the duplication of abbreviations can lead to confusion in many cases, such as information management and information retrieval. The resultant confusion annoys users. Thus, inferring a full name from an abbreviation has practical and significant advantages. The bulk of studies in the literature mainly inferred full names based on rule-based methods, statistical models, the similarity of representation, etc. However, these methods are unable to use various grained contexts properly. In this paper, we propose a flexible framework of Multi-attention mask Abbreviation Context and Full name language model<span>, named MACF to address the problem. With the abbreviation and contexts as the inputs, the MACF can automatically predict a full name by generation, where the contexts can be variously grained. That is, different grained contexts ranging from coarse to fine can be selected to perform such complicated tasks in which contexts include paragraphs, several sentences, or even just a few keywords. A novel multi-attention mask mechanism is also proposed, which allows the model to learn the relationships among abbreviations, contexts, and full names, a process that makes the most of various grained contexts. The three corpora of different languages and fields were analyzed and measured with seven metrics in various aspects to evaluate the proposed framework. According to the experimental results, the MACF yielded more significant and consistent outputs than other baseline methods. Moreover, we discuss the significance and findings, and give the case studies to show the performance in real applications.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102275"},"PeriodicalIF":2.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charles Cheolgi Lee , Jafar Afshar , Arousha Haghighian Roudsari , Woong-Kee Loh , Wookey Lee
{"title":"A bitwise approach on influence overload problem","authors":"Charles Cheolgi Lee , Jafar Afshar , Arousha Haghighian Roudsari , Woong-Kee Loh , Wookey Lee","doi":"10.1016/j.datak.2023.102276","DOIUrl":"10.1016/j.datak.2023.102276","url":null,"abstract":"<div><p><span>Increasingly developing online social networks has enabled users to send or receive information very fast. However, due to the availability of an excessive amount of data in today’s society, managing the information has become very cumbersome, which may lead to the problem of information overload. This highly eminent problem, where the existence of too much relevant information available becomes a hindrance rather than a help, may cause losses, delays, and hardships in making decisions. Thus, in this paper, by defining information overload from a different aspect, we aim to maximize the information propagation while minimizing the information overload (duplication). To do so, we theoretically present the lower and upper bounds for the information overload using a bitwise-based approach as the leverage to mitigate the computation complexities and obtain an approximation ratio of </span><span><math><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>e</mi></mrow></mfrac></mrow></math></span>. We propose two main algorithms, B-square and C-square, and compare them with the existing algorithms. Experiments on two types of datasets, synthetic and real-world networks, verify the effectiveness and efficiency of the proposed approach in addressing the problem.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102276"},"PeriodicalIF":2.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Keys for Graphs","authors":"Morteza Alipourlangouri, Fei Chiang","doi":"10.1016/j.datak.2023.102274","DOIUrl":"10.1016/j.datak.2023.102274","url":null,"abstract":"<div><p><span>Keys for graphs are a class of data quality rules that use topological and value constraints to uniquely identify entities in a data graph. They have been studied to support object identification, knowledge fusion, data deduplication, and social network reconciliation. Manual specification and discovery of graph keys is tedious and infeasible over large-scale graphs. To make </span><span><math><mi>GKeys</mi></math></span> useful in practice, we study the <span><math><mi>GKey</mi></math></span> discovery problem, and present <span><math><mi>GKMiner</mi></math></span>, an algorithm that mines keys over graphs. Our algorithm discovers keys in a graph via frequent subgraph expansion, and notably, identifies <em>recursive</em> keys, i.e., where the unique identification of an entity type is dependent upon the identification of another entity type. We introduce the key properties, <em>minimality</em> and <em>support</em>, which effectively help to reduce the space of candidate keys. <span><math><mi>GKMiner</mi></math></span><span> uses a set of auxillary structures to summarize an input graph, and to identify likely candidate keys for greater pruning efficiency and evaluation of the search space. Our evaluation shows that identifying and using recursive keys in entity linking, lead to improved accuracy, over keys found using existing graph key mining techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102274"},"PeriodicalIF":2.5,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139055186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}