{"title":"Explaining cube measures through Intentional Analytics","authors":"Matteo Francia , Stefano Rizzi , Patrick Marcel","doi":"10.1016/j.is.2023.102338","DOIUrl":"10.1016/j.is.2023.102338","url":null,"abstract":"<div><p>The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, <span>describe</span> and <span>assess</span> have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the <span>explain</span> operator, whose goal is to provide an answer to the user asking “why does measure <span><math><mi>m</mi></math></span> show these values?”; specifically, we consider models that explain <span><math><mi>m</mi></math></span> in terms of one or more other measures. We propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the relationship between <span><math><mi>m</mi></math></span> and the other cube measures via regression analysis and cross-correlation, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency and effectiveness.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102338"},"PeriodicalIF":3.7,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001746/pdfft?md5=23f8fab78fdd903fb8bd9c0b6f06f739&pid=1-s2.0-S0306437923001746-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138742073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LSPC: Exploring contrastive clustering based on local semantic information and prototype","authors":"Jun-Fen Chen, Lang Sun, Bo-Jun Xie","doi":"10.1016/j.is.2023.102336","DOIUrl":"10.1016/j.is.2023.102336","url":null,"abstract":"<div><p>Recently years, several prominent contrastive learning<span><span> algorithms, a kind of self-supervised learning methods, have been extensively studied that can efficiently extract useful feature representations from input images by means of data augmentation techniques. How to further partition the representations into meaningful clusters is the issue that deep clustering is addressing. In this work, a deep </span>clustering algorithm based on local semantic information and prototype is proposed referring to LSPC that aims at learning a group of representative prototypes. Rather than learning the distinguishing characteristics between different images, more attention is given to the essential characteristics of images that are maybe from a potential category. On the training framework, contrastive learning is skillfully combined with k-means clustering algorithm. The prediction is transformed into soft assignments for end-to-end training. In order to enable the model to accurately capture the semantic information between images, we mine similar samples of training samples in the embedded space as local semantic information to effectively increase the similarity between samples belonging to the same cluster. Experimental results show that our algorithm achieves state-of-the-art performance on several commonly used public datasets, and additional experiments prove that this superior clustering performance can also be extended to large datasets such as ImageNet.</span></p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102336"},"PeriodicalIF":3.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous graph neural networks for fraud detection and explanation in supply chain finance","authors":"Bin Wu , Kuo-Ming Chao , Yinsheng Li","doi":"10.1016/j.is.2023.102335","DOIUrl":"https://doi.org/10.1016/j.is.2023.102335","url":null,"abstract":"<div><p>It is a critical mission for financial service providers to discover fraudulent borrowers in a supply chain. The borrowers’ transactions in an ongoing business are inspected to support the providers’ decision on whether to lend the money. Considering multiple participants in a supply chain business, the borrowers may use sophisticated tricks to cheat, making fraud detection challenging. In this work, we propose a multitask learning<span> framework, MultiFraud, for complex fraud detection with reasonable explanation. The heterogeneous information from multi-view around the entities is leveraged in the detection framework based on heterogeneous graph neural networks. MultiFraud enables multiple domains to share embeddings and enhance modeling capabilities for fraud detection. The developed explainer provides comprehensive explanations across multiple graphs. Experimental results on five datasets demonstrate the framework’s effectiveness in fraud detection and explanation across domains.</span></p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102335"},"PeriodicalIF":3.7,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138577761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim
{"title":"Attention-based multi attribute matrix factorization for enhanced recommendation performance","authors":"Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim","doi":"10.1016/j.is.2023.102334","DOIUrl":"10.1016/j.is.2023.102334","url":null,"abstract":"<div><p><span>In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various </span>baseline models.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102334"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138572711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing
{"title":"An efficient visual exploration approach of geospatial vector big data on the web map","authors":"Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing","doi":"10.1016/j.is.2023.102333","DOIUrl":"10.1016/j.is.2023.102333","url":null,"abstract":"<div><p><span>The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing<span> technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream </span></span>visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102333"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138567682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt
{"title":"Validation set sampling strategies for predictive process monitoring","authors":"Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt","doi":"10.1016/j.is.2023.102330","DOIUrl":"10.1016/j.is.2023.102330","url":null,"abstract":"<div><p>Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102330"},"PeriodicalIF":3.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel
{"title":"On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records","authors":"Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel","doi":"10.1016/j.is.2023.102323","DOIUrl":"10.1016/j.is.2023.102323","url":null,"abstract":"<div><p><span><span>Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In </span>data deduplication<span><span>, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different </span>real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on </span></span>mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102323"},"PeriodicalIF":3.7,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort
{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"https://doi.org/10.1016/j.is.2023.102339","url":null,"abstract":"","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"188 ","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139026507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Worker similarity-based noise correction for crowdsourcing","authors":"Yufei Hu , Liangxiao Jiang , Wenjun Zhang","doi":"10.1016/j.is.2023.102321","DOIUrl":"https://doi.org/10.1016/j.is.2023.102321","url":null,"abstract":"<div><p>Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102321"},"PeriodicalIF":3.7,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel self-supervised graph model based on counterfactual learning for diversified recommendation","authors":"Pu Ji, Minghui Yang, Rui Sun","doi":"10.1016/j.is.2023.102322","DOIUrl":"10.1016/j.is.2023.102322","url":null,"abstract":"<div><p>Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102322"},"PeriodicalIF":3.7,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}