{"title":"Analyzing Cultural Assimilation through the Lens of Yelp Restaurant Reviews","authors":"Zaiqian Chen, Joonsuk Park","doi":"10.1109/DSAA53316.2021.9564170","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564170","url":null,"abstract":"Given the steady stream of immigrants from around the world, cultural assimilation in North America has long been a topic of interest. However, existing research focuses only on assimilation to North American culture, overlooking the mutual influence, with a very limited use of data-driven approaches. In this paper, we investigate assimilation among various cultures in North America through the lens of discussions surrounding food. We first present Cross-Cuisine Cross-Region LDA (c3rLDA), a novel probabilistic graphical model to jointly uncover latent topics shared across cuisines, as well as their regional variants for each cuisine. Then, we employ the model on 3.7 million Yelp restaurant reviews to find that cuisines assimilate to one another in varying degrees depending on the cuisines involved, the topic, and the region: A cuisine tends to be more influenced by other cuisines if it is regularly fused with others (e.g. Japanese), for certain topics (e.g. breakfast and dessert), and in specific regions (e.g. stronger Mexican influence in the Southwestern US and French influence in the East Canada). Lastly, we demonstrate that the topics generated by our model, on which the qualitative analysis is based, are more coherent than or comparable to those generated by existing neural and non-neural topic models. This work represents the first step toward large-scale data-driven analysis of cultural assimilation in North America, which is made possible by the abundant data available in social media.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127849795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clara Puga, Uli Niemann, Vishnu Unnikrishnan, Miro Schleicher, W. Schlee, M. Spiliopoulou
{"title":"Discovery of Patient Phenotypes through Multi-layer Network Analysis on the Example of Tinnitus","authors":"Clara Puga, Uli Niemann, Vishnu Unnikrishnan, Miro Schleicher, W. Schlee, M. Spiliopoulou","doi":"10.1109/DSAA53316.2021.9564158","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564158","url":null,"abstract":"Electronic health records (EHR) often include multiple perspectives on a patient's current state of well-being (e.g. vital signs and subjective indicators measured by questionnaires). In this study, we use these perspectives to build phenotypes of chronic tinnitus patients and investigate how these phenotypes are associated with response to treatment. Therefore, we model patients as nodes in a network, where those perspectives are interpreted as layers of a multi-layer network. To identify phenotypes of patients in the network, we implement a community detection algorithm. Some of these communities can be considered as phenotypes if they represent subgroups of patients that are similar according to the investigated perspectives. Furthermore, we analyze the influence of the layers on the final community structure of patients. We then propose a method to add layers given their community structure similarity. Finally, we fit a model, per community, to predict the treatment outcome. In some communities, this prediction outperformed the baseline scenario where the predictor was fitted to all patients.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116811002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Chen, Ben van Rijnsoever, Marcel Molenhuis, Dennis van Dijk, Yao-Hua Tan, B. Rukanova
{"title":"The use of machine learning to identify the correctness of HS Code for the customs import declarations","authors":"Hao Chen, Ben van Rijnsoever, Marcel Molenhuis, Dennis van Dijk, Yao-Hua Tan, B. Rukanova","doi":"10.1109/DSAA53316.2021.9564203","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564203","url":null,"abstract":"As an increasing volume of international trade activities around the world, the amount of cross-boarder import declarations grows rapidly, resulting in an unprecedented scale of potentially fraudulent transactions, in particular false commodity code (e.g., HS Code). The incorrect HS Code will cause duty risk and adversely impact the revenue collection. Physical investigation by the customs administrations is impractical due to the substantial quantity of declarations. This paper provides an automatic approach by harnessing the power of machine learning techniques to relief the burden of customs targeting officers. We introduced a novel model based on the off-the-shelf embedding encoder to identify the correctness of HS Code without any human effort. Determining whether the HS Code is correctly matched with commodity description is a classification task, so the labelled data is typically required. However, the lack of gold standard labelled data sets in customs domain limits the development of supervised-based approach. Our model is developed by the unsupervised mechanism and trained on the unlabelled historical declaration records, which is robust and able to be smoothly adapted by the different customs administrations. Rather than typically classifying whether the HS Code is correct or not, our model predicts the score to indicate the degree of the HS Code being correct. We have evaluated our proposed model on the ground-truth data set provided by Dutch customs officers. Results show promising performance of 71% overall accuracy.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Safa Boudabous, S. Clémençon, H. Labiod, Julian Garbiso
{"title":"Dynamic Graph Convolutional LSTM application for traffic flow estimation from error-prone measurements: results and transferability analysis","authors":"Safa Boudabous, S. Clémençon, H. Labiod, Julian Garbiso","doi":"10.1109/DSAA53316.2021.9564206","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564206","url":null,"abstract":"The technological advances in the transportation and automotive industry led to the use of new types of sensing systems more cost-effective and adapted to large-scale dense deployment. Those sensing techniques allow continuously gathering traffic measurements times series in different geospatial locations. The accuracy of the obtained raw measurements is often hindered by different factors related to the sensing environment and the sensing process itself and thus fail to capture the short-term traffic variations crucial for real-time traffic monitoring. In this paper, we propose the DGC-LSTM model for area-wide traffic estimation from error-prone measurements time series. The backbone of the DGC-LSTM model is a graph convolutional Long Short Term Memory model with a dynamic adjacency matrix. The adjacency matrix is learned and optimized during the model training. The adjacency matrix values are estimated from the set of contextual features that impact the dynamicity of the dependencies in both the spatial and temporal dimensions. Experiments on a realistic synthetic labelled Bluetooth counts dataset is used for model evaluation. Lastly, we highlight the importance of transfer learning methods to improve the model applicability by ensuring model adaptation to the new deployment site while avoiding the extensive data-labelling effort.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122236358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XPROAX-Local explanations for text classification with progressive neighborhood approximation","authors":"Yi Cai, A. Zimek, Eirini Ntoutsi","doi":"10.1109/DSAA53316.2021.9564153","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564153","url":null,"abstract":"The importance of the neighborhood for training a local surrogate model to approximate the local decision boundary of a black box classifier has been already highlighted in the literature. Several attempts have been made to construct a better neighborhood for high dimensional data, like texts, by using generative autoencoders. However, existing approaches mainly generate neighbors by selecting purely at random from the latent space and struggle under the curse of dimensionality to learn a good local decision boundary. To overcome this problem, we propose a progressive approximation of the neighborhood using counterfactual instances as initial landmarks and a careful 2-stage sampling approach to refine counterfactuals and generate factuals in the neighborhood of the input instance to be explained. Our work focuses on textual data and our explanations consist of both word-level explanations from the original instance (intrinsic) and the neighborhood (extrinsic) and factual- and counterfactual-instances discovered during the neighborhood generation process that further reveal the effect of altering certain parts in the input text. Our experiments on real-world datasets demonstrate that our method outperforms the competitors in terms of usefulness and stability (for the qualitative part) and completeness, compactness and correctness (for the quantitative part).","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards optimized actions in critical situations of soccer games with deep reinforcement learning","authors":"Pegah Rahimian, Afshin Oroojlooy, László Toka","doi":"10.1109/DSAA53316.2021.9564207","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564207","url":null,"abstract":"Soccer is a sparse rewarding game: any smart or careless action in critical situations can change the result of the match. Therefore players, coaches, and scouts are all curious about the best action to be performed in critical situations, such as the times with a high probability of losing ball possession or scoring a goal. This work proposes a new state representation for the soccer game and a batch reinforcement learning to train a smart policy network. This network gets the contextual information of the situation and proposes the optimal action to maximize the expected goal for the team. We performed extensive numerical experiments on the soccer logs made by InStat for 104 European soccer matches. The results show that in all 104 games, the optimized policy obtains higher rewards than its counterpart in the behavior policy. Besides, our framework learns policies that are close to the expected behavior in the real world. For instance, in the optimized policy, we observe that some actions such as foul, or ball out can be sometimes more rewarding than a shot in specific situations.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Multi-Graph Convolution Network For Metro Passenger Volume Prediction","authors":"Fuchen Gao, Zhanquan Wang, Zhenguang Liu","doi":"10.1109/DSAA53316.2021.9564196","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564196","url":null,"abstract":"Accurate prediction of metro passenger volume (number of passengers) is valuable to realize real-time metro system management, which is a pivotal yet challenging task in intelligent transportation. Due to the complex spatial correlation and temporal variation of urban subway ridership behavior, deep learning has been widely used to capture nonlinear spatial-temporal dependencies. Unfortunately, the current deep learning methods only adopt graph convolutional network as a component to model spatial relationship, without making full use of the different spatial correlation patterns between stations. In order to further improve the accuracy of metro passenger volume prediction, a deep learning model composed of Parallel multi-graph convolution and stacked Bidirectional unidirectional Gated Recurrent Unit (PB-GRU) was proposed in this paper. The parallel multi-graph convolution captures the origin-destination (OD) distribution and similar flow pattern between the metro stations, while bidirectional gated recurrent unit considers the passenger volume sequence in forward and backward directions and learns complex temporal features. Extensive experiments on two real-world datasets of subway passenger flow show the efficacy of the model. Surprisingly, compared with the existing methods, PB-GRU achieves much lower prediction error.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131191914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Neural Approach for Detecting Morphological Analogies","authors":"Safa Alsaidi, Amandine Decker, Puthineath Lay, Esteban Marquer, Pierre-Alexandre Murena, Miguel Couceiro","doi":"10.1109/DSAA53316.2021.9564186","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564186","url":null,"abstract":"Analogical proportions are statements of the form “A is to B as C is to D” that are used for several reasoning and classification tasks in artificial intelligence and natural language processing (NLP). For instance, there are analogy based approaches to semantics as well as to morphology. In fact, symbolic approaches were developed to solve or to detect analogies between character strings, e.g., the axiomatic approach as well as that based on Kolmogorov complexity. In this paper, we propose a deep learning approach to detect morphological analogies, for instance, with reinflexion or conjugation. We present empirical results that show that our framework is competitive with the above-mentioned state of the art symbolic approaches. We also explore empirically its transferability capacity across languages, which highlights interesting similarities between them.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115861475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youcef Remil, Anes Bendimerad, M. Plantevit, C. Robardet, Mehdi Kaytoue-Uberall
{"title":"Interpretable Summaries of Black Box Incident Triaging with Subgroup Discovery","authors":"Youcef Remil, Anes Bendimerad, M. Plantevit, C. Robardet, Mehdi Kaytoue-Uberall","doi":"10.1109/DSAA53316.2021.9564164","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564164","url":null,"abstract":"The need of predictive maintenance comes with an increasing number of incidents reported by monitoring systems and equipment/software users. In the front line, on-call engineers (OCEs) have to quickly assess the degree of severity of an incident and decide which service to contact for corrective actions. To automate these decisions, several predictive models have been proposed, but the most efficient models are opaque (say, black box), strongly limiting their adoption. In this paper, we propose an efficient black box model based on 170K incidents reported to our company over the last 7 years and emphasize on the need of automating triage when incidents are massively reported on thousands of servers running our product, an ERP. Recent developments in eXplainable Artificial Intelligence (XAI) help in providing global explanations to the model, but also, and most importantly, with local explanations for each model prediction/outcome. Sadly, providing a human with an explanation for each outcome is not conceivable when dealing with an important number of daily predictions. To address this problem, we propose an original data-mining method rooted in Subgroup Discovery, a pattern mining technique with the natural ability to group objects that share similar explanations of their black box predictions and provide a description for each group. We evaluate this approach and present our preliminary results which give us good hope towards an effective OCE's adoption. We believe that this approach provides a new way to address the problem of model agnostic outcome explanation.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134329690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Alves, M. Amblard, Fabien Bernier, Miguel Couceiro, A. Napoli
{"title":"Reducing Unintended Bias of ML Models on Tabular and Textual Data","authors":"Guilherme Alves, M. Amblard, Fabien Bernier, Miguel Couceiro, A. Napoli","doi":"10.1109/DSAA53316.2021.9564112","DOIUrl":"https://doi.org/10.1109/DSAA53316.2021.9564112","url":null,"abstract":"Unintended biases in machine learning (ML) models are among the major concerns that must be addressed to maintain public trust in ML. In this paper, we address process fairness of ML models that consists in reducing the dependence of models on sensitive features, without compromising their performance. We revisit the framework FixOut that is inspired in the approach “fairness through unawareness” to build fairer models. We introduce several improvements such as automating the choice of FixOut's parameters. Also, FixOut was originally proposed to improve fairness of ML models on tabular data. We also demonstrate the feasibility of FixOut's workflow for models on textual data. We present several experimental results that illustrate the fact that FixOut improves process fairness on different classification settings.","PeriodicalId":129612,"journal":{"name":"2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131530417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}