{"title":"A Multi-Source Information Learning Framework for Airbnb Price Prediction","authors":"Lu Jiang, Y. Li, Na Luo, Jianan Wang, Qiao Ning","doi":"10.1109/ICDMW58026.2022.00009","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00009","url":null,"abstract":"With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130213241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"U-Net Transfer Learning for Image Restoration on Sparse CT Reconstruction in Pre-Clinical Research","authors":"Huanyi Zhou, Honggang Zhao, Wenlu Wang","doi":"10.1109/ICDMW58026.2022.00053","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00053","url":null,"abstract":"Sparse computed tomography (CT) reconstruction can lead to significant streak artifacts. Image restoration that removes these artifacts while recovering image features is an important area of research in low-dose sparse CT imaging. In pre-clinical research, where a lag still exists in the use of professional CT equipment, existing imaging devices provide limited X-ray dose energy accompanied by strong noise patterns when scanning. Reconstructed CT images contain significant noise and artifacts. We propose a deep transfer learning (DTL) neural network training method that exploits open-source data for initial training and a small-scale detected phantom image with its total variation result for transfer learning to address this issue. We hypothesize that a pre-trained neural network from open-source data has no prior knowledge of our device configuration, which prevents its application on our measured data, and deep transfer learning on small-scale detected phantom can feed specific configurations into the model. Our experiment has demonstrated that our proposed method, incorporating a modified total variation (TV) algorithm, can successfully realize a good balance between artifact removal and image feature restoration.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128594862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical analysis of fairness-aware data segmentation","authors":"Seiji Okura, T. Mohri","doi":"10.1109/ICDMW58026.2022.00029","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00029","url":null,"abstract":"Fairness in machine learning is a research area that is recently established, for mitigating bias of unfair models that treat unprivileged people unfavorably based on protected attributes. We want to take an approach for mitigating such bias based on the idea of data segmentation, that is, dividing data into segments where people should be treated similarly. Such an approach should be useful in the sense that the mitigation process itself is explainable for cases that similar people should be treated similarly. Although research on such cases exists, the question of effectiveness of data segmentation itself, however, remains to be answered. In this paper, we answer this question by empirically analyzing the experimental results of data segmentation by using two datasets, i.e., the UCI Adult dataset and the Kaggle ‘Give me some credit’ (gmsc) dataset. We empirically show that (1) fairness can be controllable during training models by the way of dividing data into segments; more specifically, by selecting the attributes and setting the number of segments for adjusting statistics such as statistical parity of the segments and mutual information between the attributes, etc. (2) the effects of data segmentation is dependent on classifiers, and (3) there exist weak trade-offs between fairness and accuracy with regard to data segmentation.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection of Mild Cognitive Impairment from Quantitative Analysis of Timed Up and Go (TUG)","authors":"Mahmoud Seifallahi, J. Galvin, B. Ghoraani","doi":"10.1109/ICDMW58026.2022.00042","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00042","url":null,"abstract":"Mild cognitive impairment (MCl) is abnormal cognitive decline beyond expected normal decline. The rate of progression to Alzheimer's disease (AD) in people with MCl is an estimated 80% in 6 years. However, identifying MCI from normal cognition in older adults remains a clinical challenge in early AD detection. We investigated a new method for detecting MCI based on patients' gait and balance. Our approach performs a comprehensive analysis of the Timed Up and Go test (TUG), based on the first application of a Kinect v.2 camera to record and provide movement measures and machine learning to differentiate between the two groups of older adults with MCI and healthy controls (HC). We collected movement data from 25 joints of the body via a Kinect v.2 camera as 30 HC and 25 MCI subjects performed TUG. The collected data provided a comprehensive list of gait and balance measures with 61 features, including duration of TUG, duration and velocity of transition phases, and micro and macro gait features. Our analysis evidenced that 25 features were significantly different between MCI and HC subjects, where 20 of them were unique features as indicated by our correlation analysis. The classification results using three different classifiers of support vector machine (SVM), random forest, and artificial neural network showed that the ability of our approach for detecting MCI subjects with the highest performance was using SVM with 94% accuracy, 100 % precision, 93.33% F-score, and 0.94 AUC. These observations suggest the possibility of our approach as a low-cost, easy-to-use MCI screening tool for objectively detecting subjects at high risk of developing AD. Such a tool is well-suited for widespread application in clinical settings and nursing homes to detect early signs of cognitive impairment and promote healthy aging.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125740847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weight-Training Ensemble Model for Stock Price Forecast","authors":"Jianing Zhao, Ayana Takai, E. Kita","doi":"10.1109/ICDMW58026.2022.00024","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00024","url":null,"abstract":"The ensemble model is applied for the stock price prediction in this study. The proposed ensemble model is based on the weighted average estimation of the values predicted by base algorithms. The base algorithms include Linear Regression, Long Short-Term Memory (LSTM), Support Vector Regression (SVR) and lightGBM. The performance of the proposed model depends on the weight parameters. The past data are collected to calculate the weigh parameters for base models of the ensemble models. The stock price prediction of Toyota Motor Corporation is considered as the numerical examples. Then LSTM, SVR and LightGBM are built to recognize the trend of the weight sequence data and to predict the most suitable combination weights for ensemble. The experimental results show that any ensemble models achieves significantly better accuracy than each component model. The proposed model also achieved the lowest error than simple average and error-based combination method. Even a tiny difference in choosing associated combining weights can play a crucial role in linear combination of models for prediction.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128503683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid Oversampling Technique Based on Star Topology and Rejection Methodology for Classifying Imbalanced Data","authors":"Chaekyu Lee, Jaekwang Kim","doi":"10.1109/ICDMW58026.2022.00033","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00033","url":null,"abstract":"In this paper, we propose the star topology and rejection method (STARM), a new oversampling technique that generally performs well for varying data and algorithms. STARM is a hybrid technique that combines the advantages of Polynom-fit-SMOTE, LEE, and SMOTE, all of which have yielded high performance based on different technical features, and eliminates their disadvantages. To verify that the proposed technique exhibits high performance in general situations, we conducted 28,028 experiments to compare the predictive performance of 77 oversampling techniques with four machine learning algorithms for 91 imbalanced datasets of various types. Consequently, STARM yielded the highest performance on average among the 77 techniques. In addition, it showed excellent performance even in various algorithms, various imbalanced ratios, and various data volumes.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126988768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier
{"title":"Incremental constrained clustering with application to remote sensing images time series","authors":"Baptiste Lafabregue, P. Gançarski, J. Weber, G. Forestier","doi":"10.1109/ICDMW58026.2022.00110","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00110","url":null,"abstract":"Automatically extracting knowledge from various datasets is a valuable task to help experts explore new types of data and save time on annotations. This is especially required for new topics such as emergency management or environmental monitoring. Traditional unsupervised methods often tend to not fulfill experts' intuitions or non-formalized knowledge. On the other hand, supervised methods tend to require a lot of knowledge to be efficient. Constrained clustering, a form of semi-supervised methods, mitigates these two effects, as it allows experts to inject their knowledge into the clustering process. However, constraints often have a poor effect on the result because it is hard for experts to give both informative and coherent constraints. Based on the idea that it is easier to criticize than to construct, this article presents a new method, I-SAMARAH, an incremental constrained clustering method. Through an iterative process, it alternates between a clustering phase where constraints are incorporated, and a criticize phase where the expert can give feedback on the clustering. We demonstrate experimentally the efficiency of our method on remote sensing image time series. We compare it to other constrained clustering methods in terms of result quality and to supervised methods in terms of number of annotations.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132409988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study of the Dream Net model robustness across continual learning scenarios","authors":"M. Mainsant, M. Mermillod, C. Godin, M. Reyboz","doi":"10.1109/ICDMW58026.2022.00111","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00111","url":null,"abstract":"Continual learning is one of the major challenges of deep learning. For decades, many studies have proposed efficient models overcoming catastrophic forgetting when learning new data. However, as they were focused on providing the best reduce-forgetting performance, studies have moved away from real-life applications where algorithms need to adapt to changing environments and perform, no matter the type of data arrival. Therefore, there is a growing need to define new scenarios to assess the robustness of existing methods with those challenges in mind. The issue of data availability during training is another essential point in the development of solid continual learning algorithms. Depending on the streaming formulation, the model needs in the more extreme scenarios to be able to adapt to new data as soon as it arrives and without the possibility to review it afterwards. In this study, we propose a review of existing continual learning scenarios and their associated terms. Those existing terms and definitions are synthesized in an atlas in order to provide a better overview. Based on two of the main categories defined in the atlas, “Class-IL.” and “Domain-IL”, we define eight different scenarios with data streams of varying complexity that allow to test the models robustness in changing data arrival scenarios. We choose to evaluate Dream Net - Data Free, a privacy-preserving continual learning algorithm, in each proposed scenario and demonstrate that this model is robust enough to succeed in every proposed scenario, regardless of how the data is presented. We also show that it is competitive with other continual learning literature algorithms that are not privacy preserving which is a clear advantage for real-life human-centered applications.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129317867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt
{"title":"Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces","authors":"Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt","doi":"10.1109/ICDMW58026.2022.00032","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00032","url":null,"abstract":"Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116344041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres
{"title":"On the Deployment of Post-Disaster Building Damage Assessment Tools using Satellite Imagery: A Deep Learning Approach","authors":"Shahrzad Gholami, Caleb Robinson, Anthony Ortiz, Siyu Yang, J. Margutti, Cameron Birge, R. Dodhia, J. Ferres","doi":"10.1109/ICDMW58026.2022.00134","DOIUrl":"https://doi.org/10.1109/ICDMW58026.2022.00134","url":null,"abstract":"Natural disasters frequency is growing globally. Every year 350 million people are affected and billions of dollars of damage is incurred. Providing timely and appropriate humanitarian interventions like shelters, medical aid, and food to affected communities are challenging problems. AI frameworks can help support existing efforts in solving these problems in various ways. In this study, we propose using high-resolution satellite imagery from before and after disasters to develop a convolutional neural network model for localizing buildings and scoring their damage level. We categorize damage to buildings into four levels, spanning from not damaged to destroyed, based on the xView2 dataset's scale. Due to the emergency nature of disaster response efforts, the value of automating damage assessment lies primarily in the inference speed, rather than accuracy. We show that our proposed solution works three times faster than the fastest xView2 challenge winning solution and over 50 times faster than the slowest first place solution, which indicates a significant improvement from an operational viewpoint. Our proposed model achieves a pixel-wise Fl score of 0.74 for the building localization and a pixel-wise harmonic Fl score of 0.6 for damage classification and uses a simpler architecture compared to other studies. Additionally, we develop a web-based visualizer that can display the before and after imagery along with the model's building damage predictions on a custom map. This study has been collaboratively conducted to empower a humanitarian organization as the stakeholder, that plans to deploy and assess the model along with the visualizer for their disaster response efforts in the field.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123438009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}