Reilly Grant, David Kucher, A. Leon, Jonathan F. Gemmell, D. Raicu
{"title":"Discovery of Informal Topics from Post Traumatic Stress Disorder Forums","authors":"Reilly Grant, David Kucher, A. Leon, Jonathan F. Gemmell, D. Raicu","doi":"10.1109/ICDMW.2017.65","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.65","url":null,"abstract":"Post Traumatic Stress Disorder (PTSD) is a public health problem afflicting millions of people each year. It is especially prominent among military veterans. Understanding the language, attitudes, and topics associated with PTSD presents an important and challenging problem. Based on their expertise, mental health professionals have constructed a formal definition of PTSD. However, even the most assiduous mental health professionals can care for only a small fraction of those suffering from PTSD, limiting their perspective of the disorder. As social networking sites have grown in acceptance, users have begun to express personal thoughts and feelings, such as those related to PTSD. This wealth of content can be viewed as an enormous collective description of PTSD and its related issues. We automatically extract informal latent topics from thousands of social media posts in which users describe their experience with PTSD and compare these topics to the formal description generated by mental health professionals. We then explore the pattern and associations of these topics. Our informal topic discovery evaluation reveals that we can successfully identify meaningful topics in PTSD social media related data. When comparing our topics to the criteria included in the Diagnostic and Statistical Manual of Mental Disorders (DSM), we found that we were able to automatically reproduce many of the criteria. We also discovered new topics which were not mentioned in the DSM, but were prevalent across the collaborative narrative of thousands of user's experience with PTSD.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127355478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anand Gupta, H. Thakur, Ritvik Shrivastava, Pulkit Kumar, Sreyashi Nag
{"title":"A Big Data Analysis Framework Using Apache Spark and Deep Learning","authors":"Anand Gupta, H. Thakur, Ritvik Shrivastava, Pulkit Kumar, Sreyashi Nag","doi":"10.1109/ICDMW.2017.9","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.9","url":null,"abstract":"With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123642461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personal Identification by Pedestrians Behavior","authors":"E. Kita, Xuanang Feng, Hiroki Shimokubo","doi":"10.1109/ICDMW.2017.88","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.88","url":null,"abstract":"The recent progress of motion sensor system enables to the personal identification from the human behavior observed from the sensor. Kinect is a motion sensing input device developed by Microsoft for Xbox 360 and Xbox One. The personal identification using the Microsoft Kinect sensor, shortly Kinect, is presented in this study. The use of the Kinect estimates the pedestrian's body size and walk behavior. The human body sizes such as height, width and so on and the walking behavior such as joint angle, stride length and so on are taken as the explanatory variables. The models which identifies pedestrians from the explanatory variables are defined by the traditional neural network (NN) and Support Vector Machine (SVM). In the numerical experiments, the pedestrian's body sizes and walking behavior pictures are taken from fifteen examinees. The pedestrian's walking direction is specified as 0°, 90°, 180° and 225° and then, the accuracy was compared. The results show that the identification accuracy is the best in case of 180°-walking direction and that the accuracy of the support vector machine is better than that of the neural network.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123166400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multilevel NER Framework for Automatic Clinical Name Entity Recognition","authors":"T. Luu, R. Phan, Rachel Davey, G. Chetty","doi":"10.1109/ICDMW.2017.161","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.161","url":null,"abstract":"In this paper, we propose a novel multilevel NER framework, for addressing the challenges of clinical name entity recognition, based on different machine learning and text mining algorithms. The proposed framework, with multiple levels, allows models for increasingly complex NER tasks to be built. The experimental evaluation on two different publicly available datasets, corresponding to different application contexts - the CLEF 2016 challenge shared task 1A for nursing handover context, and the BIONLP/NLPBPA 2004 challenge shared task on GENIA corpus for recognizing entities in microbiology, has validated the proposed framework.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116959758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Storytelling Evaluation and Story Chain Generation","authors":"J. Rigsby, Daniel Barbará","doi":"10.1109/ICDMW.2017.15","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.15","url":null,"abstract":"Given a beginning and ending document, automated storytelling attempts to fill in intermediary documents to form a coherent story. This is a common problem for analysts; they often have two snippets of information and want to find the other pieces that relate them. Evaluation of the quality of the created stories is difficult and has routinely involved human judgment. This work extends the state of the art by providing quantitative methods of story quality evaluation which are shown to have good agreement with human judgment. Two methods of automated storytelling evaluation, dispersion and coherence are developed. Dispersion, a measure of story flow, ascertains how well the generated story flows away from the beginning document and towards the ending document. Coherence measures how well the articles in the middle of the story provide information about the relationship of the beginning and ending document pair. Kullback-Leibler divergence (KLD) is used to measure the ability to encode the vocabulary of the beginning and ending story documents using the set of middle documents in the story. The dispersion and coherence methodologies developed here have the added benefit that they do not require parametrization or user inputs and are also easily automated. An automated storytelling algorithm is proposed as a multicriteria optimization problem that maximizes dispersion and coherence simultaneously. The developed storytelling methodologies will allow for the automated identification of information which associates disparate documents in support of literaturebased discovery and link analysis tasking. In addition, the methods provide quantitative measures of the strength of these associations.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114198183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential Heterogeneous Attribute Embedding for Item Recommendation","authors":"Kuan Liu, Xing Shi, P. Natarajan","doi":"10.1109/ICDMW.2017.107","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.107","url":null,"abstract":"Attributes, such as metadata and profile, carry useful information which in principle can help improve accuracy in recommender systems. However, existing approaches have difficulty in fully leveraging attribute information due to practical challenges such as heterogeneity and sparseness. These approaches also fail to combine recurrent neural networks which have recently shown effectiveness in item recommendations in applications such as video and music browsing. To overcome the challenges and to harvest the advantages of sequence models, we present a novel approach, Heterogeneous Attribute Recurrent Neural Networks (HA-RNN), which incorporates heterogeneous attributes and captures sequential dependencies in both items and attributes. HA-RNN extends recurrent neural networks with 1) a hierarchical attribute combination input layer and 2) an output attribute embedding layer. Experiments on two large-scale datasets show significant improvements over the state-of-the-art models. Ablation experiments demonstrate the crucialness of the two components to address heterogeneous attribute challenges including variable lengths and attribute sparseness. Furthermore, our exploratory studies also shed light on why sequence modeling works well.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115984965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inside the Atoms: Mining a Network of Networks and Beyond","authors":"Hanghang Tong","doi":"10.1109/ICDMW.2017.138","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.138","url":null,"abstract":"Networks (i.e., graphs) appears in many high-impact applications. Often these networks are collected from different sources, at different times, at different granularities. In this talk, I will present our recent work on mining such multiple networks. First, we will present several new data models, whose key idea is to leverage networks as context to connect different data sets or different data mining models, including a network of networks (NoN) model, a network of co-evolving time series (NoT) model and a network of regression model. Second, we will present some algorithmic examples on how to perform mining with such new models where the key idea is to leverage the contextual network as an effective regularizer during the mining process, including ranking, imputation, prediction and inference. Finally, we will demonstrate the effectiveness of our new models and algorithms in some applications, including bioinformatics, sensor networks, critical infrastructure networks and scholarly data mining.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121633943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Factor Analysis for Anonymization","authors":"Aida Calvino, Palmira Aldeguer, J. Domingo-Ferrer","doi":"10.1109/ICDMW.2017.139","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.139","url":null,"abstract":"In this paper we propose a new method to anonymize (share relevant and detailed information while not naming names) and protect data sets (minimize the utility loss) based on Factor Analysis. The method basically consists of obtaining the factors, which are uncorrelated, protecting them and undoing the transformation in order to get interpretable protected variables. We first show how to proceed when all variables in the data set need protection and, then, we focus on the case where only a subset of variables has to be protected. Finally, we perform a simulation study to compare the proposed method with two alternative techniques: Microaggregation plus noise addition (which has been recognized as a very powerful method) and one anonymization method recently proposed based on Principal Components Analysis.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116830352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Queries of Information Retrieval Using Krylov Subspace Method","authors":"Youzuo Lin","doi":"10.1109/ICDMW.2017.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.75","url":null,"abstract":"The Krylov subspace based information retrieval (IR) approach has been shown to provide comparable accuracy to latent semantic indexing (LSI), while providing some computational advantages. Recently, in the area of numerical linear algebra, attention has been drawn to the block Krylov subspace methods, which are shown to be more efficient than the classic Krylov subspace methods in solving linear systems with multiple right hand sides. Such improvement in the algorithm gives us the opportunity to extend the original retrieval method, enabling single query searching, to multiple query searching. In this paper, we report such improvement in the retrieval algorithm, and demonstrate its performance by comparing to several other retrieval methods using the Medline corpus.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probable Biomarker Identification Using Recursive Feature Extraction and Network Analysis","authors":"Arpita Mishra, Abhishek Gupta, Umesh Maheswari, Laeeq Siddique","doi":"10.1109/ICDMW.2017.67","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.67","url":null,"abstract":"Biomarkers have tremendous potential in different phases of treatment such as risk assessment, screening/detection, diagnosis and patient's response prediction. In this paper, we present an approach for development of a generic tool for an end to end analysis of expression data to identify the probable biomarkers. We follow machine learning as well as network analysis approaches in parallel. We use statistical techniques as preliminaries for quality analysis, followed by the feature (gene) selection approach. For network analysis techniques we use measures such as eigen centrality, closeness centrality and betweenness centrality to filter the most influential mutated genes which act as biomarkers.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131455892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}