Ishtiaq Ahmed, Shiyong Lu, Changxin Bai, F. Bhuyan
{"title":"Diagnosis Recommendation Using Machine Learning Scientific Workflows","authors":"Ishtiaq Ahmed, Shiyong Lu, Changxin Bai, F. Bhuyan","doi":"10.1109/BigDataCongress.2018.00018","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00018","url":null,"abstract":"Diagnosis recommendation plays a significant role in healthcare, where a clinician infers an optimal diagnosis for a patient. This problem has a major impact on improving patients’ quality of life. Existing machine learning techniques for solving this problem require many labeled instances, which are not readily available. To overcome this limitation, in this paper, we present a scientific workflow for representing a semisupervised clustering based diagnosis recommendation model. In this approach, initial clusters are formed from a labeled dataset; then imposing certain relative threshold to a cluster, frequent patterns and their corresponding labels are obtained. Subsequently, unlabeled instances are labeled by assigning them to the most similar clusters. Finally, we form clusters on the generated new datasets and recommend the diagnosis label by applying a certain minimum threshold. To evaluate our model, we perform extensive experiments on the i2b2 datasets and compared our proposed algorithms with the self-training and co-training methods. The experimental results show that our proposed algorithm outperforms the mentioned methods in most cases. The proposed workflow is implemented in the DATAVIEW system.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123125232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Khodabakhsh, Ismail Ari, Mustafa Bakir, Serhat Murat Alagoz
{"title":"Stream Analytics and Adaptive Windows for Operational Mode Identification of Time-Varying Industrial Systems","authors":"A. Khodabakhsh, Ismail Ari, Mustafa Bakir, Serhat Murat Alagoz","doi":"10.1109/BigDataCongress.2018.00042","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00042","url":null,"abstract":"It is necessary to develop accurate, yet simple and efficient models that can be used with high-speed industrial data streams. In this paper, we develop a mode identification technique using stream analytics and show that it may be more effective than batch models, especially for time-varying systems. These industrial systems continuously monitor hundreds of sensors, but the relationships among variables change over time, which are identified as different operational modes. To detect drifts among modes, predictive modeling techniques such as regression analysis, K-means and DBSCAN clustering are used over sensor data streams from an oil refinery and models are updated in real-time using window-based analysis. Finally, an adaptive window size tuning approach based on the TCP congestion control algorithm is discussed, which reduces model update costs as well as prediction errors.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128651211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insights on Apache Spark Usage by Mining Stack Overflow Questions","authors":"L. J. Rodríguez, Xiaoran Wang, Jilong Kuang","doi":"10.1109/BigDataCongress.2018.00037","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00037","url":null,"abstract":"Apache Spark is one of the most popular big data tools. Despite its popularity, there are no studies regarding its overall usage among software developers. As such, essential questions remain unanswered. For instance, it is not known what the common issues faced by Spark users are, what the most popular Spark libraries are, or what technologies are most commonly used together with Spark. In this paper, we mine Stack Overflow questions and try to shed some light into the above issues. Specifically, we first apply Latent Dirichlet Allocation (LDA) to Stack Overflow questions and obtain the main topics of discussion. By computing previously proposed metrics and a novel modification, we provide insights into Spark usage while taking question view count into account. Further insights are then given by applying newly proposed metrics to the question tags. Temporal trends are finally discussed after analyzing the proposed metrics over time.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129154486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zainab Abbas, A. Al-Shishtawy, Sarunas Girdzijauskas, Vladimir Vlassov
{"title":"Short-Term Traffic Prediction Using Long Short-Term Memory Neural Networks","authors":"Zainab Abbas, A. Al-Shishtawy, Sarunas Girdzijauskas, Vladimir Vlassov","doi":"10.1109/BigDataCongress.2018.00015","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00015","url":null,"abstract":"Short-term traffic prediction allows Intelligent Transport Systems to proactively respond to events before they happen. With the rapid increase in the amount, quality, and detail of traffic data, new techniques are required that can exploit the information in the data in order to provide better results while being able to scale and cope with increasing amounts of data and growing cities. We propose and compare three models for short-term road traffic density prediction based on Long Short-Term Memory (LSTM) neural networks. We have trained the models using real traffic data collected by Motorway Control System in Stockholm that monitors highways and collects flow and speed data per lane every minute from radar sensors. In order to deal with the challenge of scale and to improve prediction accuracy, we propose to partition the road network into road stretches and junctions, and to model each of the partitions with one or more LSTM neural networks. Our evaluation results show that partitioning of roads improves the prediction accuracy by reducing the root mean square error by the factor of 5. We show that we can reduce the complexity of LSTM network by limiting the number of input sensors, on average to 35% of the original number, without compromising the prediction accuracy.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124099310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compile-Time Code Generation for Embedded Data-Intensive Query Languages","authors":"L. Fegaras, Md Hasanuzzaman Noor","doi":"10.1109/BigDataCongress.2018.00008","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00008","url":null,"abstract":"Many emerging Big Data programming environments, such as Spark and Flink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning data analysis applications using the provided APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, current data analysis query languages, which are typically based on the relational model, cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model, and are checked for correctness at run-time, which results in a significantly longer program development time. To address these shortcomings, we introduce a new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer can find any possible join in a query, including joins hidden across deeply nested queries, thus unnesting any form of query nesting. Currently, DIQL can run on three Big Data platforms: Apache Spark, Apache Flink, and Twitter's Cascading/Scalding.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fie Sternberg, Kasper Hedegaard Pedersen, Niklas Klve Ryelund, R. Mukkamala, Ravikiran Vatrapu
{"title":"Analysing Customer Engagement of Turkish Airlines Using Big Social Data","authors":"Fie Sternberg, Kasper Hedegaard Pedersen, Niklas Klve Ryelund, R. Mukkamala, Ravikiran Vatrapu","doi":"10.1109/BigDataCongress.2018.00017","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00017","url":null,"abstract":"Companies started taking advantage of the unlocked potential of Big Social Data, however, research on airlines’ use of social media is limited. This research aims to investigate to what extent Turkish Airlines can utilize their Facebook page to improve performance metrics. This study will exploit the concepts of Big Social Data, customer satisfaction, sentiment analysis to answer the research questions by employing dataand text mining, machine learning. The results showed a weak relationship between the business data and Facebook data, however, the findings provided explanations to customer behavior and showed that most of the company’s Facebook users were likely to purchase a Turkish Airline ticket. Therefore, Turkish Airlines could utilize their Facebook page in the short-term to improve revenue-generating indicators such as customer satisfaction and likelihood of purchase.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big Web Colors: Analyzing the World Top Sites","authors":"M. Marchiori, Giulio Rigoni","doi":"10.1109/BigDataCongress.2018.00020","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00020","url":null,"abstract":"Colors are obviously important for web sites, but how much? in this paper we try to study the problem of abstracting from the actual content, and analyze if and how colors in images have a higher-level fundamental importance. Focusing on the world top web sites, we collected a large pool (almost two millions) of images, and then investigated the relationships of colors with the attractiveness of a page. Can colors alone boost the success of a page, and in what terms? To answer this question we developed an experiment involving a large number of people, measuring how and how much colors affect a page, abstracting from the content. The results show that, rather surprisingly, colors do have a more fundamental significance that can be decoupled from the underlying shapes. We provide qualitative and quantitative insights on how important colors are, and how they actually impact the success of a site in terms of user perception.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133979725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Modeling and Task Scheduling in Distributed Graph Processing","authors":"Daniel Presser, Frank Siqueira, Fábio Reina","doi":"10.1109/BigDataCongress.2018.00025","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00025","url":null,"abstract":"The accelerated growth of datasets observed in modern applications also applies to datasets modeled as graphs. To handle this problem, several large scale distributed graph processing models have been proposed, such as Pregel. These systems are designed to run in large clusters, where the resources must be allocated efficiently. In this paper we present a prediction model and a scheduler for Pregel-based distributed graph processing jobs. The jobs are treated as moldable tasks by the scheduler that, based on the predictions, allocates the best number of workers to each job in order to minimize makespan. Experimental results show that the prediction model has accuracy close to 90%, allowing the scheduler to work within the theoretical approximation limits of the optimal makespan.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131779166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incorporating Word Embedding into Cross-Lingual Topic Modeling","authors":"Chia-Hsuan Chang, San-Yih Hwang, Tou-Hsiang Xui","doi":"10.1109/BigDataCongress.2018.00010","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00010","url":null,"abstract":"In this paper, we address the cross-lingual topic modeling, which is an important technique that enables global enterprises to detect and compare topic trends across global markets. Previous works in cross-lingual topic modeling have proposed methods that utilize parallel or comparable corpus in constructing the polylingual topic model. However, parallel or comparable corpus in many cases are not available. In this research, we incorporate techniques of mapping cross-lingual word space and the topic modeling (LDA) and propose two methods: Translated Corpus with LDA (TC-LDA) and Post Match LDA (PM-LDA). The cross-lingual word space mapping allows us to compare words of different languages, and LDA enables us to group words into topics. Both TC-LDA and PM-LDA do not need parallel or comparable corpus and hence have more applicable domains. The effectiveness of both methods is evaluated using UM-Corpus and WS-353. Our evaluation results indicate that both methods are able to identify similar documents written in different language. In addition, PM-LDA is shown to achieve better performance than TC-LDA, especially when document length is short.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127541421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianwu Wang, Chen Liu, Meiling Zhu, Pei Guo, Yapeng Hu
{"title":"Sensor Data Based System-Level Anomaly Prediction for Smart Manufacturing","authors":"Jianwu Wang, Chen Liu, Meiling Zhu, Pei Guo, Yapeng Hu","doi":"10.1109/BigDataCongress.2018.00028","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00028","url":null,"abstract":"With the popularity of Supervisory Information System (SIS), Supervisory Control and Data Acquisition (SCADA) system and Internet of Things (IoT) sensors, we can easily obtain abundant sensor data in manufacturing. We could save manufacturing maintenance costs and prevent further damages if we can accurately predict system anomalies from the sensor data. Yet learning from individual sensors often cannot directly determine whether the system will have anomaly because each sensor only measures a partial state of a big system. By detecting events across sensors collectively and their temporal dependencies, this paper proposes a new system-level anomaly prediction framework by mining anomaly dependency graph from sensor data. The advantages of the approach include explainability, collective prediction and temporal sensitivity. We applied our approach with a real-world power plant dataset to evaluate its feasibility.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132486737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}