{"title":"Survey of Distributed Computing Frameworks for Supporting Big Data Analysis","authors":"Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang","doi":"10.26599/BDMA.2022.9020014","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020014","url":null,"abstract":"Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"154-169"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026506.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cloud-Based Software Development Lifecycle: A Simplified Algorithm for Cloud Service Provider Evaluation with Metric Analysis","authors":"Santhosh S;Narayana Swamy Ramaiah","doi":"10.26599/BDMA.2022.9020016","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020016","url":null,"abstract":"At present, hundreds of cloud vendors in the global market provide various services based on a customer's requirements. All cloud vendors are not the same in terms of the number of services, infrastructure availability, security strategies, cost per customer, and reputation in the market. Thus, software developers and organizations face a dilemma when choosing a suitable cloud vendor for their developmental activities. Thus, there is a need to evaluate various cloud service providers (CSPs) and platforms before choosing a suitable vendor. Already existing solutions are either based on simulation tools as per the requirements or evaluated concerning the quality of service attributes. However, they require more time to collect data, simulate and evaluate the vendor. The proposed work compares various CSPs in terms of major metrics, such as establishment, services, infrastructure, tools, pricing models, market share, etc., based on the comparison, parameter ranking, and weightage allocated. Furthermore, the parameters are categorized depending on the priority level. The weighted average is calculated for each CSP, after which the values are sorted in descending order. The experimental results show the unbiased selection of CSPs based on the chosen parameters. The proposed parameter-ranking priority level weightage (PRPLW) algorithm simplifies the selection of the best-suited cloud vendor in accordance with the requirements of software development.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"127-138"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026515.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu
{"title":"EScope: Effective Event Validation for IoT Systems Based on State Correlation","authors":"Jian Mao;Xiaohe Xu;Qixiao Lin;Liran Ma;Jianwei Liu","doi":"10.26599/BDMA.2022.9020034","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020034","url":null,"abstract":"Typical Internet of Things (IoT) systems are event-driven platforms, in which smart sensing devices sense or subscribe to events (device state changes), and react according to the preconfigured trigger-action logic, as known as, automation rules. “Events” are essential elements to perform automatic control in an IoT system. However, events are not always trustworthy. Sensing fake event notifications injected by attackers (called event spoofing attack) can trigger sensitive actions through automation rules without involving authorized users. Existing solutions verify events via “event fingerprints” extracted by surrounding sensors. However, if a system has homogeneous sensors that have strong correlations among them, traditional threshold-based methods may cause information redundancy and noise amplification, consequently, decreasing the checking accuracy. Aiming at this, in this paper, we propose “EScope”, an effective event validation approach to check the authenticity of system events based on device state correlation. EScope selects informative and representative sensors using an Neural-Network-based (NN-based) sensor selection component and extracts a verification sensor set for event validation. We evaluate our approach using an existing dataset provided by Peeves. The experiment results demonstrate that EScope achieves an average 67% sensor amount reduction on 22 events compared with the existing work, and increases the event spoofing detection accuracy.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"218-233"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026512.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Medical Knowledge Graph: Data Sources, Construction, Reasoning, and Applications","authors":"Xuehong Wu;Junwen Duan;Yi Pan;Min Li","doi":"10.26599/BDMA.2022.9020021","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020021","url":null,"abstract":"Medical knowledge graphs (MKGs) are the basis for intelligent health care, and they have been in use in a variety of intelligent medical applications. Thus, understanding the research and application development of MKGs will be crucial for future relevant research in the biomedical field. To this end, we offer an in-depth review of MKG in this work. Our research begins with the examination of four types of medical information sources, knowledge graph creation methodologies, and six major themes for MKG development. Furthermore, three popular models of reasoning from the viewpoint of knowledge reasoning are discussed. A reasoning implementation path (RIP) is proposed as a means of expressing the reasoning procedures for MKG. In addition, we explore intelligent medical applications based on RIP and MKG and classify them into nine major types. Finally, we summarize the current state of MKG research based on more than 130 publications and future challenges and opportunities.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"201-217"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026520.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficacy of Bluetooth-Based Data Collection for Road Traffic Analysis and Visualization Using Big Data Analytics","authors":"Ashish Rajeshwar Kulkarni;Narendra Kumar;K. Ramachandra Rao","doi":"10.26599/BDMA.2022.9020039","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020039","url":null,"abstract":"Effective management of daily road traffic is a huge challenge for traffic personnel. Urban traffic management has come a long way from manual control to artificial intelligence techniques. Still real-time adaptive traffic control is an unfulfilled dream due to lack of low cost and easy to install traffic sensor with real-time communication capability. With increasing number of on-board Bluetooth devices in new generation automobiles, these devices can act as sensors to convey the traffic information indirectly. This paper presents the efficacy of road-side Bluetooth scanners for traffic data collection and big-data analytics to process the collected data to extract traffic parameters. Extracted information and analysis are presented through visualizations and tables. All data analytics and visualizations are carried out off-line in R Studio environment. Reliability aspects of the collected and processed data are also investigated. Higher speed of traffic in one direction owing to the geometry of the road is also established through data analysis. Increased penetration of smart phones and fitness bands in day to day use is also established through the device type of the data collected. The results of this work can be used for regular data collection compared to the traditional road surveys carried out annually or bi-annually. It is also found that compared to previous studies published in the literature, the device penetration rate and sample size found in this study are quite high and very encouraging. This is a novel work in literature, which would be quite useful for effective road traffic management in future.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"139-153"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026507.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran
{"title":"Semi-Supervised Machine Learning for Fault Detection and Diagnosis of a Rooftop Unit","authors":"Mohammed G. Albayati;Jalal Faraj;Amy Thompson;Prathamesh Patil;Ravi Gorthala;Sanguthevar Rajasekaran","doi":"10.26599/BDMA.2022.9020015","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020015","url":null,"abstract":"Most heating, ventilation, and air-conditioning (HVAC) systems operate with one or more faults that result in increased energy consumption and that could lead to system failure over time. Today, most building owners are performing reactive maintenance only and may be less concerned or less able to assess the health of the system until catastrophic failure occurs. This is mainly because the building owners do not previously have good tools to detect and diagnose these faults, determine their impact, and act on findings. Commercially available fault detection and diagnostics (FDD) tools have been developed to address this issue and have the potential to reduce equipment downtime, energy costs, maintenance costs, and improve occupant comfort and system reliability. However, many of these tools require an in-depth knowledge of system behavior and thermodynamic principles to interpret the results. In this paper, supervised and semi-supervised machine learning (ML) approaches are applied to datasets collected from an operating system in the field to develop new FDD methods and to help building owners see the value proposition of performing proactive maintenance. The study data was collected from one packaged rooftop unit (RTU) HVAC system running under normal operating conditions at an industrial facility in Connecticut. This paper compares three different approaches for fault classification for a real-time operating RTU using semi-supervised learning, achieving accuracies as high as 95.7% using few-shot learning.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"170-184"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026516.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan
{"title":"Continuous and Discrete Similarity Coefficient for Identifying Essential Proteins Using Gene Expression Data","authors":"Jiancheng Zhong;Zuohang Qu;Ying Zhong;Chao Tang;Yi Pan","doi":"10.26599/BDMA.2022.9020019","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020019","url":null,"abstract":"Essential proteins play a vital role in biological processes, and the combination of gene expression profiles with Protein-Protein Interaction (PPI) networks can improve the identification of essential proteins. However, gene expression data are prone to significant fluctuations due to noise interference in topological networks. In this work, we discretized gene expression data and used the discrete similarities of the gene expression spectrum to eliminate noise fluctuation. We then proposed the Pearson Jaccard coefficient (PJC) that consisted of continuous and discrete similarities in the gene expression data. Using the graph theory as the basis, we fused the newly proposed similarity coefficient with the existing network topology prediction algorithm at each protein node to recognize essential proteins. This strategy exhibited a high recognition rate and good specificity. We validated the new similarity coefficient PJC on PPI datasets of Krogan, Gavin, and DIP of yeast species and evaluated the results by receiver operating characteristic analysis, jackknife analysis, top analysis, and accuracy analysis. Compared with that of node-based network topology centrality and fusion biological information centrality methods, the new similarity coefficient PJC showed a significantly improved prediction performance for essential proteins in DC, IC, Eigenvector centrality, subgraph centrality, betweenness centrality, closeness centrality, NC, PeC, and WDC. We also compared the PJC coefficient with other methods using the NF-PIN algorithm, which predicts proteins by constructing active PPI networks through dynamic gene expression. The experimental results proved that our newly proposed similarity coefficient PJC has superior advantages in predicting essential proteins.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"185-200"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026519.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Denoising Graph Inference Network for Document-Level Relation Extraction","authors":"Hailin Wang;Ke Qin;Guiduo Duan;Guangchun Luo","doi":"10.26599/BDMA.2022.9020051","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020051","url":null,"abstract":"Relation Extraction (RE) is to obtain a predefined relation type of two entities mentioned in a piece of text, e.g., a sentence-level or a document-level text. Most existing studies suffer from the noise in the text, and necessary pruning is of great importance. The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs. However, this kind of denoising method is scarce in document-level RE. In this work, we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities. We first formalize a Syntactic Dependency Tree forest (SDT-forest) by introducing the syntax and discourse dependency relation. Then, the Steiner tree algorithm extracts a mention-level denoised graph, Steiner Graph (SG), removing linguistically irrelevant words from the SDT-forest. We then devise a slide residual attention to highlight word-level evidence on text and SG. Finally, the classification is established on the SG to infer the relations of entity pairs. We conduct extensive experiments on three public datasets. The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"248-262"},"PeriodicalIF":13.6,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026508.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenpeng Wu;Jiantao Zheng;Jiashu Liu;Cuixiang Lin;Hong-Dong Li
{"title":"DeepRetention: A Deep Learning Approach for Intron Retention Detection","authors":"Zhenpeng Wu;Jiantao Zheng;Jiashu Liu;Cuixiang Lin;Hong-Dong Li","doi":"10.26599/BDMA.2022.9020023","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020023","url":null,"abstract":"As the least understood mode of alternative splicing, Intron Retention (IR) is emerging as an interesting area and has attracted more and more attention in the field of gene regulation and disease studies. Existing methods detect IR exclusively based on one or a few predefined metrics describing local or summarized characteristics of retained introns. These metrics are not able to describe the pattern of sequencing depth of intronic reads, which is an intuitive and informative characteristic of retained introns. We hypothesize that incorporating the distribution pattern of intronic reads will improve the accuracy of IR detection. Here we present DeepRetention, a novel approach for IR detection by modeling the pattern of sequencing depth of introns. Due to the lack of a gold standard dataset of IR, we first compare DeepRetention with two state-of-the-art methods, i.e. iREAD and IRFinder, on simulated RNA-seq datasets with retained introns. The results show that DeepRetention outperforms these two methods. Next, DeepRetention performs well when it is applied to third-generation long-read RNA-seq data, while IRFinder and iREAD are not applicable to detecting IR from the third-generation sequencing data. Further, we show that IRs predicted by DeepRetention are biologically meaningful on an RNA-seq dataset from Alzheimer's Disease (AD) samples. The differential IRs are found to be significantly associated with AD based on statistical evaluation of an AD-specific functional gene network. The parent genes of differential IRs are enriched in AD-related functions. In summary, DeepRetention detects IR from a new angle of view, providing a valuable tool for IR analysis.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"115-126"},"PeriodicalIF":13.6,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026289.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67984953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ultra-Short Wave Communication Squelch Algorithm Based on Deep Neural Network","authors":"Yuanxin Xiang;Yi Lv;Wenqiang Lei;Jiancheng Lv","doi":"10.26599/BDMA.2022.9020025","DOIUrl":"https://doi.org/10.26599/BDMA.2022.9020025","url":null,"abstract":"The squelch problem of ultra-short wave communication under non-stationary noise and low Signal-to-Noise Ratio (SNR) in a complex electromagnetic environment is still challenging. To alleviate the problem, we proposed a squelch algorithm for ultra-short wave communication based on a deep neural network and the traditional energy decision method. The proposed algorithm first predicts the speech existence probability using a three-layer Gated Recurrent Unit (GRU) with the speech banding spectrum as the feature. Then it gets the final squelch result by combining the strength of the signal energy and the speech existence probability. Multiple simulations and experiments are done to verify the robustness and effectiveness of the proposed algorithm. We simulate the algorithm in three situations: the typical Amplitude Modulation (AM) and Frequency Modulation (FM) in the ultra-short wave communication under different SNR environments, the non-stationary burst-like noise environments, and the real received signal of the ultra-short wave radio. The experimental results show that the proposed algorithm performs better than the traditional squelch methods in all the simulations and experiments. In particular, the false alarm rate of the proposed squelch algorithm for non-stationary burst-like noise is significantly lower than that of traditional squelch methods.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 1","pages":"106-114"},"PeriodicalIF":13.6,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9962810/09962958.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68007926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}