{"title":"Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks","authors":"Sudhir Kumar Patnaik;C. Narendra Babu;Mukul Bhave","doi":"10.26599/BDMA.2021.9020012","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020012","url":null,"abstract":"Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"279-297"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523501.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coronavirus pandemic analysis through tripartite graph clustering in online social networks","authors":"Xueting Liao;Danyang Zheng;Xiaojun Cao","doi":"10.26599/BDMA.2021.9020010","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020010","url":null,"abstract":"The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related issues has been pouring into social platforms, such as Twitter. Many public officials and governments use Twitter to make policy announcements. People keep close track of the related information and express their concerns about the policies on Twitter. It is beneficial yet challenging to derive important information or knowledge out of such Twitter data. In this paper, we propose a Tripartite Graph Clustering for Pandemic Data Analysis (TGC-PDA) framework that builds on the proposed models and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization with regularization, and (3) sentiment analysis. We collect the tweets containing a set of keywords related to coronavirus pandemic as the ground truth data. Our framework can detect the communities of Twitter users and analyze the topics that are discussed in the communities. The extensive experiments show that our TGC-PDA framework can effectively and efficiently identify the topics and correlations within the Twitter data for monitoring and understanding public opinions, which would provide policy makers useful information and statistics for decision making.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"242-251"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523498.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LotusSQL: SQL engine for high-performance big data systems","authors":"Xiaohan Li;Bowen Yu;Guanyu Feng;Haojie Wang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020009","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020009","url":null,"abstract":"In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"252-265"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A deep-learning prediction model for imbalanced time series data forecasting","authors":"Chenyu Hou;Jiawei Wu;Bin Cao;Jing Fan","doi":"10.26599/BDMA.2021.9020011","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020011","url":null,"abstract":"Time series forecasting has attracted wide attention in recent decades. However, some time series are imbalanced and show different patterns between special and normal periods, leading to the prediction accuracy degradation of special periods. In this paper, we aim to develop a unified model to alleviate the imbalance and thus improving the prediction accuracy for special periods. This task is challenging because of two reasons: (1) the temporal dependency of series, and (2) the tradeoff between mining similar patterns and distinguishing different distributions between different periods. To tackle these issues, we propose a self-attention-based time-varying prediction model with a two-stage training strategy. First, we use an encoder-decoder module with the multi-head self-attention mechanism to extract common patterns of time series. Then, we propose a time-varying optimization module to optimize the results of special periods and eliminate the imbalance. Moreover, we propose reverse distance attention in place of traditional dot attention to highlight the importance of similar historical values to forecast results. Finally, extensive experiments show that our model performs better than other baselines in terms of mean absolute error and mean absolute percentage error.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 4","pages":"266-278"},"PeriodicalIF":13.6,"publicationDate":"2021-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9523493/09523500.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68022965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Call for papers: Special issue on intelligent systems and Internet of Things","authors":"","doi":"10.26599/BDMA.2021.9020007","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020007","url":null,"abstract":"Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"222-222"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430138.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multitask multiview neural network for end-to-end aspect-based sentiment analysis","authors":"Yong Bie;Yan Yang","doi":"10.26599/BDMA.2021.9020003","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020003","url":null,"abstract":"The aspect-based sentiment analysis (ABSA) consists of two subtasks-aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network (MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"195-207"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430135.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AIPerf: Automated machine learning as an AI-HPC benchmark","authors":"Zhixiang Ren;Yongheng Liu;Tianhui Shi;Lei Xie;Yue Zhou;Jidong Zhai;Youhui Zhang;Yunquan Zhang;Wenguang Chen","doi":"10.26599/BDMA.2021.9020004","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020004","url":null,"abstract":"The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the defacto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machinelearning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales ofmachines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimizationpotential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which ismeasured in an analytical and systematic approach, as a major metric to quantify the AI performance. We performevaluations on various systems to ensure the benchmark's stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scaleon and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"208-220"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430136.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey on algorithms for intelligent computing and smart city applications","authors":"Zhao Tong;Feng Ye;Ming Yan;Hong Liu;Sunitha Basodi","doi":"10.26599/BDMA.2020.9020029","DOIUrl":"https://doi.org/10.26599/BDMA.2020.9020029","url":null,"abstract":"With the rapid development of human society, the urbanization of the world's population is also progressing rapidly. Urbanization has brought many challenges and problems to the development of cities. For example, the urban population is under excessive pressure, various natural resources and energy are increasingly scarce, and environmental pollution is increasing, etc. However, the original urban model has to be changed to enable people to live in greener and more sustainable cities, thus providing them with a more convenient and comfortable living environment. The new urban framework, the smart city, provides excellent opportunities to meet these challenges, while solving urban problems at the same time. At this stage, many countries are actively responding to calls for smart city development plans. This paper investigates the current stage of the smart city. First, it introduces the background of smart city development and gives a brief definition of the concept of the smart city. Second, it describes the framework of a smart city in accordance with the given definition. Finally, various intelligent algorithms to make cities smarter, along with specific examples, are discussed and analyzed.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"155-172"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430132.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improvising personalized travel recommendation system with recency effects","authors":"Paromita Nitu;Joseph Coelho;Praveen Madiraju","doi":"10.26599/BDMA.2020.9020026","DOIUrl":"https://doi.org/10.26599/BDMA.2020.9020026","url":null,"abstract":"A travel recommendation system based on social media activity provides a customized place of interest to accommodate user-specific needs and preferences. In general, the user's inclination towards travel destinations is subject to change over time. In this project, we have analyzed users' twitter data, as well as their friends and followers in a timely fashion to understand recent travel interest. A machine learning classifier identifies tweets relevant to travel. The travel tweets are then used to obtain personalized travel recommendations. Unlike most of the personalized recommendation systems, our proposed model takes into account a user's most recent interest by incorporating time-sensitive recency weight into the model. Our proposed model has outperformed the existing personalized place of interest recommendation model, and the overall accuracy is 75.23%.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"139-154"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430131.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Call for papers: Special issue on unlocking genetic diseases by integrating machine learning techniques and medical data","authors":"","doi":"10.26599/BDMA.2021.9020005","DOIUrl":"https://doi.org/10.26599/BDMA.2021.9020005","url":null,"abstract":"Big Data Mining and Analytics is an international academic journal sponsored by Tsinghua University and published quarterly. It features on technologies to enable and accelerate big data discovery. All of papers published are on the IEEE Xplore Digital Library with the open access mode. The journal is indexed and abstracted in Ei Compendex, Scopus, DBLP Computer Science, Google Scholar, and CNKI.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"4 3","pages":"221-221"},"PeriodicalIF":13.6,"publicationDate":"2021-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/9430128/09430137.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"67859135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}