Pipeline Design for Data Preparation for Social Media Analysis

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality Pub Date : 2023-05-20 DOI:10.1145/3597305

Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali

{"title":"Pipeline Design for Data Preparation for Social Media Analysis","authors":"Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali","doi":"10.1145/3597305","DOIUrl":null,"url":null,"abstract":"In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"1 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3597305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.

查看原文本刊更多论文

面向社交媒体分析的数据准备管道设计

在数据驱动的文化中，分析应用程序是支持决策的主要资源，使用高质量的数据集是强制性的，以最大限度地减少错误和风险。因此，数据分析任务前需要有数据准备管道。这种管道的设计并不简单:数据分析师必须考虑几个方面，仔细选择适当的操作。这通常是通过采用一种试错方法来实现的，这种方法并不总是导致最有效的解决方案。此外，从社交媒体中提取信息会带来一些具体问题，因为需要只考虑与分析相关的帖子，因为它依赖于所考虑的上下文，因为它的多媒体内容，以及使用自动过滤器过滤掉信息丰富的帖子的风险。在本文中，我们提出了一种系统的方法来支持能够有效地提取相关数据集的管道设计，以分析来自社交媒体的数据。我们提供了一个概念模型，用于设计和注释带有质量和性能信息的数据准备管道，从而以上下文感知的方式为数据分析师提供有关结果数据集的预期质量的初步信息。与处理任务相关的元数据的生成已被认为是实现数据共享和可重用性的必要条件。为此，管道应用程序生成的数据集将自动使用来源元数据进行注释，以获得管道对其执行的所有活动的详细描述。作为一个案例研究，我们考虑设计一个管道，用于创建从社交媒体中提取的图像数据集，以分析COVID-19期间的行为方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Journal of Data and Information Quality COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.10

自引率

4.80%

发文量