智慧城市关键基础设施和应急管理高质量数据的自适应和成本效益收集——框架和挑战

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-05-29 DOI:10.1145/3190579

E. Bertino, M. Jahanshahi

{"title":"智慧城市关键基础设施和应急管理高质量数据的自适应和成本效益收集——框架和挑战","authors":"E. Bertino, M. Jahanshahi","doi":"10.1145/3190579","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/05-ART1 $15.00 https://doi.org/10.1145/3190579 ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:2 E. Bertino and M. R. Jahanshahi Fig. 1. A spatially incomplete object. agents (mobile phones, small drones, robots, sensors); 5G networks and edge computing processing [2]; crowdsourcing. In this article, we first briefly discuss relevant data quality requirements related to applications in the area of critical infrastructure and emergency management, although this framework can be extended to other applications. We then present a comprehensive framework for a real-time, adaptive, and cost-effective collection of high-quality data for such applications that leverage many of the above technologies, and elaborate on a few research challenges. 2 DATA QUALITY REQUIREMENTS Data quality is usually characterized by many different dimensions [3]. In our context, e.g., objects extracted from image data, key requirements include: —Spatial Completeness: The objects of interest should be “fully covered” by the image data. For example, an image reporting only half of a building crack would not have satisfactory spatial completeness (see Figure 1 for an example of a spatially incomplete object). —Temporal Completeness: The temporal evolution of the objects of interest should be covered as it is critical for accurate prediction. —Precision: The object images should be sharp and have high resolution. —Traceability: Information about the entire process, according to which data of interest was collected, processed, and transmitted, should be recorded; this is critical for identifying errors that lead to poor quality data about the objects of interest. —Minimality: The presence of non-relevant objects should be minimized. It is, however, important to remark that other quality requirements, such as currentness and consistency, are also relevant in our context. 3 DATA COLLECTION FRAMEWORK Our framework (see Figure 2) is based on two conceptual parties: data collection coordinator (referred to as base station (BS)); and data collectors (e.g., agents in charge of data gathering). The data collection coordinator is the interface system that coordinates the data acquisition tasks and data quality assessment. It interfaces on one side with the data users (e.g., end-users and applications) and on the other with data collectors. Given a data acquisition task and geographical area of interest, it allocates a number of data collectors, based on the capabilities of collectors, for the execution of the task, by also trying to optimize the cost of data acquisition and minimize ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:3 Fig. 2. Data collection framework. the response time. Such allocation decisions can be basically supported by optimization techniques developed in the area of operation research. The main challenge is to determine the most suitable optimization techniques for dynamic contexts. The data collection coordinator must also assess the quality of data with respect to specific quality requirements provided as input by the data users. Since a data collection task may often be split among data collectors, the coordinator may have to integrate the various collected data to see whether, overall, the data meets the specified quality requirements or not. The coordinator may also support data enrichment, for example, by using GIS data [5] and data linkage with other sources. The data collectors carry out the basic tasks of collecting data, assessing the quality of the collected data, and, based on this assessment, collecting more data. Notice that data collectors may have different capabilities. For example, some collectors may have equipment for very high-resolution imagery with powerful computing capabilities and can run machine learning tools that require large storage size and GPU. These collectors may thus be able to perform a high accurate data quality analysis. On the other hand, other collectors are very small and thus can easily move very close to the objects and take images from very short distances; however, their capability for data quality assessment may be very limited. Finally, other collectors may be equipped with mechanical devices to take samples from the environment, such as a sample of soil or water, or perform active testing through injecting dynamic disturbances by the collector actuators at selected locations (e.g., exciting the structure by a hammer and collecting the propagated wave characteristics for damage detection). The decision about the right combinations of data collectors for a data acquisition task is taken by the data collection coordinator based on the knowledge of the capabilities of each data collection device. However, as research in the area of distributed decision making for autonomous systems progresses, such decisions could be even taken autonomously by swarms of data collectors. Our framework is based on the notion of data collection cycle, which is organized according to a continuous loop consisting of two main phases: (a) data collection; (b) data quality assessment. Once data is collected, it is assessed for quality. If quality is insufficient, further data is collected. ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:4 E. Bertino and M. R. Jahanshahi Further data collection is typically tailored to improve the quality. For example, a data collector may be required to collect data of higher resolution for a specific object. Data quality assessment is executed at three levels: (1) locally at the data collector; (2) collaboratively within the data collector swarms; (3) globally at the data collection coordinator. Assessments (1) and (2) may not always be possible. Assessment (1) may not be possible as the data collector may not have the capabilities to assess data. Assessment (2) may not be possible if the swarm does not have capabilities to assess the data or if a data collector is isolated from the rest of the swarm. However, Assessment (2) may be highly desirable when connections with the BS are fragmented/unreliable. Adaptation capabilities are thus crucial to deal with those situations. It is important to notice that a critical challenge is to develop approaches for automatically assessing the quality of collected data and automatically determining which additional data needs to be further collected to refine/complete/enhance the quality of the initial data. In particular, when data collection is performed by a swarm of data collectors, the swarm should automatically assess the data and decide further data collection. The development of such framework requires addressing several challenges: —Optimized data-quality driven allocation of data collection tasks to agents: Data collectors are typically heterogeneous with respect to hardware and software capabilities and with respect to special equipment for data acquisition—for example, a drone may have equipment for acquiring images at very high resolution. Also, data collectors may be located in different geographical regions. Data collection also depends on the quality requirements; for example, when performing an initial assessment, data of low quality may be fine. Therefore, it is important to design approaches that are able to support the optimal allocation of data acquisition tasks based on different constraints, requirements, and data collectors’ capabilities and status. Furthermore, it is important that each data collector has the capability of autonomously deciding which data to collect based on its own local assessment of data that have already been collected. Thus, the allocation of data collection tasks is a combination of centralized decisions with decisions local to the data collectors and/or data collector swarms. —Automatic (collaborative) data quality assessment: Techniques are needed to automatically assess the quality of the collected data with respect to the specific quality requirements. Techniques based on machine learning are relevant here. The main issue is that such assessment may be carried out at three different levels (see previous section) and thus tradeoff may be needed between accuracy and resource usage. For example, at the level of the data collectors, resource usage should be minimized. However, resource use minimization may lead to less accurate decisions. It is also critical to devise approaches by which such assessments can be carried out by data collector swarms. Finally, for assessments to be carried out at the BS level, it is important to assess the “optimal data transmission strategy,” namely whether the bulk data should be sent from the data collectors to the BS, or whether the data collectors should perform some local data reduction and then send the reduced data based on the desired tradeoff between accuracy, communication costs, and data collectors’ resource usage. We use here the term data reduction with a broad meaning to indicate techniques to reduce the amount of data to be transmitted. Examples of such techniques include extracting features from images and sending only these features, discarding images that do not include objects of interest, discarding images of poor quality, and selecting relevant frames from videos. Data reduction is important when the computation, memory, power, and transmission bandwidth constraints of the data collectors are considered, particularly ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:5 for large infrast","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"95 1","pages":"1 - 6"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Adaptive and Cost-Effective Collection of High-Quality Data for Critical Infrastructure and Emergency Management in Smart Cities—Framework and Challenges\",\"authors\":\"E. Bertino, M. Jahanshahi\",\"doi\":\"10.1145/3190579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/05-ART1 $15.00 https://doi.org/10.1145/3190579 ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:2 E. Bertino and M. R. Jahanshahi Fig. 1. A spatially incomplete object. agents (mobile phones, small drones, robots, sensors); 5G networks and edge computing processing [2]; crowdsourcing. In this article, we first briefly discuss relevant data quality requirements related to applications in the area of critical infrastructure and emergency management, although this framework can be extended to other applications. We then present a comprehensive framework for a real-time, adaptive, and cost-effective collection of high-quality data for such applications that leverage many of the above technologies, and elaborate on a few research challenges. 2 DATA QUALITY REQUIREMENTS Data quality is usually characterized by many different dimensions [3]. In our context, e.g., objects extracted from image data, key requirements include: —Spatial Completeness: The objects of interest should be “fully covered” by the image data. For example, an image reporting only half of a building crack would not have satisfactory spatial completeness (see Figure 1 for an example of a spatially incomplete object). —Temporal Completeness: The temporal evolution of the objects of interest should be covered as it is critical for accurate prediction. —Precision: The object images should be sharp and have high resolution. —Traceability: Information about the entire process, according to which data of interest was collected, processed, and transmitted, should be recorded; this is critical for identifying errors that lead to poor quality data about the objects of interest. —Minimality: The presence of non-relevant objects should be minimized. It is, however, important to remark that other quality requirements, such as currentness and consistency, are also relevant in our context. 3 DATA COLLECTION FRAMEWORK Our framework (see Figure 2) is based on two conceptual parties: data collection coordinator (referred to as base station (BS)); and data collectors (e.g., agents in charge of data gathering). The data collection coordinator is the interface system that coordinates the data acquisition tasks and data quality assessment. It interfaces on one side with the data users (e.g., end-users and applications) and on the other with data collectors. Given a data acquisition task and geographical area of interest, it allocates a number of data collectors, based on the capabilities of collectors, for the execution of the task, by also trying to optimize the cost of data acquisition and minimize ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:3 Fig. 2. Data collection framework. the response time. Such allocation decisions can be basically supported by optimization techniques developed in the area of operation research. The main challenge is to determine the most suitable optimization techniques for dynamic contexts. The data collection coordinator must also assess the quality of data with respect to specific quality requirements provided as input by the data users. Since a data collection task may often be split among data collectors, the coordinator may have to integrate the various collected data to see whether, overall, the data meets the specified quality requirements or not. The coordinator may also support data enrichment, for example, by using GIS data [5] and data linkage with other sources. The data collectors carry out the basic tasks of collecting data, assessing the quality of the collected data, and, based on this assessment, collecting more data. Notice that data collectors may have different capabilities. For example, some collectors may have equipment for very high-resolution imagery with powerful computing capabilities and can run machine learning tools that require large storage size and GPU. These collectors may thus be able to perform a high accurate data quality analysis. On the other hand, other collectors are very small and thus can easily move very close to the objects and take images from very short distances; however, their capability for data quality assessment may be very limited. Finally, other collectors may be equipped with mechanical devices to take samples from the environment, such as a sample of soil or water, or perform active testing through injecting dynamic disturbances by the collector actuators at selected locations (e.g., exciting the structure by a hammer and collecting the propagated wave characteristics for damage detection). The decision about the right combinations of data collectors for a data acquisition task is taken by the data collection coordinator based on the knowledge of the capabilities of each data collection device. However, as research in the area of distributed decision making for autonomous systems progresses, such decisions could be even taken autonomously by swarms of data collectors. Our framework is based on the notion of data collection cycle, which is organized according to a continuous loop consisting of two main phases: (a) data collection; (b) data quality assessment. Once data is collected, it is assessed for quality. If quality is insufficient, further data is collected. ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:4 E. Bertino and M. R. Jahanshahi Further data collection is typically tailored to improve the quality. For example, a data collector may be required to collect data of higher resolution for a specific object. Data quality assessment is executed at three levels: (1) locally at the data collector; (2) collaboratively within the data collector swarms; (3) globally at the data collection coordinator. Assessments (1) and (2) may not always be possible. Assessment (1) may not be possible as the data collector may not have the capabilities to assess data. Assessment (2) may not be possible if the swarm does not have capabilities to assess the data or if a data collector is isolated from the rest of the swarm. However, Assessment (2) may be highly desirable when connections with the BS are fragmented/unreliable. Adaptation capabilities are thus crucial to deal with those situations. It is important to notice that a critical challenge is to develop approaches for automatically assessing the quality of collected data and automatically determining which additional data needs to be further collected to refine/complete/enhance the quality of the initial data. In particular, when data collection is performed by a swarm of data collectors, the swarm should automatically assess the data and decide further data collection. The development of such framework requires addressing several challenges: —Optimized data-quality driven allocation of data collection tasks to agents: Data collectors are typically heterogeneous with respect to hardware and software capabilities and with respect to special equipment for data acquisition—for example, a drone may have equipment for acquiring images at very high resolution. Also, data collectors may be located in different geographical regions. Data collection also depends on the quality requirements; for example, when performing an initial assessment, data of low quality may be fine. Therefore, it is important to design approaches that are able to support the optimal allocation of data acquisition tasks based on different constraints, requirements, and data collectors’ capabilities and status. Furthermore, it is important that each data collector has the capability of autonomously deciding which data to collect based on its own local assessment of data that have already been collected. Thus, the allocation of data collection tasks is a combination of centralized decisions with decisions local to the data collectors and/or data collector swarms. —Automatic (collaborative) data quality assessment: Techniques are needed to automatically assess the quality of the collected data with respect to the specific quality requirements. Techniques based on machine learning are relevant here. The main issue is that such assessment may be carried out at three different levels (see previous section) and thus tradeoff may be needed between accuracy and resource usage. For example, at the level of the data collectors, resource usage should be minimized. However, resource use minimization may lead to less accurate decisions. It is also critical to devise approaches by which such assessments can be carried out by data collector swarms. Finally, for assessments to be carried out at the BS level, it is important to assess the “optimal data transmission strategy,” namely whether the bulk data should be sent from the data collectors to the BS, or whether the data collectors should perform some local data reduction and then send the reduced data based on the desired tradeoff between accuracy, communication costs, and data collectors’ resource usage. We use here the term data reduction with a broad meaning to indicate techniques to reduce the amount of data to be transmitted. Examples of such techniques include extracting features from images and sending only these features, discarding images that do not include objects of interest, discarding images of poor quality, and selecting relevant frames from videos. Data reduction is important when the computation, memory, power, and transmission bandwidth constraints of the data collectors are considered, particularly ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:5 for large infrast\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"95 1\",\"pages\":\"1 - 6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190579\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

允许赊账付款。以其他方式复制或重新发布，在服务器上发布或重新分发到列表，需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2018 ACM 1936-1955/2018/05-ART1 $15.00 https://doi.org/10.1145/3190579 ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1。出版日期:2018年5月。1:2 E. Bertino和mr . Jahanshahi空间上不完整的物体代理(手机、小型无人机、机器人、传感器);5G网络与边缘计算处理[2];众包。在本文中，我们首先简要讨论与关键基础设施和应急管理领域的应用程序相关的相关数据质量要求，尽管该框架可以扩展到其他应用程序。然后，我们提出了一个全面的框架，用于利用上述许多技术的此类应用的实时、自适应和经济高效的高质量数据收集，并详细说明了一些研究挑战。数据质量通常有许多不同的维度[3]。在我们的环境中，例如，从图像数据中提取的对象，关键要求包括:-空间完整性:感兴趣的对象应该被图像数据“完全覆盖”。例如，仅报告一半建筑物裂缝的图像不会具有令人满意的空间完整性(参见图1中空间不完整对象的示例)。-时间完整性:应涵盖感兴趣对象的时间演变，因为它对准确预测至关重要。-精度:目标图像应清晰，具有高分辨率。-可追溯性:记录有关数据收集、处理和传输的整个过程的信息;这对于识别导致有关感兴趣对象的低质量数据的错误至关重要。最小化:不相关对象的存在应该被最小化。然而，重要的是要注意到其他质量要求，例如当前性和一致性，在我们的环境中也是相关的。我们的框架(见图2)基于两个概念方:数据收集协调器(简称基站(BS));以及数据收集者(例如，负责数据收集的代理)。数据采集协调器是协调数据采集任务和数据质量评估的接口系统。它一方面与数据用户(例如，最终用户和应用程序)接口，另一方面与数据收集器接口。给定数据采集任务和感兴趣的地理区域，它根据收集器的能力分配一些数据收集器来执行任务，同时也试图优化数据采集的成本并最小化ACM数据与信息质量杂志，Vol. 10, No. 1, Article 1。出版日期:2018年5月。高质量数据的自适应和低成本收集数据收集框架。响应时间。这种分配决策基本上可以由运筹学领域开发的优化技术支持。主要的挑战是确定最适合动态上下文的优化技术。数据收集协调员还必须根据数据用户作为输入提供的特定质量要求评估数据的质量。由于数据收集任务可能经常在数据收集器之间进行拆分，因此协调器可能必须集成收集到的各种数据，以查看数据是否总体上满足指定的质量要求。协调器还可以支持数据丰富，例如，通过使用GIS数据[5]和与其他来源的数据链接。数据收集人员的基本任务是收集数据，评估收集到的数据的质量，并在此基础上收集更多的数据。请注意，数据收集器可能具有不同的功能。例如，一些收集器可能具有非常高分辨率图像的设备，具有强大的计算能力，并且可以运行需要大存储空间和GPU的机器学习工具。因此，这些收集器可能能够执行高精度的数据质量分析。另一方面，其他收集器非常小，因此可以很容易地移动到离物体很近的地方，并从很短的距离拍摄图像;然而，它们对数据质量评估的能力可能非常有限。最后，其他收集器可能配备机械装置，从环境中采集样本，例如土壤或水的样本，或者通过在选定位置注入收集器执行器的动态扰动来进行主动测试(例如，用锤子激发结构并收集传播波特性以进行损伤检测)。数据收集协调器根据对每个数据收集设备功能的了解，决定数据采集任务的数据收集器的正确组合。然而，随着自主系统分布式决策领域的研究进展，这样的决策甚至可以由一群数据收集器自主地做出。我们的框架基于数据收集周期的概念，该概念根据由两个主要阶段组成的连续循环进行组织:(a)数据收集;(b)数据质量评估。一旦收集到数据，就要对其质量进行评估。如果质量不足，则进一步收集数据。ACM数据与信息质量学报，第10卷，第1期，第1条。出版日期:2018年5月。[1:4] E. Bertino和M. R. Jahanshahi进一步的数据收集通常是为了提高质量。例如，可能需要数据收集器为特定对象收集更高分辨率的数据。数据质量评估在三个层面上执行:(1)在数据收集器本地执行;(2)在数据收集器群内协同工作;(3)全球数据收集协调器。评估(1)和(2)可能并不总是可行的。由于数据收集者可能不具备评估数据的能力，因此可能无法进行评估(1)。如果蜂群不具备评估数据的能力，或者如果数据收集器与蜂群的其余部分隔离，则可能无法进行评估(2)。然而，当与BS的连接支离破碎/不可靠时，评估(2)可能是非常可取的。因此，适应能力对于处理这些情况至关重要。重要的是要注意到，一个关键的挑战是开发方法来自动评估所收集数据的质量，并自动确定需要进一步收集哪些额外数据以改进/完善/提高初始数据的质量。特别是，当由一群数据收集器执行数据收集时，该群应该自动评估数据并决定进一步的数据收集。此类框架的开发需要解决以下几个挑战:-优化数据质量驱动的数据收集任务分配给代理:数据收集器在硬件和软件功能以及用于数据采集的特殊设备方面通常是异构的-例如，无人机可能具有用于获取非常高分辨率图像的设备。此外，数据收集器可能位于不同的地理区域。数据收集还取决于质量要求;例如，在执行初始评估时，低质量的数据可能是好的。因此，设计能够支持基于不同约束、需求和数据收集器的能力和状态的数据采集任务的最佳分配的方法是很重要的。此外，重要的是，每个数据收集器都有能力根据自己对已收集数据的本地评估自主决定收集哪些数据。因此，数据收集任务的分配是集中式决策与数据收集器和/或数据收集器群的本地决策的结合。-自动(协作)数据质量评估:需要技术来根据特定的质量要求自动评估收集的数据的质量。基于机器学习的技术与此相关。主要问题是，这种评估可能在三个不同的层次上进行(见前一节)，因此可能需要在准确性和资源使用之间进行权衡。例如，在数据收集器级别上，应该尽量减少资源使用。然而，资源使用最小化可能导致决策的准确性降低。同样重要的是，要设计出让数据收集者群体能够进行这种评估的方法。最后，对于在BS级别进行的评估，重要的是评估“最优数据传输策略”，即是否应该将大容量数据从数据收集器发送到BS，或者数据收集器是否应该执行一些本地数据缩减，然后根据准确性、通信成本和数据收集器的资源使用之间的期望权衡发送减少的数据。我们在这里使用具有广泛含义的术语“数据缩减”来表示减少要传输的数据量的技术。此类技术的示例包括从图像中提取特征并仅发送这些特征，丢弃不包含感兴趣对象的图像，丢弃质量差的图像以及从视频中选择相关帧。当考虑到数据收集器的计算、内存、功率和传输带宽限制时，数据约简是重要的，特别是ACM数据与信息质量杂志，Vol. 10, No. 1, Article 1。出版日期:2018年5月。大基础设施的高质量数据1:5自适应和低成本收集

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptive and Cost-Effective Collection of High-Quality Data for Critical Infrastructure and Emergency Management in Smart Cities—Framework and Challenges

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/05-ART1 $15.00 https://doi.org/10.1145/3190579 ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:2 E. Bertino and M. R. Jahanshahi Fig. 1. A spatially incomplete object. agents (mobile phones, small drones, robots, sensors); 5G networks and edge computing processing [2]; crowdsourcing. In this article, we first briefly discuss relevant data quality requirements related to applications in the area of critical infrastructure and emergency management, although this framework can be extended to other applications. We then present a comprehensive framework for a real-time, adaptive, and cost-effective collection of high-quality data for such applications that leverage many of the above technologies, and elaborate on a few research challenges. 2 DATA QUALITY REQUIREMENTS Data quality is usually characterized by many different dimensions [3]. In our context, e.g., objects extracted from image data, key requirements include: —Spatial Completeness: The objects of interest should be “fully covered” by the image data. For example, an image reporting only half of a building crack would not have satisfactory spatial completeness (see Figure 1 for an example of a spatially incomplete object). —Temporal Completeness: The temporal evolution of the objects of interest should be covered as it is critical for accurate prediction. —Precision: The object images should be sharp and have high resolution. —Traceability: Information about the entire process, according to which data of interest was collected, processed, and transmitted, should be recorded; this is critical for identifying errors that lead to poor quality data about the objects of interest. —Minimality: The presence of non-relevant objects should be minimized. It is, however, important to remark that other quality requirements, such as currentness and consistency, are also relevant in our context. 3 DATA COLLECTION FRAMEWORK Our framework (see Figure 2) is based on two conceptual parties: data collection coordinator (referred to as base station (BS)); and data collectors (e.g., agents in charge of data gathering). The data collection coordinator is the interface system that coordinates the data acquisition tasks and data quality assessment. It interfaces on one side with the data users (e.g., end-users and applications) and on the other with data collectors. Given a data acquisition task and geographical area of interest, it allocates a number of data collectors, based on the capabilities of collectors, for the execution of the task, by also trying to optimize the cost of data acquisition and minimize ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:3 Fig. 2. Data collection framework. the response time. Such allocation decisions can be basically supported by optimization techniques developed in the area of operation research. The main challenge is to determine the most suitable optimization techniques for dynamic contexts. The data collection coordinator must also assess the quality of data with respect to specific quality requirements provided as input by the data users. Since a data collection task may often be split among data collectors, the coordinator may have to integrate the various collected data to see whether, overall, the data meets the specified quality requirements or not. The coordinator may also support data enrichment, for example, by using GIS data [5] and data linkage with other sources. The data collectors carry out the basic tasks of collecting data, assessing the quality of the collected data, and, based on this assessment, collecting more data. Notice that data collectors may have different capabilities. For example, some collectors may have equipment for very high-resolution imagery with powerful computing capabilities and can run machine learning tools that require large storage size and GPU. These collectors may thus be able to perform a high accurate data quality analysis. On the other hand, other collectors are very small and thus can easily move very close to the objects and take images from very short distances; however, their capability for data quality assessment may be very limited. Finally, other collectors may be equipped with mechanical devices to take samples from the environment, such as a sample of soil or water, or perform active testing through injecting dynamic disturbances by the collector actuators at selected locations (e.g., exciting the structure by a hammer and collecting the propagated wave characteristics for damage detection). The decision about the right combinations of data collectors for a data acquisition task is taken by the data collection coordinator based on the knowledge of the capabilities of each data collection device. However, as research in the area of distributed decision making for autonomous systems progresses, such decisions could be even taken autonomously by swarms of data collectors. Our framework is based on the notion of data collection cycle, which is organized according to a continuous loop consisting of two main phases: (a) data collection; (b) data quality assessment. Once data is collected, it is assessed for quality. If quality is insufficient, further data is collected. ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:4 E. Bertino and M. R. Jahanshahi Further data collection is typically tailored to improve the quality. For example, a data collector may be required to collect data of higher resolution for a specific object. Data quality assessment is executed at three levels: (1) locally at the data collector; (2) collaboratively within the data collector swarms; (3) globally at the data collection coordinator. Assessments (1) and (2) may not always be possible. Assessment (1) may not be possible as the data collector may not have the capabilities to assess data. Assessment (2) may not be possible if the swarm does not have capabilities to assess the data or if a data collector is isolated from the rest of the swarm. However, Assessment (2) may be highly desirable when connections with the BS are fragmented/unreliable. Adaptation capabilities are thus crucial to deal with those situations. It is important to notice that a critical challenge is to develop approaches for automatically assessing the quality of collected data and automatically determining which additional data needs to be further collected to refine/complete/enhance the quality of the initial data. In particular, when data collection is performed by a swarm of data collectors, the swarm should automatically assess the data and decide further data collection. The development of such framework requires addressing several challenges: —Optimized data-quality driven allocation of data collection tasks to agents: Data collectors are typically heterogeneous with respect to hardware and software capabilities and with respect to special equipment for data acquisition—for example, a drone may have equipment for acquiring images at very high resolution. Also, data collectors may be located in different geographical regions. Data collection also depends on the quality requirements; for example, when performing an initial assessment, data of low quality may be fine. Therefore, it is important to design approaches that are able to support the optimal allocation of data acquisition tasks based on different constraints, requirements, and data collectors’ capabilities and status. Furthermore, it is important that each data collector has the capability of autonomously deciding which data to collect based on its own local assessment of data that have already been collected. Thus, the allocation of data collection tasks is a combination of centralized decisions with decisions local to the data collectors and/or data collector swarms. —Automatic (collaborative) data quality assessment: Techniques are needed to automatically assess the quality of the collected data with respect to the specific quality requirements. Techniques based on machine learning are relevant here. The main issue is that such assessment may be carried out at three different levels (see previous section) and thus tradeoff may be needed between accuracy and resource usage. For example, at the level of the data collectors, resource usage should be minimized. However, resource use minimization may lead to less accurate decisions. It is also critical to devise approaches by which such assessments can be carried out by data collector swarms. Finally, for assessments to be carried out at the BS level, it is important to assess the “optimal data transmission strategy,” namely whether the bulk data should be sent from the data collectors to the BS, or whether the data collectors should perform some local data reduction and then send the reduced data based on the desired tradeoff between accuracy, communication costs, and data collectors’ resource usage. We use here the term data reduction with a broad meaning to indicate techniques to reduce the amount of data to be transmitted. Examples of such techniques include extracting features from images and sending only these features, discarding images that do not include objects of interest, discarding images of poor quality, and selecting relevant frames from videos. Data reduction is important when the computation, memory, power, and transmission bandwidth constraints of the data collectors are considered, particularly ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective Collection of High-Quality Data 1:5 for large infrast

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量