通过数据质量管理和数据素养实现以数据为中心的人工智能

IF 1.3 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

IT-Information Technology Pub Date : 2022-02-18 DOI:10.1515/itit-2021-0048

Ziawasch Abedjan

{"title":"通过数据质量管理和数据素养实现以数据为中心的人工智能","authors":"Ziawasch Abedjan","doi":"10.1515/itit-2021-0048","DOIUrl":null,"url":null,"abstract":"Abstract Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.","PeriodicalId":43953,"journal":{"name":"IT-Information Technology","volume":"64 1","pages":"67 - 70"},"PeriodicalIF":1.3000,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Enabling data-centric AI through data quality management and data literacy\",\"authors\":\"Ziawasch Abedjan\",\"doi\":\"10.1515/itit-2021-0048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.\",\"PeriodicalId\":43953,\"journal\":{\"name\":\"IT-Information Technology\",\"volume\":\"64 1\",\"pages\":\"67 - 70\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IT-Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/itit-2021-0048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IT-Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/itit-2021-0048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

摘要

摘要数据的生成速度非常缓慢。与此同时，人们对将这些数据用于涵盖所有可以想象的领域的用例有着永不满足的兴趣，包括健康、气候、商业和游戏。除了围绕数据驱动创新的新的社会技术挑战之外，仍然存在阻碍数据驱动技术可用性的开放数据处理挑战。人们普遍认为，克服语法和语义方面的数据异构性，将各种来源组合起来以实现共同目标是一个主要瓶颈。此外，这些数据的质量总是受到质疑，因为今天的数据科学管道是高度临时的，没有必要关心来源。最后，质量标准超越了个人价值观的语法和语义正确性，但也包含了群体层面的约束，如与受保护群体的平等平等和机会，在这一过程中发挥着越来越重要的作用。传统的数据集成研究主要集中在公司合并后的集成，其中必须集成客户或产品数据库。虽然这通常已经够难的了，但今天的挑战加剧了，因为越来越多的利益相关者正在使用数据分析工具来获得特定领域的见解。我把这种现象称为数据科学的民主化，这是一个具有挑战性和必要性的过程。新型系统需要用户友好，不仅受过培训的数据库管理员可以处理这些系统，而且需要不太懂计算机科学的利益相关者。因此，我们的研究重点是用于数据准备和管理的可扩展示例驱动技术。此外，我们认为，重要的是教育全社会了解数据驱动世界的影响，并积极宣传数据素养这一基本能力的概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enabling data-centric AI through data quality management and data literacy

Abstract Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IT-Information Technology COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

3.80

自引率

0.00%

发文量