非结构化大数据质量测量和失败数据再处理方法

2023 World Conference on Communication & Computing (WCONF) Pub Date : 2023-07-14 DOI:10.1109/WCONF58270.2023.10235088

Prasad Pande, Nikhil Rao

{"title":"非结构化大数据质量测量和失败数据再处理方法","authors":"Prasad Pande, Nikhil Rao","doi":"10.1109/WCONF58270.2023.10235088","DOIUrl":null,"url":null,"abstract":"This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.","PeriodicalId":202864,"journal":{"name":"2023 World Conference on Communication & Computing (WCONF)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unstructured Big Data Quality Measurements and Failed Data Re-Processing Approach\",\"authors\":\"Prasad Pande, Nikhil Rao\",\"doi\":\"10.1109/WCONF58270.2023.10235088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.\",\"PeriodicalId\":202864,\"journal\":{\"name\":\"2023 World Conference on Communication & Computing (WCONF)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 World Conference on Communication & Computing (WCONF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WCONF58270.2023.10235088\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 World Conference on Communication & Computing (WCONF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WCONF58270.2023.10235088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文讨论了数据质量检查在识别大容量半结构化和非结构化数据处理过程中的消息丢失中的作用。我们讨论了数据质量维度，适用于半结构化和非结构化数据形式的电子邮件数据。邮件数据确认为RFC822格式，可以有图像、文档等非结构化数据作为附件。除了讨论适用的数据质量维度之外，我们还讨论了评估这些维度的数据质量的公式。本文以大数据再处理为研究重点，从数据再处理的需求出发，识别通用的再处理场景。此场景识别步骤对于确保解决方案满足所有再处理需求非常重要。进一步讨论了数据再处理的类型、再处理解决方案所依据的核心概念以及处理不同类型再处理的方法。本文随后讨论了流数据的积压处理原则。流数据，如电子邮件数据，以非常高的速度持续流动。处理系统的任何停机都会导致数据积压的积累。必须完全处理这些累积的数据，而不产生堆积如山的积压。本文讨论了预测清除堆积的流数据积压所需时间的统计方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unstructured Big Data Quality Measurements and Failed Data Re-Processing Approach

This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 World Conference on Communication & Computing (WCONF)

自引率

0.00%

发文量