{"title":"非结构化大数据质量测量和失败数据再处理方法","authors":"Prasad Pande, Nikhil Rao","doi":"10.1109/WCONF58270.2023.10235088","DOIUrl":null,"url":null,"abstract":"This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.","PeriodicalId":202864,"journal":{"name":"2023 World Conference on Communication & Computing (WCONF)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unstructured Big Data Quality Measurements and Failed Data Re-Processing Approach\",\"authors\":\"Prasad Pande, Nikhil Rao\",\"doi\":\"10.1109/WCONF58270.2023.10235088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.\",\"PeriodicalId\":202864,\"journal\":{\"name\":\"2023 World Conference on Communication & Computing (WCONF)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 World Conference on Communication & Computing (WCONF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WCONF58270.2023.10235088\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 World Conference on Communication & Computing (WCONF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WCONF58270.2023.10235088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unstructured Big Data Quality Measurements and Failed Data Re-Processing Approach
This paper discusses role of data quality checks, in identifying message loss during large volume semi structured and unstructured data processing. We discuss data quality dimensions, applicable to semi structured and unstructured data in the form of email data. Email data confirms to RFC822 format and can have unstructured data like images, documents as attachments. Along with discussion of applicable data quality dimensions, we also discuss the formulae to evaluate data quality for each of these dimensions. The paper, focuses on big-data reprocessing, starting with need for data re-processing and identifying generic re-processing scenarios. This scenario identification step is important to ensure that the solution caters to all the re-processing requirements. It further discusses about types of data re-processing, core concepts on which re-processing solution is based and approach for handling different type of re-processing. The paper subsequently discusses principles of backlog processing for streaming data. The streaming data like email data, keeps flowing continuously at a very high rate. Any downtime in processing system, results in accumulation of data backlog. This accumulated data must be processed fully without creating a domino of backlogs. This document discusses statistical approach for forecasting time required to clear piled-up backlog of streaming data.