Error Analysis on Harvesting Data over the Internet

Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference Pub Date : 2018-06-26 DOI:10.1145/3197768.3201537

S. Kapidakis

{"title":"Error Analysis on Harvesting Data over the Internet","authors":"S. Kapidakis","doi":"10.1145/3197768.3201537","DOIUrl":null,"url":null,"abstract":"Harvesting tasks gather information to a central repository. We studied 880560 harvesting tasks from 3446 harvesting services in 354 harvesting rounds during a period of 15 months, of which 382705 failed and the remaining tasks occasionally returning fewer records. A significant part of the Open Archive Initiative harvesting services never worked or have ceased working while many other services fail occasionally. A harvesting task includes many stages of information exchange, and each one of them may fail - but with different consequences each time. We studied the reported warning messages, the number of records returned, and the required response time to discover relations among them. We found that about half of the harvesting tasks on each harvesting round fail, and the number of failing tasks is slowly increasing. We developed a method of analysis that can be used to reverse engineering such complex network systems and to categorize the reasons of failure into useful classes. Our results do not indicate a new approach to harvesting or conclude to a breakthrough advice, but make clear the complexity of the operation in an ever changing networking environment and alarm the reader that some facts that may be considered trivial, actually they are not! They help us to better understand the risks involved, and to design more reliable procedures and improved ways to closely monitor them.","PeriodicalId":130190,"journal":{"name":"Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3197768.3201537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Harvesting tasks gather information to a central repository. We studied 880560 harvesting tasks from 3446 harvesting services in 354 harvesting rounds during a period of 15 months, of which 382705 failed and the remaining tasks occasionally returning fewer records. A significant part of the Open Archive Initiative harvesting services never worked or have ceased working while many other services fail occasionally. A harvesting task includes many stages of information exchange, and each one of them may fail - but with different consequences each time. We studied the reported warning messages, the number of records returned, and the required response time to discover relations among them. We found that about half of the harvesting tasks on each harvesting round fail, and the number of failing tasks is slowly increasing. We developed a method of analysis that can be used to reverse engineering such complex network systems and to categorize the reasons of failure into useful classes. Our results do not indicate a new approach to harvesting or conclude to a breakthrough advice, but make clear the complexity of the operation in an ever changing networking environment and alarm the reader that some facts that may be considered trivial, actually they are not! They help us to better understand the risks involved, and to design more reliable procedures and improved ways to closely monitor them.

查看原文本刊更多论文

互联网数据采集的误差分析

收集任务将信息收集到中央存储库。在15个月的时间里，我们研究了来自3446个收集服务的354轮收集任务中的880560个收集任务，其中382705个失败，其余任务偶尔返回较少的记录。Open Archive Initiative收集服务的一个重要部分从未工作过或已经停止工作，而许多其他服务偶尔会失败。收集任务包括信息交换的许多阶段，每个阶段都可能失败——但每次都有不同的后果。我们研究了报告的警告消息、返回的记录数量和所需的响应时间，以发现它们之间的关系。我们发现在每个收获回合中大约有一半的收获任务失败，并且失败任务的数量正在缓慢增加。我们开发了一种分析方法，可用于对此类复杂网络系统进行逆向工程，并将故障原因分类为有用的类别。我们的研究结果并没有指出一种新的方法来获取或总结出突破性的建议，而是明确了在不断变化的网络环境中操作的复杂性，并提醒读者一些可能被认为微不足道的事实，实际上并非如此!它们帮助我们更好地了解所涉及的风险，并设计更可靠的程序和改进的方法来密切监测它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference

自引率

0.00%

发文量