自拍:从数字化生物馆藏中提取自我意识信息

2017 IEEE 13th International Conference on e-Science (e-Science) Pub Date : 2017-10-01 DOI:10.1109/eScience.2017.19

I. Alzuru, Andréa M. Matsunaga, Maurício O. Tsugawa, J. Fortes

{"title":"自拍:从数字化生物馆藏中提取自我意识信息","authors":"I. Alzuru, Andréa M. Matsunaga, Maurício O. Tsugawa, J. Fortes","doi":"10.1109/eScience.2017.19","DOIUrl":null,"url":null,"abstract":"Biological collections store information with broad societal and environmental impact. In the last 15 years, after worldwide investments and crowdsourcing efforts, 25% of the collected specimens have been digitized; a process that includes the imaging of text attached to specimens and subsequent extraction of information from the resulting image. This information extraction (IE) process is complex, thus slow and typically involving human tasks. We propose a hybrid (Human-Machine) information extraction model that efficiently uses resources of different cost (machines, volunteers and/or experts) and speeds up the biocollections' digitization process, while striving to maintain the same quality as human-only IE processes. In the proposed model, called SELFIE, self-aware IE processes determine whether their output quality is satisfactory. If the quality is unsatisfactory, additional or alternative processes that yield higher quality output at higher cost are triggered. The effectiveness of this model is demonstrated by three SELFIE workflows for the extraction of Darwin-core terms from specimens' images. Compared to the traditional human-driven IE approach, SELFIE workflows showed, on average, a reduction of 27% in the information-capture time and a decrease of 32% in the required number of humans and their associated cost, while the quality of the results was negligibly reduced by 0.27%.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"423 2-3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"SELFIE: Self-Aware Information Extraction from Digitized Biocollections\",\"authors\":\"I. Alzuru, Andréa M. Matsunaga, Maurício O. Tsugawa, J. Fortes\",\"doi\":\"10.1109/eScience.2017.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Biological collections store information with broad societal and environmental impact. In the last 15 years, after worldwide investments and crowdsourcing efforts, 25% of the collected specimens have been digitized; a process that includes the imaging of text attached to specimens and subsequent extraction of information from the resulting image. This information extraction (IE) process is complex, thus slow and typically involving human tasks. We propose a hybrid (Human-Machine) information extraction model that efficiently uses resources of different cost (machines, volunteers and/or experts) and speeds up the biocollections' digitization process, while striving to maintain the same quality as human-only IE processes. In the proposed model, called SELFIE, self-aware IE processes determine whether their output quality is satisfactory. If the quality is unsatisfactory, additional or alternative processes that yield higher quality output at higher cost are triggered. The effectiveness of this model is demonstrated by three SELFIE workflows for the extraction of Darwin-core terms from specimens' images. Compared to the traditional human-driven IE approach, SELFIE workflows showed, on average, a reduction of 27% in the information-capture time and a decrease of 32% in the required number of humans and their associated cost, while the quality of the results was negligibly reduced by 0.27%.\",\"PeriodicalId\":137652,\"journal\":{\"name\":\"2017 IEEE 13th International Conference on e-Science (e-Science)\",\"volume\":\"423 2-3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 13th International Conference on e-Science (e-Science)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2017.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 13th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

生物馆藏储存的信息具有广泛的社会和环境影响。在过去的15年里，经过全球投资和众包的努力，25%的标本已被数字化;对附在标本上的文本进行成像并随后从所得到的图像中提取信息的一种处理方法。这个信息提取(IE)过程很复杂，因此很慢，通常涉及人工任务。我们提出了一种混合(人机)信息提取模型，该模型有效地利用了不同成本的资源(机器、志愿者和/或专家)，加快了生物收集的数字化进程，同时努力保持与纯人工IE过程相同的质量。在这个被称为SELFIE的模型中，自我感知的IE进程决定它们的输出质量是否令人满意。如果质量不令人满意，则触发以更高成本产生更高质量输出的附加或替代工艺。该模型的有效性通过从标本图像中提取达尔文核心术语的三个SELFIE工作流得到了验证。与传统的人工驱动的IE方法相比，自拍工作流平均减少了27%的信息捕获时间，减少了32%的所需人力和相关成本，而结果的质量却下降了0.27%，这是可以忽略不计的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SELFIE: Self-Aware Information Extraction from Digitized Biocollections

Biological collections store information with broad societal and environmental impact. In the last 15 years, after worldwide investments and crowdsourcing efforts, 25% of the collected specimens have been digitized; a process that includes the imaging of text attached to specimens and subsequent extraction of information from the resulting image. This information extraction (IE) process is complex, thus slow and typically involving human tasks. We propose a hybrid (Human-Machine) information extraction model that efficiently uses resources of different cost (machines, volunteers and/or experts) and speeds up the biocollections' digitization process, while striving to maintain the same quality as human-only IE processes. In the proposed model, called SELFIE, self-aware IE processes determine whether their output quality is satisfactory. If the quality is unsatisfactory, additional or alternative processes that yield higher quality output at higher cost are triggered. The effectiveness of this model is demonstrated by three SELFIE workflows for the extraction of Darwin-core terms from specimens' images. Compared to the traditional human-driven IE approach, SELFIE workflows showed, on average, a reduction of 27% in the information-capture time and a decrease of 32% in the required number of humans and their associated cost, while the quality of the results was negligibly reduced by 0.27%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 13th International Conference on e-Science (e-Science)

自引率

0.00%

发文量