Corpus refactoring: a feasibility study.

Journal of biomedical discovery and collaboration Pub Date : 2007-09-13 DOI:10.1186/1747-5333-2-4

Helen L Johnson, William A Baumgartner, Martin Krallinger, K Bretonnel Cohen, Lawrence Hunter

{"title":"Corpus refactoring: a feasibility study.","authors":"Helen L Johnson, William A Baumgartner, Martin Krallinger, K Bretonnel Cohen, Lawrence Hunter","doi":"10.1186/1747-5333-2-4","DOIUrl":null,"url":null,"abstract":"Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.","PeriodicalId":87404,"journal":{"name":"Journal of biomedical discovery and collaboration","volume":" ","pages":"4"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1747-5333-2-4","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of biomedical discovery and collaboration","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/1747-5333-2-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.

Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.

Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

Abstract Image

查看原文本刊更多论文

语料库重构:可行性研究。

背景:大多数生物医学语料库还没有在创建它们的实验室之外使用，尽管它们提供的金标准评估数据的可用性是生物医学文本挖掘进展的限速因素之一。数据表明，影响语料库在其家庭实验室之外使用的一个主要因素是语料库的分发格式。本文测试了语料库重构——在不改变语料库语义的情况下改变语料库的格式——是一个可行的目标的假设，即它可以用半自动化的过程和时间效率的方式来完成。我们使用简单的文本处理方法和有限的人工验证将Protein Design Group语料库转换为两种新格式:WordFreak和嵌入式XML。我们跟踪了所花费的总时间和自动化步骤的成功率。结果:重构的语料库可在BioNLP SourceForge网站http://bionlp.sourceforge.net下载。花费的总时间刚刚超过3人周，包括大约102小时的编程时间(其中大部分是一次性开发成本)和20小时的自动输出的手动验证。此外，还介绍了重构语料库所需的步骤。结论:我们得出结论，重构公共可用的语料库是一种技术和经济上可行的方法，可以增加已有数据的使用，用于评估生物医学语言处理系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of biomedical discovery and collaboration

自引率

0.00%

发文量