UIMA based solution in pharma text

2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2015-11-09 DOI:10.1109/BIBM.2015.7359958

Aditya Rao, Thomas Joseph, V. Saipradeep, Rajgopal Srinivasan

{"title":"UIMA based solution in pharma text","authors":"Aditya Rao, Thomas Joseph, V. Saipradeep, Rajgopal Srinivasan","doi":"10.1109/BIBM.2015.7359958","DOIUrl":null,"url":null,"abstract":"Background: Text-processing of unstructured biomedical text has become crucial to pharma companies, both with regards to legacy as well as topical documentation. The Apache Unstructured Information Management Applications (UIMA) framework addresses general information extraction requirements. We present in this poster two use cases of using UIMA for specific unstructured biomedical information extraction tasks in pharma companies. The first use case requires extraction of values belonging to specific fields from legacy clinical study documents. These fields could be diverse, examples being study duration, study population, study arm, completion date and co-morbidity. The second use case deals with accurate propagation of drug label information to digital channels such as drug-specific websites. Due to the increased importance of such websites and mobile applications, pharma companies are looking at text-processing solutions to keep information in such channels accurate and up-to-date. Implementation: The use cases were implemented using the UIMA framework. The framework comprises of core UIMA modules and custom in-house modules specifically built for each of the use cases. Some of the key custom modules include document clustering, section identification, named entity recognition and relation-identification. For the first use case, a total of 70 fields were extracted from clinical study reports. These included study phase, study type, study duration, study start date and the drug dosage. For the second use case, content extraction was first done on drug-websites, and fields such as target dosage, dosage regimen and study duration were then extracted from the content. The field values were evaluated for accuracy against the label information. Conclusion: Both implementations were successful, with high degree of precision and recall. The second use case has successfully moved from proof-of-concept to pilot phase. While there is a requirement for comprehensive knowledge management solutions dealing with exploration and management of biomedical text within the big data umbrella in pharma, we have seen that there also exist small and specific problems in the within the industry that can benefit from bespoke text-processing solutions built around frameworks such as UIMA.","PeriodicalId":186217,"journal":{"name":"2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2015.7359958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Text-processing of unstructured biomedical text has become crucial to pharma companies, both with regards to legacy as well as topical documentation. The Apache Unstructured Information Management Applications (UIMA) framework addresses general information extraction requirements. We present in this poster two use cases of using UIMA for specific unstructured biomedical information extraction tasks in pharma companies. The first use case requires extraction of values belonging to specific fields from legacy clinical study documents. These fields could be diverse, examples being study duration, study population, study arm, completion date and co-morbidity. The second use case deals with accurate propagation of drug label information to digital channels such as drug-specific websites. Due to the increased importance of such websites and mobile applications, pharma companies are looking at text-processing solutions to keep information in such channels accurate and up-to-date. Implementation: The use cases were implemented using the UIMA framework. The framework comprises of core UIMA modules and custom in-house modules specifically built for each of the use cases. Some of the key custom modules include document clustering, section identification, named entity recognition and relation-identification. For the first use case, a total of 70 fields were extracted from clinical study reports. These included study phase, study type, study duration, study start date and the drug dosage. For the second use case, content extraction was first done on drug-websites, and fields such as target dosage, dosage regimen and study duration were then extracted from the content. The field values were evaluated for accuracy against the label information. Conclusion: Both implementations were successful, with high degree of precision and recall. The second use case has successfully moved from proof-of-concept to pilot phase. While there is a requirement for comprehensive knowledge management solutions dealing with exploration and management of biomedical text within the big data umbrella in pharma, we have seen that there also exist small and specific problems in the within the industry that can benefit from bespoke text-processing solutions built around frameworks such as UIMA.

查看原文本刊更多论文

医药文本中基于UIMA的解决方案

背景:非结构化生物医学文本的文本处理对于制药公司来说已经变得至关重要，无论是关于遗产还是主题文档。Apache非结构化信息管理应用程序(UIMA)框架解决了一般的信息提取需求。在这张海报中，我们展示了在制药公司中使用UIMA进行特定的非结构化生物医学信息提取任务的两个用例。第一个用例需要从遗留临床研究文档中提取属于特定字段的值。这些领域可以是多种多样的，例如研究持续时间、研究人群、研究分组、完成日期和合并症。第二个用例处理将药品标签信息准确地传播到数字渠道(如特定于药物的网站)。由于这些网站和移动应用程序的重要性日益增加，制药公司正在寻找文本处理解决方案，以保持这些渠道中的信息准确和最新。实现:用例是使用UIMA框架实现的。该框架由核心UIMA模块和专门为每个用例构建的自定义内部模块组成。一些关键的自定义模块包括文档聚类、节识别、命名实体识别和关系识别。对于第一个用例，从临床研究报告中提取了总共70个字段。这些包括研究阶段、研究类型、研究持续时间、研究开始日期和药物剂量。对于第二个用例，首先在药物网站上进行内容提取，然后从内容中提取目标剂量、给药方案和研究持续时间等字段。根据标签信息评估字段值的准确性。结论:两种方法均成功，具有较高的查准率和查全率。第二个用例已经成功地从概念验证转移到试验阶段。虽然制药行业需要全面的知识管理解决方案来处理大数据保护伞下的生物医学文本的探索和管理，但我们已经看到，行业内也存在一些小而具体的问题，这些问题可以从围绕UIMA等框架构建的定制文本处理解决方案中受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量