{"title":"分布式大规模文本分析","authors":"M. T. Egner, M. Lorch, Edd Biddle","doi":"10.1109/CCGRID.2007.118","DOIUrl":null,"url":null,"abstract":"This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.","PeriodicalId":278535,"journal":{"name":"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)","volume":"613 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"UIMA GRID: Distributed Large-scale Text Analysis\",\"authors\":\"M. T. Egner, M. Lorch, Edd Biddle\",\"doi\":\"10.1109/CCGRID.2007.118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.\",\"PeriodicalId\":278535,\"journal\":{\"name\":\"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)\",\"volume\":\"613 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2007.118\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2007.118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19
摘要
本文展示了如何利用Condor管理的松散耦合计算资源与IBM OmniFind一起实现基于非结构化信息管理体系结构(Unstructured Information Management Architecture, UIMA)的可伸缩文本分析环境。文本分析可用于从实体及其关系等非结构化文本数据中提取有价值的知识。当应用于大量数据(例如数百万个文档)时,该过程可能过于耗时,无法对业务需求做出反应。当文本分析组件使用的规则集、字典或分类法被更改以提取用于特定业务目的的新信息时,这将成为一个特别的问题。这种更改可能需要重新分析整个文档集。在推动这项工作的场景中,需要经常重新处理不断增长的文档集(目前有1000万个文档)以适应此类更改。部署的文本分析算法非常复杂且计算密集,目前需要大约20个cpu年才能进行一次完整的重新分析。通过本文讨论的分布式体系结构,可以在一个日历月内通过机会性地利用异构Condor池中的计算节点来执行重新分析。
This paper shows how loosely coupled compute resources, managed by Condor, can be leveraged together with IBM OmniFind to implement a scalable environment for text analysis based on the Unstructured Information Management Architecture (UIMA). Text analysis can be used to extract valuable knowledge from unstructured text data such as entities and their relationships. When applied to large amounts of data e.g., in the magnitude of several million documents, the process can be too time consuming to react to business needs. This becomes a particular problem when the rule sets, dictionaries, or taxonomies used by the text analysis components are changed to extract new information for a particular business purpose. Such changes may require that the entire set of documents must be reanalyzed. In the scenario motivating this work a constantly growing set of currently 10 million documents needs to frequently be re-processed to accommodate such changes. The text analysis algorithms deployed are very complex and compute intensive, requiring currently about 20 CPU-years for a full re-analysis. Through the distributed architecture discussed in this paper the re-analysis can be performed in one calendar month by opportunistically leveraging compute nodes from a heterogeneous Condor pool.