Finite-State Machines for Mining Patterns in Very Large Text Repositories

Finite-State Methods and Natural Language Processing Pub Date : 2009-07-11 DOI:10.3233/978-1-58603-975-2-23

Wojciech Skut

引用次数: 2

Abstract

The emergence of WWW search engines since the 1990s has changed the scale of many natural language processing applications. Text mining, information extraction and related tasks can now be applied to tens of billions of documents, which sets new efficiency standards for NLP algorithms. Finite-state machines are an obvious choice of a formal framework for such applications. However, the scale of the problem (size of the searchable corpus, number of patterns to be matched) often poses a problem even to well-established finite-state string matching techniques. In my presentation, I will focus on the experience gained in the implementation a finite-state matching library optimized for searching large amounts of complex patterns in a WWW-scale repository of documents. Both algorithmic and implementation-related aspects of the task will be discussed. The library is based on OpenFST.

查看原文本刊更多论文

在超大型文本库中挖掘模式的有限状态机

自20世纪90年代以来，万维网搜索引擎的出现改变了许多自然语言处理应用的规模。文本挖掘、信息提取和相关任务现在可以应用于数百亿的文档，这为NLP算法设定了新的效率标准。对于此类应用程序，有限状态机显然是正式框架的选择。然而，问题的规模(可搜索语料库的大小，要匹配的模式的数量)经常会给建立良好的有限状态字符串匹配技术带来问题。在我的演讲中，我将重点介绍在实现有限状态匹配库中获得的经验，该库针对在www级文档存储库中搜索大量复杂模式进行了优化。将讨论该任务的算法和实现相关方面。该库基于OpenFST。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Finite-State Methods and Natural Language Processing

自引率

0.00%

发文量