用于搜索非常大的文本数据库的关联/并行处理器

R. M. Bird, J. Tu, R. Worthy
{"title":"用于搜索非常大的文本数据库的关联/并行处理器","authors":"R. M. Bird, J. Tu, R. Worthy","doi":"10.1145/800180.810247","DOIUrl":null,"url":null,"abstract":"This paper describes an approach to solving a major problem in the information processing sciences— that of searching very large (5-50 billion characters) data bases of unstructured free-text for random queries within a reasonable time and at an affordable price.\n The need by information specialists and knowledge workers for large, fast low-cost text and document retrieval systems is growing rapidly. Conventional approaches to the problem have usually depended upon expensive, general purpose computers, upon special pre-preprocessing of the textual data (e.g. file inverting, indexing, abstracting, etc.), and upon elaborate, costly software. The resulting retrieval systems often cost hundreds of dollars per query and the full scanning of an uninverted, unstructured billion byte textual data base could take hours of computer services. However, in spite of these restrictions, such full text search systems have proved useful and even indispensible for many applications.\n Computer technology of the late 1960's and the 1970's, in both hardware and software (e.g., minicomputers, low-cost, high density disk storage, “chip” electronics, natural language query systems, etc.), have made i t practical to build special purpose, low-cost text retrieval systems. Such a system has been built, tested, and is now in a production stage. The system called the Associative File Processor (AFP), utilizes a conventional minicomputer (DEC's PDP-11/45) for control, off-the-shelf high density disks for storage, a special purpose parallel search module as a text term detector, and query and retrieval software. The AFP is currently being field tested at two sites. Full text, parallel searches on un-preprocessed textual data bases are being performed at the effective matching rates of 4 billion bytes per second (8K byte key memory times 500 Kbyte/second data stream). Estimated costs are 10 to 25 cents per query for a one billion byte data base. The costs per query and the time for searching increase in a linear fashion as data base increases. A basic architecture for the AFP is described and an implemented version is discussed. A more powerful term detector module is also under development. This system is designed around a finite state automaton algorithm.","PeriodicalId":328859,"journal":{"name":"Computer Architecture Workshop","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"Associative/parallel processors for searching very large textual data bases\",\"authors\":\"R. M. Bird, J. Tu, R. Worthy\",\"doi\":\"10.1145/800180.810247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an approach to solving a major problem in the information processing sciences— that of searching very large (5-50 billion characters) data bases of unstructured free-text for random queries within a reasonable time and at an affordable price.\\n The need by information specialists and knowledge workers for large, fast low-cost text and document retrieval systems is growing rapidly. Conventional approaches to the problem have usually depended upon expensive, general purpose computers, upon special pre-preprocessing of the textual data (e.g. file inverting, indexing, abstracting, etc.), and upon elaborate, costly software. The resulting retrieval systems often cost hundreds of dollars per query and the full scanning of an uninverted, unstructured billion byte textual data base could take hours of computer services. However, in spite of these restrictions, such full text search systems have proved useful and even indispensible for many applications.\\n Computer technology of the late 1960's and the 1970's, in both hardware and software (e.g., minicomputers, low-cost, high density disk storage, “chip” electronics, natural language query systems, etc.), have made i t practical to build special purpose, low-cost text retrieval systems. Such a system has been built, tested, and is now in a production stage. The system called the Associative File Processor (AFP), utilizes a conventional minicomputer (DEC's PDP-11/45) for control, off-the-shelf high density disks for storage, a special purpose parallel search module as a text term detector, and query and retrieval software. The AFP is currently being field tested at two sites. Full text, parallel searches on un-preprocessed textual data bases are being performed at the effective matching rates of 4 billion bytes per second (8K byte key memory times 500 Kbyte/second data stream). Estimated costs are 10 to 25 cents per query for a one billion byte data base. The costs per query and the time for searching increase in a linear fashion as data base increases. A basic architecture for the AFP is described and an implemented version is discussed. A more powerful term detector module is also under development. This system is designed around a finite state automaton algorithm.\",\"PeriodicalId\":328859,\"journal\":{\"name\":\"Computer Architecture Workshop\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Architecture Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/800180.810247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Architecture Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/800180.810247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 44

摘要

本文描述了一种解决信息处理科学中的一个主要问题的方法——在合理的时间内以可承受的价格为随机查询搜索非常大(50 - 500亿字符)的非结构化自由文本数据库。信息专家和知识工作者对大型、快速、低成本的文本和文档检索系统的需求正在迅速增长。解决这一问题的传统方法通常依赖于昂贵的通用计算机,依赖于对文本数据进行特殊的预处理(例如文件反转、索引、抽象等),以及依赖于精心设计的昂贵软件。由此产生的检索系统每次查询通常要花费数百美元,而对未反转、非结构化的十亿字节文本数据库进行全面扫描可能需要数小时的计算机服务。然而,尽管有这些限制,这种全文搜索系统已被证明是有用的,甚至是许多应用程序不可或缺的。20世纪60年代末和70年代的计算机技术,在硬件和软件方面(例如,小型机、低成本、高密度磁盘存储、“芯片”电子、自然语言查询系统等),已经使建立专门用途、低成本的文本检索系统成为现实。这样的系统已经建立、测试,现在处于生产阶段。该系统称为联合文件处理器(AFP),利用传统的小型计算机(DEC的PDP-11/45)进行控制,使用现成的高密度磁盘进行存储,使用专用并行搜索模块作为文本术语检测器,以及查询和检索软件。AFP目前正在两个地点进行实地测试。全文、未预处理文本数据库上的并行搜索正在以每秒40亿字节的有效匹配速率执行(8K字节的关键内存乘以500 Kbyte/秒的数据流)。对于十亿字节的数据库,每次查询的估计成本为10到25美分。每次查询的成本和搜索的时间随着数据库的增加呈线性增长。描述了AFP的基本体系结构,并讨论了实现版本。一个更强大的术语检测器模块也在开发中。该系统是围绕有限状态自动机算法设计的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Associative/parallel processors for searching very large textual data bases
This paper describes an approach to solving a major problem in the information processing sciences— that of searching very large (5-50 billion characters) data bases of unstructured free-text for random queries within a reasonable time and at an affordable price. The need by information specialists and knowledge workers for large, fast low-cost text and document retrieval systems is growing rapidly. Conventional approaches to the problem have usually depended upon expensive, general purpose computers, upon special pre-preprocessing of the textual data (e.g. file inverting, indexing, abstracting, etc.), and upon elaborate, costly software. The resulting retrieval systems often cost hundreds of dollars per query and the full scanning of an uninverted, unstructured billion byte textual data base could take hours of computer services. However, in spite of these restrictions, such full text search systems have proved useful and even indispensible for many applications. Computer technology of the late 1960's and the 1970's, in both hardware and software (e.g., minicomputers, low-cost, high density disk storage, “chip” electronics, natural language query systems, etc.), have made i t practical to build special purpose, low-cost text retrieval systems. Such a system has been built, tested, and is now in a production stage. The system called the Associative File Processor (AFP), utilizes a conventional minicomputer (DEC's PDP-11/45) for control, off-the-shelf high density disks for storage, a special purpose parallel search module as a text term detector, and query and retrieval software. The AFP is currently being field tested at two sites. Full text, parallel searches on un-preprocessed textual data bases are being performed at the effective matching rates of 4 billion bytes per second (8K byte key memory times 500 Kbyte/second data stream). Estimated costs are 10 to 25 cents per query for a one billion byte data base. The costs per query and the time for searching increase in a linear fashion as data base increases. A basic architecture for the AFP is described and an implemented version is discussed. A more powerful term detector module is also under development. This system is designed around a finite state automaton algorithm.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信