{"title":"索引和查询重叠的结构","authors":"Faegheh Hasibi","doi":"10.1145/2484028.2484234","DOIUrl":null,"url":null,"abstract":"Structural information retrieval is mostly based on hierarchy. However, in real life information is not purely hierarchical and structural elements may overlap each other. The most common example is a document with two distinct structural views, where the logical view is section/ subsection/ paragraph and the physical view is page/ line. Each single structural view of this document is a hierarchy and the components are either disjoint or nested inside each other. The overlapping issue arises when one structural element cannot be neatly nested into others. For instance, when a paragraph starts in one page and terminates in the next page. Similar situations can appear in videos and other multimedia contents, where temporal or spatial constituents of a media file may overlap each other. Querying over overlapping structures is one of the challenges of large scale search engines. For instance, FSIS (FAST Search for Internet Sites) [1] is a Microsoft search platform, which encounters overlaps while analysing content of textual data. FSIS uses a pipeline process to extract structure and semantic information of documents. The pipeline contains several components, where each component writes annotations to the input data. These annotations consist of structural elements and some of them may overlap each other. Handling overlapping structures in search engines will add a novel capability of searching, where users can ask queries such as \"Find all the words that overlap two lines\" or \"Find the music played during Intro scene of Avatar movie\". There are also other use cases, where the user of the search engine is not a person, but is a specific program with complex, non-traditional information retrieval needs. This research attempts to index overlapping structures and provide efficient query processing for large-scale search engines. The current research on overlapping structures revolves around encoding and modelling data, while indexing and query processing methods need investigations. Moreover, due to intrinsic complexity of overlaps, XML indexing and query processing techniques cannot be used for overlapping structures. Hence, my research on overlapping structures comprises three main parts: (1) an indexing method that supports both hierarchies and overlaps; (2) a query processing method based on the indexing technique and (3) a query language that is close to natural language and supports both full text and structural queries. Our approach for indexing overlaps is to adapt the PrePost [3] XML indexing method to overlapping structures. This method labels each node with its start and end positions and requires modest storage space. However, PrePost indexing cannot be used for overlapping nodes. To overcome this issue, we need to define a data model for overlapping structures. Since hierarchies are not sufficient to describe overlapping components, several data structures have been introduced by scholars. One of the most interesting data models is GODDAG [2]. GODDAG is a tree-like graph, where nodes can have multiple parentage. This model can support overlaps as well as simple inheritance. Our proposed data model for indexing overlaps is such a tree-like structure, where we can define overlapping, parent-child and ancestor-descendant relationships.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"375 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Indexing and querying overlapping structures\",\"authors\":\"Faegheh Hasibi\",\"doi\":\"10.1145/2484028.2484234\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Structural information retrieval is mostly based on hierarchy. However, in real life information is not purely hierarchical and structural elements may overlap each other. The most common example is a document with two distinct structural views, where the logical view is section/ subsection/ paragraph and the physical view is page/ line. Each single structural view of this document is a hierarchy and the components are either disjoint or nested inside each other. The overlapping issue arises when one structural element cannot be neatly nested into others. For instance, when a paragraph starts in one page and terminates in the next page. Similar situations can appear in videos and other multimedia contents, where temporal or spatial constituents of a media file may overlap each other. Querying over overlapping structures is one of the challenges of large scale search engines. For instance, FSIS (FAST Search for Internet Sites) [1] is a Microsoft search platform, which encounters overlaps while analysing content of textual data. FSIS uses a pipeline process to extract structure and semantic information of documents. The pipeline contains several components, where each component writes annotations to the input data. These annotations consist of structural elements and some of them may overlap each other. Handling overlapping structures in search engines will add a novel capability of searching, where users can ask queries such as \\\"Find all the words that overlap two lines\\\" or \\\"Find the music played during Intro scene of Avatar movie\\\". There are also other use cases, where the user of the search engine is not a person, but is a specific program with complex, non-traditional information retrieval needs. This research attempts to index overlapping structures and provide efficient query processing for large-scale search engines. The current research on overlapping structures revolves around encoding and modelling data, while indexing and query processing methods need investigations. Moreover, due to intrinsic complexity of overlaps, XML indexing and query processing techniques cannot be used for overlapping structures. Hence, my research on overlapping structures comprises three main parts: (1) an indexing method that supports both hierarchies and overlaps; (2) a query processing method based on the indexing technique and (3) a query language that is close to natural language and supports both full text and structural queries. Our approach for indexing overlaps is to adapt the PrePost [3] XML indexing method to overlapping structures. This method labels each node with its start and end positions and requires modest storage space. However, PrePost indexing cannot be used for overlapping nodes. To overcome this issue, we need to define a data model for overlapping structures. Since hierarchies are not sufficient to describe overlapping components, several data structures have been introduced by scholars. One of the most interesting data models is GODDAG [2]. GODDAG is a tree-like graph, where nodes can have multiple parentage. This model can support overlaps as well as simple inheritance. Our proposed data model for indexing overlaps is such a tree-like structure, where we can define overlapping, parent-child and ancestor-descendant relationships.\",\"PeriodicalId\":178818,\"journal\":{\"name\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"375 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2484028.2484234\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484234","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
结构信息检索主要基于层次结构。然而,在现实生活中,信息并不是纯粹分层的,结构元素可能相互重叠。最常见的例子是具有两个不同结构视图的文档,其中逻辑视图是节/分段/段落,物理视图是页/行。该文档的每个单一结构视图都是一个层次结构,组件要么不相交,要么彼此嵌套。当一个结构元素不能整齐地嵌套到其他元素中时,就会出现重叠问题。例如,当一个段落从一页开始,在下一页结束时。类似的情况也可能出现在视频和其他多媒体内容中,其中媒体文件的时间或空间成分可能相互重叠。对重叠结构的查询是大型搜索引擎面临的挑战之一。例如,FSIS (FAST Search For Internet Sites)[1]是微软的一个搜索平台,它在分析文本数据的内容时遇到了重叠。FSIS使用流水线过程提取文档的结构和语义信息。该管道包含几个组件,其中每个组件向输入数据写入注释。这些注释由结构元素组成,其中一些可能相互重叠。在搜索引擎中处理重叠结构将增加一种新颖的搜索功能,用户可以询问诸如“查找所有重叠两条线的单词”或“查找阿凡达电影介绍场景中播放的音乐”之类的问题。还有其他用例,其中搜索引擎的用户不是人,而是具有复杂非传统信息检索需求的特定程序。本研究试图对重叠结构进行索引,为大规模搜索引擎提供高效的查询处理。目前对重叠结构的研究主要围绕着数据的编码和建模,而索引和查询处理方法还有待研究。此外,由于重叠固有的复杂性,XML索引和查询处理技术不能用于重叠结构。因此,我对重叠结构的研究包括三个主要部分:(1)同时支持层次和重叠的索引方法;(2)一种基于索引技术的查询处理方法;(3)一种接近自然语言、支持全文查询和结构查询的查询语言。我们索引重叠的方法是将PrePost [3] XML索引方法应用于重叠的结构。该方法用开始和结束位置标记每个节点,并且需要适度的存储空间。但是,PrePost索引不能用于重叠节点。为了克服这个问题,我们需要为重叠结构定义一个数据模型。由于层次结构不足以描述重叠的组件,学者们引入了几种数据结构。最有趣的数据模型之一是GODDAG[2]。GODDAG是一个树状图,其中的节点可以有多个父节点。该模型既支持简单继承,也支持重叠。我们提出的索引重叠的数据模型就是这样一个树状结构,我们可以在其中定义重叠、父子关系和祖先-后代关系。
Structural information retrieval is mostly based on hierarchy. However, in real life information is not purely hierarchical and structural elements may overlap each other. The most common example is a document with two distinct structural views, where the logical view is section/ subsection/ paragraph and the physical view is page/ line. Each single structural view of this document is a hierarchy and the components are either disjoint or nested inside each other. The overlapping issue arises when one structural element cannot be neatly nested into others. For instance, when a paragraph starts in one page and terminates in the next page. Similar situations can appear in videos and other multimedia contents, where temporal or spatial constituents of a media file may overlap each other. Querying over overlapping structures is one of the challenges of large scale search engines. For instance, FSIS (FAST Search for Internet Sites) [1] is a Microsoft search platform, which encounters overlaps while analysing content of textual data. FSIS uses a pipeline process to extract structure and semantic information of documents. The pipeline contains several components, where each component writes annotations to the input data. These annotations consist of structural elements and some of them may overlap each other. Handling overlapping structures in search engines will add a novel capability of searching, where users can ask queries such as "Find all the words that overlap two lines" or "Find the music played during Intro scene of Avatar movie". There are also other use cases, where the user of the search engine is not a person, but is a specific program with complex, non-traditional information retrieval needs. This research attempts to index overlapping structures and provide efficient query processing for large-scale search engines. The current research on overlapping structures revolves around encoding and modelling data, while indexing and query processing methods need investigations. Moreover, due to intrinsic complexity of overlaps, XML indexing and query processing techniques cannot be used for overlapping structures. Hence, my research on overlapping structures comprises three main parts: (1) an indexing method that supports both hierarchies and overlaps; (2) a query processing method based on the indexing technique and (3) a query language that is close to natural language and supports both full text and structural queries. Our approach for indexing overlaps is to adapt the PrePost [3] XML indexing method to overlapping structures. This method labels each node with its start and end positions and requires modest storage space. However, PrePost indexing cannot be used for overlapping nodes. To overcome this issue, we need to define a data model for overlapping structures. Since hierarchies are not sufficient to describe overlapping components, several data structures have been introduced by scholars. One of the most interesting data models is GODDAG [2]. GODDAG is a tree-like graph, where nodes can have multiple parentage. This model can support overlaps as well as simple inheritance. Our proposed data model for indexing overlaps is such a tree-like structure, where we can define overlapping, parent-child and ancestor-descendant relationships.