Ulrich Schäfer, B. Kiefer, Christian Spurk, Jörg Steffen, Rui Wang, Benjamin Weitz, Magdalena Wolska
{"title":"数字图书馆中句子语义检索、全文检索和书目检索相结合的检索平台","authors":"Ulrich Schäfer, B. Kiefer, Christian Spurk, Jörg Steffen, Rui Wang, Benjamin Weitz, Magdalena Wolska","doi":"10.18352/LQ.8091","DOIUrl":null,"url":null,"abstract":"We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude). In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout. To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology ). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de . Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields. This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage [1] or via the Searchbench [2] .","PeriodicalId":357594,"journal":{"name":"The Liber Quarterly","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries\",\"authors\":\"Ulrich Schäfer, B. Kiefer, Christian Spurk, Jörg Steffen, Rui Wang, Benjamin Weitz, Magdalena Wolska\",\"doi\":\"10.18352/LQ.8091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude). In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout. To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology ). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de . Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields. This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage [1] or via the Searchbench [2] .\",\"PeriodicalId\":357594,\"journal\":{\"name\":\"The Liber Quarterly\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Liber Quarterly\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18352/LQ.8091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Liber Quarterly","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18352/LQ.8091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
我们描述了一种精确搜索数字图书馆全部内容的新方法。Searchbench(用于搜索工作台)是基于基于句子的句法和语义自然语言处理(NLP)的PDF格式的原生数字出版物和扫描出版物。“天生数字化”指的是原生数字化,即使用LaTeX、OpenOffice等排版系统以电子方式准备。在Searchbench中,查询可以公式化为(可能未指定的)语句,由简单的主语-谓语-宾语结构组成,例如“算法改进单词对齐”。在大型文档集合中,当搜索词碰巧彼此接近,但在语义上不相关时,这减少了错误命中的数量。该方法还对被动语态和谓语同义词进行了抽象。此外,否定的语句可以从搜索结果中排除,否定的反义词谓词也算作同义词(例如,not include = exclude)。在Searchbench中,句子语义搜索可以与经典全文、书目元数据和自动计算的领域术语的搜索过滤器相结合。自动建议字段便于文本输入。查询可以添加书签或通过电子邮件发送。此外,在Searchbench中,一个新颖的引文浏览器允许在引文网络中的图形导航。这些都是从元数据和纸质文本中自动提取出来的。引文浏览器在引文图的边缘显示引文句子中的短句,从而使学生和研究人员能够快速浏览出版物并沉浸在新的研究领域。通过单击引文边缘,原始引文句子将显示在上下文中,也可选择显示在原始PDF布局中。为了展示我们的研究的有用性,我们已经将其应用到一个集合,目前大约。25000篇开放获取的计算语言学和语言技术领域的研究论文,ACL文集(http://aclweb.org/anthology)。Searchbench用户界面是一个web应用程序,可以在每一个现代的、支持javascript的web浏览器上运行,也可以在智能手机和平板电脑上运行。该系统是一项免费的公共服务,网址为http://aclasb.dfki.de。因为NLP技术是独立于领域的,所以它也可以应用于报纸文本、技术文档或来自其他学科的科学出版物。本文的目的是使这种新的、基于语言技术的方法的好处在图书馆研究和相关领域中为人所知。本文总结了过去三年来在计算语言学领域的国际会议和研讨会上发表的9篇同行评议出版物,并试图以适当的方式向LIBER读者展示它们。原始论文包含更多细节,可以从作者的主页[1]或通过Searchbench[2]免费获得。
The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries
We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude). In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout. To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology ). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de . Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields. This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage [1] or via the Searchbench [2] .