Investigating the Impacts of Misspellings in Patent Search by Combining Natural Language Tools and Rule-Based Approaches

Science of aging knowledge environment : SAGE KE Pub Date : 2022-09-07 DOI:10.3390/knowledge2030029

D. Russo, C. Spreafico, S. Avogadri, Andrea Precorvi

{"title":"Investigating the Impacts of Misspellings in Patent Search by Combining Natural Language Tools and Rule-Based Approaches","authors":"D. Russo, C. Spreafico, S. Avogadri, Andrea Precorvi","doi":"10.3390/knowledge2030029","DOIUrl":null,"url":null,"abstract":"Among all sources of technical information, patent information is one of the richest and most comprehensive. Knowing how to search in this mass of documents is becoming increasingly crucial. However, many users have limited knowledge of patents and search strategies, so they must use intuitive, often approximate approaches that can lead to highly inaccurate searches and be time-consuming. To address this problem, there are tools that help expand queries to increase recall so as not to miss good documents, however, it remains an open problem dealing with misspellings-based strategies. Typically, the problem of the presence of misspellings in patent text is underestimated even by experts in the field, and there is no specific functionality to handle it in the tools available, both free and paid. The goal of the article is to raise awareness about the difficulties in making a proper patent strategy that also takes into account the possible presence of misspellings. It is important to know where we expect to find them and how much these may affect the final result. In particular, it is chosen to divide misspellings into categories, distinguishing between misspellings associated with a generic keyword or multiword from misspellings in acronyms, chemical formulas, names of applicants, inventors, or names of specific formulas or theorems. At least one example case is given for each category, showing when and how it may affect the result. Finally, an integrated approach combining word and contextual embedding models based on deep learning with a rule-based algorithm based on wild cards and truncation operators is suggested for correcting the query, automatically suggesting the most consistent misspellings, thus achieving a more accurate and reliable result.","PeriodicalId":74770,"journal":{"name":"Science of aging knowledge environment : SAGE KE","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of aging knowledge environment : SAGE KE","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/knowledge2030029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Among all sources of technical information, patent information is one of the richest and most comprehensive. Knowing how to search in this mass of documents is becoming increasingly crucial. However, many users have limited knowledge of patents and search strategies, so they must use intuitive, often approximate approaches that can lead to highly inaccurate searches and be time-consuming. To address this problem, there are tools that help expand queries to increase recall so as not to miss good documents, however, it remains an open problem dealing with misspellings-based strategies. Typically, the problem of the presence of misspellings in patent text is underestimated even by experts in the field, and there is no specific functionality to handle it in the tools available, both free and paid. The goal of the article is to raise awareness about the difficulties in making a proper patent strategy that also takes into account the possible presence of misspellings. It is important to know where we expect to find them and how much these may affect the final result. In particular, it is chosen to divide misspellings into categories, distinguishing between misspellings associated with a generic keyword or multiword from misspellings in acronyms, chemical formulas, names of applicants, inventors, or names of specific formulas or theorems. At least one example case is given for each category, showing when and how it may affect the result. Finally, an integrated approach combining word and contextual embedding models based on deep learning with a rule-based algorithm based on wild cards and truncation operators is suggested for correcting the query, automatically suggesting the most consistent misspellings, thus achieving a more accurate and reliable result.

查看原文本刊更多论文

结合自然语言工具和基于规则的方法研究专利检索中拼写错误的影响

在所有的技术信息来源中，专利信息是最丰富、最全面的信息之一。知道如何在这大量的文件中进行搜索变得越来越重要。然而，许多用户对专利和搜索策略的了解有限，因此他们必须使用直观的、通常近似的方法，这可能导致高度不准确的搜索，并且非常耗时。为了解决这个问题，有一些工具可以帮助扩展查询以增加召回，从而不会错过好的文档，但是，处理基于拼写错误的策略仍然是一个开放的问题。通常，即使是该领域的专家也低估了专利文本中存在拼写错误的问题，并且在可用的工具中没有特定的功能来处理它，无论是免费的还是付费的。本文的目的是提高人们对制定适当的专利策略的困难的认识，同时考虑到可能存在的拼写错误。重要的是要知道我们期望在哪里找到它们，以及它们对最终结果的影响有多大。特别是，选择将拼写错误分类，将与通用关键字或多词相关的拼写错误与首字母缩写词、化学式、申请人名称、发明者名称或特定公式或定理名称中的拼写错误区分开来。每个类别至少给出一个例子，说明它何时以及如何影响结果。最后，提出了基于深度学习的单词和上下文嵌入模型与基于通配符和截断算子的基于规则的算法相结合的查询纠错方法，自动提出最一致的拼写错误，从而获得更准确、更可靠的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Science of aging knowledge environment : SAGE KE

自引率

0.00%

发文量