使用随机森林和DBSCAN的专利数据库的发明人名称消歧

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL) Pub Date : 2016-06-19 DOI:10.1145/2910896.2925465

Kunho Kim, Madian Khabsa, C. Lee Giles

{"title":"使用随机森林和DBSCAN的专利数据库的发明人名称消歧","authors":"Kunho Kim, Madian Khabsa, C. Lee Giles","doi":"10.1145/2910896.2925465","DOIUrl":null,"url":null,"abstract":"Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Inventor name disambiguation for a patent database using a random forest and DBSCAN\",\"authors\":\"Kunho Kim, Madian Khabsa, C. Lee Giles\",\"doi\":\"10.1145/2910896.2925465\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.\",\"PeriodicalId\":109613,\"journal\":{\"name\":\"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)\",\"volume\":\"198 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2910896.2925465\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2910896.2925465","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

发明人姓名消歧义是将专利数据库中的每个唯一发明人与所有其他发明人记录区分开来的任务。此任务对于处理人名查询是必不可少的，以便获得与特定发明人相关的信息，例如该发明人的所有专利列表。利用前人关于作者姓名消歧的研究成果，将其应用于发明者姓名消歧中。训练随机森林分类器来分类每对发明家记录是否为同一个人。采用DBSCAN算法对发明家记录进行聚类，并利用随机森林分类器导出其距离函数。对于可伸缩性，块函数用于降低记录匹配的复杂性并启用并行化，因为每个块可以同时运行。在美国专利商标局的专利数据库中进行测试，在6.5小时内消除了1200万个发明家记录的歧义。对来自USPTO PatentsView竞赛的标记数据集的评估表明，我们的算法优于提交给竞赛的所有算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Inventor name disambiguation for a patent database using a random forest and DBSCAN

Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量