系统综述重复记录检测自动化：Deduplicator.

IF 6.3 4区医学 Q1 MEDICINE, GENERAL & INTERNAL

Systematic Reviews Pub Date : 2024-08-02 DOI:10.1186/s13643-024-02619-9

Connor Forbes, Hannah Greenwood, Matt Carter, Justin Clark

{"title":"系统综述重复记录检测自动化：Deduplicator.","authors":"Connor Forbes, Hannah Greenwood, Matt Carter, Justin Clark","doi":"10.1186/s13643-024-02619-9","DOIUrl":null,"url":null,"abstract":"Background: To describe the algorithm and investigate the efficacy of a novel systematic review automation tool \"the Deduplicator\" to remove duplicate records from a multi-database systematic review search.Methods: We constructed and tested the efficacy of the Deduplicator tool by using 10 previous Cochrane systematic review search results to compare the Deduplicator's 'balanced' algorithm to a semi-manual EndNote method. Two researchers each performed deduplication on the 10 libraries of search results. For five of those libraries, one researcher used the Deduplicator, while the other performed semi-manual deduplication with EndNote. They then switched methods for the remaining five libraries. In addition to this analysis, comparison between the three different Deduplicator algorithms ('balanced', 'focused' and 'relaxed') was performed on two datasets of previously deduplicated search results.Results: Before deduplication, the mean library size for the 10 systematic reviews was 1962 records. When using the Deduplicator, the mean time to deduplicate was 5 min per 1000 records compared to 15 min with EndNote. The mean error rate with Deduplicator was 1.8 errors per 1000 records in comparison to 3.1 with EndNote. Evaluation of the different Deduplicator algorithms found that the 'balanced' algorithm had the highest mean F1 score of 0.9647. The 'focused' algorithm had the highest mean accuracy of 0.9798 and the highest recall of 0.9757. The 'relaxed' algorithm had the highest mean precision of 0.9896.Conclusions: This demonstrates that using the Deduplicator for duplicate record detection reduces the time taken to deduplicate, while maintaining or improving accuracy compared to using a semi-manual EndNote method. However, further research should be performed comparing more deduplication methods to establish relative performance of the Deduplicator against other deduplication methods.","PeriodicalId":22162,"journal":{"name":"Systematic Reviews","volume":"13 1","pages":"206"},"PeriodicalIF":6.3000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295717/pdf/","citationCount":"0","resultStr":"{\"title\":\"Automation of duplicate record detection for systematic reviews: Deduplicator.\",\"authors\":\"Connor Forbes, Hannah Greenwood, Matt Carter, Justin Clark\",\"doi\":\"10.1186/s13643-024-02619-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: To describe the algorithm and investigate the efficacy of a novel systematic review automation tool \\\"the Deduplicator\\\" to remove duplicate records from a multi-database systematic review search.Methods: We constructed and tested the efficacy of the Deduplicator tool by using 10 previous Cochrane systematic review search results to compare the Deduplicator's 'balanced' algorithm to a semi-manual EndNote method. Two researchers each performed deduplication on the 10 libraries of search results. For five of those libraries, one researcher used the Deduplicator, while the other performed semi-manual deduplication with EndNote. They then switched methods for the remaining five libraries. In addition to this analysis, comparison between the three different Deduplicator algorithms ('balanced', 'focused' and 'relaxed') was performed on two datasets of previously deduplicated search results.Results: Before deduplication, the mean library size for the 10 systematic reviews was 1962 records. When using the Deduplicator, the mean time to deduplicate was 5 min per 1000 records compared to 15 min with EndNote. The mean error rate with Deduplicator was 1.8 errors per 1000 records in comparison to 3.1 with EndNote. Evaluation of the different Deduplicator algorithms found that the 'balanced' algorithm had the highest mean F1 score of 0.9647. The 'focused' algorithm had the highest mean accuracy of 0.9798 and the highest recall of 0.9757. The 'relaxed' algorithm had the highest mean precision of 0.9896.Conclusions: This demonstrates that using the Deduplicator for duplicate record detection reduces the time taken to deduplicate, while maintaining or improving accuracy compared to using a semi-manual EndNote method. However, further research should be performed comparing more deduplication methods to establish relative performance of the Deduplicator against other deduplication methods.\",\"PeriodicalId\":22162,\"journal\":{\"name\":\"Systematic Reviews\",\"volume\":\"13 1\",\"pages\":\"206\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2024-08-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295717/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Reviews\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s13643-024-02619-9\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Reviews","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13643-024-02619-9","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

摘要

背景：描述一种新型系统综述自动化工具 "去重器 "的算法并研究其功效：目的：描述一种新型系统综述自动化工具 "Deduplicator "的算法，并研究其从多数据库系统综述检索中删除重复记录的功效：我们使用之前的 10 篇 Cochrane 系统综述检索结果构建并测试了重复数据删除工具的功效，并将重复数据删除工具的 "平衡 "算法与半手动 EndNote 方法进行了比较。两名研究人员分别对 10 个检索结果库进行了重复数据删除。对于其中的五个库，一位研究人员使用 Deduplicator，另一位使用 EndNote 进行半手动重复数据删除。然后，他们对其余五个库交换了方法。除上述分析外，研究人员还在两个重复数据集上对三种不同的重复数据消除器算法（"平衡"、"集中 "和 "放松"）进行了比较：结果：重复数据删除前，10 篇系统综述的平均库规模为 1962 条记录。使用重复数据消除器时，每 1000 条记录的平均重复时间为 5 分钟，而使用 EndNote 时则为 15 分钟。使用 Deduplicator 时，每 1000 条记录的平均错误率为 1.8，而使用 EndNote 时为 3.1。对不同 Deduplicator 算法的评估发现，"平衡 "算法的平均 F1 分数最高，为 0.9647。集中 "算法的平均准确率最高，为 0.9798，召回率最高，为 0.9757。宽松 "算法的平均精确度最高，为 0.9896：这表明，与使用半手工 EndNote 方法相比，使用 Deduplicator 检测重复记录可以减少重复记录的时间，同时保持或提高准确性。不过，还需要对更多重复数据删除方法进行进一步研究，以确定重复数据删除工具与其他重复数据删除方法的相对性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automation of duplicate record detection for systematic reviews: Deduplicator.

Background: To describe the algorithm and investigate the efficacy of a novel systematic review automation tool "the Deduplicator" to remove duplicate records from a multi-database systematic review search.

Methods: We constructed and tested the efficacy of the Deduplicator tool by using 10 previous Cochrane systematic review search results to compare the Deduplicator's 'balanced' algorithm to a semi-manual EndNote method. Two researchers each performed deduplication on the 10 libraries of search results. For five of those libraries, one researcher used the Deduplicator, while the other performed semi-manual deduplication with EndNote. They then switched methods for the remaining five libraries. In addition to this analysis, comparison between the three different Deduplicator algorithms ('balanced', 'focused' and 'relaxed') was performed on two datasets of previously deduplicated search results.

Results: Before deduplication, the mean library size for the 10 systematic reviews was 1962 records. When using the Deduplicator, the mean time to deduplicate was 5 min per 1000 records compared to 15 min with EndNote. The mean error rate with Deduplicator was 1.8 errors per 1000 records in comparison to 3.1 with EndNote. Evaluation of the different Deduplicator algorithms found that the 'balanced' algorithm had the highest mean F1 score of 0.9647. The 'focused' algorithm had the highest mean accuracy of 0.9798 and the highest recall of 0.9757. The 'relaxed' algorithm had the highest mean precision of 0.9896.

Conclusions: This demonstrates that using the Deduplicator for duplicate record detection reduces the time taken to deduplicate, while maintaining or improving accuracy compared to using a semi-manual EndNote method. However, further research should be performed comparing more deduplication methods to establish relative performance of the Deduplicator against other deduplication methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Systematic Reviews Medicine-Medicine (miscellaneous)

CiteScore

8.30

自引率

0.00%

发文量

241

审稿时长

11 weeks

期刊介绍： Systematic Reviews encompasses all aspects of the design, conduct and reporting of systematic reviews. The journal publishes high quality systematic review products including systematic review protocols, systematic reviews related to a very broad definition of health, rapid reviews, updates of already completed systematic reviews, and methods research related to the science of systematic reviews, such as decision modelling. At this time Systematic Reviews does not accept reviews of in vitro studies. The journal also aims to ensure that the results of all well-conducted systematic reviews are published, regardless of their outcome.