{"title":"基于citeseerx的记录链接与元数据提取数据集","authors":"Z. Bodó","doi":"10.1109/SYNASC.2018.00044","DOIUrl":null,"url":null,"abstract":"Data cleaning constitutes an important problem in information science. Collecting data about the same entities from multiple sources or following distinct methodologies might result in slightly different, inconsistent data. The objective of data cleaning is to produce a fused version combining the differing data, resulting in a cleaner dataset. In this paper we collect document metadata records from CiteSeerX and build a supervised record linker to Crossref. The supervised method is trained using a manually linked dataset containing 512 verified DOIs—to our knowledge, up to now being the largest such dataset for bibliographic record linkage. We experiment using different supervised learning methods, and also prove experimentally that the accuracy of the attached metadata records can improve the performance of automatic metadata extraction systems.","PeriodicalId":273805,"journal":{"name":"2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction\",\"authors\":\"Z. Bodó\",\"doi\":\"10.1109/SYNASC.2018.00044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data cleaning constitutes an important problem in information science. Collecting data about the same entities from multiple sources or following distinct methodologies might result in slightly different, inconsistent data. The objective of data cleaning is to produce a fused version combining the differing data, resulting in a cleaner dataset. In this paper we collect document metadata records from CiteSeerX and build a supervised record linker to Crossref. The supervised method is trained using a manually linked dataset containing 512 verified DOIs—to our knowledge, up to now being the largest such dataset for bibliographic record linkage. We experiment using different supervised learning methods, and also prove experimentally that the accuracy of the attached metadata records can improve the performance of automatic metadata extraction systems.\",\"PeriodicalId\":273805,\"journal\":{\"name\":\"2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)\",\"volume\":\"95 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2018.00044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2018.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction
Data cleaning constitutes an important problem in information science. Collecting data about the same entities from multiple sources or following distinct methodologies might result in slightly different, inconsistent data. The objective of data cleaning is to produce a fused version combining the differing data, resulting in a cleaner dataset. In this paper we collect document metadata records from CiteSeerX and build a supervised record linker to Crossref. The supervised method is trained using a manually linked dataset containing 512 verified DOIs—to our knowledge, up to now being the largest such dataset for bibliographic record linkage. We experiment using different supervised learning methods, and also prove experimentally that the accuracy of the attached metadata records can improve the performance of automatic metadata extraction systems.