{"title":"Rule Based Approach to Extract Metadata from Scientific PDF Documents","authors":"Ahmer Maqsood Hashmi, M. Afzal, S. Rehman","doi":"10.1109/CITISIA50690.2020.9371784","DOIUrl":null,"url":null,"abstract":"The number of scientific PDF documents is increasing at a very rapid pace. The searching for these documents is becoming a time consuming task, due to the large number of PDF documents. To make the search and storage more efficient, we need a mechanism to extract metadata from these documents and store this metadata according to their semantics. Extracting information from metadata and storing that information is very time consuming task and requires lots of human effort if performed manually due to large numbers of documents and their varying formats. In this paper, we present a rule-based approach to extract metadata information from the research articles. This approach was developed and evaluated on a diverse data-set provided by ESWC (2016) having a number of different formats and features. Evaluation results show that our proposed approach performs 22% better than CERMINE and 9% better than GROBID.","PeriodicalId":145272,"journal":{"name":"2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CITISIA50690.2020.9371784","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The number of scientific PDF documents is increasing at a very rapid pace. The searching for these documents is becoming a time consuming task, due to the large number of PDF documents. To make the search and storage more efficient, we need a mechanism to extract metadata from these documents and store this metadata according to their semantics. Extracting information from metadata and storing that information is very time consuming task and requires lots of human effort if performed manually due to large numbers of documents and their varying formats. In this paper, we present a rule-based approach to extract metadata information from the research articles. This approach was developed and evaluated on a diverse data-set provided by ESWC (2016) having a number of different formats and features. Evaluation results show that our proposed approach performs 22% better than CERMINE and 9% better than GROBID.