Hwan-Gue Cho, H. Tak, Han-Ho Kim, Yeoneo Kim, Yongju Shin, Chulsu Lim, Kwangnam Choi
{"title":"Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents","authors":"Hwan-Gue Cho, H. Tak, Han-Ho Kim, Yeoneo Kim, Yongju Shin, Chulsu Lim, Kwangnam Choi","doi":"10.1145/3133811.3133817","DOIUrl":null,"url":null,"abstract":"Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: In the first part texts were produced by human work by artificial plagiarism approach through the linear pipelined procedure. In the second part, texts are generated by software that inserts, deletes, and substitutes certain parts of the target documents to make a similar document from an input document. These document set is known as the Serially Evolved Documents (SED). We propose new methods: Order Preserving Precision (OPP) and Order Preserving Recall (OPR), to compute how the evolutionary order is kept among output documents obtained from the subject IR system. Using those testing texts we evaluated KONAN, a document retrieval system for Korean documents.","PeriodicalId":403248,"journal":{"name":"Proceedings of the 3rd International Conference on Industrial and Business Engineering","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Industrial and Business Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3133811.3133817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: In the first part texts were produced by human work by artificial plagiarism approach through the linear pipelined procedure. In the second part, texts are generated by software that inserts, deletes, and substitutes certain parts of the target documents to make a similar document from an input document. These document set is known as the Serially Evolved Documents (SED). We propose new methods: Order Preserving Precision (OPP) and Order Preserving Recall (OPR), to compute how the evolutionary order is kept among output documents obtained from the subject IR system. Using those testing texts we evaluated KONAN, a document retrieval system for Korean documents.