{"title":"查找苛刻的文件","authors":"O. Frieder","doi":"10.1145/3469096.3469864","DOIUrl":null,"url":null,"abstract":"Conventional, textual document search is arguably well understood. Traditional and modern (neural) algorithms are available; benchmark collections and evaluation metrics are prevalent. However, not all documents are conventional or purely textual. We explore what is takes to search \"harsh\" document collections. Such collections comprise potentially of documents that are natively non-digital, are multilingual, include components that are not strictly textual, are corrupted, or are a combination thereof. We address machine readability and its implication on search. We overview component segmentation and integration as a search process. We describe the processing of search queries that are informationally deficient or corrupt. We then comment on the evaluation of the selected efforts presented and highlight their history from concept to practice. We conclude with a brief commentary on ongoing efforts.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Searching harsh documents\",\"authors\":\"O. Frieder\",\"doi\":\"10.1145/3469096.3469864\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional, textual document search is arguably well understood. Traditional and modern (neural) algorithms are available; benchmark collections and evaluation metrics are prevalent. However, not all documents are conventional or purely textual. We explore what is takes to search \\\"harsh\\\" document collections. Such collections comprise potentially of documents that are natively non-digital, are multilingual, include components that are not strictly textual, are corrupted, or are a combination thereof. We address machine readability and its implication on search. We overview component segmentation and integration as a search process. We describe the processing of search queries that are informationally deficient or corrupt. We then comment on the evaluation of the selected efforts presented and highlight their history from concept to practice. We conclude with a brief commentary on ongoing efforts.\",\"PeriodicalId\":423462,\"journal\":{\"name\":\"Proceedings of the 21st ACM Symposium on Document Engineering\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3469096.3469864\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469096.3469864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Conventional, textual document search is arguably well understood. Traditional and modern (neural) algorithms are available; benchmark collections and evaluation metrics are prevalent. However, not all documents are conventional or purely textual. We explore what is takes to search "harsh" document collections. Such collections comprise potentially of documents that are natively non-digital, are multilingual, include components that are not strictly textual, are corrupted, or are a combination thereof. We address machine readability and its implication on search. We overview component segmentation and integration as a search process. We describe the processing of search queries that are informationally deficient or corrupt. We then comment on the evaluation of the selected efforts presented and highlight their history from concept to practice. We conclude with a brief commentary on ongoing efforts.