{"title":"Classifying Turkish Trade Registry Gazette Announcements","authors":"İrem Nur Demirtaş, Seçcil Arslan, Gülşen Eryiğit","doi":"10.1109/UBMK55850.2022.9919536","DOIUrl":null,"url":null,"abstract":"Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.","PeriodicalId":417604,"journal":{"name":"2022 7th International Conference on Computer Science and Engineering (UBMK)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK55850.2022.9919536","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.