Classifying Turkish Trade Registry Gazette Announcements

İrem Nur Demirtaş, Seçcil Arslan, Gülşen Eryiğit
{"title":"Classifying Turkish Trade Registry Gazette Announcements","authors":"İrem Nur Demirtaş, Seçcil Arslan, Gülşen Eryiğit","doi":"10.1109/UBMK55850.2022.9919536","DOIUrl":null,"url":null,"abstract":"Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.","PeriodicalId":417604,"journal":{"name":"2022 7th International Conference on Computer Science and Engineering (UBMK)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK55850.2022.9919536","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.
分类土耳其贸易注册处公报公告
土耳其贸易登记公报是银行和电信等许多部门的重要信息来源。虽然报纸是公开的,但数据很难获得,而且公告以图像格式提供。可以搜索公司的特定公告,但是在返回的图像中存在许多其他不相关的公告。这给信息提取方式带来了多重挑战。由于这些图像中文档的结构,很难直接进行OCR。此外,即使在提取文本的情况下,也必须检测到公告边界以分割页面内的公告。一旦提取了公告,就应该匹配搜索公司的公告。由于查询的结果没有给出有关周围公告的信息,因此还应该对这些公告进行分类,以检测其他公司可能感兴趣的任何事件。在这项工作中,我们解决了所有这些问题,并提出了一个包括图像处理、OCR、公告分割和文档分类步骤的管道。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信