Proceedings of the 22nd ACM Symposium on Document Engineering最新文献

筛选
英文 中文
From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points 从印刷到在线小屏幕报纸:一种旨在保留入口点的版面生成方法
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563847
Sebastián Gallardo Díaz, Dorian Mazauric, Pierre Kornprobst
{"title":"From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points","authors":"Sebastián Gallardo Díaz, Dorian Mazauric, Pierre Kornprobst","doi":"10.1145/3558100.3563847","DOIUrl":"https://doi.org/10.1145/3558100.3563847","url":null,"abstract":"Simply transposing the print newspapers into digital media can not be satisfactory because they were not designed for small displays. One key feature lost is the notion of entry points that are essential for navigation. By focusing on headlines as entry points, we show how to produce alternative layouts for small displays that preserve entry points quality (readability and usability) while optimizing aesthetics and style. Our approach consists in a relayouting approach implemented via a genetic-inspired approach. We tested it on realistic newspaper pages. For the case discussed here, we obtained more than 2000 different layouts where the font was increased by a factor of two. We show that the quality of headlines is globally much better with the new layouts than with the original layout. Future work will tend to generalize this promising approach, accounting for the complexity of real newspapers, with user experience quality as the primary goal.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A cascaded approach for page-object detection in scientific papers 科学论文中页面对象检测的级联方法
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563851
Erika Spiteri Bailey, Alexandra Bonnici, Stefania Cristina
{"title":"A cascaded approach for page-object detection in scientific papers","authors":"Erika Spiteri Bailey, Alexandra Bonnici, Stefania Cristina","doi":"10.1145/3558100.3563851","DOIUrl":"https://doi.org/10.1145/3558100.3563851","url":null,"abstract":"In recent years, Page Object Detection (POD) has become a popular document understanding task, proving to be a non-trivial task given the potential complexity of documents. The rise of neural networks facilitated a more general learning approach to this task. However, in the literature, the different objects such as formulae, or figures among others, are generally considered individually. In this paper, we describe the joint localisation of six object classes relevant to scientific papers, namely isolated formulae, embedded formulae, figures, tables, variables and references. Through a qualitative analysis of these object classes, we note a hierarchy among the classes and propose a new localisation approach, using two, cascaded You Only Look Once (YOLO) networks. We also present a new data set consisting of labelled bounding boxes for all six object classes. This data set combines two commonly used data sets in the literature for formulae localisation, adding to the document images in these data sets the labels for figures, tables, variables and references. Using this data set, we achieve an average F1-score of 0.755 across all classes, which is comparable to the state-of-the-art for the object classes when considered individually for localisation.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Academic writing and publishing beyond documents 学术写作和出版以外的文件
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563840
C. Mahlow, M. Piotrowski
{"title":"Academic writing and publishing beyond documents","authors":"C. Mahlow, M. Piotrowski","doi":"10.1145/3558100.3563840","DOIUrl":"https://doi.org/10.1145/3558100.3563840","url":null,"abstract":"Research on writing tools stopped in the late 1980s when Microsoft Word had achieved monopoly status. However, the development of the Web and the advent of mobile devices are increasingly rendering static print-like documents obsolete. In this vision paper we reflect on the impact of this development on scholarly writing and publishing. Academic publications increasingly include dynamic elements, e.g., code, data plots, and other visualizations, which clearly requires other tools for document production than traditional word processors. When the printed page no longer is the desired final product, content and form can be addressed explicitly and separately, thus emphasizing the structure of texts rather than the structure of documents. The resulting challenges have not yet been fully addressed by document engineering.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Binarization of photographed documents image quality, processing time and size assessment 二值化对拍摄的文档图像质量、处理时间和尺寸进行评估
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3564159
R. Lins, R. Bernardino, Ricardo da Silva Barboza, S. Simske
{"title":"Binarization of photographed documents image quality, processing time and size assessment","authors":"R. Lins, R. Bernardino, Ricardo da Silva Barboza, S. Simske","doi":"10.1145/3558100.3564159","DOIUrl":"https://doi.org/10.1145/3558100.3564159","url":null,"abstract":"Today, over eighty percent of the world's population owns a smart-phone with an in-built camera, and they are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four \"classical\" and alternative algorithms. The evaluation dataset is composed of offset, laser, and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"767 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116137288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Detecting malware using text documents extracted from spam email through machine learning 通过机器学习从垃圾邮件中提取文本文档来检测恶意软件
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563854
Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez
{"title":"Detecting malware using text documents extracted from spam email through machine learning","authors":"Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez","doi":"10.1145/3558100.3563854","DOIUrl":"https://doi.org/10.1145/3558100.3563854","url":null,"abstract":"Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available \"Spam Email Malware Detection - 600\" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anonymizing and obfuscating PDF content while preserving document structure 匿名化和模糊化PDF内容,同时保留文档结构
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563849
Charlotte Curtis
{"title":"Anonymizing and obfuscating PDF content while preserving document structure","authors":"Charlotte Curtis","doi":"10.1145/3558100.3563849","DOIUrl":"https://doi.org/10.1145/3558100.3563849","url":null,"abstract":"The portable document format (PDF) is both versatile and complex, with a specification exceeding well over a thousand pages. For independent developers writing software that reads, displays, or transforms PDFs, it is difficult to comprehensively account for all of the potential variations that might exist in the wild. Compounding this problem are the usage agreements that often accompany purchased and proprietary PDFs, preventing end users from uploading a troublesome document as part of a bug report and limiting the set of test cases that can be made public for open source development. In this paper, pdf-mangler is presented as a solution to this problem. The goal of pdf-mangler is to remove information in the form of text, images, and vector graphics while retaining as much of the document structure and general visual appearance as possible. The intention is for pdf-mangler to be deployed as part of an automated bug reporting tool for PDF software.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Triplet transformer network for multi-label document classification 三联体变压器网络多标签文档分类
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563843
J. Melsbach, Sven Stahlmann, Stefan Hirschmeier, D. Schoder
{"title":"Triplet transformer network for multi-label document classification","authors":"J. Melsbach, Sven Stahlmann, Stefan Hirschmeier, D. Schoder","doi":"10.1145/3558100.3563843","DOIUrl":"https://doi.org/10.1145/3558100.3563843","url":null,"abstract":"Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without taking the textual information of label names into account, therefore omitting possibly valuable information. We present an approach that leverages this information stored in label names by reformulating the problem of multi label classification into a document similarity problem. To achieve this, we use a triplet transformer network that learns to embed labels and documents into a joint vector space. Our approach is fast at inference, classifying documents by determining the closest and therefore most similar labels. We evaluate our approach on a challenging real-world dataset of a German radio-broadcaster and find that our model provides competitive results compared to other established approaches.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126248655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Tab this folder of documents: page stream segmentation of business documents 标签此文件夹的文件:页流分割的业务文件
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563852
Thisanaporn Mungmeeprued, Yuxin Ma, Nisarg Mehta, Aldo Lipani
{"title":"Tab this folder of documents: page stream segmentation of business documents","authors":"Thisanaporn Mungmeeprued, Yuxin Ma, Nisarg Mehta, Aldo Lipani","doi":"10.1145/3558100.3563852","DOIUrl":"https://doi.org/10.1145/3558100.3563852","url":null,"abstract":"In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116950370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optical character recognition with transformers and CTC 光学字符识别与变压器和CTC
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563845
Israel Campiotti, R. Lotufo
{"title":"Optical character recognition with transformers and CTC","authors":"Israel Campiotti, R. Lotufo","doi":"10.1145/3558100.3563845","DOIUrl":"https://doi.org/10.1145/3558100.3563845","url":null,"abstract":"Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127538270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Chinese public procurement document harvesting pipeline 中国公共采购文件收集管道
Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563848
Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau
{"title":"Chinese public procurement document harvesting pipeline","authors":"Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau","doi":"10.1145/3558100.3563848","DOIUrl":"https://doi.org/10.1145/3558100.3563848","url":null,"abstract":"We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125112305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信