{"title":"Session details: Systems for visual document analysis","authors":"Tamir Hassan","doi":"10.1145/3482786","DOIUrl":"https://doi.org/10.1145/3482786","url":null,"abstract":"","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126923515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Document engineering issues in malware analysis","authors":"Charles K. Nicholas, R. J. Joyce, S. Simske","doi":"10.1145/3469096.3470950","DOIUrl":"https://doi.org/10.1145/3469096.3470950","url":null,"abstract":"We present an overview of the field of malware analysis with emphasis on issues related to document engineering. We will introduce the field with a discussion of the types of malware, including executable binaries, malicious PDFs, polymorphic malware, ransomware, and exploit kits. We will conclude with our view of important research questions in the field. This is an updated version of tutorials presented in previous years, with more information about newly-available tools.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123837933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ajit Jain, A. Kerne, Nic Lupfer, Gabriel Britain, Aaron Perrine, Y. Choe, J. Keyser, Ruihong Huang
{"title":"Recognizing creative visual design: multiscale design characteristics in free-form web curation documents","authors":"Ajit Jain, A. Kerne, Nic Lupfer, Gabriel Britain, Aaron Perrine, Y. Choe, J. Keyser, Ruihong Huang","doi":"10.1145/3469096.3469869","DOIUrl":"https://doi.org/10.1145/3469096.3469869","url":null,"abstract":"Multiscale design is the widely practiced use of space and scale to visually explore and articulate relationships. Free-form web curation (FFWC) is an approach to supporting multiscale design, involving creative strategies of collecting content, assembling it to juxtapose and organize, sketching, writing, shifting perspective to navigate, and exhibiting to share and collaborate. Our long term goal is to support design students with automatic, on demand feedback. We introduce a spatial clustering technique for recognizing multiscale design characteristics---scales and clusters---in FFWC documents. We perform quantitative evaluation to establish baseline performance. We contribute to human-centered AI by advancing fundamental human aspirations, through automatic recognizers of creative design, e.g., for representing and communicating abstract ideas. We develop implications, (1) for supporting people using content recognition in creative contexts, such as design education; (2) for overcoming design fixation with human-centered AI; and (3) for recognizing multiscale design characteristics.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132023464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Security and sensitive documents","authors":"Charles K. Nicholas","doi":"10.1145/3482784","DOIUrl":"https://doi.org/10.1145/3482784","url":null,"abstract":"","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115205560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient clustering of short text streams using online-offline clustering","authors":"Md. Rashadul Hasan Rakib, N. Zeh, E. Milios","doi":"10.1145/3469096.3469866","DOIUrl":"https://doi.org/10.1145/3469096.3469866","url":null,"abstract":"Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time and to achieve better clustering result. To overcome these two challenges, we propose an efficient short text stream clustering algorithm (called EStream) consisting of two modules: online and offline. The online module of EStream algorithm assigns a text to a cluster one by one as it arrives. To assign a text to a cluster it computes similarity between a text and a selected number of clusters instead of all clusters and thus significantly reduces the running time of the clustering of short text streams. EStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds. Thus EStream efficiently deals with the concept drift problem. The offline module of EStream algorithm enhances the distributions of texts in the clusters obtained by the online module so that the upcoming short texts can be assigned to the appropriate clusters. Experimental results demonstrate that EStream outperforms the state-of-the-art short text stream clustering methods (in terms of clustering result) by a statistically significant margin on several short text datasets. Moreover, the running time of EStream is several orders of magnitude faster than that of the state-of-the-art methods.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122071169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text line extraction using deep learning and minimal sub seams","authors":"Adi Azran, A. Schclar, Raid Saabni","doi":"10.1145/3469096.3474941","DOIUrl":"https://doi.org/10.1145/3469096.3474941","url":null,"abstract":"Accurate text line extraction is a vital prerequisite for efficient and successful text recognition systems ranging from keywords/phrases searching to complete conversion to text. In many cases, the proposed algorithms target binary pre-processed versions of the image, which may cause insufficient results due to poor quality document images. Recently, more papers present solutions that work directly on gray-level images [1,2,7,12,15]. In this paper, we present a novel robust, and efficient algorithm to extract text-lines directly from gray-level document images. The proposed approach uses a combination of two variants of Convolutional Neural Network (CNNs), followed by minimal energy seam extraction. The first ConvNet is a modified version of the autoencoder used for biomedical image segmentation [8]. The second is a deep convolutional Neural Network, working on overlapping vertical slices of the original image. The two variants are combined to one neural net after re-attaching the resulting slices of the second net. The merged results of the two nets are used as a preprocessed image to obtain an energy map for a second phase. In the second step, we use the algorithm presented in [2], to track minimal energy sub-seams accumulated to perform a full local minimal/maximal separating and medial seam defining the text baselines and the text line regions. We have tested our approach on multi-lingual various datasets written at a range of image quality based on the ICDAR datasets.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128738189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MTLV","authors":"Fatemeh Rahimi, Evangelos E. Milios, S. Matwin","doi":"10.1145/3469096.3474926","DOIUrl":"https://doi.org/10.1145/3469096.3474926","url":null,"abstract":"Multi-Task Learning (MTL) for text classification takes advantage of the data to train a single shared model with multiple task-specific layers on multiple related classification tasks to improve its generalization performance. We choose pre-trained language models (BERT-family) as the shared part of this architecture. Although they have achieved noticeable performance in different downstream NLP tasks, their performance in an MTL setting for the biomedical domain is not thoroughly investigated. In this work, we investigate the performance of BERT-family models in different MTL settings with Open-I (radiology reports) and OHSUMED (PubMed abstracts) datasets. We introduce the MTLV (Multi-Task Learning Visualizer) library for building Multi-task learning-related architectures which use existing infrastructure (e.g., Hugging Face Transformers and MLflow Tracking). Following previous work in computer vision, we clustered tasks and trained a separate model on each cluster (Grouped Multi-Task Learning (GMTL)). Contextual representation of the class labels (Tasks) and their descriptions was used by the library as features to cluster the tasks. We observed that grouping tasks for training with few models (GMTL) outperforms the MTL also GMTL is computationally more efficient than the STL setting (a separate model is trained for each task).","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125432045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching harsh documents","authors":"O. Frieder","doi":"10.1145/3469096.3469864","DOIUrl":"https://doi.org/10.1145/3469096.3469864","url":null,"abstract":"Conventional, textual document search is arguably well understood. Traditional and modern (neural) algorithms are available; benchmark collections and evaluation metrics are prevalent. However, not all documents are conventional or purely textual. We explore what is takes to search \"harsh\" document collections. Such collections comprise potentially of documents that are natively non-digital, are multilingual, include components that are not strictly textual, are corrupted, or are a combination thereof. We address machine readability and its implication on search. We overview component segmentation and integration as a search process. We describe the processing of search queries that are informationally deficient or corrupt. We then comment on the evaluation of the selected efforts presented and highlight their history from concept to practice. We conclude with a brief commentary on ongoing efforts.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122172795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxime Cauz, Julien Albert, Anne Wallemacq, Isabelle Linden, Bruno Dumas
{"title":"Shock wave: a graph layout algorithm for text analyzing","authors":"Maxime Cauz, Julien Albert, Anne Wallemacq, Isabelle Linden, Bruno Dumas","doi":"10.1145/3469096.3474925","DOIUrl":"https://doi.org/10.1145/3469096.3474925","url":null,"abstract":"The EVOQ tool offers researchers in social sciences a set of text analysis tools relying on the post-structuralist approach. This analysis approach relies on the identification of association and opposition relations between terms (words or expressions). The so-defined graph is presented in EVOQ by a node-link diagram. The Shock Wave is a placement algorithm specifically designed to be combined with a classical force-directed algorithm to produce a graph layout which meets the interpretability needs of the text analysts while preserving efficiency on large numbers of nodes. It structures the nodes on a circular placement with transversal opposition relations to highlight oppositions within the text concepts. Beyond our use case, the interest of Shock Wave lies in the fact that it is a novel method to present graphs of text with a strong emphasis on underlying semantic fields.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125774309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles","authors":"Rémi Calizzano, Malte Ostendorff, G. Rehm","doi":"10.1145/3469096.3469874","DOIUrl":"https://doi.org/10.1145/3469096.3469874","url":null,"abstract":"Passage ordering aims to maximise discourse coherence in document generation or document modification tasks such as summarisation or storytelling. This paper extends the passage ordering task from sentences to paragraphs, i.e., passages with multiple sentences. Increasing the passage length increases the task's difficulty. To account for this, we propose the combination of a pre-trained encoder-decoder Transformer model, namely BART, with variations of pointer networks. We empirically evaluate the proposed models for sentence and paragraph ordering. Our best model outperforms previous state of the art methods by 0.057 Kendall's Tau on one of three sentence ordering benchmarks (arXiv, VIST, ROC-Story). For paragraph ordering, we construct two novel datasets from Wikipedia and CNN-DailyMail on which we achieve 0.67 and 0.47 Kendall's Tau. The best model variation utilises multiple pointer networks in an ensemble-like fashion. We hypothesise that the use of multiple pointers better reflects the multitude of possible orders of paragraphs in more complex texts. Our code, data, and models are publicly available1.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134117076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}