{"title":"How did dennis ritchie produce his PhD thesis?: a typographical mystery","authors":"D. Brailsford, B. Kernighan, Williamson Ritchie","doi":"10.1145/3558100.3563839","DOIUrl":"https://doi.org/10.1145/3558100.3563839","url":null,"abstract":"Dennis Ritchie, the creator of the C programming language and, with Ken Thompson, the co-creator of the Unix operating system, completed his Harvard PhD thesis on recursive function theory in early 1968. But for unknown reasons, he never officially received his degree, and the thesis itself disappeared for nearly 50 years. This strange set of circumstances raises at least three broad questions: • What was the technical contribution of the thesis? • Why wasn't the degree granted? • How was the thesis prepared? This paper investigates the third question: how was a long and typographically complicated mathematical thesis produced at a very early stage in the history of computerized document preparation?","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114498233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modifying PDF sewing patterns for use with projectors","authors":"Charlotte Curtis","doi":"10.1145/3558100.3563853","DOIUrl":"https://doi.org/10.1145/3558100.3563853","url":null,"abstract":"Print-at-home PDF sewing patterns have gained popularity over the last decade and now represent a significant proportion of the home sewing pattern market. Recently, an all-digital workflow has emerged through the use of ceiling-mounted projectors, allowing for patterns to be projected directly onto fabric. However, PDF patterns produced for printing are not suitable for projecting. This paper presents PDFStitcher, an open-source cross-platform graphical tool that enables end users to modify PDF sewing patterns for use with a projector. The key functionality of PDFStitcher is described, followed by a brief discussion on the future of sewing pattern file formats and information processing.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131290044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Theory entity extraction for social and behavioral sciences papers using distant supervision","authors":"Xin Wei, Lamia Salsabil, Jian Wu","doi":"10.1145/3558100.3563855","DOIUrl":"https://doi.org/10.1145/3558100.3563855","url":null,"abstract":"Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility of research works. Compared with metadata, such as title, author, keywords, etc., theory extraction in scientific literature is rarely explored, especially for social and behavioral science (SBS) domains. One challenge of applying supervised learning methods is the lack of a large number of labeled samples for training. In this paper, we propose an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We use this corpus to train models for theory extraction in SBS papers. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The model is promising to be conveniently extended to domains other than SBS. The code and data are publicly available at https://github.com/lamps-lab/theory.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"341-342 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123865609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexis Blandin, Farida Saïd, Jeanne Villaneau, P. Marteau
{"title":"Graphical document representation for french newsletters analysis","authors":"Alexis Blandin, Farida Saïd, Jeanne Villaneau, P. Marteau","doi":"10.1145/3558100.3563856","DOIUrl":"https://doi.org/10.1145/3558100.3563856","url":null,"abstract":"Document analysis is essential in many industrial applications. However, engineering natural language resources to represent entire documents is still challenging. Besides, available resources in French are scarce and do not cover all possible tasks, especially in specific business applications. In this context, we present a French newsletter dataset and its use to predict the good or bad impact of newsletters on readers. We propose a new representation of newsletters in the form of graphs that consider the newsletters' layout. We evaluate the relevance of the proposed representation to predict a newsletter's performance in terms of open and click rates using graph analysis methods.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philipp Hildebrandt, Maximilian Schulze, S. Cohen, Vanja Doskoc, Raid Saabni, Tobias Friedrich
{"title":"Optical character recognition guided image super resolution","authors":"Philipp Hildebrandt, Maximilian Schulze, S. Cohen, Vanja Doskoc, Raid Saabni, Tobias Friedrich","doi":"10.1145/3558100.3563841","DOIUrl":"https://doi.org/10.1145/3558100.3563841","url":null,"abstract":"Recognizing disturbed text in real-life images is a difficult problem, as information that is missing due to low resolution or out-of-focus text has to be recreated. Combining text super-resolution and optical character recognition deep learning models can be a valuable tool to enlarge and enhance text images for better readability, as well as recognize text automatically afterwards. We achieve improved peak signal-to-noise ratio and text recognition accuracy scores over a state-of-the-art text super-resolution model TBSRN on the real-world low-resolution dataset TextZoom while having a smaller theoretical model size due to the usage of quantization techniques. In addition, we show how different training strategies influence the performance of the resulting model.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128170195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Wu, Ryan Hiltabrand, Dominik Soós, C. Lee Giles
{"title":"Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC","authors":"Jian Wu, Ryan Hiltabrand, Dominik Soós, C. Lee Giles","doi":"10.1145/3558100.3563850","DOIUrl":"https://doi.org/10.1145/3558100.3563850","url":null,"abstract":"Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131445789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Long-term lifecycle-related management of digital building documents: towards a holistic and standard-based concept for a technical and organizational solution in building authorities","authors":"Uwe M. Borghoff, Eberhard Pfeiffer, Peter Rödig","doi":"10.1145/3558100.3563842","DOIUrl":"https://doi.org/10.1145/3558100.3563842","url":null,"abstract":"The long-term lifecycle-related management of digital building information is essential to improve the overall quality of public built assets. However, this management task still poses great challenges for building authorities, as they are usually responsible for large, heterogeneous and long-lived built assets with countless of data sets and documents that are increasingly changing from analogue to digital representations. These digital collections are characterized by complex dependencies, by numerous different, sometimes highly specialized and proprietary formats and also by their inappropriate organization. The major challenge is to ensure completeness, consistency and usability over the entire lifecycle of buildings or their associated digital data and documents. In this paper, we present an approach for a holistic and standard-based concept for a technical and organizational solution in building authorities. Holistic means integrating concepts for the long-term usability of digital building information, taking into account the framework conditions described in building authorities, including the introduction of BIM (building information modeling). To this end, we outline how the concepts of the consolidated and widely accepted ISO-standardized reference model OAIS (open archive information system) can be applied to a building-specific information architecture. First, we sketch the history of electronic data processing in the building sector and introduce the essential concepts of OAIS. Then, we illustrate typical major actors and their (future) IT systems, including systems intended for OAIS-compliant long-term usability. Next, we outline major (future) software components and their interactions and assignment to lifecycle phases. Finally, we delineate how the generic information model of OAIS can be used. In summary, ensuring the long-term usability of digital information in the building sector will remain a grand challenge, but our proposed approach to the systematic application and further refinement of the OAIS reference model can help to better organize future discussions as well as research, development and implementation activities. We conclude with some suggestions for further research based on the concepts of the OAIS reference model, such as refining information models or developing information repositories needed for long-term interpretation of digital objects.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129867435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Eren, N. Solovyev, Manish Bhattarai, Kim Ø. Rasmussen, Charles Nicholas, B. Alexandrov
{"title":"SeNMFk-SPLIT: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection","authors":"M. Eren, N. Solovyev, Manish Bhattarai, Kim Ø. Rasmussen, Charles Nicholas, B. Alexandrov","doi":"10.1145/3558100.3563844","DOIUrl":"https://doi.org/10.1145/3558100.3563844","url":null,"abstract":"As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115812369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Downstream transformer generation of question-answer pairs with preprocessing and postprocessing pipelines","authors":"Cheng Zhang, Hao Zhang, Jie Wang","doi":"10.1145/3558100.3563846","DOIUrl":"https://doi.org/10.1145/3558100.3563846","url":null,"abstract":"We present a method to perform a downstream task of transformers on generating question-answer pairs (QAPs) from a given article. We first finetune pretrained transformers on QAP datasets. We then use a preprocessing pipeline to select appropriate answers from the article, and feed each answer and the relevant context to the finetuned transformer to generate a candidate QAP. Finally we use a postprocessing pipeline to filter inadequate QAPs. In particular, using pretrained T5 models as transformers and the SQuAD dataset as the finetruning dataset, we obtain a finetuned T5 model that outperforms previous models on standard performance measures over the SQuAD dataset. We then show that our method based on this finetuned model generates a satisfactory number of QAPs with high qualities on the Gaokao-EN dataset assessed by human judges.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117316826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 22nd ACM Symposium on Document Engineering","authors":"C. Concolato, P. Schmitz","doi":"10.1145/2361354","DOIUrl":"https://doi.org/10.1145/2361354","url":null,"abstract":"It is our great pleasure to welcome you to the 2012 ACM Symposium on Document Engineering -- DocEng 2012, which is being held September 4-7, 2012, in Paris, France. This year's symposium continues its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document presentation and adaptation, analysis, modeling, transformation, systems, theory, and applications. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. \u0000 \u0000The call for papers attracted 89 submissions from Asia, Australia, Canada, Europe, the Russian Federation, and the United States. The program committee accepted 14 of 42 full paper submissions (33%), plus another 20 short papers, and 5 demos and posters, for a combined acceptance rate of 44%. The papers cover a variety of topics, including Layout and Presentation Control, Document Analysis, OCR and Visual Analysis, Multimedia and Hypermedia, XML and Related Tools, Architecture and Document Management, Search and Sense-making, and Digital Humanities. In addition, the program includes workshops on authoring issues, and on education models and curricula for Document Engineering. DocEng 2012 features keynote speeches by Bruno Bachimont of the Institut National de l'Audiovisuel, and Universite de Technologie de Compiagne, and by Thierry Delprat of Nuxeo. We hope that these proceedings will serve as a valuable reference for document engineering researchers and developers.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132806773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}