Proceedings of the 22nd ACM Symposium on Document Engineering最新文献_第2页

How did dennis ritchie produce his PhD thesis?: a typographical mystery 丹尼斯·里奇是如何完成他的博士论文的?印刷上的谜团

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563839

D. Brailsford, B. Kernighan, Williamson Ritchie

引用次数: 0

Modifying PDF sewing patterns for use with projectors 修改PDF缝纫模式与投影仪使用

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563853

Charlotte Curtis

引用次数: 0

Theory entity extraction for social and behavioral sciences papers using distant supervision 使用远程监督的社会和行为科学论文的理论实体提取

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563855

Xin Wei, Lamia Salsabil, Jian Wu

{"title":"Theory entity extraction for social and behavioral sciences papers using distant supervision","authors":"Xin Wei, Lamia Salsabil, Jian Wu","doi":"10.1145/3558100.3563855","DOIUrl":"https://doi.org/10.1145/3558100.3563855","url":null,"abstract":"Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility of research works. Compared with metadata, such as title, author, keywords, etc., theory extraction in scientific literature is rarely explored, especially for social and behavioral science (SBS) domains. One challenge of applying supervised learning methods is the lack of a large number of labeled samples for training. In this paper, we propose an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We use this corpus to train models for theory extraction in SBS papers. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The model is promising to be conveniently extended to domains other than SBS. The code and data are publicly available at https://github.com/lamps-lab/theory.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"341-342 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123865609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graphical document representation for french newsletters analysis 法语通讯分析的图形文件表示

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563856

Alexis Blandin, Farida Saïd, Jeanne Villaneau, P. Marteau

引用次数: 1

Optical character recognition guided image super resolution 光学字符识别引导图像超分辨率

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563841

Philipp Hildebrandt, Maximilian Schulze, S. Cohen, Vanja Doskoc, Raid Saabni, Tobias Friedrich

引用次数: 0

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC 学术大数据质量评估:以S2ORC文件链接与合并为例

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563850

Jian Wu, Ryan Hiltabrand, Dominik Soós, C. Lee Giles

{"title":"Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC","authors":"Jian Wu, Ryan Hiltabrand, Dominik Soós, C. Lee Giles","doi":"10.1145/3558100.3563850","DOIUrl":"https://doi.org/10.1145/3558100.3563850","url":null,"abstract":"Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131445789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Long-term lifecycle-related management of digital building documents: towards a holistic and standard-based concept for a technical and organizational solution in building authorities 与数字建筑文件的长期生命周期相关的管理:为建筑当局提供技术和组织解决方案的整体和基于标准的概念

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563842

Uwe M. Borghoff, Eberhard Pfeiffer, Peter Rödig

{"title":"Long-term lifecycle-related management of digital building documents: towards a holistic and standard-based concept for a technical and organizational solution in building authorities","authors":"Uwe M. Borghoff, Eberhard Pfeiffer, Peter Rödig","doi":"10.1145/3558100.3563842","DOIUrl":"https://doi.org/10.1145/3558100.3563842","url":null,"abstract":"The long-term lifecycle-related management of digital building information is essential to improve the overall quality of public built assets. However, this management task still poses great challenges for building authorities, as they are usually responsible for large, heterogeneous and long-lived built assets with countless of data sets and documents that are increasingly changing from analogue to digital representations. These digital collections are characterized by complex dependencies, by numerous different, sometimes highly specialized and proprietary formats and also by their inappropriate organization. The major challenge is to ensure completeness, consistency and usability over the entire lifecycle of buildings or their associated digital data and documents. In this paper, we present an approach for a holistic and standard-based concept for a technical and organizational solution in building authorities. Holistic means integrating concepts for the long-term usability of digital building information, taking into account the framework conditions described in building authorities, including the introduction of BIM (building information modeling). To this end, we outline how the concepts of the consolidated and widely accepted ISO-standardized reference model OAIS (open archive information system) can be applied to a building-specific information architecture. First, we sketch the history of electronic data processing in the building sector and introduce the essential concepts of OAIS. Then, we illustrate typical major actors and their (future) IT systems, including systems intended for OAIS-compliant long-term usability. Next, we outline major (future) software components and their interactions and assignment to lifecycle phases. Finally, we delineate how the generic information model of OAIS can be used. In summary, ensuring the long-term usability of digital information in the building sector will remain a grand challenge, but our proposed approach to the systematic application and further refinement of the OAIS reference model can help to better organize future discussions as well as research, development and implementation activities. We conclude with some suggestions for further research based on the concepts of the OAIS reference model, such as refining information models or developing information repositories needed for long-term interpretation of digital objects.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129867435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SeNMFk-SPLIT: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection SeNMFk-SPLIT:基于语义非负矩阵分解和自动模型选择的大型语料库主题建模

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-08-21 DOI: 10.1145/3558100.3563844

M. Eren, N. Solovyev, Manish Bhattarai, Kim Ø. Rasmussen, Charles Nicholas, B. Alexandrov

{"title":"SeNMFk-SPLIT: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection","authors":"M. Eren, N. Solovyev, Manish Bhattarai, Kim Ø. Rasmussen, Charles Nicholas, B. Alexandrov","doi":"10.1145/3558100.3563844","DOIUrl":"https://doi.org/10.1145/3558100.3563844","url":null,"abstract":"As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115812369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Downstream transformer generation of question-answer pairs with preprocessing and postprocessing pipelines 带预处理和后处理管道的下游变压器问答对生成

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-05-15 DOI: 10.1145/3558100.3563846

Cheng Zhang, Hao Zhang, Jie Wang

引用次数: 6

Proceedings of the 22nd ACM Symposium on Document Engineering 第22届美国计算机学会文献工程研讨会论文集

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354

C. Concolato, P. Schmitz

{"title":"Proceedings of the 22nd ACM Symposium on Document Engineering","authors":"C. Concolato, P. Schmitz","doi":"10.1145/2361354","DOIUrl":"https://doi.org/10.1145/2361354","url":null,"abstract":"It is our great pleasure to welcome you to the 2012 ACM Symposium on Document Engineering -- DocEng 2012, which is being held September 4-7, 2012, in Paris, France. This year's symposium continues its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of document presentation and adaptation, analysis, modeling, transformation, systems, theory, and applications. The mission of the symposium is to share significant results, to evaluate novel approaches and models, and to identify promising directions for future research and development. DocEng gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of document engineering. \u0000 \u0000The call for papers attracted 89 submissions from Asia, Australia, Canada, Europe, the Russian Federation, and the United States. The program committee accepted 14 of 42 full paper submissions (33%), plus another 20 short papers, and 5 demos and posters, for a combined acceptance rate of 44%. The papers cover a variety of topics, including Layout and Presentation Control, Document Analysis, OCR and Visual Analysis, Multimedia and Hypermedia, XML and Related Tools, Architecture and Document Management, Search and Sense-making, and Digital Humanities. In addition, the program includes workshops on authoring issues, and on education models and curricula for Document Engineering. DocEng 2012 features keynote speeches by Bruno Bachimont of the Institut National de l'Audiovisuel, and Universite de Technologie de Compiagne, and by Thierry Delprat of Nuxeo. We hope that these proceedings will serve as a valuable reference for document engineering researchers and developers.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132806773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3