Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering最新文献

筛选
英文 中文
Humanist-centric tools for big data: berkeley prosopography services 以人为中心的大数据工具:伯克利人文学服务
P. Schmitz, L. Pearce
{"title":"Humanist-centric tools for big data: berkeley prosopography services","authors":"P. Schmitz, L. Pearce","doi":"10.1145/2644866.2644870","DOIUrl":"https://doi.org/10.1145/2644866.2644870","url":null,"abstract":"In this paper, we describe Berkeley Prosopography Services (BPS), a new set of tools for prosopography - the identification of individuals and study of their interactions - in support of humanities research. Prosopography is an example of \"big data\" in the humanities, characterized not by the size of the datasets, but by the way that computational and data-driven methods can transform scholarly workflows. BPS is based upon re-usable infrastructure, supporting generalized web services for corpus management, social network analysis, and visualization. The BPS disambiguation model is a formal implementation of the traditional heuristics used by humanists, and supports plug-in rules for adaptation to a wide range of domain corpora. A workspace model supports exploratory research and collaboration. We contrast the BPS model of configurable heuristic rules to other approaches for automated text analysis, and explain how our model facilitates interpretation by humanist researchers. We describe the significance of the BPS assertion model in which researchers assert conclusions or possibilities, allowing them to override automated inference, to explore ideas in what-if scenarios, and to formally publish and subscribe-to asserted annotations among colleagues, and/or with students. We present an initial evaluation of researchers' experience using the tools to study corpora of cuneiform tablets, and describe plans to expand the application of the tools to a broader range of corpora.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"78 1 1","pages":"179-188"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78290101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fine-grained change detection in structured text documents 结构化文本文档中的细粒度变更检测
Hannes Dohrn, D. Riehle
{"title":"Fine-grained change detection in structured text documents","authors":"Hannes Dohrn, D. Riehle","doi":"10.1145/2644866.2644880","DOIUrl":"https://doi.org/10.1145/2644866.2644880","url":null,"abstract":"Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change.\u0000 Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes.\u0000 We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"149 1","pages":"87-96"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79440986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Image-based document management: aggregating collections of handwritten forms 基于图像的文档管理:聚合手写表单的集合
J. Barrus, E. L. Schwartz
{"title":"Image-based document management: aggregating collections of handwritten forms","authors":"J. Barrus, E. L. Schwartz","doi":"10.1145/2644866.2644891","DOIUrl":"https://doi.org/10.1145/2644866.2644891","url":null,"abstract":"Many companies still operate critical business processes using paper-based forms, including customer surveys, inspections, contracts and invoices. Converting those handwritten forms to symbolic data is expensive and complicated. This paper presents an overview of the Image-Based Document Management (IBDM) system for analyzing handwritten forms without requiring conversion to symbolic data. Strokes captured in a questionnaire on a tablet are separated into fields that are then displayed in a spreadsheet. Rows represent documents while columns represent corresponding fields across all documents. IBDM allows a process owner to capture and analyze large collections of documents with minimal IT support. IBDM supports the creation of filters and queries on the data. IBDM also allows the user to request symbolic conversion of individual columns of data and permits the user to create custom views by reordering and sorting the columns. In other words, IBDM provides a \"writing on paper\" experience for the data collector and a web-based database experience for the analyst.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"78 1","pages":"117-120"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83524681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The virtual splitter: refactoring web applications for themultiscreen environment 虚拟分配器:为多屏幕环境重构web应用程序
Mira Sarkis, C. Concolato, Jean-Claude Dufourd
{"title":"The virtual splitter: refactoring web applications for themultiscreen environment","authors":"Mira Sarkis, C. Concolato, Jean-Claude Dufourd","doi":"10.1145/2644866.2644893","DOIUrl":"https://doi.org/10.1145/2644866.2644893","url":null,"abstract":"Creating web applications for the multiscreen environment is still a challenge. One approach is to transform existing single-screen applications but this has not been done yet automatically or generically. This paper proposes a refactoring system. It consists of a generic and extensible mapping phase that automatically analyzes the application content based on a semantic or a visual criterion determined by the author or the user, and prepares it for the splitting process. The system then splits the application and as a result delivers two instrumented applications ready for distribution across devices. During runtime, the system uses a mirroring phase to maintain the functionality of the distributed application and to support a dynamic splitting process. Developed as a Chrome extension, our approach is validated on several web applications, including a YouTube page and a video application from Mozilla.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"139-142"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89425276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
On automatic text segmentation 自动文本分割
Boris Dadachev, A. Balinsky, H. Balinsky
{"title":"On automatic text segmentation","authors":"Boris Dadachev, A. Balinsky, H. Balinsky","doi":"10.1145/2644866.2644874","DOIUrl":"https://doi.org/10.1145/2644866.2644874","url":null,"abstract":"Automatic text segmentation, which is the task of breaking a text into topically-consistent segments, is a fundamental problem in Natural Language Processing, Document Classification and Information Retrieval. Text segmentation can significantly improve the performance of various text mining algorithms, by splitting heterogeneous documents into homogeneous fragments and thus facilitating subsequent processing. Applications range from screening of radio communication transcripts to document summarization, from automatic document classification to information visualization, from automatic filtering to security policy enforcement - all rely on, or can largely benefit from, automatic document segmentation. In this article, a novel approach for automatic text and data stream segmentation is presented and studied. The proposed automatic segmentation algorithm takes advantage of feature extraction and unusual behaviour detection algorithms developed in [4, 5]. It is entirely unsupervised and flexible to allow segmentation at different scales, such as short paragraphs and large sections. We also briefly review the most popular and important algorithms for automatic text segmentation and present detailed comparisons of our approach with several of those state-of-the-art algorithms.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"9 1","pages":"73-80"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89114138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Personalized document clustering with dual supervision 具有双重监督的个性化文档聚类
Yeming Hu, E. Milios, J. Blustein, Shali Liu
{"title":"Personalized document clustering with dual supervision","authors":"Yeming Hu, E. Milios, J. Blustein, Shali Liu","doi":"10.1145/2361354.2361393","DOIUrl":"https://doi.org/10.1145/2361354.2361393","url":null,"abstract":"The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"59 Pt A 1","pages":"161-170"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86924607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Just-in-time personalized video presentations 即时的个性化视频演示
Jack Jansen, Pablo César, R. Guimarães, D. Bulterman
{"title":"Just-in-time personalized video presentations","authors":"Jack Jansen, Pablo César, R. Guimarães, D. Bulterman","doi":"10.1145/2361354.2361368","DOIUrl":"https://doi.org/10.1145/2361354.2361368","url":null,"abstract":"Using high-quality video cameras on mobile devices, it is relatively easy to capture a significant volume of video content for community events such as local concerts or sporting events. A more difficult problem is selecting and sequencing individual media fragments that meet the personal interests of a viewer of such content. In this paper, we consider an infrastructure that supports the just-in-time delivery of personalized content. Based on user profiles and interests, tailored video mash-ups can be created at view-time and then further tailored to user interests via simple end-user interaction. Unlike other mash-up research, our system focuses on client-side compilation based on personal (rather than aggregate) interests. This paper concentrates on a discussion of language and infrastructure issues required to support just-in-time video composition and delivery. Using a high school concert as an example, we provide a set of requirements for dynamic content delivery. We then provide an architecture and infrastructure that meets these requirements. We conclude with a technical and user analysis of the just-in-time personalized video approach.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"59-68"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83657243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Ad insertion in automatically composed documents 自动组合文档中的广告插入
Niranjan Damera-Venkata, José Bento
{"title":"Ad insertion in automatically composed documents","authors":"Niranjan Damera-Venkata, José Bento","doi":"10.1145/2361354.2361358","DOIUrl":"https://doi.org/10.1145/2361354.2361358","url":null,"abstract":"We consider the problem of automatically inserting advertisements (ads) into machine composed documents. We explicitly analyze the fundamental tradeoff between expected revenue due to ad insertion and the quality of the corresponding composed documents. We show that the optimal tradeoff a publisher can expect may be expressed as an efficient-frontier in the revenue-quality space. We develop algorithms to compose documents that lie on this optimal tradeoff frontier. These algorithms can automatically choose distributions of ad sizes and ad placement locations to optimize revenue for a given quality or optimize quality for given revenue. Such automation allows a market maker to accept highly personalized content from publishers who have no design or ad inventory management capability and distribute formatted documents to end users with aesthetic ad placement. The ad density/coverage may be controlled by the publisher or the end user on a per document basis by simply sliding along the tradeoff frontier. Business models where ad sales precede (ad-pull) or follow (ad-push) document composition are analyzed from a document engineering perspective.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"82 1","pages":"3-12"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87378701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Receipts2Go: the big world of small documents Receipts2Go:小文档的大世界
Bill Janssen, E. Saund, E. Bier, Patricia Wall, M. Sprague
{"title":"Receipts2Go: the big world of small documents","authors":"Bill Janssen, E. Saund, E. Bier, Patricia Wall, M. Sprague","doi":"10.1145/2361354.2361381","DOIUrl":"https://doi.org/10.1145/2361354.2361381","url":null,"abstract":"The Receipts2Go system is about the world of one-page documents: cash register receipts, book covers, cereal boxes, price tags, train tickets, fire extinguisher tags. In that world, we're exploring techniques for extracting accurate information from documents for which we have no layout descriptions -- indeed no initial idea of what the document's genre is -- using photos taken with cell phone cameras by users who aren't skilled document capture technicians. This paper outlines the system and reports on some initial results, including the algorithms we've found useful for cleaning up those document images, and the techniques used to extract and organize relevant information from thousands of similar-but-different page layouts.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"146 1","pages":"121-124"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72714221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A methodology for evaluating algorithms for table understanding in PDF documents 一种评估PDF文档中表理解算法的方法
Max C. Göbel, Tamir Hassan, Ermelinda Oro, G. Orsi
{"title":"A methodology for evaluating algorithms for table understanding in PDF documents","authors":"Max C. Göbel, Tamir Hassan, Ermelinda Oro, G. Orsi","doi":"10.1145/2361354.2361365","DOIUrl":"https://doi.org/10.1145/2361354.2361365","url":null,"abstract":"This paper presents a methodology for the evaluation of table understanding algorithms for PDF documents. The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis. We provide a general and flexible output model for each task along with corresponding evaluation metrics and methods. We also present a methodology for collecting and ground-truthing PDF documents based on consensus-reaching principles and provide a publicly available ground-truthed dataset.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"261 1","pages":"45-48"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76740919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信