Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering最新文献_第4页

Humanist-centric tools for big data: berkeley prosopography services 以人为中心的大数据工具:伯克利人文学服务

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644870

P. Schmitz, L. Pearce

{"title":"Humanist-centric tools for big data: berkeley prosopography services","authors":"P. Schmitz, L. Pearce","doi":"10.1145/2644866.2644870","DOIUrl":"https://doi.org/10.1145/2644866.2644870","url":null,"abstract":"In this paper, we describe Berkeley Prosopography Services (BPS), a new set of tools for prosopography - the identification of individuals and study of their interactions - in support of humanities research. Prosopography is an example of \"big data\" in the humanities, characterized not by the size of the datasets, but by the way that computational and data-driven methods can transform scholarly workflows. BPS is based upon re-usable infrastructure, supporting generalized web services for corpus management, social network analysis, and visualization. The BPS disambiguation model is a formal implementation of the traditional heuristics used by humanists, and supports plug-in rules for adaptation to a wide range of domain corpora. A workspace model supports exploratory research and collaboration. We contrast the BPS model of configurable heuristic rules to other approaches for automated text analysis, and explain how our model facilitates interpretation by humanist researchers. We describe the significance of the BPS assertion model in which researchers assert conclusions or possibilities, allowing them to override automated inference, to explore ideas in what-if scenarios, and to formally publish and subscribe-to asserted annotations among colleagues, and/or with students. We present an initial evaluation of researchers' experience using the tools to study corpora of cuneiform tablets, and describe plans to expand the application of the tools to a broader range of corpora.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"78 1 1","pages":"179-188"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78290101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fine-grained change detection in structured text documents 结构化文本文档中的细粒度变更检测

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644880

Hannes Dohrn, D. Riehle

{"title":"Fine-grained change detection in structured text documents","authors":"Hannes Dohrn, D. Riehle","doi":"10.1145/2644866.2644880","DOIUrl":"https://doi.org/10.1145/2644866.2644880","url":null,"abstract":"Detecting and understanding changes between document revisions is an important task. The acquired knowledge can be used to classify the nature of a new document revision or to support a human editor in the review process. While purely textual change detection algorithms offer fine-grained results, they do not understand the syntactic meaning of a change. By representing structured text documents as XML documents we can apply tree-to-tree correction algorithms to identify the syntactic nature of a change.\u0000 Many algorithms for change detection in XML documents have been propsed but most of them focus on the intricacies of generic XML data and emphasize speed over the quality of the result. Structured text requires a change detection algorithm to pay close attention to the content in text nodes, however, recent algorithms treat text nodes as black boxes.\u0000 We present an algorithm that combines the advantages of the purely textual approach with the advantages of tree-to-tree change detection by redistributing text from non-overlapping common substrings to the nodes of the trees. This allows us to not only spot changes in the structure but also in the text itself, thus achieving higher quality and a fine-grained result in linear time on average. The algorithm is evaluated by applying it to the corpus of structured text documents that can be found in the English Wikipedia.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"149 1","pages":"87-96"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79440986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Image-based document management: aggregating collections of handwritten forms 基于图像的文档管理:聚合手写表单的集合

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644891

J. Barrus, E. L. Schwartz

引用次数: 0

The virtual splitter: refactoring web applications for themultiscreen environment 虚拟分配器:为多屏幕环境重构web应用程序

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644893

Mira Sarkis, C. Concolato, Jean-Claude Dufourd

引用次数: 8

On automatic text segmentation 自动文本分割

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2014-09-16 DOI: 10.1145/2644866.2644874

Boris Dadachev, A. Balinsky, H. Balinsky

{"title":"On automatic text segmentation","authors":"Boris Dadachev, A. Balinsky, H. Balinsky","doi":"10.1145/2644866.2644874","DOIUrl":"https://doi.org/10.1145/2644866.2644874","url":null,"abstract":"Automatic text segmentation, which is the task of breaking a text into topically-consistent segments, is a fundamental problem in Natural Language Processing, Document Classification and Information Retrieval. Text segmentation can significantly improve the performance of various text mining algorithms, by splitting heterogeneous documents into homogeneous fragments and thus facilitating subsequent processing. Applications range from screening of radio communication transcripts to document summarization, from automatic document classification to information visualization, from automatic filtering to security policy enforcement - all rely on, or can largely benefit from, automatic document segmentation. In this article, a novel approach for automatic text and data stream segmentation is presented and studied. The proposed automatic segmentation algorithm takes advantage of feature extraction and unusual behaviour detection algorithms developed in [4, 5]. It is entirely unsupervised and flexible to allow segmentation at different scales, such as short paragraphs and large sections. We also briefly review the most popular and important algorithms for automatic text segmentation and present detailed comparisons of our approach with several of those state-of-the-art algorithms.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"9 1","pages":"73-80"},"PeriodicalIF":0.0,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89114138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Personalized document clustering with dual supervision 具有双重监督的个性化文档聚类

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361393

Yeming Hu, E. Milios, J. Blustein, Shali Liu

{"title":"Personalized document clustering with dual supervision","authors":"Yeming Hu, E. Milios, J. Blustein, Shali Liu","doi":"10.1145/2361354.2361393","DOIUrl":"https://doi.org/10.1145/2361354.2361393","url":null,"abstract":"The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"59 Pt A 1","pages":"161-170"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86924607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Just-in-time personalized video presentations 即时的个性化视频演示

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361368

Jack Jansen, Pablo César, R. Guimarães, D. Bulterman

{"title":"Just-in-time personalized video presentations","authors":"Jack Jansen, Pablo César, R. Guimarães, D. Bulterman","doi":"10.1145/2361354.2361368","DOIUrl":"https://doi.org/10.1145/2361354.2361368","url":null,"abstract":"Using high-quality video cameras on mobile devices, it is relatively easy to capture a significant volume of video content for community events such as local concerts or sporting events. A more difficult problem is selecting and sequencing individual media fragments that meet the personal interests of a viewer of such content. In this paper, we consider an infrastructure that supports the just-in-time delivery of personalized content. Based on user profiles and interests, tailored video mash-ups can be created at view-time and then further tailored to user interests via simple end-user interaction. Unlike other mash-up research, our system focuses on client-side compilation based on personal (rather than aggregate) interests. This paper concentrates on a discussion of language and infrastructure issues required to support just-in-time video composition and delivery. Using a high school concert as an example, we provide a set of requirements for dynamic content delivery. We then provide an architecture and infrastructure that meets these requirements. We conclude with a technical and user analysis of the just-in-time personalized video approach.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"59-68"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83657243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Ad insertion in automatically composed documents 自动组合文档中的广告插入

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361358

Niranjan Damera-Venkata, José Bento

{"title":"Ad insertion in automatically composed documents","authors":"Niranjan Damera-Venkata, José Bento","doi":"10.1145/2361354.2361358","DOIUrl":"https://doi.org/10.1145/2361354.2361358","url":null,"abstract":"We consider the problem of automatically inserting advertisements (ads) into machine composed documents. We explicitly analyze the fundamental tradeoff between expected revenue due to ad insertion and the quality of the corresponding composed documents. We show that the optimal tradeoff a publisher can expect may be expressed as an efficient-frontier in the revenue-quality space. We develop algorithms to compose documents that lie on this optimal tradeoff frontier. These algorithms can automatically choose distributions of ad sizes and ad placement locations to optimize revenue for a given quality or optimize quality for given revenue. Such automation allows a market maker to accept highly personalized content from publishers who have no design or ad inventory management capability and distribute formatted documents to end users with aesthetic ad placement. The ad density/coverage may be controlled by the publisher or the end user on a per document basis by simply sliding along the tradeoff frontier. Business models where ad sales precede (ad-pull) or follow (ad-push) document composition are analyzed from a document engineering perspective.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"82 1","pages":"3-12"},"PeriodicalIF":0.0,"publicationDate":"2012-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87378701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Receipts2Go: the big world of small documents Receipts2Go:小文档的大世界

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361381

Bill Janssen, E. Saund, E. Bier, Patricia Wall, M. Sprague

引用次数: 9

A methodology for evaluating algorithms for table understanding in PDF documents 一种评估PDF文档中表理解算法的方法

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2012-09-04 DOI: 10.1145/2361354.2361365

Max C. Göbel, Tamir Hassan, Ermelinda Oro, G. Orsi

引用次数: 61