Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web Pub Date : 2023-07-03 DOI:https://dl.acm.org/doi/10.1145/3606030

Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson

{"title":"Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars","authors":"Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson","doi":"https://dl.acm.org/doi/10.1145/3606030","DOIUrl":null,"url":null,"abstract":"People often create themed collections to make sense of an ever-increasing number of archived web pages. Some of these collections contain hundreds of thousands of documents. Thousands of collections exist, many covering the same topic. Few collections include standardized metadata. This scale makes understanding a collection an expensive proposition. Our Dark and Stormy Archives (DSA) five-process model implements a novel summarization method to help users understand a collection by combining web archives and social media storytelling. The five processes of the DSA model are: select exemplars, generate story metadata, generate document metadata, visualize the story, and distribute the story. Selecting exemplars produces a set of k documents from the N documents in the collection, where k < <N, thus reducing the number of documents visitors need to review to understand a collection. Generating story and document metadata selects images, titles, descriptions, and other content from these exemplars. Visualizing the story ties this metadata together in a format the visitor can consume. Without distributing the story, it is not shared for others to consume. We present a research study demonstrating that our algorithmic primitives can be combined to select relevant exemplars that are otherwise undiscoverable using a conventional search engine and query generation methods. Having demonstrated improved methods for selecting exemplars, we visualize the story. Previous work established that the social card is the best format for visitors to consume surrogates. The social card combines metadata fields, including the document’s title, a brief description, and a striking image. Social cards are commonly found on social media platforms. We discovered that these platforms perform poorly for mementos and rely on web page authors to supply the necessary values for these metadata fields. With web archives, we often encounter archived web pages that predate the existence of this metadata. To generate this missing metadata and ensure that storytelling is available for these documents, we apply machine learning to generate the images needed for social cards with a [email protected] of 0.8314. We also provide the length values needed for executing automatic summarization algorithms to generate document descriptions. Applying these concepts helps us create the visualizations needed to fulfill the final processes of story generation. We close this work with examples and applications of this technology.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 9","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3606030","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

People often create themed collections to make sense of an ever-increasing number of archived web pages. Some of these collections contain hundreds of thousands of documents. Thousands of collections exist, many covering the same topic. Few collections include standardized metadata. This scale makes understanding a collection an expensive proposition. Our Dark and Stormy Archives (DSA) five-process model implements a novel summarization method to help users understand a collection by combining web archives and social media storytelling. The five processes of the DSA model are: select exemplars, generate story metadata, generate document metadata, visualize the story, and distribute the story. Selecting exemplars produces a set of k documents from the N documents in the collection, where k < <N, thus reducing the number of documents visitors need to review to understand a collection. Generating story and document metadata selects images, titles, descriptions, and other content from these exemplars. Visualizing the story ties this metadata together in a format the visitor can consume. Without distributing the story, it is not shared for others to consume. We present a research study demonstrating that our algorithmic primitives can be combined to select relevant exemplars that are otherwise undiscoverable using a conventional search engine and query generation methods. Having demonstrated improved methods for selecting exemplars, we visualize the story. Previous work established that the social card is the best format for visitors to consume surrogates. The social card combines metadata fields, including the document’s title, a brief description, and a striking image. Social cards are commonly found on social media platforms. We discovered that these platforms perform poorly for mementos and rely on web page authors to supply the necessary values for these metadata fields. With web archives, we often encounter archived web pages that predate the existence of this metadata. To generate this missing metadata and ensure that storytelling is available for these documents, we apply machine learning to generate the images needed for social cards with a [email protected] of 0.8314. We also provide the length values needed for executing automatic summarization algorithms to generate document descriptions. Applying these concepts helps us create the visualizations needed to fulfill the final processes of story generation. We close this work with examples and applications of this technology.

查看原文本刊更多论文

通过自动选择和可视化范例，通过社交媒体讲故事来总结网络档案语料库

人们经常创建主题集合，以使越来越多的存档网页变得有意义。其中一些藏品包含数十万份文件。成千上万的集合存在，许多涵盖相同的主题。很少有集合包含标准化的元数据。这种规模使得理解一个集合成为一个昂贵的命题。我们的黑暗和风暴档案(DSA)五过程模型实现了一种新颖的总结方法，通过结合网络档案和社交媒体故事来帮助用户理解藏品。DSA模型的五个过程是:选择范例、生成故事元数据、生成文档元数据、可视化故事和分发故事。选择范例从集合中的N个文档中生成一组k个文档，其中k <<N，从而减少了访问者为了解集合而需要查看的文档数量。生成故事和文档元数据从这些示例中选择图像、标题、描述和其他内容。可视化故事将这些元数据以访问者可以使用的格式联系在一起。不分发故事，就不能分享给其他人消费。我们提出了一项研究，表明我们的算法原语可以结合起来选择使用传统搜索引擎和查询生成方法无法发现的相关示例。在演示了选择范例的改进方法之后，我们将故事可视化。之前的研究表明，社交卡是游客消费代用品的最佳形式。社交卡组合了元数据字段，包括文档的标题、简要描述和引人注目的图像。社交卡在社交媒体平台上很常见。我们发现这些平台在纪念品方面表现不佳，并且依赖网页作者为这些元数据字段提供必要的值。对于web存档，我们经常会遇到在元数据存在之前就已存档的网页。为了生成这些缺失的元数据并确保这些文档的故事叙述可用，我们应用机器学习来生成社交卡所需的图像，[email protected]为0.8314。我们还提供了执行自动摘要算法以生成文档描述所需的长度值。运用这些概念可以帮助我们创造出完成故事生成最终过程所需的可视化效果。我们以这项技术的例子和应用来结束这项工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.