{"title":"Big Data Text Summarization for Events: A Problem Based Learning Course","authors":"Tarek Kanan, Xuan Zhang, M. Magdy, E. Fox","doi":"10.1145/2756406.2756943","DOIUrl":"https://doi.org/10.1145/2756406.2756943","url":null,"abstract":"Problem/project Based Learning (PBL) is a highly effective student-centered teaching method, where student teams learn by solving problems. This paper describes an instance of PBL applied to digital library education. We show the design, implementation, results, and partial evaluation of a Computational Linguistics course that provides students an opportunity to engage in active learning about adding value to digital libraries with large collections of text, i.e., one aspect of \"big data.\" Students are engaging in PBL with the semester long challenge of generating good English summaries of an event, given a large collection from our webpage archives. Six teams, each working with a different type of event, and applying three different summarization methods, learned how to generate good summaries; these have fair precision relative to the Wikipedia page that describes their event.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Before the Repository: Defining the Preservation Threats to Research Data in the Lab","authors":"Stacy T. Kowalczyk","doi":"10.1145/2756406.2756909","DOIUrl":"https://doi.org/10.1145/2756406.2756909","url":null,"abstract":"This paper describes the results of a large survey designed to quantify the risks and threats to the preservation of the research data in the lab and to determine the mitigating actions of researchers. A total of 724 National Science Foundation awardees completed this survey. Identifying risks and threats to digital preservation has been a significant research stream. Much of this work has been within the context of a preservation technology infrastructure such as data archives for a digital repository. This study looks at the risks and threats to research data prior to its inclusion in a preservation technology infrastructure. The greatest threat to preservation is human error, followed by equipment malfunction, obsolete software, and data corruption. Lost and mislabeled media are not components in the threat taxonomies developed for repositories; however, they do represent an important threat to research data in the lab. Researchers have recognized the need to mitigate the risks inherent in maintaining digital data by implementing data management in their lab environments and have taken their responsibility as data managers seriously; however, they would still prefer to have professional data management support.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121463678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The HathiTrust Research Center: Providing analytic access to the HathiTrust Digital Library's 4.7 billion pages","authors":"J. S. Downie","doi":"10.1145/2756406.2771494","DOIUrl":"https://doi.org/10.1145/2756406.2771494","url":null,"abstract":"This lecture provides an update on the recent developments and activities of the HathiTrust Research Center (HTRC). The HTRC is the research arm of the HathiTrust, an online repository dedicated to the provision of access to a comprehensive body of published works for scholarship and education. The HathiTrust is a partnership of over 100 major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. Membership is open to institutions worldwide. Over 13.1 million volumes (4.7 billion pages) have been ingested into the HathiTrust digital archive from sources including Google Books, member university libraries, the Internet Archive, and numerous private collections. The HTRC is dedicated to facilitating scholarship by enabling analytic access to the corpus, developing research tools, fostering research projects and communities, and providing additional resources such as enhanced metadata and indices that will assist scholars to more easily exploit the HathiTrust materials. This talk will outline the mission, goals and structure of the HTRC. It will also provide an overview of recent work being conducted on a range of projects, partnerships and initiatives. Projects include Workset Creation for Scholarly Analysis project (WCSA, funded by the Andrew W. Mellon Foundation) and the HathiTrust + Bookworm project (HT+BW, funded by the National Endowment for the Humanities). HTRC's involvement with the NOVEL(TM) text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme, will be introduced. The HTRC's new feature extraction and Data Capsule initiatives, part of its ongoing work its ongoing efforts to enable the non-consumptive analyses of the approximately 8 million volumes under copyright restrictions will also be discussed. The talk will conclude with some suggestions on how the non-consumptive research model might be improved upon and possibly extended beyond the HathiTrust context.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115272594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 9 - Archiving, Repositories, and Content","authors":"Maureen Henninger","doi":"10.1145/3260517","DOIUrl":"https://doi.org/10.1145/3260517","url":null,"abstract":"","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121736925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Case Study of Waiting List on WPLC Digital Library","authors":"Wooseob Jeong, H. Han, Laura Ridenour","doi":"10.1145/2756406.2756961","DOIUrl":"https://doi.org/10.1145/2756406.2756961","url":null,"abstract":"With the increasing popularity of e-books and audiobooks provided by public libraries in the U.S., the demand does not seem to be met with sufficient supply, as many popular titles require months of waiting time. In this study, we collected data from the Wisconsin Public Library Consortium's digital libraries service once a day for more than two months for selected popular titles. This data reflects the current supply and demand of popular titles in public libraries' digital library services. Based on our data analysis and observation, we suggest ways to achieve faster circulation, which ultimately allows for better services to library users.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117169866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling","authors":"Gerhard Gossen, Elena Demidova, T. Risse","doi":"10.1145/2756406.2756925","DOIUrl":"https://doi.org/10.1145/2756406.2756925","url":null,"abstract":"Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"358 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122728865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Classification of Research Documents using Textual Entailment","authors":"B. Ojokoh, O. Omisore, O. W. Samuel","doi":"10.1145/2756406.2756960","DOIUrl":"https://doi.org/10.1145/2756406.2756960","url":null,"abstract":"Exploring the accumulative nature of Internet documents has become a rising issue that requires systematic ways to construct what we need from what we have. Manual and semi-manual document classification techniques have facilitated retrieval and maintenance of document repositories for easy access; however, they are customarily painstaking and labor-intensive. Herein, we propose a document classification model using automatic access of natural language meaning. The model is made up of application, business, and storage layers. The business layer, as a core component, automatically extracts sentences containing keywords from research documents and classifies them using the geometrical similarity of their sentential entailments.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123182803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 3 - Big Data, Big Resources","authors":"G. Newton","doi":"10.1145/3260511","DOIUrl":"https://doi.org/10.1145/3260511","url":null,"abstract":"","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Demystifying the Semantics of Relevant Objects in Scholarly Collections: A Probabilistic Approach","authors":"J. M. Pinto, Wolf-Tilo Balke","doi":"10.1145/2756406.2756923","DOIUrl":"https://doi.org/10.1145/2756406.2756923","url":null,"abstract":"Efforts to make highly specialized knowledge accessible through scientific digital libraries need to go beyond mere bibliographic metadata, since here information search is mostly entity-centric. Previous work has realized this trend and developed different methods to recognize and (to some degree even automatically) annotate several important types of entities: genes and proteins, chemical structures and molecules, or drug names to name but a few. Moreover, such entities are often crossreferenced with entries in curated databases. However, several questions still remain to be answered: Given a scientific discipline what are the important entities? How can they be automatically identified? Are really all of them relevant, i.e. do all of them carry deeper semantics for assessing a publication? How can they be represented, described, and subsequently annotated? How can they be used for search tasks? In this work we focus on answering some of these questions. We claim that to bring the use of scientific digital libraries to the next level we must find treat topic-specific entities as first class citizens and deeply integrate their semantics into the search process. To support this we propose a novel probabilistic approach that not only successfully provides a solution to the integration problem, but also demonstrates how to leverage the knowledge encoded in entities and provide insights to explore the use of our approach in different scenarios. Finally, we show how our results can benefit information providers.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"os-44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127782629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing User Requests for Anime Recommendations","authors":"Jin Ha Lee, Yun-Jeong Shim, Jacob Jett","doi":"10.1145/2756406.2756969","DOIUrl":"https://doi.org/10.1145/2756406.2756969","url":null,"abstract":"Anime is increasingly becoming recognized as an important commercial product and cultural artifact. However, little is known regarding users' information needs and behavior related to anime. This study specifically attempts to improve our understanding of how people seek anime recommendations. We analyzed 546 user questions in natural language, collected from a Korean Q&A website Naver Knowledge-iN, where users are asking for anime recommendations. The findings suggest the importance of establishing robust metadata for the seven commonly used features for anime recommenders (i.e., title, genre, artistic style, story, character description, series title, and mood) in digital libraries, as well as allowing users to specify known anime and series titles as examples for seeking similar items, or examples of the kinds of items to be excluded.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127103118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}