{"title":"The Archival Acid Test: Evaluating archive performance on advanced HTML and JavaScript","authors":"Mat Kelly, Michael L. Nelson, Michele C. Weigle","doi":"10.1109/JCDL.2014.6970146","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970146","url":null,"abstract":"When preserving web pages, archival crawlers sometimes produce a result that varies from what an end-user expects. To quantitatively evaluate the degree to which an archival crawler is capable of comprehensively reproducing a web page from the live web into the archives, the crawlers' capabilities must be evaluated. In this paper, we propose a set of metrics to evaluate the capability of archival crawlers and other preservation tools using the Acid Test concept. For a variety of web preservation tools, we examine previous captures within web archives and note the features that produce incomplete or unexpected results. From there, we design the test to produce a quantitative measure of how well each tool performs its task.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"95 1","pages":"25-28"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90523481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PageRank-based Word Sense Induction within Web Search Results Clustering","authors":"Jose G. Moreno, G. Dias","doi":"10.1109/JCDL.2014.6970227","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970227","url":null,"abstract":"Word Sense Induction is an open problem in Natural Language Processing. Many recent works have been addressing this problem with a wide spectrum of strategies based on content analysis. In this paper, we present a sense induction strategy exclusively based on link analysis over the Web. In particular, we explore the idea that the main different senses of a given word share similar linking properties and can be found by performing clustering with link-based similarity metrics. The evaluation results show that PageRank-based sense induction achieves interesting results when compared to state-of-the-art content-based algorithms in the context of Web Search Results Clustering.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"145 1","pages":"465-466"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89086956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Crowd-sourcing Web knowledge for metadata extraction","authors":"Zhaohui Wu, W. Huang, Chen Liang, C. Lee Giles","doi":"10.1109/JCDL.2014.6970160","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970160","url":null,"abstract":"We explore a new metadata extraction framework without human annotators with the ground truth harvested from Web. A new training sample is selected based on not only the uncertainty and representativeness in the unlabeled pool, but also on its availability and credibility in Web knowledge bases. We construct a dataset of 4329 books with valid metadata and evaluate our approach using 5 Web book databases as oracles. Empirical results demonstrate its effectiveness and efficiency.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"135 1","pages":"141-144"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86424825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teru Agata, Yosuke Miyata, Emi Ishita, Atsushi Ikeuchi, S. Ueda
{"title":"Life span of web pages: A survey of 10 million pages collected in 2001","authors":"Teru Agata, Yosuke Miyata, Emi Ishita, Atsushi Ikeuchi, S. Ueda","doi":"10.1109/JCDL.2014.6970226","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970226","url":null,"abstract":"This paper highlights the results of a survival survey and life span study of 10 million web pages, mainly in Japanese, that were collected for NTCIR-3 (web task) in 2001. To calculate web page life span, metadata was collected from Internet Archive's Wayback Machine via Memento. The life span study showed that the average life span of a web page is 1,132.1 days.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"55 1","pages":"463-464"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73821077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shansong Yang, Weiming Lu, Zhanjiang Zhang, Baogang Wei, Wenjia An
{"title":"Amplifying scientific paper's abstract by leveraging data-weighted reconstruction","authors":"Shansong Yang, Weiming Lu, Zhanjiang Zhang, Baogang Wei, Wenjia An","doi":"10.1016/j.ipm.2015.12.014","DOIUrl":"https://doi.org/10.1016/j.ipm.2015.12.014","url":null,"abstract":"","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"10 1","pages":"447-448"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72926792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring relationships among video games","authors":"R. Clarke, Jin Ha Lee, Jacob Jett, S. Sacchi","doi":"10.1109/JCDL.2014.6970235","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970235","url":null,"abstract":"This poster explores relationships among video games in an attempt to better understand the domain of video games and interactive media as well as improve user access to games. Video games are related in complex ways that cannot be adequately represented by contemporary conceptual models like Functional Requirements for Bibliographic Records (FRBR). Relationships between game editions, series, distribution methods and additional game content all pose challenges for those seeking to describe video games in a user-centered way.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"36 1","pages":"481-482"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75076702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a stratified learning approach to predict future citation counts","authors":"Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, Animesh Mukherjee","doi":"10.1109/JCDL.2014.6970190","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970190","url":null,"abstract":"In this paper, we study the problem of predicting future citation count of a scientific article after a given time interval of its publication. To this end, we gather and conduct an exhaustive analysis on a dataset of more than 1.5 million scientific papers of computer science domain. On analysis of the dataset, we notice that the citation count of the articles over the years follows a diverse set of patterns; on closer inspection we identify six broad categories of citation patterns. This important observation motivates us to adopt stratified learning approach in the prediction task, whereby, we propose a two-stage prediction model - in the first stage, the model maps a query paper into one of the six categories, and then in the second stage a regression module is run only on the subpopulation corresponding to that category to predict the future citation count of the query paper. Experimental results show that the categorization of this huge dataset during the training phase leads to a remarkable improvement (around 50%) in comparison to the well-known baseline system.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"41 1","pages":"351-360"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77931761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The anatomy of a search and mining system for digital humanities","authors":"Martyn Harris, M. Levene, Dell Zhang, D. Levene","doi":"10.1109/JCDL.2014.6970163","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970163","url":null,"abstract":"Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic incantation texts from late antiquity. In contrast to standard search engines and text mining systems that rely on the bag-of-words representation of text, Samtla provides the retrieval and discovery of fuzzy text patterns/motifs (aka “formulae” to historians), which is achieved through applying a character-based n-gram statistical language model built on top of a powerful generalised suffix tree data structure. This paper brie y describes the major components of Samtla and their underlying techniques.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"11 1","pages":"165-168"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80280288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards automatic identification of core concepts in educational resources","authors":"Md Arafat Sultan, Steven Bethard, T. Sumner","doi":"10.1109/JCDL.2014.6970194","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970194","url":null,"abstract":"Automatically identifying and extracting key ideas and concepts from educational resources is an important but challenging computational task. We present a supervised machine learning approach to assessing the “coreness” of concepts expressed by resource sentences. The algorithm has been developed and evaluated in the domain of science education where coreness refers to the degree to which a sentence embodies key concepts important to developing a robust understanding of the domain. Our method operates by automatically computing and leveraging the degree of semantic similarity between resource sentences and standard domain concepts designed by human experts for various STEM domains. In our experiments, the algorithm demonstrates high accuracy in identifying sentence coreness when there is agreement between human experts on the coreness rating. We also present performance comparisons with a number of baseline systems.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"1 1","pages":"379-388"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79643586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The value of risk management for data management in science and engineering","authors":"Filipe Ferreira, Ricardo Vieira, J. Borbinha","doi":"10.1109/JCDL.2014.6970214","DOIUrl":"https://doi.org/10.1109/JCDL.2014.6970214","url":null,"abstract":"An established concept to address data management challenges in science and engineering is the Data Management Plans. However, we claim that in some complex scenarios the actual principles for Data Management Plans might not be enough, especially when Risk Management turns to be relevant. Therefore, we propose a method, based on the ISO 31000, for science and engineering projects to create a Risk Management Plan that can complement the Data Management Plan. The validation of this proposal is presented in the real case of an engineering laboratory.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"29 6 1","pages":"439-440"},"PeriodicalIF":0.0,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81444292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}