2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)最新文献_第7页

Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases 德国议会程序的历时分析:政治偏见镜头下的意识形态转变

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-08-13 DOI: 10.1109/JCDL52503.2021.00017

Tobias Walter, Celina Kirschner, Steffen Eger, Goran Glavavs, Anne Lauscher, Simone Paolo Ponzetto

引用次数: 10

COMPARE: A Taxonomy and Dataset of Comparison Discussions in Peer Reviews 比较:同行评议中比较讨论的分类和数据集

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-08-09 DOI: 10.1109/JCDL52503.2021.00068

Shruti Singh, M. Singh, Pawan Goyal

引用次数: 4

Profiling Web Archival Voids for Memento Routing 剖析网络档案空白的纪念品路由

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-08-06 DOI: 10.1109/JCDL52503.2021.00027

Sawood Alam, Michele C. Weigle, Michael L. Nelson

引用次数: 3

Garbage, Glitter, or Gold: Assigning Multi-Dimensional Quality Scores to Social Media Seeds for Web Archive Collections 垃圾，闪光，还是黄金:为网络档案收藏的社交媒体种子分配多维质量分数

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-07-06 DOI: 10.1109/JCDL52503.2021.00020

Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

{"title":"Garbage, Glitter, or Gold: Assigning Multi-Dimensional Quality Scores to Social Media Seeds for Web Archive Collections","authors":"Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson","doi":"10.1109/JCDL52503.2021.00020","DOIUrl":"https://doi.org/10.1109/JCDL52503.2021.00020","url":null,"abstract":"From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by link rot and content drift (reference rot) which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories and events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media posts. The quality of social media content content varies widely, therefore, we propose a framework for assigning multidimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject-expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust; seeds can be scored when a subset of the QP dimensions are absent. Most importantly, scores assigned by Quality Proxies are explainable, providing the opportunity to critique them. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ≈0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains that select seeds from social media.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128738845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations 自动元数据提取结合视觉特征从扫描电子论文和学位论文

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-07-01 DOI: 10.1109/JCDL52503.2021.00066

Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, E. Fox

{"title":"Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations","authors":"Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, E. Fox","doi":"10.1109/JCDL52503.2021.00066","DOIUrl":"https://doi.org/10.1109/JCDL52503.2021.00066","url":null,"abstract":"Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive11httns://tinvurl.com/y8kxzwrp and a GitHub repository22https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf, respectively.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125416474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

GraphConfRec: A Graph Neural Network-Based Conference Recommender System GraphConfRec:基于图神经网络的会议推荐系统

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-06-23 DOI: 10.1109/JCDL52503.2021.00021

Andreea Iana, Heiko Paulheim

引用次数: 6

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations ScanBank:从扫描电子论文和学位论文中提取图形的基准数据集

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-06-23 DOI: 10.1109/JCDL52503.2021.00030

S. Kahu, William A. Ingram, E. Fox, Jian Wu

{"title":"ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations","authors":"S. Kahu, William A. Ingram, E. Fox, Jian Wu","doi":"10.1109/JCDL52503.2021.00030","DOIUrl":"https://doi.org/10.1109/JCDL52503.2021.00030","url":null,"abstract":"We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121203439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

TweetPap: A Dataset to Study the Social Media Discourse of Scientific Papers TweetPap:一个研究科学论文的社交媒体话语的数据集

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-06-14 DOI: 10.1109/JCDL52503.2021.00055

Naman Jain, M. Singh

引用次数: 1

ConSTR: A Contextual Search Term Recommender ConSTR:上下文搜索词推荐器

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-06-08 DOI: 10.1109/JCDL52503.2021.00042

Thomas Kramer, Zeljko Carevic, Dwaipayan Roy, Claus-Peter Klas, Philipp Mayr

引用次数: 0

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications 从德国出版物中提取元数据的深度迁移学习

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-06-04 DOI: 10.1109/JCDL52503.2021.00076

Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami, Muhammad Arslan Zafar

{"title":"MexPub: Deep Transfer Learning for Metadata Extraction from German Publications","authors":"Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami, Muhammad Arslan Zafar","doi":"10.1109/JCDL52503.2021.00076","DOIUrl":"https://doi.org/10.1109/JCDL52503.2021.00076","url":null,"abstract":"In contrast to most of the English scientific publications that follow standard and simple layouts, the order, content, position and size of metadata in German publications vary greatly among publications. This variety makes traditional NLP methods fail to accurately extract metadata from these publications. In this paper, we present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image. We used Mask R-CNN which is trained on COCO dataset and finetuned with PubLayNet dataset that consists of 200K PDF snapshots with five basic classes (e.g, text, figure, etc). We refine-tuned the model on our proposed synthetic dataset consisting of 30K article snapshots to extract nine patterns (i.e. author, title, etc). Our synthetic dataset is generated using contents in both languages German and English and a finite set of challenging templates obtained from German publications. Our method achieved an average accuracy of around 90% which validates its capability to accurately extract metadata from a variety of PDF documents with challenging templates.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125920144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5