Detecting Research from an Uncurated HTML Archive Using Semi-Supervised Machine Learning

2021 Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2021-04-30 DOI:10.1109/SIEDS52267.2021.9483725

John McNulty, Sarai Alvarez, Michael Langmayr

{"title":"Detecting Research from an Uncurated HTML Archive Using Semi-Supervised Machine Learning","authors":"John McNulty, Sarai Alvarez, Michael Langmayr","doi":"10.1109/SIEDS52267.2021.9483725","DOIUrl":null,"url":null,"abstract":"The Internet Archive seeks to provide \"universal access to all knowledge\" through their digital library, which includes a digital repository of over 475 billion crawled web documents in addition to other content. Of particular interest, to those who use their platform, is the preservation and access to research due to its inherent value. Research or scholarly work outside of mainstream institutions, publishers, topics, or languages is at particular risk of not being properly archived. The Internet Archive preserves these documents in its attempts to archive all content, however, these documents of interest are still at risk of not being discoverable due to lack of proper indexing within this uncurated archive. We provide a preliminary classifier to identify and prioritize research, to include long tail research, which circumvents this issue and enhances their overall approach. Classification is complicated by the fact that documents are in many different formats, there are no clear boundaries between official and unofficial research, and documents are not labeled. To address this problem, we focus on HTML documents and develop a semi-supervised approach that identifies documents by their provenance, structure, content, and linguistic formality heuristics. We describe a semi-supervised machine learning classifier to filter crawled HTML documents as research, both mainstream and obscure, or non-research. Because the HTML datasets were not labelled, a provenanced approach was used where provenance was substituted for label. A data pipeline was built to deconstruct HTML website content into raw text. We targeted structural features, content features, and stylistic features which were extracted from the text and metadata. This methodology provides the ability to leverage the similarities found across differing subjects and languages in scholarly work. The optimal classifier explored, XGBoost, predicts whether a crawled HTML document is research or non-research with 98% accuracy. This project lays the foundation for future work to further distinguish between mainstream and long tail research, both English and non-English.","PeriodicalId":426747,"journal":{"name":"2021 Systems and Information Engineering Design Symposium (SIEDS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS52267.2021.9483725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The Internet Archive seeks to provide "universal access to all knowledge" through their digital library, which includes a digital repository of over 475 billion crawled web documents in addition to other content. Of particular interest, to those who use their platform, is the preservation and access to research due to its inherent value. Research or scholarly work outside of mainstream institutions, publishers, topics, or languages is at particular risk of not being properly archived. The Internet Archive preserves these documents in its attempts to archive all content, however, these documents of interest are still at risk of not being discoverable due to lack of proper indexing within this uncurated archive. We provide a preliminary classifier to identify and prioritize research, to include long tail research, which circumvents this issue and enhances their overall approach. Classification is complicated by the fact that documents are in many different formats, there are no clear boundaries between official and unofficial research, and documents are not labeled. To address this problem, we focus on HTML documents and develop a semi-supervised approach that identifies documents by their provenance, structure, content, and linguistic formality heuristics. We describe a semi-supervised machine learning classifier to filter crawled HTML documents as research, both mainstream and obscure, or non-research. Because the HTML datasets were not labelled, a provenanced approach was used where provenance was substituted for label. A data pipeline was built to deconstruct HTML website content into raw text. We targeted structural features, content features, and stylistic features which were extracted from the text and metadata. This methodology provides the ability to leverage the similarities found across differing subjects and languages in scholarly work. The optimal classifier explored, XGBoost, predicts whether a crawled HTML document is research or non-research with 98% accuracy. This project lays the foundation for future work to further distinguish between mainstream and long tail research, both English and non-English.

查看原文本刊更多论文

使用半监督机器学习从未整理的HTML档案中检测研究

互联网档案馆试图通过他们的数字图书馆提供“对所有知识的普遍访问”，其中包括一个超过4750亿的抓取网络文档和其他内容的数字存储库。对于那些使用他们平台的人来说，特别感兴趣的是由于其固有价值而保存和获取研究成果。主流机构、出版商、主题或语言之外的研究或学术工作尤其有可能没有被妥善存档。互联网档案馆保存了这些文档，试图存档所有内容，然而，由于缺乏适当的索引，这些感兴趣的文档仍然面临着无法被发现的风险。我们提供了一个初步的分类器来识别和优先考虑研究，包括长尾研究，这规避了这个问题，并增强了他们的整体方法。由于文件格式多种多样，官方和非官方研究之间没有明确的界限，而且文件没有标签，分类变得复杂。为了解决这个问题，我们将重点放在HTML文档上，并开发一种半监督的方法，根据文档的来源、结构、内容和语言形式启发式来标识文档。我们描述了一个半监督机器学习分类器来过滤抓取的HTML文档作为研究，包括主流的和模糊的，或非研究。由于HTML数据集没有标记，因此使用了一种出处方法，其中出处取代了标签。建立了一个数据管道，将HTML网站内容解构为原始文本。我们的目标是从文本和元数据中提取的结构特征、内容特征和风格特征。这种方法提供了在学术工作中利用不同学科和语言之间的相似性的能力。所探索的最优分类器XGBoost预测抓取的HTML文档是研究文档还是非研究文档，准确率为98%。这个项目为未来进一步区分主流和长尾研究，英语和非英语研究奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量