{"title":"Detecting Research from an Uncurated HTML Archive Using Semi-Supervised Machine Learning","authors":"John McNulty, Sarai Alvarez, Michael Langmayr","doi":"10.1109/SIEDS52267.2021.9483725","DOIUrl":null,"url":null,"abstract":"The Internet Archive seeks to provide \"universal access to all knowledge\" through their digital library, which includes a digital repository of over 475 billion crawled web documents in addition to other content. Of particular interest, to those who use their platform, is the preservation and access to research due to its inherent value. Research or scholarly work outside of mainstream institutions, publishers, topics, or languages is at particular risk of not being properly archived. The Internet Archive preserves these documents in its attempts to archive all content, however, these documents of interest are still at risk of not being discoverable due to lack of proper indexing within this uncurated archive. We provide a preliminary classifier to identify and prioritize research, to include long tail research, which circumvents this issue and enhances their overall approach. Classification is complicated by the fact that documents are in many different formats, there are no clear boundaries between official and unofficial research, and documents are not labeled. To address this problem, we focus on HTML documents and develop a semi-supervised approach that identifies documents by their provenance, structure, content, and linguistic formality heuristics. We describe a semi-supervised machine learning classifier to filter crawled HTML documents as research, both mainstream and obscure, or non-research. Because the HTML datasets were not labelled, a provenanced approach was used where provenance was substituted for label. A data pipeline was built to deconstruct HTML website content into raw text. We targeted structural features, content features, and stylistic features which were extracted from the text and metadata. This methodology provides the ability to leverage the similarities found across differing subjects and languages in scholarly work. The optimal classifier explored, XGBoost, predicts whether a crawled HTML document is research or non-research with 98% accuracy. This project lays the foundation for future work to further distinguish between mainstream and long tail research, both English and non-English.","PeriodicalId":426747,"journal":{"name":"2021 Systems and Information Engineering Design Symposium (SIEDS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS52267.2021.9483725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The Internet Archive seeks to provide "universal access to all knowledge" through their digital library, which includes a digital repository of over 475 billion crawled web documents in addition to other content. Of particular interest, to those who use their platform, is the preservation and access to research due to its inherent value. Research or scholarly work outside of mainstream institutions, publishers, topics, or languages is at particular risk of not being properly archived. The Internet Archive preserves these documents in its attempts to archive all content, however, these documents of interest are still at risk of not being discoverable due to lack of proper indexing within this uncurated archive. We provide a preliminary classifier to identify and prioritize research, to include long tail research, which circumvents this issue and enhances their overall approach. Classification is complicated by the fact that documents are in many different formats, there are no clear boundaries between official and unofficial research, and documents are not labeled. To address this problem, we focus on HTML documents and develop a semi-supervised approach that identifies documents by their provenance, structure, content, and linguistic formality heuristics. We describe a semi-supervised machine learning classifier to filter crawled HTML documents as research, both mainstream and obscure, or non-research. Because the HTML datasets were not labelled, a provenanced approach was used where provenance was substituted for label. A data pipeline was built to deconstruct HTML website content into raw text. We targeted structural features, content features, and stylistic features which were extracted from the text and metadata. This methodology provides the ability to leverage the similarities found across differing subjects and languages in scholarly work. The optimal classifier explored, XGBoost, predicts whether a crawled HTML document is research or non-research with 98% accuracy. This project lays the foundation for future work to further distinguish between mainstream and long tail research, both English and non-English.