A Framework for data mining of structured semantic markup extracted from educational resources on University websites

Lorena Recalde, Rosa Navarrete, Luis Rosero Correa
{"title":"A Framework for data mining of structured semantic markup extracted from educational resources on University websites","authors":"Lorena Recalde, Rosa Navarrete, Luis Rosero Correa","doi":"10.54941/ahfe1001745","DOIUrl":null,"url":null,"abstract":"The coronavirus pandemic has forced education at all levels to change from face-to-face mode to online learning. In keeping with that purpose, Universities are releasing a significant number of educational resources on the Web to support virtual education. Final users, who need these educational resources, explore the Web through search engines such as Google, Yahoo, Yandex, or Bing; unfortunately, the search results they obtain lack accuracy and are not necessarily adequate to their requirements. This problem is because Web resources release does not consider their visibility or ease of being found. One way to improve the experience of users who browse the Web is by delivering more appropriate content in response to their searches. An alternative to enhancing the meaning of web searching results is embedding structured semantic markup in the HTML of web pages through standards such as JSON-LD and Schema.org vocabulary, in compliance with W3C recommendation. Search engines can interpret this markup to understand the resources being published and, consequently, improve the rightness of search results. For example, Google uses the structured semantic markup to show rich fragments, Rich Snippets, or even Knowledge Graph in user searches.This research proposes a framework that enables a systematic analysis of the websites of the top-ranking universities, focused on the educational content they provide to review the embedded semantic markup annotated by using JSON-LD and the Schema vocabulary. To this end, a worldwide list of the universities that are part of the top international ranking has been compiled. Then, by using Web Scraping techniques, we have analyzed these universities' Websites in search of educational resources and reviewed if the embedded structured markup is included. Finally, data mining techniques have been used to describe and organize the educational resources obtained.The contribution of this work is two-fold. Firstly, the analysis of embedded structured markup that uses Schema vocabulary and JSON-LD format in university websites. This analysis is relevant since previous research has not explicitly focused on the educational field or has not used a specific dataset within this context. Secondly, the proposal of a framework that allows accomplishing this type of analysis of embedded structured markup from a data collection phase to obtaining results and indicators on the data. It addresses the data mining process from download to the final data analysis to get information. The proposed framework consists of eleven components distributed in three well-defined layers: data access layer, service layer, and application layer. The framework component development process is defined by merging two methodologies, Design Science Research (DSR), to guide the creation of an artifact, and CRISP-DM, to address the data mining process. The architecture of the framework integrates tools such as Scrapy (Python), for web scraping and crawling functions, MongoDB for manipulating semi-structured data with a NoSQL management mode, Redis as an in-memory database (auxiliary) that through queries allows to determine if the URLs that are extracted in the Web Scraping process have already been processed or not (duplicate control), and Apache Kafka as a communication intermediary and facilitator of the flow or exchange of information between the other components.Moreover, this work provides a data set made up of the HTML pages of the universities' Web sites that can be used for further analysis.","PeriodicalId":409565,"journal":{"name":"Usability and User Experience","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Usability and User Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54941/ahfe1001745","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The coronavirus pandemic has forced education at all levels to change from face-to-face mode to online learning. In keeping with that purpose, Universities are releasing a significant number of educational resources on the Web to support virtual education. Final users, who need these educational resources, explore the Web through search engines such as Google, Yahoo, Yandex, or Bing; unfortunately, the search results they obtain lack accuracy and are not necessarily adequate to their requirements. This problem is because Web resources release does not consider their visibility or ease of being found. One way to improve the experience of users who browse the Web is by delivering more appropriate content in response to their searches. An alternative to enhancing the meaning of web searching results is embedding structured semantic markup in the HTML of web pages through standards such as JSON-LD and Schema.org vocabulary, in compliance with W3C recommendation. Search engines can interpret this markup to understand the resources being published and, consequently, improve the rightness of search results. For example, Google uses the structured semantic markup to show rich fragments, Rich Snippets, or even Knowledge Graph in user searches.This research proposes a framework that enables a systematic analysis of the websites of the top-ranking universities, focused on the educational content they provide to review the embedded semantic markup annotated by using JSON-LD and the Schema vocabulary. To this end, a worldwide list of the universities that are part of the top international ranking has been compiled. Then, by using Web Scraping techniques, we have analyzed these universities' Websites in search of educational resources and reviewed if the embedded structured markup is included. Finally, data mining techniques have been used to describe and organize the educational resources obtained.The contribution of this work is two-fold. Firstly, the analysis of embedded structured markup that uses Schema vocabulary and JSON-LD format in university websites. This analysis is relevant since previous research has not explicitly focused on the educational field or has not used a specific dataset within this context. Secondly, the proposal of a framework that allows accomplishing this type of analysis of embedded structured markup from a data collection phase to obtaining results and indicators on the data. It addresses the data mining process from download to the final data analysis to get information. The proposed framework consists of eleven components distributed in three well-defined layers: data access layer, service layer, and application layer. The framework component development process is defined by merging two methodologies, Design Science Research (DSR), to guide the creation of an artifact, and CRISP-DM, to address the data mining process. The architecture of the framework integrates tools such as Scrapy (Python), for web scraping and crawling functions, MongoDB for manipulating semi-structured data with a NoSQL management mode, Redis as an in-memory database (auxiliary) that through queries allows to determine if the URLs that are extracted in the Web Scraping process have already been processed or not (duplicate control), and Apache Kafka as a communication intermediary and facilitator of the flow or exchange of information between the other components.Moreover, this work provides a data set made up of the HTML pages of the universities' Web sites that can be used for further analysis.
高校网站教育资源结构化语义标记数据挖掘框架
新冠肺炎疫情迫使各级教育从面对面学习转变为在线学习。为了实现这一目标,大学在网络上发布了大量的教育资源来支持虚拟教育。需要这些教育资源的最终用户通过b谷歌、雅虎、Yandex或必应等搜索引擎探索网络;不幸的是,他们获得的搜索结果缺乏准确性,并不一定足以满足他们的要求。这个问题是因为Web资源的释放没有考虑到它们的可见性或容易被发现。改善浏览Web的用户体验的一种方法是根据他们的搜索提供更合适的内容。增强web搜索结果含义的另一种方法是按照W3C推荐标准,通过JSON-LD和Schema.org词汇表等标准,在web页面的HTML中嵌入结构化语义标记。搜索引擎可以解释这个标记,以了解正在发布的资源,从而提高搜索结果的正确性。例如,谷歌使用结构化语义标记在用户搜索中显示富片段、富片段甚至知识图。本研究提出了一个框架,可以对排名最高的大学的网站进行系统分析,重点关注他们提供的教育内容,以审查使用JSON-LD和Schema词汇表注释的嵌入式语义标记。为此,我们编制了一份全球顶尖大学排行榜。然后,通过Web抓取技术,我们分析了这些大学的网站搜索教育资源,并检查是否包含嵌入式结构化标记。最后,利用数据挖掘技术对获得的教育资源进行描述和组织。这项工作的贡献是双重的。首先,对高校网站中使用Schema词汇表和JSON-LD格式的嵌入式结构化标记进行了分析。这一分析是相关的,因为之前的研究没有明确地关注教育领域,也没有在这一背景下使用特定的数据集。其次,提出了一个框架,该框架允许从数据收集阶段到获得数据的结果和指示器,完成对嵌入式结构化标记的这种类型的分析。它解决了数据挖掘从下载到最终数据分析获取信息的过程。该框架由11个组件组成,分布在三个定义良好的层:数据访问层、服务层和应用层。框架组件开发过程是通过合并两种方法来定义的:设计科学研究(DSR),用于指导工件的创建,CRISP-DM,用于处理数据挖掘过程。框架的架构集成了工具,如Scrapy (Python),用于网页抓取和爬行功能,MongoDB用于使用NoSQL管理模式操作半结构化数据,Redis作为内存数据库(辅助),通过查询允许确定在web抓取过程中提取的url是否已经被处理(重复控制)。Apache Kafka作为其他组件之间信息流或交换的通信中介和促进者。此外,这项工作提供了一个由大学网站的HTML页面组成的数据集,可以用于进一步的分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信