Dimitrios A. Koutsomitropoulos, Andreas D. Andriopoulos, S. Likothanassis
{"title":"Subject Classification of Learning Resources Using Word Embeddings and Semantic Thesauri","authors":"Dimitrios A. Koutsomitropoulos, Andreas D. Andriopoulos, S. Likothanassis","doi":"10.1109/INISTA.2019.8778377","DOIUrl":null,"url":null,"abstract":"Open Educational Resources (OERs) are often scattered among various sources and may follow different metadata schemata. In addition, they may not include exhaustive annotations; even worse, their subject characterization, if any, may be represented by arbitrary, ad-hoc keywords instead of standard, controlled vocabularies, a fact that stretches up the search space and hampers interoperability. To address this issue, in this paper we propose a twofold method based on two seemingly disjoint technology stacks: machine learning and the semantic web. First, OERs harvested from various repositories are assigned subject terms from a formal, standard thesaurus for a domain of interest, by discovering the semantic matches of the harvesting keyword within the thesaurus ontology. Then, we use word embeddings to represent an item's metadata and compute its similarity with the thesaurus keywords. These word embeddings are learned by a doc2vec model that has been trained with already annotated corpora from the biomedical domain. By combining both worlds, we show that it is possible to produce a reasonable set of thematic suggestions which exceed a certain similarity threshold.","PeriodicalId":262143,"journal":{"name":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INISTA.2019.8778377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Open Educational Resources (OERs) are often scattered among various sources and may follow different metadata schemata. In addition, they may not include exhaustive annotations; even worse, their subject characterization, if any, may be represented by arbitrary, ad-hoc keywords instead of standard, controlled vocabularies, a fact that stretches up the search space and hampers interoperability. To address this issue, in this paper we propose a twofold method based on two seemingly disjoint technology stacks: machine learning and the semantic web. First, OERs harvested from various repositories are assigned subject terms from a formal, standard thesaurus for a domain of interest, by discovering the semantic matches of the harvesting keyword within the thesaurus ontology. Then, we use word embeddings to represent an item's metadata and compute its similarity with the thesaurus keywords. These word embeddings are learned by a doc2vec model that has been trained with already annotated corpora from the biomedical domain. By combining both worlds, we show that it is possible to produce a reasonable set of thematic suggestions which exceed a certain similarity threshold.