An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX

ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal Pub Date : 2024-06-05 DOI:10.14201/adcaij.31506

Wria Mohammed Salih Mohammed, Alaa Khalil Ju Maa

{"title":"An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX","authors":"Wria Mohammed Salih Mohammed, Alaa Khalil Ju Maa","doi":"10.14201/adcaij.31506","DOIUrl":null,"url":null,"abstract":"\n\nThe volume of data is growing at an astonishingly high speed. Traditional techniques for storing and processing data, such as relational and centralized databases, have become inefficient and time-consuming. Linked data and the Semantic Web make internet data machine-readable. Because of the increasing volume of linked data and Semantic Web data, storing and working with them using traditional approaches is not enough, and this causes limited hardware resources. To solve this problem, storing datasets using distributed and clustered methods is essential. Hadoop can store datasets because it can use many hard disks for distributed data clustering; Apache Spark can be used for parallel data processing more efficiently than Hadoop MapReduce because Spark uses memory instead of the hard disk. Semantic Web data has been stored and processed in this paper using Apache Spark GraphX and the Hadoop Distributed File System (HDFS). Spark's in-memory processing and distributed computing enable efficient data analysis of massive datasets stored in HDFS. Spark GraphX allows graph-based semantic web data processing. The fundamental objective of this work is to provide a way for efficiently combining Semantic Web and big data technologies to utilize their combined strengths in data analysis and processing.\nFirst, the proposed approach uses the SPARQL query language to extract Semantic Web data from DBpedia datasets. DBpedia is a hugely available Semantic Web dataset built on Wikipedia. Secondly, the extracted Semantic Web data was converted to the GraphX data format; vertices and edges files were generated. The conversion process is implemented using Apache Spark GraphX. Third, both vertices and edge tables are stored in HDFS and are available for visualization and analysis operations. Furthermore, the proposed techniques improve the data storage efficiency by reducing the amount of storage space by half when converting from Semantic Web Data to a GraphX file, meaning the RDF size is around 133.8 and GraphX is 75.3. Adopting parallel data processing provided by Apache Spark in the proposed technique reduces the required data processing and analysis time.\nThis article concludes that Apache Spark GraphX can enhance Semantic Web and Big Data technologies. We minimize data size and processing time by converting Semantic Web data to GraphX format, enabling efficient data management and seamless integration.\n","PeriodicalId":504145,"journal":{"name":"ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal","volume":"12 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij.31506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The volume of data is growing at an astonishingly high speed. Traditional techniques for storing and processing data, such as relational and centralized databases, have become inefficient and time-consuming. Linked data and the Semantic Web make internet data machine-readable. Because of the increasing volume of linked data and Semantic Web data, storing and working with them using traditional approaches is not enough, and this causes limited hardware resources. To solve this problem, storing datasets using distributed and clustered methods is essential. Hadoop can store datasets because it can use many hard disks for distributed data clustering; Apache Spark can be used for parallel data processing more efficiently than Hadoop MapReduce because Spark uses memory instead of the hard disk. Semantic Web data has been stored and processed in this paper using Apache Spark GraphX and the Hadoop Distributed File System (HDFS). Spark's in-memory processing and distributed computing enable efficient data analysis of massive datasets stored in HDFS. Spark GraphX allows graph-based semantic web data processing. The fundamental objective of this work is to provide a way for efficiently combining Semantic Web and big data technologies to utilize their combined strengths in data analysis and processing. First, the proposed approach uses the SPARQL query language to extract Semantic Web data from DBpedia datasets. DBpedia is a hugely available Semantic Web dataset built on Wikipedia. Secondly, the extracted Semantic Web data was converted to the GraphX data format; vertices and edges files were generated. The conversion process is implemented using Apache Spark GraphX. Third, both vertices and edge tables are stored in HDFS and are available for visualization and analysis operations. Furthermore, the proposed techniques improve the data storage efficiency by reducing the amount of storage space by half when converting from Semantic Web Data to a GraphX file, meaning the RDF size is around 133.8 and GraphX is 75.3. Adopting parallel data processing provided by Apache Spark in the proposed technique reduces the required data processing and analysis time. This article concludes that Apache Spark GraphX can enhance Semantic Web and Big Data technologies. We minimize data size and processing time by converting Semantic Web data to GraphX format, enabling efficient data management and seamless integration.

查看原文本刊更多论文

使用 Hadoop 和 Apache Spark GraphX 提取和存储大型语义网络数据的高效方法

数据量正在以惊人的速度增长。传统的数据存储和处理技术，如关系数据库和集中式数据库，已经变得低效而耗时。关联数据和语义网使互联网数据变得机器可读。由于关联数据和语义网数据的数量不断增加，使用传统方法来存储和处理这些数据已远远不够，这导致硬件资源有限。要解决这个问题，必须使用分布式和集群方法来存储数据集。Hadoop 可以存储数据集，因为它可以使用许多硬盘进行分布式数据集群；Apache Spark 可以比 Hadoop MapReduce 更高效地用于并行数据处理，因为 Spark 使用的是内存而不是硬盘。本文使用 Apache Spark GraphX 和 Hadoop 分布式文件系统（HDFS）来存储和处理语义网数据。利用 Spark 的内存处理和分布式计算功能，可以对存储在 HDFS 中的海量数据集进行高效的数据分析。Spark GraphX 允许基于图的语义网络数据处理。这项工作的基本目标是提供一种有效结合语义网和大数据技术的方法，以利用它们在数据分析和处理方面的综合优势。DBpedia 是建立在维基百科基础上的大量可用的语义网数据集。其次，将提取的语义网数据转换为 GraphX 数据格式；生成顶点和边文件。转换过程使用 Apache Spark GraphX 实现。第三，顶点和边表都存储在 HDFS 中，可用于可视化和分析操作。此外，在将语义网数据转换为 GraphX 文件时，所提出的技术提高了数据存储效率，减少了一半的存储空间，这意味着 RDF 大小约为 133.8，而 GraphX 大小为 75.3。本文的结论是，Apache Spark GraphX 可以增强语义网和大数据技术。通过将语义网数据转换为 GraphX 格式，我们最大限度地减少了数据规模和处理时间，实现了高效的数据管理和无缝集成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal

自引率

0.00%

发文量