A scalable crawler framework for FLOSS data

Proceedings of the 5th Asia-Pacific Symposium on Internetware Pub Date : 2013-10-23 DOI:10.1145/2532443.2532454

Lingxiao Zhang, Yanzhen Zou, Bing Xie

引用次数: 3

Abstract

Free / Libre / Open Source Software (FLOSS) data, such as bug reports, mailing lists and related webpages, contains valuable information for reusing open source software projects. Before conducting further experiment on FLOSS data, researchers often need to download these data into a local storage system. We refer to this pre-process as FLOSS data retrieval, which in many cases can be a challenging task. In this paper, we proposed a crawler framework to ease the process of FLOSS data retrieval. To cope with various types of FLOSS data scattered on the Internet, we designed the framework in a scalable manner where a crawler program can be easily plugged into the system to extend its functionality. Researchers can perform the retrieval process on datasets of various types and sources simply by adding new configurations to the system. We have implemented the framework and provided basic functions via web-based interfaces. We presented the usage of the system by a detailed case study where we retrieved various types of datasets related to Apache Lucene project using our framework.

查看原文本刊更多论文

用于FLOSS数据的可伸缩爬虫框架

自由/自由/开源软件(FLOSS)数据，如bug报告、邮件列表和相关网页，包含了重用开源软件项目的宝贵信息。在对FLOSS数据进行进一步的实验之前，研究人员通常需要将这些数据下载到本地存储系统中。我们将这个预处理过程称为FLOSS数据检索，这在许多情况下可能是一项具有挑战性的任务。在本文中，我们提出了一个爬虫框架来简化FLOSS数据检索的过程。为了处理分散在Internet上的各种类型的FLOSS数据，我们以可伸缩的方式设计了框架，可以轻松地将爬虫程序插入系统以扩展其功能。研究人员可以对各种类型和来源的数据集执行检索过程，只需向系统添加新的配置。我们已经实现了框架，并通过基于web的接口提供了基本功能。我们通过一个详细的案例研究展示了该系统的用法，在这个案例中，我们使用我们的框架检索了与Apache Lucene项目相关的各种类型的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th Asia-Pacific Symposium on Internetware

自引率

0.00%

发文量