Challenge Paper

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-09-07 DOI:10.1145/3230669

E. Karanja, S. Masupe, Mandu Gasennelwe-Jeffrey

{"title":"Challenge Paper","authors":"E. Karanja, S. Masupe, Mandu Gasennelwe-Jeffrey","doi":"10.1145/3230669","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:2 E. M. Karanja et al. 2 THE VISION FOR IOT MALWARE DATASETS The adoption of existing malware data portals for IoT research poses a challenge on how to accommodate factors associated with heterogeneity. We derive two branches of factors that envision a well-defined IoT malware ecosystem. These factors are the functional factors and non-functional factors. Functional factors are based on the malware data inherent features. We also highlight the shortfalls in the existing dataset ecosystems based on these metrics or factors that make them inappropriate for IoT malware. (1) Malware data description: The key aspect is whether the data have sufficient descriptors that fully explain their format and interactions. Malware data for research should in be formats that can be reused and published without barriers. IoT malware can originate from various heterogeneous sources but need to be presented in a form that is easy to extract to popular standard malware data formats losing the associated meta-data relevancy (Ding et al. 2014). Malware datafiles are usually stored as hash values in the existing portals, and the associated descriptors such as family and class integer are provided. For heterogeneous architecture sourced malware, it is worthwhile to give the data format descriptors associated with the architecture or device diversity of attack. Malware datasets need to use a variety of identifiers for each malware such as the universally unique identifier (UUID), hash name, or generic malware name. (2) Provenance: Malware data provenance is a measure of data and its associated data creation processes’ trustworthiness (Zafar et al. 2017). Data-oriented provenance includes accessing data origins and provenance-related metadata associated with the data item. Data creation and access process artifacts are also included (Hartig 2009). Data citation and attribution parameters need to be included in the malware data. The citation shows the original contributor while the attribution shows the derived contributions or improvements. In any proposed IoT malware portal there is need for citation request where contributors can give first introduction and description. Currently only a few datasets such as Allix et al. (2016), Daniel et al. (2014), and Ronen et al. (2018) have complete citation descriptors. Citation improves dataset provenance, for instance, 50 academic articles have used malware data in Ronen et al. (2018), therefore enriching its usage. (3) Linkage of data and metadata: Heterogeneous open datasets can be created through linkage of various data sources or -sets. The linkage creates opportunities and challenges. A detailed survey of trends, opportunities, and challenges in linked data is provided in Freitas et al. (2012). The malware source can either be user submission or uniform resource locator (URL) if the malware is a zero day detected by interlinked anti-malware tools. The exchangeable image file format (Exif) metadata, if generated, need to be fully described. For instance, malware labels indicating type, name, and tokenization data for Allix et al. (2016) are generated using Euphony (Hurier et al. 2017). When an external tool such as Euphony is used in malware data creation or labeling, the complete descriptors and attribution metrics need to be published. The datasets in the portal could result in isolated sectoral data islands (e.g., by architecture such as X86) and not linked even where a malware family span multiple architectures due to poor linkage. Middleware is a critical component of IoT interactions. It would be interesting to document the middleware influence on malware propagation. (4) Reliability of data: This is a measure of data completeness as acceptable for its usage and context within the subjective norms of malware community. Reliability can also be contextualized as repeatability or consistency to obtain the same data after using the prescribed data collection instrument and method. Reliability encapsulates quixotic reliability ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. Challenge Paper: Towards Open Datasets for Internet of Things Malware 7:3 (single method of data collection, its results should not vary over time, etc.) and diachronic reliability (e.g., in situ data should stable as observed over time) (Kitchin 2014). (5) Data modes and frequency: The IoT malware portal needs timely data registry if not real time. Data on malware needs to be incorporated or availed on demand for various users to utilize it in malware understanding such as zero day vulnerabilities discovered by vendors. (6) Quality of ontologies: There is need for a complete metadata and ontology for IoT malware datasets that is not available at the moment. The metadata need to include the affected domains, the file first creation or submission date, and the subsequent submission dates. In heterogeneously created datasets such as IoT malware, quality ontologies have benefits such as the following: (i) creating a harmonized view of structure of data. (ii) to enable re-use of data domain knowledge and make assumptions explicit within acceptable norms of user community. There are various non-functional factors that would enhance creation of open IoT malware datasets. Below is a brief description of selected key non-functional metrics. (1) Subjective norms: This is the general acceptance of use of a given dataset. The acceptance can be based on the view of peers or the data consumers. There is need to create a peer-based voting model for validating and rating the usability of datasets. Data citation frequency and span gives a view on dataset acceptance; for example, Daniel et al. (2014) has been used by 157 universities globally. (2) Access rights: Users can access dataset or submit their credentials such as password or public/private keys. Security aspects such as confidentiality and non-repudiation need to be enforced. This is based on a debatable compromise between anonymity as an attribute of openness and the need for accountability through non-repudiable user control mechanism. Most open data platforms handle privacy as a contextual integrity issue. Online open data portals can implement privacy as contextual integrity (Grodzinsky and Tavani 2011; Barth et al. 2006). (3) Mode of license and legal awareness of use: Data scientists might be very familiar with concepts of open source software licenses vs. the proprietary software or copyright vs. copyleft aspects, but data are usually not classified as creative works where these concepts apply (Miller et al. 2008). Use of open data is usually governed on the basis of drafted principles, e.g., Nairobi data sharing principles (CODATA 2014), joint declaration of data citation principles (Martone 2014), among others, that can offer guiding principles on data processes. The principles are agreeable as norms within the ratifying community. Users in IoT malware ecosystems can adopt a licensing model for data such as GNU General Public License GPL3 to enhance fair use. 3 THE UDA FRAMEWORK FOR IOT MALWARE OPEN DATA ECOSYSTEM To achieve the vision for a robust IoT malware dataset ecosystem, we propose the User, Data and Access (UDA) framework shown in Figure 1. The framework offers a summary of protocols that need to be defined on users, data, and access. It also offers a basic checklist of key items that realizes functional and non-functional requirements of the IoT malware data ecosystem. The user protocol will be used to define roles descriptions, give the voting criteria as an appraisal mechanism, and postulate data citation standards and portal usability evaluation standard. Data Protocol highlights the parameters that need to be described for each IoT malware dataset. Access Protocol describes the metrics that define how accessible the ecosystem is to users. Open dataset ecosystem evaluation is a broad subject. In the proposed UDA framework, we focus on ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:4 E. M. Karanja et al. Fig. 1. UDA framework. evaluation based on usability aspects. A good portal provides interaction between data provider and data consumers for feedback and audit purposes. Aspects such as accessibility, navigation, interactivity, and information content of the portal are used to evaluate user experience. In the design of web resources, human–computer interaction needs to be considered. Nielsen (1999) offers a practical guide for designing web portals for usability.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"230 1","pages":"1 - 5"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3230669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:2 E. M. Karanja et al. 2 THE VISION FOR IOT MALWARE DATASETS The adoption of existing malware data portals for IoT research poses a challenge on how to accommodate factors associated with heterogeneity. We derive two branches of factors that envision a well-defined IoT malware ecosystem. These factors are the functional factors and non-functional factors. Functional factors are based on the malware data inherent features. We also highlight the shortfalls in the existing dataset ecosystems based on these metrics or factors that make them inappropriate for IoT malware. (1) Malware data description: The key aspect is whether the data have sufficient descriptors that fully explain their format and interactions. Malware data for research should in be formats that can be reused and published without barriers. IoT malware can originate from various heterogeneous sources but need to be presented in a form that is easy to extract to popular standard malware data formats losing the associated meta-data relevancy (Ding et al. 2014). Malware datafiles are usually stored as hash values in the existing portals, and the associated descriptors such as family and class integer are provided. For heterogeneous architecture sourced malware, it is worthwhile to give the data format descriptors associated with the architecture or device diversity of attack. Malware datasets need to use a variety of identifiers for each malware such as the universally unique identifier (UUID), hash name, or generic malware name. (2) Provenance: Malware data provenance is a measure of data and its associated data creation processes’ trustworthiness (Zafar et al. 2017). Data-oriented provenance includes accessing data origins and provenance-related metadata associated with the data item. Data creation and access process artifacts are also included (Hartig 2009). Data citation and attribution parameters need to be included in the malware data. The citation shows the original contributor while the attribution shows the derived contributions or improvements. In any proposed IoT malware portal there is need for citation request where contributors can give first introduction and description. Currently only a few datasets such as Allix et al. (2016), Daniel et al. (2014), and Ronen et al. (2018) have complete citation descriptors. Citation improves dataset provenance, for instance, 50 academic articles have used malware data in Ronen et al. (2018), therefore enriching its usage. (3) Linkage of data and metadata: Heterogeneous open datasets can be created through linkage of various data sources or -sets. The linkage creates opportunities and challenges. A detailed survey of trends, opportunities, and challenges in linked data is provided in Freitas et al. (2012). The malware source can either be user submission or uniform resource locator (URL) if the malware is a zero day detected by interlinked anti-malware tools. The exchangeable image file format (Exif) metadata, if generated, need to be fully described. For instance, malware labels indicating type, name, and tokenization data for Allix et al. (2016) are generated using Euphony (Hurier et al. 2017). When an external tool such as Euphony is used in malware data creation or labeling, the complete descriptors and attribution metrics need to be published. The datasets in the portal could result in isolated sectoral data islands (e.g., by architecture such as X86) and not linked even where a malware family span multiple architectures due to poor linkage. Middleware is a critical component of IoT interactions. It would be interesting to document the middleware influence on malware propagation. (4) Reliability of data: This is a measure of data completeness as acceptable for its usage and context within the subjective norms of malware community. Reliability can also be contextualized as repeatability or consistency to obtain the same data after using the prescribed data collection instrument and method. Reliability encapsulates quixotic reliability ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. Challenge Paper: Towards Open Datasets for Internet of Things Malware 7:3 (single method of data collection, its results should not vary over time, etc.) and diachronic reliability (e.g., in situ data should stable as observed over time) (Kitchin 2014). (5) Data modes and frequency: The IoT malware portal needs timely data registry if not real time. Data on malware needs to be incorporated or availed on demand for various users to utilize it in malware understanding such as zero day vulnerabilities discovered by vendors. (6) Quality of ontologies: There is need for a complete metadata and ontology for IoT malware datasets that is not available at the moment. The metadata need to include the affected domains, the file first creation or submission date, and the subsequent submission dates. In heterogeneously created datasets such as IoT malware, quality ontologies have benefits such as the following: (i) creating a harmonized view of structure of data. (ii) to enable re-use of data domain knowledge and make assumptions explicit within acceptable norms of user community. There are various non-functional factors that would enhance creation of open IoT malware datasets. Below is a brief description of selected key non-functional metrics. (1) Subjective norms: This is the general acceptance of use of a given dataset. The acceptance can be based on the view of peers or the data consumers. There is need to create a peer-based voting model for validating and rating the usability of datasets. Data citation frequency and span gives a view on dataset acceptance; for example, Daniel et al. (2014) has been used by 157 universities globally. (2) Access rights: Users can access dataset or submit their credentials such as password or public/private keys. Security aspects such as confidentiality and non-repudiation need to be enforced. This is based on a debatable compromise between anonymity as an attribute of openness and the need for accountability through non-repudiable user control mechanism. Most open data platforms handle privacy as a contextual integrity issue. Online open data portals can implement privacy as contextual integrity (Grodzinsky and Tavani 2011; Barth et al. 2006). (3) Mode of license and legal awareness of use: Data scientists might be very familiar with concepts of open source software licenses vs. the proprietary software or copyright vs. copyleft aspects, but data are usually not classified as creative works where these concepts apply (Miller et al. 2008). Use of open data is usually governed on the basis of drafted principles, e.g., Nairobi data sharing principles (CODATA 2014), joint declaration of data citation principles (Martone 2014), among others, that can offer guiding principles on data processes. The principles are agreeable as norms within the ratifying community. Users in IoT malware ecosystems can adopt a licensing model for data such as GNU General Public License GPL3 to enhance fair use. 3 THE UDA FRAMEWORK FOR IOT MALWARE OPEN DATA ECOSYSTEM To achieve the vision for a robust IoT malware dataset ecosystem, we propose the User, Data and Access (UDA) framework shown in Figure 1. The framework offers a summary of protocols that need to be defined on users, data, and access. It also offers a basic checklist of key items that realizes functional and non-functional requirements of the IoT malware data ecosystem. The user protocol will be used to define roles descriptions, give the voting criteria as an appraisal mechanism, and postulate data citation standards and portal usability evaluation standard. Data Protocol highlights the parameters that need to be described for each IoT malware dataset. Access Protocol describes the metrics that define how accessible the ecosystem is to users. Open dataset ecosystem evaluation is a broad subject. In the proposed UDA framework, we focus on ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:4 E. M. Karanja et al. Fig. 1. UDA framework. evaluation based on usability aspects. A good portal provides interaction between data provider and data consumers for feedback and audit purposes. Aspects such as accessibility, navigation, interactivity, and information content of the portal are used to evaluate user experience. In the design of web resources, human–computer interaction needs to be considered. Nielsen (1999) offers a practical guide for designing web portals for usability.

查看原文本刊更多论文

挑战的论文

允许赊账付款。以其他方式复制或重新发布，在服务器上发布或重新分发到列表，需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7。出版日期:2018年9月。7:2 E. M. Karanja等人。2物联网恶意软件数据集的远景物联网研究采用现有的恶意软件数据门户对如何适应与异质性相关的因素提出了挑战。我们得出了两个分支的因素，设想了一个定义良好的物联网恶意软件生态系统。这些因素是功能性因素和非功能性因素。功能因子是基于恶意软件数据的固有特征。我们还强调了基于这些指标或因素的现有数据集生态系统的不足之处，这些指标或因素使它们不适合物联网恶意软件。(1)恶意软件数据描述:关键方面是数据是否有足够的描述符来充分解释其格式和交互。用于研究的恶意软件数据应该采用可以无障碍地重用和发布的格式。物联网恶意软件可以来自各种异构来源，但需要以易于提取为流行的标准恶意软件数据格式的形式呈现，从而失去相关的元数据相关性(Ding et al. 2014)。恶意软件数据通常以散列值的形式存储在现有的门户中，并提供相关的描述符，如family和class integer。对于源自异构架构的恶意软件，给出与攻击的架构或设备多样性相关的数据格式描述符是值得的。恶意软件数据集需要为每个恶意软件使用各种标识符，例如通用唯一标识符(UUID)、哈希名称或通用恶意软件名称。(2)来源:恶意软件数据来源是对数据及其相关数据创建过程可信度的度量(Zafar et al. 2017)。面向数据的来源包括访问数据来源和与数据项关联的来源相关元数据。数据创建和访问过程工件也包括在内(Hartig 2009)。数据引用和属性参数需要包含在恶意软件数据中。引文显示原始贡献者，而署名显示派生贡献或改进。在任何提议的物联网恶意软件门户中，都需要引用请求，贡献者可以在其中给出第一个介绍和描述。目前只有少数数据集，如Allix等人(2016)、Daniel等人(2014)和Ronen等人(2018)拥有完整的引文描述符。引用改善了数据集的来源，例如，Ronen等人(2018)的50篇学术文章使用了恶意软件数据，从而丰富了其使用。(3)数据与元数据的联动:通过各种数据源或数据集的联动，可以创建异构开放数据集。这种联系既带来机遇，也带来挑战。Freitas et al.(2012)对关联数据的趋势、机遇和挑战进行了详细调查。如果恶意软件是由相互关联的反恶意软件工具检测到的零日恶意软件，则恶意软件的来源可以是用户提交或统一资源定位符(URL)。如果生成了可交换图像文件格式(Exif)元数据，则需要对其进行完整描述。例如，指示Allix等人(2016)的类型、名称和标记化数据的恶意软件标签是使用Euphony (Hurier等人，2017)生成的。当外部工具(如Euphony)用于恶意软件数据创建或标记时，需要发布完整的描述符和归属度量。门户中的数据集可能导致孤立的部门数据孤岛(例如，按X86等体系结构划分)，即使恶意软件家族由于链接不良而跨越多个体系结构，也无法链接。中间件是物联网交互的关键组件。记录中间件对恶意软件传播的影响会很有趣。(4)数据的可靠性:这是在恶意软件社区的主观规范中，对其使用和上下文可接受的数据完整性的度量。可靠性也可以被语境化为在使用规定的数据收集工具和方法后获得相同数据的可重复性或一致性。可靠性概括了堂吉诃德式的可靠性。美国计算机学会数据与信息质量杂志，第10卷，第2期，第7条。出版日期:2018年9月。挑战论文:面向物联网恶意软件的开放数据集7:3(单一数据收集方法，其结果不应随时间变化等)和历时可靠性(例如，原位数据应随着时间的推移而稳定)(Kitchin 2014)。(5)数据方式和频率:物联网恶意软件门户即使不是实时的，也需要及时的数据注册。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量