E. Karanja, S. Masupe, Mandu Gasennelwe-Jeffrey
{"title":"挑战的论文","authors":"E. Karanja, S. Masupe, Mandu Gasennelwe-Jeffrey","doi":"10.1145/3230669","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:2 E. M. Karanja et al. 2 THE VISION FOR IOT MALWARE DATASETS The adoption of existing malware data portals for IoT research poses a challenge on how to accommodate factors associated with heterogeneity. We derive two branches of factors that envision a well-defined IoT malware ecosystem. These factors are the functional factors and non-functional factors. Functional factors are based on the malware data inherent features. We also highlight the shortfalls in the existing dataset ecosystems based on these metrics or factors that make them inappropriate for IoT malware. (1) Malware data description: The key aspect is whether the data have sufficient descriptors that fully explain their format and interactions. Malware data for research should in be formats that can be reused and published without barriers. IoT malware can originate from various heterogeneous sources but need to be presented in a form that is easy to extract to popular standard malware data formats losing the associated meta-data relevancy (Ding et al. 2014). Malware datafiles are usually stored as hash values in the existing portals, and the associated descriptors such as family and class integer are provided. For heterogeneous architecture sourced malware, it is worthwhile to give the data format descriptors associated with the architecture or device diversity of attack. Malware datasets need to use a variety of identifiers for each malware such as the universally unique identifier (UUID), hash name, or generic malware name. (2) Provenance: Malware data provenance is a measure of data and its associated data creation processes’ trustworthiness (Zafar et al. 2017). Data-oriented provenance includes accessing data origins and provenance-related metadata associated with the data item. Data creation and access process artifacts are also included (Hartig 2009). Data citation and attribution parameters need to be included in the malware data. The citation shows the original contributor while the attribution shows the derived contributions or improvements. In any proposed IoT malware portal there is need for citation request where contributors can give first introduction and description. Currently only a few datasets such as Allix et al. (2016), Daniel et al. (2014), and Ronen et al. (2018) have complete citation descriptors. Citation improves dataset provenance, for instance, 50 academic articles have used malware data in Ronen et al. (2018), therefore enriching its usage. (3) Linkage of data and metadata: Heterogeneous open datasets can be created through linkage of various data sources or -sets. The linkage creates opportunities and challenges. A detailed survey of trends, opportunities, and challenges in linked data is provided in Freitas et al. (2012). The malware source can either be user submission or uniform resource locator (URL) if the malware is a zero day detected by interlinked anti-malware tools. The exchangeable image file format (Exif) metadata, if generated, need to be fully described. For instance, malware labels indicating type, name, and tokenization data for Allix et al. (2016) are generated using Euphony (Hurier et al. 2017). When an external tool such as Euphony is used in malware data creation or labeling, the complete descriptors and attribution metrics need to be published. The datasets in the portal could result in isolated sectoral data islands (e.g., by architecture such as X86) and not linked even where a malware family span multiple architectures due to poor linkage. Middleware is a critical component of IoT interactions. It would be interesting to document the middleware influence on malware propagation. (4) Reliability of data: This is a measure of data completeness as acceptable for its usage and context within the subjective norms of malware community. Reliability can also be contextualized as repeatability or consistency to obtain the same data after using the prescribed data collection instrument and method. Reliability encapsulates quixotic reliability ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. Challenge Paper: Towards Open Datasets for Internet of Things Malware 7:3 (single method of data collection, its results should not vary over time, etc.) and diachronic reliability (e.g., in situ data should stable as observed over time) (Kitchin 2014). (5) Data modes and frequency: The IoT malware portal needs timely data registry if not real time. Data on malware needs to be incorporated or availed on demand for various users to utilize it in malware understanding such as zero day vulnerabilities discovered by vendors. (6) Quality of ontologies: There is need for a complete metadata and ontology for IoT malware datasets that is not available at the moment. The metadata need to include the affected domains, the file first creation or submission date, and the subsequent submission dates. In heterogeneously created datasets such as IoT malware, quality ontologies have benefits such as the following: (i) creating a harmonized view of structure of data. (ii) to enable re-use of data domain knowledge and make assumptions explicit within acceptable norms of user community. There are various non-functional factors that would enhance creation of open IoT malware datasets. Below is a brief description of selected key non-functional metrics. (1) Subjective norms: This is the general acceptance of use of a given dataset. The acceptance can be based on the view of peers or the data consumers. There is need to create a peer-based voting model for validating and rating the usability of datasets. Data citation frequency and span gives a view on dataset acceptance; for example, Daniel et al. (2014) has been used by 157 universities globally. (2) Access rights: Users can access dataset or submit their credentials such as password or public/private keys. Security aspects such as confidentiality and non-repudiation need to be enforced. This is based on a debatable compromise between anonymity as an attribute of openness and the need for accountability through non-repudiable user control mechanism. Most open data platforms handle privacy as a contextual integrity issue. Online open data portals can implement privacy as contextual integrity (Grodzinsky and Tavani 2011; Barth et al. 2006). (3) Mode of license and legal awareness of use: Data scientists might be very familiar with concepts of open source software licenses vs. the proprietary software or copyright vs. copyleft aspects, but data are usually not classified as creative works where these concepts apply (Miller et al. 2008). Use of open data is usually governed on the basis of drafted principles, e.g., Nairobi data sharing principles (CODATA 2014), joint declaration of data citation principles (Martone 2014), among others, that can offer guiding principles on data processes. The principles are agreeable as norms within the ratifying community. Users in IoT malware ecosystems can adopt a licensing model for data such as GNU General Public License GPL3 to enhance fair use. 3 THE UDA FRAMEWORK FOR IOT MALWARE OPEN DATA ECOSYSTEM To achieve the vision for a robust IoT malware dataset ecosystem, we propose the User, Data and Access (UDA) framework shown in Figure 1. The framework offers a summary of protocols that need to be defined on users, data, and access. It also offers a basic checklist of key items that realizes functional and non-functional requirements of the IoT malware data ecosystem. The user protocol will be used to define roles descriptions, give the voting criteria as an appraisal mechanism, and postulate data citation standards and portal usability evaluation standard. Data Protocol highlights the parameters that need to be described for each IoT malware dataset. Access Protocol describes the metrics that define how accessible the ecosystem is to users. Open dataset ecosystem evaluation is a broad subject. In the proposed UDA framework, we focus on ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:4 E. M. Karanja et al. Fig. 1. UDA framework. evaluation based on usability aspects. A good portal provides interaction between data provider and data consumers for feedback and audit purposes. Aspects such as accessibility, navigation, interactivity, and information content of the portal are used to evaluate user experience. In the design of web resources, human–computer interaction needs to be considered. Nielsen (1999) offers a practical guide for designing web portals for usability.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"230 1","pages":"1 - 5"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Challenge Paper\",\"authors\":\"E. Karanja, S. Masupe, Mandu Gasennelwe-Jeffrey\",\"doi\":\"10.1145/3230669\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:2 E. M. Karanja et al. 2 THE VISION FOR IOT MALWARE DATASETS The adoption of existing malware data portals for IoT research poses a challenge on how to accommodate factors associated with heterogeneity. We derive two branches of factors that envision a well-defined IoT malware ecosystem. These factors are the functional factors and non-functional factors. Functional factors are based on the malware data inherent features. We also highlight the shortfalls in the existing dataset ecosystems based on these metrics or factors that make them inappropriate for IoT malware. (1) Malware data description: The key aspect is whether the data have sufficient descriptors that fully explain their format and interactions. Malware data for research should in be formats that can be reused and published without barriers. IoT malware can originate from various heterogeneous sources but need to be presented in a form that is easy to extract to popular standard malware data formats losing the associated meta-data relevancy (Ding et al. 2014). Malware datafiles are usually stored as hash values in the existing portals, and the associated descriptors such as family and class integer are provided. For heterogeneous architecture sourced malware, it is worthwhile to give the data format descriptors associated with the architecture or device diversity of attack. Malware datasets need to use a variety of identifiers for each malware such as the universally unique identifier (UUID), hash name, or generic malware name. (2) Provenance: Malware data provenance is a measure of data and its associated data creation processes’ trustworthiness (Zafar et al. 2017). Data-oriented provenance includes accessing data origins and provenance-related metadata associated with the data item. Data creation and access process artifacts are also included (Hartig 2009). Data citation and attribution parameters need to be included in the malware data. The citation shows the original contributor while the attribution shows the derived contributions or improvements. In any proposed IoT malware portal there is need for citation request where contributors can give first introduction and description. Currently only a few datasets such as Allix et al. (2016), Daniel et al. (2014), and Ronen et al. (2018) have complete citation descriptors. Citation improves dataset provenance, for instance, 50 academic articles have used malware data in Ronen et al. (2018), therefore enriching its usage. (3) Linkage of data and metadata: Heterogeneous open datasets can be created through linkage of various data sources or -sets. The linkage creates opportunities and challenges. A detailed survey of trends, opportunities, and challenges in linked data is provided in Freitas et al. (2012). The malware source can either be user submission or uniform resource locator (URL) if the malware is a zero day detected by interlinked anti-malware tools. The exchangeable image file format (Exif) metadata, if generated, need to be fully described. For instance, malware labels indicating type, name, and tokenization data for Allix et al. (2016) are generated using Euphony (Hurier et al. 2017). When an external tool such as Euphony is used in malware data creation or labeling, the complete descriptors and attribution metrics need to be published. The datasets in the portal could result in isolated sectoral data islands (e.g., by architecture such as X86) and not linked even where a malware family span multiple architectures due to poor linkage. Middleware is a critical component of IoT interactions. It would be interesting to document the middleware influence on malware propagation. (4) Reliability of data: This is a measure of data completeness as acceptable for its usage and context within the subjective norms of malware community. Reliability can also be contextualized as repeatability or consistency to obtain the same data after using the prescribed data collection instrument and method. Reliability encapsulates quixotic reliability ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. Challenge Paper: Towards Open Datasets for Internet of Things Malware 7:3 (single method of data collection, its results should not vary over time, etc.) and diachronic reliability (e.g., in situ data should stable as observed over time) (Kitchin 2014). (5) Data modes and frequency: The IoT malware portal needs timely data registry if not real time. Data on malware needs to be incorporated or availed on demand for various users to utilize it in malware understanding such as zero day vulnerabilities discovered by vendors. (6) Quality of ontologies: There is need for a complete metadata and ontology for IoT malware datasets that is not available at the moment. The metadata need to include the affected domains, the file first creation or submission date, and the subsequent submission dates. In heterogeneously created datasets such as IoT malware, quality ontologies have benefits such as the following: (i) creating a harmonized view of structure of data. (ii) to enable re-use of data domain knowledge and make assumptions explicit within acceptable norms of user community. There are various non-functional factors that would enhance creation of open IoT malware datasets. Below is a brief description of selected key non-functional metrics. (1) Subjective norms: This is the general acceptance of use of a given dataset. The acceptance can be based on the view of peers or the data consumers. There is need to create a peer-based voting model for validating and rating the usability of datasets. Data citation frequency and span gives a view on dataset acceptance; for example, Daniel et al. (2014) has been used by 157 universities globally. (2) Access rights: Users can access dataset or submit their credentials such as password or public/private keys. Security aspects such as confidentiality and non-repudiation need to be enforced. This is based on a debatable compromise between anonymity as an attribute of openness and the need for accountability through non-repudiable user control mechanism. Most open data platforms handle privacy as a contextual integrity issue. Online open data portals can implement privacy as contextual integrity (Grodzinsky and Tavani 2011; Barth et al. 2006). (3) Mode of license and legal awareness of use: Data scientists might be very familiar with concepts of open source software licenses vs. the proprietary software or copyright vs. copyleft aspects, but data are usually not classified as creative works where these concepts apply (Miller et al. 2008). Use of open data is usually governed on the basis of drafted principles, e.g., Nairobi data sharing principles (CODATA 2014), joint declaration of data citation principles (Martone 2014), among others, that can offer guiding principles on data processes. The principles are agreeable as norms within the ratifying community. Users in IoT malware ecosystems can adopt a licensing model for data such as GNU General Public License GPL3 to enhance fair use. 3 THE UDA FRAMEWORK FOR IOT MALWARE OPEN DATA ECOSYSTEM To achieve the vision for a robust IoT malware dataset ecosystem, we propose the User, Data and Access (UDA) framework shown in Figure 1. The framework offers a summary of protocols that need to be defined on users, data, and access. It also offers a basic checklist of key items that realizes functional and non-functional requirements of the IoT malware data ecosystem. The user protocol will be used to define roles descriptions, give the voting criteria as an appraisal mechanism, and postulate data citation standards and portal usability evaluation standard. Data Protocol highlights the parameters that need to be described for each IoT malware dataset. Access Protocol describes the metrics that define how accessible the ecosystem is to users. Open dataset ecosystem evaluation is a broad subject. In the proposed UDA framework, we focus on ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:4 E. M. Karanja et al. Fig. 1. UDA framework. evaluation based on usability aspects. A good portal provides interaction between data provider and data consumers for feedback and audit purposes. Aspects such as accessibility, navigation, interactivity, and information content of the portal are used to evaluate user experience. In the design of web resources, human–computer interaction needs to be considered. Nielsen (1999) offers a practical guide for designing web portals for usability.\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"230 1\",\"pages\":\"1 - 5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3230669\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3230669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Challenge Paper
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/09-ART7 $15.00 https://doi.org/10.1145/3230669 ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:2 E. M. Karanja et al. 2 THE VISION FOR IOT MALWARE DATASETS The adoption of existing malware data portals for IoT research poses a challenge on how to accommodate factors associated with heterogeneity. We derive two branches of factors that envision a well-defined IoT malware ecosystem. These factors are the functional factors and non-functional factors. Functional factors are based on the malware data inherent features. We also highlight the shortfalls in the existing dataset ecosystems based on these metrics or factors that make them inappropriate for IoT malware. (1) Malware data description: The key aspect is whether the data have sufficient descriptors that fully explain their format and interactions. Malware data for research should in be formats that can be reused and published without barriers. IoT malware can originate from various heterogeneous sources but need to be presented in a form that is easy to extract to popular standard malware data formats losing the associated meta-data relevancy (Ding et al. 2014). Malware datafiles are usually stored as hash values in the existing portals, and the associated descriptors such as family and class integer are provided. For heterogeneous architecture sourced malware, it is worthwhile to give the data format descriptors associated with the architecture or device diversity of attack. Malware datasets need to use a variety of identifiers for each malware such as the universally unique identifier (UUID), hash name, or generic malware name. (2) Provenance: Malware data provenance is a measure of data and its associated data creation processes’ trustworthiness (Zafar et al. 2017). Data-oriented provenance includes accessing data origins and provenance-related metadata associated with the data item. Data creation and access process artifacts are also included (Hartig 2009). Data citation and attribution parameters need to be included in the malware data. The citation shows the original contributor while the attribution shows the derived contributions or improvements. In any proposed IoT malware portal there is need for citation request where contributors can give first introduction and description. Currently only a few datasets such as Allix et al. (2016), Daniel et al. (2014), and Ronen et al. (2018) have complete citation descriptors. Citation improves dataset provenance, for instance, 50 academic articles have used malware data in Ronen et al. (2018), therefore enriching its usage. (3) Linkage of data and metadata: Heterogeneous open datasets can be created through linkage of various data sources or -sets. The linkage creates opportunities and challenges. A detailed survey of trends, opportunities, and challenges in linked data is provided in Freitas et al. (2012). The malware source can either be user submission or uniform resource locator (URL) if the malware is a zero day detected by interlinked anti-malware tools. The exchangeable image file format (Exif) metadata, if generated, need to be fully described. For instance, malware labels indicating type, name, and tokenization data for Allix et al. (2016) are generated using Euphony (Hurier et al. 2017). When an external tool such as Euphony is used in malware data creation or labeling, the complete descriptors and attribution metrics need to be published. The datasets in the portal could result in isolated sectoral data islands (e.g., by architecture such as X86) and not linked even where a malware family span multiple architectures due to poor linkage. Middleware is a critical component of IoT interactions. It would be interesting to document the middleware influence on malware propagation. (4) Reliability of data: This is a measure of data completeness as acceptable for its usage and context within the subjective norms of malware community. Reliability can also be contextualized as repeatability or consistency to obtain the same data after using the prescribed data collection instrument and method. Reliability encapsulates quixotic reliability ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. Challenge Paper: Towards Open Datasets for Internet of Things Malware 7:3 (single method of data collection, its results should not vary over time, etc.) and diachronic reliability (e.g., in situ data should stable as observed over time) (Kitchin 2014). (5) Data modes and frequency: The IoT malware portal needs timely data registry if not real time. Data on malware needs to be incorporated or availed on demand for various users to utilize it in malware understanding such as zero day vulnerabilities discovered by vendors. (6) Quality of ontologies: There is need for a complete metadata and ontology for IoT malware datasets that is not available at the moment. The metadata need to include the affected domains, the file first creation or submission date, and the subsequent submission dates. In heterogeneously created datasets such as IoT malware, quality ontologies have benefits such as the following: (i) creating a harmonized view of structure of data. (ii) to enable re-use of data domain knowledge and make assumptions explicit within acceptable norms of user community. There are various non-functional factors that would enhance creation of open IoT malware datasets. Below is a brief description of selected key non-functional metrics. (1) Subjective norms: This is the general acceptance of use of a given dataset. The acceptance can be based on the view of peers or the data consumers. There is need to create a peer-based voting model for validating and rating the usability of datasets. Data citation frequency and span gives a view on dataset acceptance; for example, Daniel et al. (2014) has been used by 157 universities globally. (2) Access rights: Users can access dataset or submit their credentials such as password or public/private keys. Security aspects such as confidentiality and non-repudiation need to be enforced. This is based on a debatable compromise between anonymity as an attribute of openness and the need for accountability through non-repudiable user control mechanism. Most open data platforms handle privacy as a contextual integrity issue. Online open data portals can implement privacy as contextual integrity (Grodzinsky and Tavani 2011; Barth et al. 2006). (3) Mode of license and legal awareness of use: Data scientists might be very familiar with concepts of open source software licenses vs. the proprietary software or copyright vs. copyleft aspects, but data are usually not classified as creative works where these concepts apply (Miller et al. 2008). Use of open data is usually governed on the basis of drafted principles, e.g., Nairobi data sharing principles (CODATA 2014), joint declaration of data citation principles (Martone 2014), among others, that can offer guiding principles on data processes. The principles are agreeable as norms within the ratifying community. Users in IoT malware ecosystems can adopt a licensing model for data such as GNU General Public License GPL3 to enhance fair use. 3 THE UDA FRAMEWORK FOR IOT MALWARE OPEN DATA ECOSYSTEM To achieve the vision for a robust IoT malware dataset ecosystem, we propose the User, Data and Access (UDA) framework shown in Figure 1. The framework offers a summary of protocols that need to be defined on users, data, and access. It also offers a basic checklist of key items that realizes functional and non-functional requirements of the IoT malware data ecosystem. The user protocol will be used to define roles descriptions, give the voting criteria as an appraisal mechanism, and postulate data citation standards and portal usability evaluation standard. Data Protocol highlights the parameters that need to be described for each IoT malware dataset. Access Protocol describes the metrics that define how accessible the ecosystem is to users. Open dataset ecosystem evaluation is a broad subject. In the proposed UDA framework, we focus on ACM Journal of Data and Information Quality, Vol. 10, No. 2, Article 7. Publication date: September 2018. 7:4 E. M. Karanja et al. Fig. 1. UDA framework. evaluation based on usability aspects. A good portal provides interaction between data provider and data consumers for feedback and audit purposes. Aspects such as accessibility, navigation, interactivity, and information content of the portal are used to evaluate user experience. In the design of web resources, human–computer interaction needs to be considered. Nielsen (1999) offers a practical guide for designing web portals for usability.