A Binary Feature Extraction Based Data Provenance System Implemented on Flink Platform

2018 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC) Pub Date : 2018-10-01 DOI:10.1109/CYBERC.2018.00045

Yang Wang, Lan Li, Lei Fan

{"title":"A Binary Feature Extraction Based Data Provenance System Implemented on Flink Platform","authors":"Yang Wang, Lan Li, Lei Fan","doi":"10.1109/CYBERC.2018.00045","DOIUrl":null,"url":null,"abstract":"Data protection and the control of information flow are basic requirements for the security operation of enterprises or organizations. The data provenance of documents is a function that records the transmission of a specific document and provenance afterwards. As an important function of enterprise information security control, it has been confronted with the trouble of high management costs. Therefore, this paper attempts to recover the document content by proactively monitoring the internal traffic data of the enterprise and restore the document and find the parent document accurately through the proposed algorithm, thereby getting rid of the shackle of traditional document tracing. In order to ensure the flexibility and scalability of the streaming data restoration, this paper tries to build algorithm modules based on Flink, a streaming process platform, by migrating key computing services to its platform. In the process, the capture agent is set at the key node to collect traffic data, which is put into the stream processing system through the message queue. The stream processing system restores the file using document restoration algorithm, and finally the file is handed over to the feature extraction module. After the feature extraction module completes the file analysis, it is stored on file systems or structed data storage systems and waits for document tracking requests. The entire system solution achieved above and the daily business of the enterprise are completely seperated, while the load on the internal network flow is also very small. On the other hand, relying on the advantages of Flink's excellent distributed features, the experiments show that the data provenance results are satisfactory.","PeriodicalId":282903,"journal":{"name":"2018 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERC.2018.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data protection and the control of information flow are basic requirements for the security operation of enterprises or organizations. The data provenance of documents is a function that records the transmission of a specific document and provenance afterwards. As an important function of enterprise information security control, it has been confronted with the trouble of high management costs. Therefore, this paper attempts to recover the document content by proactively monitoring the internal traffic data of the enterprise and restore the document and find the parent document accurately through the proposed algorithm, thereby getting rid of the shackle of traditional document tracing. In order to ensure the flexibility and scalability of the streaming data restoration, this paper tries to build algorithm modules based on Flink, a streaming process platform, by migrating key computing services to its platform. In the process, the capture agent is set at the key node to collect traffic data, which is put into the stream processing system through the message queue. The stream processing system restores the file using document restoration algorithm, and finally the file is handed over to the feature extraction module. After the feature extraction module completes the file analysis, it is stored on file systems or structed data storage systems and waits for document tracking requests. The entire system solution achieved above and the daily business of the enterprise are completely seperated, while the load on the internal network flow is also very small. On the other hand, relying on the advantages of Flink's excellent distributed features, the experiments show that the data provenance results are satisfactory.

查看原文本刊更多论文

基于二进制特征提取的数据溯源系统在Flink平台上实现

数据保护和信息流控制是企业或组织安全运行的基本要求。文件的数据溯源功能是记录特定文件的传递和其后的溯源功能。作为企业信息安全控制的一项重要功能，它一直面临着管理成本高的困扰。因此，本文试图通过主动监控企业内部的流量数据来恢复文档内容，并通过提出的算法准确地恢复文档并找到父文档，从而摆脱传统文档追踪的束缚。为了保证流数据恢复的灵活性和可扩展性，本文尝试在Flink流处理平台上构建算法模块，将关键计算服务迁移到Flink流处理平台上。在此过程中，在关键节点设置捕获代理，采集流量数据，并通过消息队列输入流处理系统。流处理系统使用文档还原算法对文件进行还原，最后将文件交给特征提取模块。特征提取模块完成文件分析后，存储在文件系统或结构化数据存储系统中，等待文档跟踪请求。整个系统解决方案实现了以上与企业的日常业务完全分离，同时内部网络流量的负载也非常小。另一方面，依托Flink优良的分布式特征优势，实验表明，数据溯源结果令人满意。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)

自引率

0.00%

发文量