Local fuzzy rough attribute reduction for large-scale mixed data with limited missing labels based on local fuzzy self information

IF 8.1 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2024-11-06 DOI:10.1016/j.ins.2024.121613

Zhaowen Li , Run Guo , Ning Lin , Tao Lu

{"title":"Local fuzzy rough attribute reduction for large-scale mixed data with limited missing labels based on local fuzzy self information","authors":"Zhaowen Li , Run Guo , Ning Lin , Tao Lu","doi":"10.1016/j.ins.2024.121613","DOIUrl":null,"url":null,"abstract":"<div><div>The advent of the era of big data is accompanied by the generation of large-scale data of various types. Extracting the potential value and rules from such data has always been a challenge. Due to various external and internal factors, it is commonplace for large-scale data to exhibit the phenomenon of missing limited labels. In addressing a large-scale mixed information system with limited label missing (LSMDISLML), local neighborhood rough set model (LNRS-model) is typically employed. However, the identical neighborhood radius is often used by such model when confronted with numerical attributes, which could potentially attenuate the classification capability of the data. Local fuzzy rough set model (LFRS-model) can overcome this point. This paper studies local fuzzy rough attribute reduction for large-scale mixed data with limited missing labels based on LFRS-model via local fuzzy self information and overlap degree function. First, leveraging the statistical distribution of data as a foundation, fuzzy relations on the entire sample set are established, which has the advantage of being able to use different fuzzy similarity radii to calculate similarity, thereby adapting to different data distributions. Subsequently, the samples with missing labels are discarded as they constitute a small proportion of the entire sample set and have little impact on overall performance of dataset. The limited computing resources and storage space are focused on the sample set with complete labels (denoted as target set). Thereafter, based on the target set, local fuzzy <em>λ</em>-upper and lower approximations are defined, and LFRS-model is constructed. This model not only reduces processing time and sources of error in large-scale data but also improves data quality and enhances the reliability of the experimental results. Then, local fuzzy <em>λ</em>-self information is introduced and used to design a local fuzzy rough attribute reduction algorithm in a LSMDISLML. Furthermore, a overlap degree function is introduced to evaluate and reorder the attributes based on their importance, prioritizing the elimination of redundant attributes with high overlap and low importance from the preordered attribute set. This strategy effectively improves the efficiency of obtaining the optimal subset. Finally, a series of experiments are carried out. The experiment results demonstrate that the designed algorithm exhibits excellent performance in classification tasks and outlier detection tasks, surpassing existing four algorithms.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"691 ","pages":"Article 121613"},"PeriodicalIF":8.1000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025524015275","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The advent of the era of big data is accompanied by the generation of large-scale data of various types. Extracting the potential value and rules from such data has always been a challenge. Due to various external and internal factors, it is commonplace for large-scale data to exhibit the phenomenon of missing limited labels. In addressing a large-scale mixed information system with limited label missing (LSMDISLML), local neighborhood rough set model (LNRS-model) is typically employed. However, the identical neighborhood radius is often used by such model when confronted with numerical attributes, which could potentially attenuate the classification capability of the data. Local fuzzy rough set model (LFRS-model) can overcome this point. This paper studies local fuzzy rough attribute reduction for large-scale mixed data with limited missing labels based on LFRS-model via local fuzzy self information and overlap degree function. First, leveraging the statistical distribution of data as a foundation, fuzzy relations on the entire sample set are established, which has the advantage of being able to use different fuzzy similarity radii to calculate similarity, thereby adapting to different data distributions. Subsequently, the samples with missing labels are discarded as they constitute a small proportion of the entire sample set and have little impact on overall performance of dataset. The limited computing resources and storage space are focused on the sample set with complete labels (denoted as target set). Thereafter, based on the target set, local fuzzy λ-upper and lower approximations are defined, and LFRS-model is constructed. This model not only reduces processing time and sources of error in large-scale data but also improves data quality and enhances the reliability of the experimental results. Then, local fuzzy λ-self information is introduced and used to design a local fuzzy rough attribute reduction algorithm in a LSMDISLML. Furthermore, a overlap degree function is introduced to evaluate and reorder the attributes based on their importance, prioritizing the elimination of redundant attributes with high overlap and low importance from the preordered attribute set. This strategy effectively improves the efficiency of obtaining the optimal subset. Finally, a series of experiments are carried out. The experiment results demonstrate that the designed algorithm exhibits excellent performance in classification tasks and outlier detection tasks, surpassing existing four algorithms.

查看原文本刊更多论文

基于局部模糊自信息，对具有有限缺失标签的大规模混合数据进行局部模糊粗糙属性还原

大数据时代的到来伴随着各种类型的大规模数据的产生。如何从这些数据中提取潜在的价值和规则一直是个难题。由于各种外部和内部因素的影响，大规模数据普遍存在有限标签缺失的现象。在处理有限标签缺失的大规模混合信息系统（LSMDISLML）时，通常会采用局部邻域粗糙集模型（LNRS-model）。然而，在面对数字属性时，这类模型通常使用相同的邻域半径，这可能会削弱数据的分类能力。局部模糊粗糙集模型（LFRS-model）可以克服这一点。本文基于 LFRS 模型，通过局部模糊自信息和重叠度函数，研究了大规模混合数据中有限缺失标签的局部模糊粗糙属性还原问题。首先，以数据的统计分布为基础，建立整个样本集的模糊关系，其优点是可以使用不同的模糊相似度半径来计算相似度，从而适应不同的数据分布。随后，由于缺失标签的样本只占整个样本集的一小部分，对数据集的整体性能影响不大，因此将其舍弃。有限的计算资源和存储空间将集中在具有完整标签的样本集（称为目标集）上。然后，根据目标集定义局部模糊 λ 上近似值和下近似值，并构建 LFRS 模型。该模型不仅减少了大规模数据的处理时间和误差来源，还提高了数据质量，增强了实验结果的可靠性。然后，在 LSMDISLML 中引入局部模糊λ-自信息并用于设计局部模糊粗糙属性还原算法。此外，还引入了重叠度函数，根据属性的重要性对属性进行评估和重新排序，优先剔除预排序属性集中重叠度高、重要性低的冗余属性。这一策略有效提高了获得最佳子集的效率。最后，我们进行了一系列实验。实验结果表明，所设计的算法在分类任务和离群点检测任务中表现优异，超越了现有的四种算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.