Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali
{"title":"A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k","authors":"Enes Dedeoglu, Himmet Toprak Kesgin, Mehmet Fatih Amasyali","doi":"10.1007/s11704-023-2430-4","DOIUrl":"https://doi.org/10.1007/s11704-023-2430-4","url":null,"abstract":"<p>The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-<i>k</i>, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-<i>k</i>, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-<i>k</i> method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-<i>k</i> is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-<i>k</i> is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-<i>k</i> is available at GitHub.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"104 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gria: an efficient deterministic concurrency control protocol","authors":"Xinyuan Wang, Yun Peng, Hejiao Huang","doi":"10.1007/s11704-023-2605-z","DOIUrl":"https://doi.org/10.1007/s11704-023-2605-z","url":null,"abstract":"<p>Deterministic databases are able to reduce coordination costs in a replication. This property has fostered a significant interest in the design of efficient deterministic concurrency control protocols. However, the state-of-the-art deterministic concurrency control protocol Aria has three issues. First, it is impractical to configure a suitable batch size when the read-write set is unknown. Second, Aria running in low-concurrency scenarios, e.g., a single-thread scenario, suffers from the same conflicts as running in high-concurrency scenarios. Third, the single-version schema brings write-after-write conflicts.</p><p>To address these issues, we propose Gria, an efficient deterministic concurrency control protocol. Gria has the following properties. First, the batch size of Gria is auto-scaling. Second, Gria’s conflict probability in low-concurrency scenarios is lower than that in high-concurrency scenarios. Third, Gria has no write-after-write conflicts by adopting a multi-version structure. To further reduce conflicts, we propose two optimizations: a reordering mechanism as well as a rechecking strategy. The evaluation result on two popular benchmarks shows that Gria outperforms Aria by 13x.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"5 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Density estimation-based method to determine sample size for random sample partition of big data","authors":"","doi":"10.1007/s11704-023-2356-x","DOIUrl":"https://doi.org/10.1007/s11704-023-2356-x","url":null,"abstract":"<h3>Abstract</h3> <p>Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (<em>p.d.f.</em>); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of <em>p.d.f.</em> estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"60 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimizing the cost of periodically replicated systems via model and quantitative analysis","authors":"Chenhao Zhang, Liang Wang, Limin Xiao, Shixuan Jiang, Meng Han, Jinquan Wang, Bing Wei, Guangjun Qin","doi":"10.1007/s11704-023-2625-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2625-8","url":null,"abstract":"<p>Geographically replicating objects across multiple data centers improves the performance and reliability of cloud storage systems. Maintaining consistent replicas comes with high synchronization costs, as it faces more expensive WAN transport prices and increased latency. Periodic replication is the widely used technique to reduce the synchronization costs. Periodic replication strategies in existing cloud storage systems are too static to handle traffic changes, which indicates that they are inflexible in the face of unforeseen loads, resulting in additional synchronization cost. We propose quantitative analysis models to quantify consistency and synchronization cost for periodically replicated systems, and derive the optimal synchronization period to achieve the best tradeoff between consistency and synchronization cost. Based on this, we propose a dynamic periodic synchronization method, Sync-Opt, which allows systems to set the optimal synchronization period according to the variable load in clouds to minimize the synchronization cost. Simulation results demonstrate the effectiveness of our models. Compared with the policies widely used in modern cloud storage systems, the Sync-Opt strategy significantly reduces the synchronization cost.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"25 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Index-free triangle-based graph local clustering","authors":"Zhe Yuan, Zhewei Wei, Fangrui Lv, Ji-Rong Wen","doi":"10.1007/s11704-023-2768-7","DOIUrl":"https://doi.org/10.1007/s11704-023-2768-7","url":null,"abstract":"<p>Motif-based graph local clustering (MGLC) is a popular method for graph mining tasks due to its various applications. However, the traditional two-phase approach of precomputing motif weights before performing local clustering loses locality and is impractical for large graphs. While some attempts have been made to address the efficiency bottleneck, there is still no applicable algorithm for large scale graphs with billions of edges. In this paper, we propose a purely local and index-free method called Index-free Triangle-based Graph Local Clustering (TGLC*) to solve the MGLC problem w.r.t. a triangle. TGLC* directly estimates the Personalized PageRank (PPR) vector using random walks with the desired triangle-weighted distribution and proposes the clustering result using a standard sweep procedure. We demonstrate TGLC*’s scalability through theoretical analysis and its practical benefits through a novel visualization layout. TGLC* is the first algorithm to solve the MGLC problem without precomputing the motif weight. Extensive experiments on seven real-world large-scale datasets show that TGLC* is applicable and scalable for large graphs.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"232 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zhang, Ruidong Fan, Hong Tao, Jiacheng Jiang, Chenping Hou
{"title":"Constrained clustering with weak label prior","authors":"Jing Zhang, Ruidong Fan, Hong Tao, Jiacheng Jiang, Chenping Hou","doi":"10.1007/s11704-023-3355-7","DOIUrl":"https://doi.org/10.1007/s11704-023-3355-7","url":null,"abstract":"<p>Clustering is widely exploited in data mining. It has been proved that embedding weak label prior into clustering is effective to promote its performance. Previous researches mainly focus on only one type of prior. However, in many real scenarios, two kinds of weak label prior information, e.g., pairwise constraints and cluster ratio, are easily obtained or already available. How to incorporate them to improve clustering performance is important but rarely studied. We propose a novel constrained Clustering with Weak Label Prior method (CWLP), which is an integrated framework. Within the unified spectral clustering model, the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation. To approximate a variant of the embedding matrix more precisely, we replace a cluster indicator matrix with its scaled version. Instead of fixing an initial similarity matrix, we propose a new similarity matrix that is more suitable for deriving clustering results. Except for the theoretical convergence and computational complexity analyses, we validate the effectiveness of CWLP through several benchmark datasets, together with its ability to discriminate suspected breast cancer patients from healthy controls. The experimental evaluation illustrates the superiority of our proposed approach.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"34 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Safeguarding text generation API’s intellectual property through meaning-preserving lexical watermarks","authors":"Shiyu Zhu, Yun Li, Xiaoye Ouyang, Xiaocheng Hu, Jipeng Qiang","doi":"10.1007/s11704-023-3252-0","DOIUrl":"https://doi.org/10.1007/s11704-023-3252-0","url":null,"abstract":"<p>We aim to protect text generation APIs in this work. Previous LW methods compromised text quality and made watermarks easy to detect through error analysis due to not considering polysemy. To fit this, we propose meaning-preserving lexical substitution method that considers the target word’s correct meaning in context <b>x</b>. This enables high-confidence identification while making watermarks more invisible.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"7 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138581818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic similarity-based program retrieval: a multi-relational graph perspective","authors":"Qianwen Gou, Yunwei Dong, YuJiao Wu, Qiao Ke","doi":"10.1007/s11704-023-2678-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2678-8","url":null,"abstract":"<p>In this paper, we formulate the program retrieval problem as a graph similarity problem. This is achieved by first explicitly representing queries and program snippets as AMR and CPG, respectively. Then, through intra-level and inter-level attention mechanisms to infer fine-grained correspondence by propagating node correspondence along the graph edge. Moreover, such a design can learn correspondence of nodes at different levels, which were mostly ignored by previous works. Experiments have demonstrated the superiority of USRAE.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"13 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138579522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guocheng Zhu, Debiao He, Haoyang An, Min Luo, Cong Peng
{"title":"The governance technology for blockchain systems: a survey","authors":"Guocheng Zhu, Debiao He, Haoyang An, Min Luo, Cong Peng","doi":"10.1007/s11704-023-3113-x","DOIUrl":"https://doi.org/10.1007/s11704-023-3113-x","url":null,"abstract":"<p>After the Ethereum DAO attack in 2016, which resulted in significant economic losses, blockchain governance has become a prominent research area. However, there is a lack of comprehensive and systematic literature review on blockchain governance. To deeply understand the process of blockchain governance and provide guidance for the future design of the blockchain governance model, we provide an in-depth review of blockchain governance. In this paper, first we introduce the consensus algorithms currently used in blockchain and relate them to governance theory. Second, we present the main content of off-chain governance and investigate two well-known off-chain governance projects. Third, we investigate four common on-chain governance voting techniques, then summarize the seven attributes that the on-chain governance voting process should meet, and finally analyze four well-known on-chain governance blockchain projects based on the previous research. We hope this survey will provide an in-depth insight into the potential development direction of blockchain governance and device future research agenda.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"14 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138534550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MLDA: a multi-level k-degree anonymity scheme on directed social network graphs","authors":"Yuanjing Hao, Long Li, Liang Chang, Tianlong Gu","doi":"10.1007/s11704-023-2759-8","DOIUrl":"https://doi.org/10.1007/s11704-023-2759-8","url":null,"abstract":"<p>With the emergence of network-centric data, social network graph publishing is conducive to data analysts to mine the value of social networks, analyze the social behavior of individuals or groups, implement personalized recommendations, and so on. However, published social network graphs are often subject to re-identification attacks from adversaries, which results in the leakage of users’ privacy. The <i>k</i>-anonymity technology is widely used in the field of graph publishing, which is quite effective to resist re-identification attacks. However, the current researches still exist some issues to be solved: the protection of directed graphs is less concerned than that of undirected graphs; the protection of graph structure is often ignored while achieving the protection of nodes’ identities; the same protection is performed for different users, which doesn’t meet the different privacy requirements of users. Therefore, to address the above issues, a multi-level <i>k</i>-degree anonymity (MLDA) scheme on directed social network graphs is proposed in this paper. First, node sets with different importance are divided by the firefly algorithm and constrained connectedness upper approximation, and they are performed different <i>k</i>-degree anonymity protection to meet the different privacy requirements of users. Second, a new graph anonymity method is proposed, which achieves the addition and removal of edges with the help of fake nodes. In addition, to improve the utility of the anonymized graph, a new edge cost criterion is proposed, which is used to select the most appropriate edge to be removed. Third, to protect the community structure of the original graph as much as possible, fake nodes contained in a same community are merged prior to fake nodes contained in different communities. Experimental results on real datasets show that the newly proposed MLDA scheme is effective to balance the privacy and utility of the anonymized graph.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"1 1","pages":""},"PeriodicalIF":4.2,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138534539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}