Vasily Zadorozhnyy;Edison Mucllari;Cole Pospisil;Duc Nguyen;Qiang Ye
{"title":"Orthogonal Gated Recurrent Unit With Neumann-Cayley Transformation","authors":"Vasily Zadorozhnyy;Edison Mucllari;Cole Pospisil;Duc Nguyen;Qiang Ye","doi":"10.1162/neco_a_01710","DOIUrl":"10.1162/neco_a_01710","url":null,"abstract":"In recent years, using orthogonal matrices has been shown to be a promising approach to improving recurrent neural networks (RNNs) with training, stability, and convergence, particularly to control gradients. While gated recurrent unit (GRU) and long short-term memory (LSTM) architectures address the vanishing gradient problem by using a variety of gates and memory cells, they are still prone to the exploding gradient problem. In this work, we analyze the gradients in GRU and propose the use of orthogonal matrices to prevent exploding gradient problems and enhance long-term memory. We study where to use orthogonal matrices and propose a Neumann series–based scaled Cayley transformation for training orthogonal matrices in GRU, which we call Neumann-Cayley orthogonal GRU (NC-GRU). We present detailed experiments of our model on several synthetic and real-world tasks, which show that NC-GRU significantly outperforms GRU and several other RNNs.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2651-2676"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Victor Geadah;Gabriel Barello;Daniel Greenidge;Adam S. Charles;Jonathan W. Pillow
{"title":"Sparse-Coding Variational Autoencoders","authors":"Victor Geadah;Gabriel Barello;Daniel Greenidge;Adam S. Charles;Jonathan W. Pillow","doi":"10.1162/neco_a_01715","DOIUrl":"10.1162/neco_a_01715","url":null,"abstract":"The sparse coding model posits that the visual system has evolved to efficiently code natural stimuli using a sparse set of features from an overcomplete dictionary. The original sparse coding model suffered from two key limitations; however: (1) computing the neural response to an image patch required minimizing a nonlinear objective function via recurrent dynamics and (2) fitting relied on approximate inference methods that ignored uncertainty. Although subsequent work has developed several methods to overcome these obstacles, we propose a novel solution inspired by the variational autoencoder (VAE) framework. We introduce the sparse coding variational autoencoder (SVAE), which augments the sparse coding model with a probabilistic recognition model parameterized by a deep neural network. This recognition model provides a neurally plausible feedforward implementation for the mapping from image patches to neural activities and enables a principled method for fitting the sparse coding model to data via maximization of the evidence lower bound (ELBO). The SVAE differs from standard VAEs in three key respects: the latent representation is overcomplete (there are more latent dimensions than image pixels), the prior is sparse or heavy-tailed instead of gaussian, and the decoder network is a linear projection instead of a deep network. We fit the SVAE to natural image data under different assumed prior distributions and show that it obtains higher test performance than previous fitting methods. Finally, we examine the response properties of the recognition network and show that it captures important nonlinear properties of neurons in the early visual pathway.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2571-2601"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine Granularity Is Critical for Intelligent Neural Network Pruning","authors":"Alex Heyman;Joel Zylberberg","doi":"10.1162/neco_a_01717","DOIUrl":"10.1162/neco_a_01717","url":null,"abstract":"Neural network pruning is a popular approach to reducing the computational costs of training and/or deploying a network and aims to do so while minimizing accuracy loss. Pruning methods that remove individual weights (fine granularity) can remove more total network parameters before reaching a given degree of accuracy loss, while methods that preserve some or all of a network’s structure (coarser granularity, such as pruning channels from a CNN) take better advantage of hardware and software optimized for dense matrix computations. We compare intelligent iterative pruning using several different criteria sampled from the literature against random pruning at initialization across multiple granularities on two different architectures and three image classification tasks. Our work is the first direct and comprehensive investigation of the relationship between granularity and the efficacy of intelligent pruning relative to a random-pruning baseline. We find that the accuracy advantage of intelligent over random pruning decreases dramatically as granularity becomes coarser, with minimal advantage for intelligent pruning at granularity coarse enough to fully preserve network structure. For instance, at pruning rates where random pruning leaves ResNet-20 at 85.0% test accuracy on CIFAR-10 after 30,000 training iterations, intelligent weight pruning with the best-in-context criterion leaves it at about 90.0% accuracy (on par with the unpruned network), kernel pruning leaves it at about 86.5%, and channel pruning leaves it at about 85.5%. Our results suggest that compared to coarse pruning, fine pruning combined with efficient implementation of the resulting networks is a more promising direction for easing the trade-off between high accuracy and low computational cost.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2677-2709"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KLIF: An Optimized Spiking Neuron Unit for Tuning Surrogate Gradient Function","authors":"Chunming Jiang;Yilei Zhang","doi":"10.1162/neco_a_01712","DOIUrl":"10.1162/neco_a_01712","url":null,"abstract":"Spiking neural networks (SNNs) have garnered significant attention owing to their adeptness in processing temporal information, low power consumption, and enhanced biological plausibility. Despite these advantages, the development of efficient and high-performing learning algorithms for SNNs remains a formidable challenge. Techniques such as artificial neural network (ANN)-to-SNN conversion can convert ANNs to SNNs with minimal performance loss, but they necessitate prolonged simulations to approximate rate coding accurately. Conversely, the direct training of SNNs using spike-based backpropagation (BP), such as surrogate gradient approximation, is more flexible and widely adopted. Nevertheless, our research revealed that the shape of the surrogate gradient function profoundly influences the training and inference accuracy of SNNs. Importantly, we identified that the shape of the surrogate gradient function significantly affects the final training accuracy. The shape of the surrogate gradient function is typically manually selected before training and remains static throughout the training process. In this article, we introduce a novel k-based leaky integrate-and-fire (KLIF) spiking neural model. KLIF, featuring a learnable parameter, enables the dynamic adjustment of the height and width of the effective surrogate gradient near threshold during training. Our proposed model undergoes evaluation on static CIFAR-10 and CIFAR-100 data sets, as well as neuromorphic CIFAR10-DVS and DVS128-Gesture data sets. Experimental results demonstrate that KLIF outperforms the leaky Integrate-and-Fire (LIF) model across multiple data sets and network architectures. The superior performance of KLIF positions it as a viable replacement for the essential role of LIF in SNNs across diverse tasks.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2636-2650"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petr Anokhin;Artyom Sorokin;Mikhail Burtsev;Karl Friston
{"title":"Associative Learning and Active Inference","authors":"Petr Anokhin;Artyom Sorokin;Mikhail Burtsev;Karl Friston","doi":"10.1162/neco_a_01711","DOIUrl":"10.1162/neco_a_01711","url":null,"abstract":"Associative learning is a behavioral phenomenon in which individuals develop connections between stimuli or events based on their co-occurrence. Initially studied by Pavlov in his conditioning experiments, the fundamental principles of learning have been expanded on through the discovery of a wide range of learning phenomena. Computational models have been developed based on the concept of minimizing reward prediction errors. The Rescorla-Wagner model, in particular, is a well-known model that has greatly influenced the field of reinforcement learning. However, the simplicity of these models restricts their ability to fully explain the diverse range of behavioral phenomena associated with learning. In this study, we adopt the free energy principle, which suggests that living systems strive to minimize surprise or uncertainty under their internal models of the world. We consider the learning process as the minimization of free energy and investigate its relationship with the Rescorla-Wagner model, focusing on the informational aspects of learning, different types of surprise, and prediction errors based on beliefs and values. Furthermore, we explore how well-known behavioral phenomena such as blocking, overshadowing, and latent inhibition can be modeled within the active inference framework. We accomplish this by using the informational and novelty aspects of attention, which share similar ideas proposed by seemingly contradictory models such as Mackintosh and Pearce-Hall models. Thus, we demonstrate that the free energy principle, as a theoretical framework derived from first principles, can integrate the ideas and models of associative learning proposed based on empirical experiments and serve as a framework for a better understanding of the computational processes behind associative learning in the brain.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2602-2635"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Attention and Cognitive Control Costs Using Temporally Layered Architectures","authors":"Devdhar Patel;Terrence Sejnowski;Hava Siegelmann","doi":"10.1162/neco_a_01718","DOIUrl":"10.1162/neco_a_01718","url":null,"abstract":"The current reinforcement learning framework focuses exclusively on performance, often at the expense of efficiency. In contrast, biological control achieves remarkable performance while also optimizing computational energy expenditure and decision frequency. We propose a decision-bounded Markov decision process (DB-MDP) that constrains the number of decisions and computational energy available to agents in reinforcement learning environments. Our experiments demonstrate that existing reinforcement learning algorithms struggle within this framework, leading to either failure or suboptimal performance. To address this, we introduce a biologically inspired, temporally layered architecture (TLA), enabling agents to manage computational costs through two layers with distinct timescales and energy requirements. TLA achieves optimal performance in decision-bounded environments and in continuous control environments, matching state-of-the-art performance while using a fraction of the computing cost. Compared to current reinforcement learning algorithms that solely prioritize performance, our approach significantly lowers computational energy expenditure while maintaining performance. These findings establish a benchmark and pave the way for future research on energy and time-aware control.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2734-2763"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalization Analysis of Transformers in Distribution Regression.","authors":"Peilin Liu, Ding-Xuan Zho","doi":"10.1162/neco_a_01726","DOIUrl":"10.1162/neco_a_01726","url":null,"abstract":"<p><p>In recent years, models based on the transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful and efficient techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind transformers and related techniques, we first propose a transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Throughout theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.</p>","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":" ","pages":"1-34"},"PeriodicalIF":2.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142669939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, Ding-Xuan Zhou
{"title":"Generalization Guarantees of Gradient Descent for Shallow Neural Networks.","authors":"Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, Ding-Xuan Zhou","doi":"10.1162/neco_a_01725","DOIUrl":"10.1162/neco_a_01725","url":null,"abstract":"<p><p>Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.</p>","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":" ","pages":"1-59"},"PeriodicalIF":2.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentin Leplat;Le T. K. Hien;Akwum Onwunta;Nicolas Gillis
{"title":"Deep Nonnegative Matrix Factorization With Beta Divergences","authors":"Valentin Leplat;Le T. K. Hien;Akwum Onwunta;Nicolas Gillis","doi":"10.1162/neco_a_01679","DOIUrl":"10.1162/neco_a_01679","url":null,"abstract":"Deep nonnegative matrix factorization (deep NMF) has recently emerged as a valuable technique for extracting multiple layers of features across different scales. However, all existing deep NMF models and algorithms have primarily centered their evaluation on the least squares error, which may not be the most appropriate metric for assessing the quality of approximations on diverse data sets. For instance, when dealing with data types such as audio signals and documents, it is widely acknowledged that ß-divergences offer a more suitable alternative. In this article, we develop new models and algorithms for deep NMF using some ß-divergences, with a focus on the Kullback-Leibler divergence. Subsequently, we apply these techniques to the extraction of facial features, the identification of topics within document collections, and the identification of materials within hyperspectral images.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 11","pages":"2365-2402"},"PeriodicalIF":2.7,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal and Multifactor Branching Time Active Inference","authors":"Théophile Champion;Marek Grześ;Howard Bowman","doi":"10.1162/neco_a_01703","DOIUrl":"10.1162/neco_a_01703","url":null,"abstract":"Active inference is a state-of-the-art framework for modeling the brain that explains a wide range of mechanisms. Recently, two versions of branching time active inference (BTAI) have been developed to handle the exponential (space and time) complexity class that occurs when computing the prior over all possible policies up to the time horizon. However, those two versions of BTAI still suffer from an exponential complexity class with regard to the number of observed and latent variables being modeled. We resolve this limitation by allowing each observation to have its own likelihood mapping and each latent variable to have its own transition mapping. The implicit mean field approximation was tested in terms of its efficiency and computational cost using a dSprites environment in which the metadata of the dSprites data set was used as input to the model. In this setting, earlier implementations of branching time active inference (namely, BTAIVMP and BTAIBF) underperformed in relation to the mean field approximation (BTAI3MF) in terms of performance and computational efficiency. Specifically, BTAIVMP was able to solve 96.9% of the task in 5.1 seconds, and BTAIBF was able to solve 98.6% of the task in 17.5 seconds. Our new approach outperformed both of its predecessors by solving the task completely (100%) in only 2.559 seconds.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 11","pages":"2479-2504"},"PeriodicalIF":2.7,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142114870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}