Convergence analysis of feedforward neural networks using the online gradient method with smoothing L1 regularization
Khidir Shaib Mohamed and Suhail Abdullah Alsaqer
Abstract
An online gradient method is the simplest and most widely used training method for feed forward neural networks (FFNNs). However, a problem can arise with this method: sometimes the weights become very large, leading to overfitting. Regularization is a technique used to improve the generalization performance and prevent overfitting in networks. This paper focused on the convergence analysis of an online gradient method with L1 regularization for training FFNNs. L1 regularization promotes sparse models but complicates the convergence analysis process due to the inclusion of the absolute value function, which is not differentiable. To address this issue, an adaptive smoothing function was introduced into the error function to replace the L1 regularization term at the origin. This approach encourages a sparser network structure by forcing weights to become smaller during training and eventually eliminating them after training. This strategy simplifies the network structure and accelerates convergence, as demonstrated by the numerical experiments presented in this paper. Additionally, it enables us to prove the convergence of the proposed training method. Numerical experiments based on 4-bit and 5-bit parity problems, Gaussian and hyperbolic function approximations, and Monk and Sonar classifications are provided to validate the theoretical findings and the superiority of the proposed algorithm.
Keyword
Online gradient method, Smoothing function, L1 regularization, Convergence, Feedforward neural network.
Cite this article
Mohamed KS, Alsaqer SA.Convergence analysis of feedforward neural networks using the online gradient method with smoothing L1 regularization. International Journal of Advanced Technology and Engineering Exploration. 2024;11(117):1127-1142. DOI:10.19101/IJATEE.2024.111100102
Refference
[1]Montana DJ, Davis L. Training feedforward neural networks using genetic algorithms. In IJCAI 1989 (pp. 762-7).
[2]Jensen CA, Reed RD, Marks RJ, El-sharkawi MA, Jung JB, Miyamoto RT, et al. Inversion of feedforward neural networks: algorithms and applications. Proceedings of the IEEE. 1999; 87(9):1536-49.
[3]Nicole S. Feedforward neural networks for principal components extraction. Computational Statistics & Data Analysis. 2000; 33(4):425-37.
[4]Johansson EM, Dowla FU, Goodman DM. Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient method. International Journal of Neural Systems. 1991; 2(4):291-301.
[5]Cilimkovic M. Neural networks and back propagation algorithm. Institute of Technology Blanchardstown, Blanchardstown Road North Dublin. 2015;15(1).
[6]Kushner HJ, Clark DS. Stochastic approximation methods for constrained and unconstrained systems. Springer Science & Business Media; 2012.
[7]Nevelson MB, Has minskii RZ. Stochastic approximation and recursive estimation. American Mathematical Society; 1976.
[8]Leinweber DJ. Stupid data miner tricks: overfitting the S&P 500. Journal of Investing. 2007; 16(1):15.
[9]Tetko IV, Livingstone DJ, Luik AI. Neural network studies. 1. Comparison of overfitting and overtraining. Journal of Chemical Information and Computer Sciences. 1995; 35(5):826-33.
[10]Caruana R, Lawrence S, Giles C. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Advances in Neural Information Processing Systems. 2000;13.
[11]Wan L, Zeiler M, Zhang S, Le CY, Fergus R. Regularization of neural networks using dropconnect. In international conference on machine learning 2013 (pp. 1058-66). PMLR.
[12]Matsuoka K. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics. 1992; 22(3):436-40.
[13]Browne MW. Cross-validation methods. Journal of Mathematical Psychology. 2000; 44(1):108-32.
[14]Yu X, Chen Q. Convergence of gradient method with penalty for ridge polynomial neural network. Neurocomputing. 2012; 97:405-9.
[15]Zhou L, Fan Q, Huang X, Liu Y. Weak and strong convergence analysis of elman neural networks via weight decay regularization. Optimization. 2023; 72(9):2287-309.
[16]Shi G, Zhang J, Li H, Wang C. Enhance the performance of deep neural networks via L2 regularization on the input of activations. Neural Processing Letters. 2019; 50:57-75.
[17]Miao C, Yu H. Alternating Iteration for regularized CT reconstruction. IEEE Access. 2016; 4:4355-63.
[18]Zhang Z, Xu Y, Yang J, Li X, Zhang D. A survey of sparse representation: algorithms and applications. IEEE Access. 2015; 3:490-530.
[19]Ng AY. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In proceedings of the twenty-first international conference on machine learning 2004 (p. 78). ACM.
[20]Meng D, Zhao Q, Xu Z. Improve robustness of sparse PCA by L1-norm maximization. Pattern Recognition. 2012; 45(1):487-97.
[21]Xu Z, Chang X, Xu F, Zhang H. regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems. 2012; 23(7):1013-27.
[22]He X, Sun Z. Sparse identification of dynamical systems by reweighted l1-regularized least absolute deviation regression. Communications in Nonlinear Science and Numerical Simulation. 2024; 131:107813.
[23]Gao J, Wang Y, Yao J, Zhan X, Sun G, Bai J. Three-dimensional array SAR sparse imaging based on hybrid regularization. IEEE Sensors Journal. 2024; 14(10): 16699- 709.
[24]Jiang Y, He Y, Zhang H. Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association. 2016; 111(513):355-76.
[25]Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997; 16(4):385-95.
[26]Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996; 58(1):267-88.
[27]Weigend AS, Rumelhart DE, Huberman BA. Back-propagation, weight-elimination and time series prediction. In connectionist models 1991(pp. 105-16). Morgan kaufmann.
[28]https://home.ttic.edu/~shai/papers/KakadeShalevTewari09.pdf. Accessed 11 April 2024.
[29]Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005; 67(2):301-20.
[30]Shaib K. Pruning feedforward polynomial neural with smoothing elastic net regularization. Sensors & Transducers. 2023; 260(1):14-23.
[31]Jiao J, Su K. A new sigma-Pi-sigma neural network based on L1 and L2 regularization and applications. AIMS Mathematics. 2024; 9(3):5995-6012.
[32]Zhang H, Tang Y. Online gradient method with smoothing ℓ0 regularization for feedforward neural networks. Neurocomputing. 2017; 224:1-8.
[33]Fan Q, Liu T. Smoothing l0 regularization for extreme learning machine. Mathematical Problems in Engineering. 2020; 2020(1):9175106.
[34]Mohamed KS, Mohammed YS. An online gradient method with smoothing L_0 regularization for Pi-sigma network. Transactions on Machine Learning and Artificial Intelligence. 2018; 6(6):96.
[35]Nguyen TT, Thang VD, Nguyen VT, Nguyen PT. SGD method for entropy error function with smoothing l 0 regularization for neural networks. Applied Intelligence. 2024:1-6.
[36]Fan Q, Zurada JM, Wu W. Convergence of online gradient method for feedforward neural networks with smoothing L1/2 regularization penalty. Neurocomputing. 2014; 131:208-16.
[37]Liu Y, Li Z, Yang D, Mohamed KS, Wang J, Wu W. Convergence of batch gradient learning algorithm with smoothing L1/2 regularization for sigma–Pi–sigma neural networks. Neurocomputing. 2015; 151:333-41.
[38]Zhao J, Zurada JM, Yang J, Wu W. The convergence analysis of spikeprop algorithm with smoothing L1∕ 2 regularization. Neural Networks. 2018; 103:19-28.
[39]Lu Y, Li W, Wang H. A batch variable learning rate gradient descent algorithm with the smoothing L 1/2 regularization for takagi-sugeno models. IEEE Access. 2020; 8:100185-93.
[40]Xie X, Zhang H, Wang J, Chang Q, Wang J, Pal NR. Learning optimized structure of neural networks by hidden node pruning with $ L_ {1} $ regularization. IEEE Transactions on Cybernetics. 2019; 50(3):1333-46.
[41]Onaran I, Ince NF, Cetin AE. Sparse spatial filter via a novel objective function minimization with smooth ℓ1 regularization. Biomedical Signal Processing and Control. 2013; 8(3):282-8.
[42]Schmidt M, Fung G, Rosales R. Fast optimization methods for l1 regularization: a comparative study and two new approaches. In machine learning: ECML: 18th European conference on machine learning, Warsaw, Poland 2007 (pp. 286-97). Springer Berlin Heidelberg.
[43]Mohamed KS. Batch gradient learning algorithm with smoothing L1 regularization for feedforward neural networks. Computers. 2022; 12(1):1-15.
[44]Tadic V, Stankovic S. Learning in neural networks by normalized stochastic gradient algorithm: local convergence. In proceedings of the 5th seminar on neural network applications in electrical engineering, NEUREL 2000 (pp. 11-7). IEEE.
[45]Xu ZB, Zhang R, Jing WF. When does online BP training converge? IEEE Transactions on Neural Networks. 2009; 20(10):1529-39.
[46]Wu W, Feng G, Li Z, Xu Y. Deterministic convergence of an online gradient method for BP neural networks. IEEE Transactions on Neural Networks. 2005; 16(3):533-40.
[47]Wu W, Xu Y. Deterministic convergence of an online gradient method for neural networks. Journal of Computational and Applied Mathematics. 2002; 144(1-2):335-47.
[48]Li Z, Wu W, Tian Y. Convergence of an online gradient method for feedforward neural networks with stochastic inputs. Journal of Computational and Applied Mathematics. 2004; 163(1):165-76.
[49]Wu W, Feng G, Li X. Training multilayer perceptrons via minimization of sum of ridge functions. Advances in Computational Mathematics. 2002; 17:331-47.
[50]Wu W, Wang J, Cheng M, Li Z. Convergence analysis of online gradient method for BP neural networks. Neural Networks. 2011; 24(1):91-8.
[51]Wang J, Wu W, Zurada JM. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011; 74(14-15):2368-76.