ACCENTS Journals

Download PDF
Back

Paper Title	:	Convergence analysis of feedforward neural networks using the online gradient method with smoothing L1 regularization
Author Name	:	Khidir Shaib Mohamed and Suhail Abdullah Alsaqer
Abstract	:	An online gradient method is the simplest and most widely used training method for feed forward neural networks (FFNNs). However, a problem can arise with this method: sometimes the weights become very large, leading to overfitting. Regularization is a technique used to improve the generalization performance and prevent overfitting in networks. This paper focused on the convergence analysis of an online gradient method with L1 regularization for training FFNNs. L1 regularization promotes sparse models but complicates the convergence analysis process due to the inclusion of the absolute value function, which is not differentiable. To address this issue, an adaptive smoothing function was introduced into the error function to replace the L1 regularization term at the origin. This approach encourages a sparser network structure by forcing weights to become smaller during training and eventually eliminating them after training. This strategy simplifies the network structure and accelerates convergence, as demonstrated by the numerical experiments presented in this paper. Additionally, it enables us to prove the convergence of the proposed training method. Numerical experiments based on 4-bit and 5-bit parity problems, Gaussian and hyperbolic function approximations, and Monk and Sonar classifications are provided to validate the theoretical findings and the superiority of the proposed algorithm.
Keywords	:	Online gradient method, Smoothing function, L1 regularization, Convergence, Feedforward neural network.
Cite this article	:	Mohamed KS, Alsaqer SA.Convergence analysis of feedforward neural networks using the online gradient method with smoothing L1 regularization. International Journal of Advanced Technology and Engineering Exploration. 2024;11(117):1127-1142. DOI:10.19101/IJATEE.2024.111100102
References	:	[1]Montana DJ, Davis L. Training feedforward neural networks using genetic algorithms. In IJCAI 1989 (pp. 762-7). [Google Scholar] [2]Jensen CA, Reed RD, Marks RJ, El-sharkawi MA, Jung JB, Miyamoto RT, et al. Inversion of feedforward neural networks: algorithms and applications. Proceedings of the IEEE. 1999; 87(9):1536-49. [Crossref] [Google Scholar] [3]Nicole S. Feedforward neural networks for principal components extraction. Computational Statistics & Data Analysis. 2000; 33(4):425-37. [Crossref] [Google Scholar] [4]Johansson EM, Dowla FU, Goodman DM. Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient method. International Journal of Neural Systems. 1991; 2(4):291-301. [Crossref] [Google Scholar] [5]Cilimkovic M. Neural networks and back propagation algorithm. Institute of Technology Blanchardstown, Blanchardstown Road North Dublin. 2015;15(1). [Google Scholar] [6]Kushner HJ, Clark DS. Stochastic approximation methods for constrained and unconstrained systems. Springer Science & Business Media; 2012. [Google Scholar] [7]Nevelson MB, Has minskii RZ. Stochastic approximation and recursive estimation. American Mathematical Society; 1976. [Google Scholar] [8]Leinweber DJ. Stupid data miner tricks: overfitting the S&P 500. Journal of Investing. 2007; 16(1):15. [Google Scholar] [9]Tetko IV, Livingstone DJ, Luik AI. Neural network studies. 1. Comparison of overfitting and overtraining. Journal of Chemical Information and Computer Sciences. 1995; 35(5):826-33. [Google Scholar] [10]Caruana R, Lawrence S, Giles C. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Advances in Neural Information Processing Systems. 2000;13. [Google Scholar] [11]Wan L, Zeiler M, Zhang S, Le CY, Fergus R. Regularization of neural networks using dropconnect. In international conference on machine learning 2013 (pp. 1058-66). PMLR. [Google Scholar] [12]Matsuoka K. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics. 1992; 22(3):436-40. [Crossref] [Google Scholar] [13]Browne MW. Cross-validation methods. Journal of Mathematical Psychology. 2000; 44(1):108-32. [Crossref] [Google Scholar] [14]Yu X, Chen Q. Convergence of gradient method with penalty for ridge polynomial neural network. Neurocomputing. 2012; 97:405-9. [Crossref] [Google Scholar] [15]Zhou L, Fan Q, Huang X, Liu Y. Weak and strong convergence analysis of elman neural networks via weight decay regularization. Optimization. 2023; 72(9):2287-309. [Crossref] [Google Scholar] [16]Shi G, Zhang J, Li H, Wang C. Enhance the performance of deep neural networks via L2 regularization on the input of activations. Neural Processing Letters. 2019; 50:57-75. [Crossref] [Google Scholar] [17]Miao C, Yu H. Alternating Iteration for regularized CT reconstruction. IEEE Access. 2016; 4:4355-63. [Crossref] [Google Scholar] [18]Zhang Z, Xu Y, Yang J, Li X, Zhang D. A survey of sparse representation: algorithms and applications. IEEE Access. 2015; 3:490-530. [Crossref] [Google Scholar] [19]Ng AY. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In proceedings of the twenty-first international conference on machine learning 2004 (p. 78). ACM. [Crossref] [Google Scholar] [20]Meng D, Zhao Q, Xu Z. Improve robustness of sparse PCA by L1-norm maximization. Pattern Recognition. 2012; 45(1):487-97. [Crossref] [Google Scholar] [21]Xu Z, Chang X, Xu F, Zhang H. regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems. 2012; 23(7):1013-27. [Crossref] [Google Scholar] [22]He X, Sun Z. Sparse identification of dynamical systems by reweighted l1-regularized least absolute deviation regression. Communications in Nonlinear Science and Numerical Simulation. 2024; 131:107813. [Crossref] [Google Scholar] [23]Gao J, Wang Y, Yao J, Zhan X, Sun G, Bai J. Three-dimensional array SAR sparse imaging based on hybrid regularization. IEEE Sensors Journal. 2024; 14(10): 16699- 709. [Crossref] [Google Scholar] [24]Jiang Y, He Y, Zhang H. Variable selection with prior information for generalized linear models via the prior LASSO method. Journal of the American Statistical Association. 2016; 111(513):355-76. [Crossref] [Google Scholar] [25]Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997; 16(4):385-95. [Crossref] [Google Scholar] [26]Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1996; 58(1):267-88. [Crossref] [Google Scholar] [27]Weigend AS, Rumelhart DE, Huberman BA. Back-propagation, weight-elimination and time series prediction. In connectionist models 1991(pp. 105-16). Morgan kaufmann. [Crossref] [Google Scholar] [28]https://home.ttic.edu/~shai/papers/KakadeShalevTewari09.pdf. Accessed 11 April 2024. [29]Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005; 67(2):301-20. [Crossref] [Google Scholar] [30]Shaib K. Pruning feedforward polynomial neural with smoothing elastic net regularization. Sensors & Transducers. 2023; 260(1):14-23. [Google Scholar] [31]Jiao J, Su K. A new sigma-Pi-sigma neural network based on L1 and L2 regularization and applications. AIMS Mathematics. 2024; 9(3):5995-6012. [Google Scholar] [32]Zhang H, Tang Y. Online gradient method with smoothing ℓ0 regularization for feedforward neural networks. Neurocomputing. 2017; 224:1-8. [Crossref] [Google Scholar] [33]Fan Q, Liu T. Smoothing l0 regularization for extreme learning machine. Mathematical Problems in Engineering. 2020; 2020(1):9175106. [Crossref] [Google Scholar] [34]Mohamed KS, Mohammed YS. An online gradient method with smoothing L_0 regularization for Pi-sigma network. Transactions on Machine Learning and Artificial Intelligence. 2018; 6(6):96. [Google Scholar] [35]Nguyen TT, Thang VD, Nguyen VT, Nguyen PT. SGD method for entropy error function with smoothing l 0 regularization for neural networks. Applied Intelligence. 2024:1-6. [Crossref] [Google Scholar] [36]Fan Q, Zurada JM, Wu W. Convergence of online gradient method for feedforward neural networks with smoothing L1/2 regularization penalty. Neurocomputing. 2014; 131:208-16. [Crossref] [Google Scholar] [37]Liu Y, Li Z, Yang D, Mohamed KS, Wang J, Wu W. Convergence of batch gradient learning algorithm with smoothing L1/2 regularization for sigma–Pi–sigma neural networks. Neurocomputing. 2015; 151:333-41. [Crossref] [Google Scholar] [38]Zhao J, Zurada JM, Yang J, Wu W. The convergence analysis of spikeprop algorithm with smoothing L1∕ 2 regularization. Neural Networks. 2018; 103:19-28. [Crossref] [Google Scholar] [39]Lu Y, Li W, Wang H. A batch variable learning rate gradient descent algorithm with the smoothing L 1/2 regularization for takagi-sugeno models. IEEE Access. 2020; 8:100185-93. [Crossref] [Google Scholar] [40]Xie X, Zhang H, Wang J, Chang Q, Wang J, Pal NR. Learning optimized structure of neural networks by hidden node pruning with $ L_ {1} $ regularization. IEEE Transactions on Cybernetics. 2019; 50(3):1333-46. [Crossref] [Google Scholar] [41]Onaran I, Ince NF, Cetin AE. Sparse spatial filter via a novel objective function minimization with smooth ℓ1 regularization. Biomedical Signal Processing and Control. 2013; 8(3):282-8. [Crossref] [Google Scholar] [42]Schmidt M, Fung G, Rosales R. Fast optimization methods for l1 regularization: a comparative study and two new approaches. In machine learning: ECML: 18th European conference on machine learning, Warsaw, Poland 2007 (pp. 286-97). Springer Berlin Heidelberg. [Crossref] [Google Scholar] [43]Mohamed KS. Batch gradient learning algorithm with smoothing L1 regularization for feedforward neural networks. Computers. 2022; 12(1):1-15. [Crossref] [Google Scholar] [44]Tadic V, Stankovic S. Learning in neural networks by normalized stochastic gradient algorithm: local convergence. In proceedings of the 5th seminar on neural network applications in electrical engineering, NEUREL 2000 (pp. 11-7). IEEE. [Crossref] [Google Scholar] [45]Xu ZB, Zhang R, Jing WF. When does online BP training converge? IEEE Transactions on Neural Networks. 2009; 20(10):1529-39. [Crossref] [Google Scholar] [46]Wu W, Feng G, Li Z, Xu Y. Deterministic convergence of an online gradient method for BP neural networks. IEEE Transactions on Neural Networks. 2005; 16(3):533-40. [Crossref] [Google Scholar] [47]Wu W, Xu Y. Deterministic convergence of an online gradient method for neural networks. Journal of Computational and Applied Mathematics. 2002; 144(1-2):335-47. [Crossref] [Google Scholar] [48]Li Z, Wu W, Tian Y. Convergence of an online gradient method for feedforward neural networks with stochastic inputs. Journal of Computational and Applied Mathematics. 2004; 163(1):165-76. [Crossref] [Google Scholar] [49]Wu W, Feng G, Li X. Training multilayer perceptrons via minimization of sum of ridge functions. Advances in Computational Mathematics. 2002; 17:331-47. [Crossref] [Google Scholar] [50]Wu W, Wang J, Cheng M, Li Z. Convergence analysis of online gradient method for BP neural networks. Neural Networks. 2011; 24(1):91-8. [Crossref] [Google Scholar] [51]Wang J, Wu W, Zurada JM. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011; 74(14-15):2368-76. [Crossref] [Google Scholar]