Convergence of batch gradient training-based smoothing L1regularization via adaptive momentum for feedforward neural networks
Khidir Shaib Mohamed and Raed Muhammad Albadrani
Abstract
Momentum has been extensively researched for regularization, and it is a widely used method to quicken the convergence of practical training. Unfortunately, no such effective acceleration method for L1 regularization has yet to be presented. For the purpose of training and pruning feedforward neural networks, the convergence of batch gradient training-based smoothing L1 regularization via adaptive momentum (BGSL_1AM) was developed. In doing so, the usual L1 regularization is foremost nonsmooth; it generates oscillations in the computation and makes convergence analysis difficult. To overcome this issue, a smoothing function approximation L1 regularizer at the origin is proposed. Numerical simulation results on a range of function approximations and pattern classification problems demonstrate the effectiveness of BGTSL_1AM algorithm. This process progressively reduces weights pending training and allows for their removal afterwards. Together with the significance of the suggested approach and some weak and strong convergence analyses, the convergence conditions are also offered. The suggested learning method has good convergence qualities and accuracy in function approximation, as demonstrated by the simulation results.
Keyword
Convergence, Smoothing L1 regularization, Adaptive momentum, Feedforward neural network, Batch gradient training.
Cite this article
Mohamed KS, Albadrani RM.Convergence of batch gradient training-based smoothing L1regularization via adaptive momentum for feedforward neural networks . International Journal of Advanced Technology and Engineering Exploration. 2024;11(116):1005-1019. DOI:10.19101/IJATEE.2024.111100125
Refference
[1]Chen T, Lu W, Amari SI. Global convergence rate of recurrently connected neural networks. Neural Computation. 2002; 14(12):2947-57.
[2]Haykin S. Neural networks: a comprehensive foundation. Prentice Hall PTR; 1998.
[3]Magoulas GD, Vrahatis MN, Androulakis GS. Improving the convergence of the backpropagation algorithm using learning rate adaptation methods. Neural Computation. 1999; 11(7):1769-96.
[4]Plaut DC. Experiments on learning by back propagation. Reports-Research. 1986.
[5]Wilson DR, Martinez TR. The general inefficiency of batch training for gradient descent learning. Neural Networks. 2003; 16(10):1429-51.
[6]Amari SI. Backpropagation and stochastic gradient descent method. Neurocomputing. 1993; 5(4-5):185-96.
[7]Nakama T. Theoretical analysis of batch and on-line training for gradient descent learning in neural networks. Neurocomputing. 2009; 73(1-3):151-9.
[8]Wu W, Shao H, Li Z. Convergence of batch BP algorithm with penalty for FNN training. In neural information processing: 13th international conference, ICONIP, Hong Kong, China. 2006 (pp. 562-9). Springer Berlin Heidelberg.
[9]Zhang H, Wu W, Liu F, Yao M. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Transactions on Neural Networks. 2009; 20(6):1050-4.
[10]Ying X. An overview of overfitting and its solutions. In journal of physics: conference series 2019 (pp. 1-6). IOP Publishing.
[11]Santos CF, Papa JP. Avoiding overfitting: a survey on regularization methods for convolutional neural networks. ACM Computing Surveys. 2022; 54(10):1-25.
[12]He Z, Xie Z, Zhu Q, Qin Z. Sparse double descent: where network pruning aggravates overfitting. In international conference on machine learning 2022 (pp. 8635-59). PMLR.
[13]Li M, Soltanolkotabi M, Oymak S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In international conference on artificial intelligence and statistics 2020 (pp. 4313-24). PMLR.
[14]Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data. 2019; 6(1):1-48.
[15]Sollich P, Krogh A. Learning with ensembles: how overfitting can be useful. Advances in Neural Information Processing Systems. 1995 :1-7.
[16]Zhou L, Fan Q, Huang X, Liu Y. Weak and strong convergence analysis of Elman neural networks via weight decay regularization. Optimization. 2023; 72(9):2287-309.
[17]Tian Y, Zhang Y. A comprehensive survey on regularization strategies in machine learning. Information Fusion. 2022; 80:146-66.
[18]Weigend AS, Rumelhart DE, Huberman BA. Back-propagation, weight-elimination and time series prediction. In connectionist models 1991 (pp. 105-16). Morgan Kaufmann.
[19]Moudafi A. On an extension of a spare regularization model. Mathematics. 2023; 11(20):1-12.
[20]Vidaurre D, Bielza C, Larranaga P. A survey of L1 regression. International Statistical Review. 2013; 81(3):361-87.
[21]Xu Z, Zhang H, Wang Y, Chang X, Liang Y. L 1/2 regularization. Science China Information Sciences. 2010; 53:1159-69.
[22]Zhang J, Li Y, Zhao N, Zheng Z. L 0-regularization for high-dimensional regression with corrupted data. Communications in Statistics-Theory and Methods. 2024; 53(1):215-31.
[23]Ming H, Yang H. L0 regularized logistic regression for large-scale data. Pattern Recognition. 2024; 146:110024.
[24]Zhang H, Tang Y. Online gradient method with smoothing ℓ0 regularization for feedforward neural networks. Neurocomputing. 2017; 224:1-8.
[25]Fan Q, Liu T. Smoothing l0 regularization for extreme learning machine. Mathematical Problems in Engineering. 2020; 2020(1):9175106.
[26]Wu W, Fan Q, Zurada JM, Wang J, Yang D, Liu Y. Batch gradient method with smoothing L1/2 regularization for training of feedforward neural networks. Neural Networks. 2014; 50:72-8.
[27]Li W, Chu M. A pruning feedforward small-world neural network by dynamic sparse regularization with smoothing l1/2 norm for nonlinear system modeling. Applied Soft Computing. 2023; 136:110133.
[28]Li W, Li Z, Qiao J. A fast feedforward small-world neural network for nonlinear system modeling. IEEE Transactions on Neural Networks and Learning Systems. 2024.
[29]Huang W, Li S, Fu X, Zhang C, Shi J, Zhu Z. Transient extraction based on minimax concave regularized sparse representation for gear fault diagnosis. Measurement. 2020; 151:107273.
[30]Shi Y, Zhang Y, Zhang P, Xiao Y, Niu L. Federated learning with ℓ1 regularization. Pattern Recognition Letters. 2023; 172:15-21.
[31]Tehrani JN, Mcewan A, Jin C, Van SA. L1 regularization method in electrical impedance tomography by using the L1-curve (pareto frontier curve). Applied Mathematical Modelling. 2012; 36(3):1095-105.
[32]Yashwanth M, Nayak GK, Rangwani H, Singh A, Babu RV, Chakraborty A. Minimizing layerwise activation norm improves generalization in federated learning. In proceedings of the winter conference on applications of computer vision 2024 (pp. 2287-2296). IEEE.
[33]Mohamed KS. Batch gradient learning algorithm with smoothing L 1 regularization for feedforward neural networks. Computers. 2022; 12(1):1-15.
[34]Lillo WE, Loh MH, Hui S, Zak SH. On solving constrained optimization problems with neural networks: a penalty method approach. IEEE Transactions on Neural Networks. 1993; 4(6):931-40.
[35]Setiono R. A penalty-function approach for pruning feedforward neural networks. Neural Computation. 1997; 9(1):185-204.
[36]Fan Q, Peng J, Li H, Lin S. Convergence of a gradient-based learning algorithm with penalty for ridge polynomial neural networks. IEEE Access. 2020; 9:28742-52.
[37]Hong-mei S, Wei W, Li-jun L. Convergence of online gradient method with penalty for BP neural networks. Communications in Mathematical Research. 2010; 26(1):67-75.
[38]Attoh-okine NO. Analysis of learning rate and momentum term in backpropagation neural network algorithm trained to predict pavement performance. Advances in Engineering Software. 1999; 30(4):291-302.
[39]Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks. 1999; 12(1):145-51.
[40]Gitman I, Lang H, Zhang P, Xiao L. Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems. 2019; 32.
[41]Hagiwara M, Sato A. Analysis of momentum term in back-propagation. IEICE Transactions on Information and Systems. 1995; 78(8):1080-6.
[42]Choi J. Physical approach to price momentum and its application to momentum strategy. Physica A: Statistical Mechanics and its Applications. 2014; 415:61-72.
[43]Kuhl D, Ramm E. Constraint energy momentum algorithm and its application to non-linear dynamics of shells. Computer Methods in Applied Mechanics and Engineering. 1996; 136(3-4):293-315.
[44]Chang SY. Application of the momentum equations of motion to pseudo–dynamic testing. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences. 2001; 359(1786):1801-27.
[45]Hameed AA, Karlik B, Salman MS. Back-propagation algorithm with variable adaptive momentum. Knowledge-Based Systems. 2016; 114:79-87.
[46]Han X, Dong J. Applications of fractional gradient descent method with adaptive momentum in BP neural networks. Applied Mathematics and Computation. 2023; 448:127944.
[47]Alkhairi P, Wanayumini W, Hayadi BH. Analysis of the adaptive learning rate and momentum effects on prediction problems in increasing the training time of the backpropagation algorithm. In AIP conference proceedings 2024. AIP Publishing.
[48]Majda AJ, Stechmann SN. A simple dynamical model with features of convective momentum transport. Journal of the Atmospheric Sciences. 2009; 66(2):373-92.
[49]Zhang H, Wu W, Yao M. Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing. 2012; 89:141-6.
[50]Zhang H, Xu D, Zhang Y. Boundedness and convergence of split-complex back-propagation algorithm with momentum and penalty. Neural Processing Letters. 2014; 39:297-307.
[51]Fan Q, Wu W, Zurada JM. Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks. Springer Plus. 2016; 5:1-7.
[52]Kang Q, Fan Q, Zurada JM. Deterministic convergence analysis via smoothing group lasso regularization and adaptive momentum for sigma-Pi-sigma neural network. Information Sciences. 2021; 553:66-82.
[53]Fan Q, Liu L, Kang Q, Zhou L. Convergence of batch gradient method for training of pi-sigma neural network with regularizer and adaptive momentum term. Neural Processing Letters. 2023; 55(4):4871-88.
[54]Wang L, Fu Z, Zhou Y, Yan Z. The implicit regularization of momentum gradient descent in overparametrized models. In proceedings of the AAAI conference on artificial intelligence 2023 (pp. 10149-56).
[55]Jelassi S, Li Y. Towards understanding how momentum improves generalization in deep learning. In international conference on machine learning 2022 (pp. 9965-10040). PMLR.