br By using boosting algorithms some researchers
By using boosting algorithms, some researchers [13,21,35] can theoretically design a strong learning model by combining multiple weak learners. These boosting algorithms collect and optimize decision trees, which are considered weak learners. Among the available boosting algorithms, AdaBoost could advantageously use recurrent neural networks as weak learners, which are applied for analyzing text or time series, as demonstrated by Assaad et al.  and Buabin et al. ). In addition, Karianakis et al.  tried to use the AdaBoost decision trees method to combine CNNs for classifying images. Friedman et al.  introduced gradient boosting machines (GBM) which can be applied for training neural networks and can be used as an effective gradient descent optimization algorithm. In turn, Gao et al.  developed a boosting algorithm that selected weak CNN learners among a set of CNNs to build a stronger learner for video recognition. Wang et al.  and Han et al.
 built a boosting algorithm to regularize the CNN loss function.
3. Gradient boosting trees
The algorithm for boosting trees was developed to solve regression and classification problems and to produce a strong prediction model by ensembling weak prediction models. This algorithm can be considered as an optimization algorithm
that computes a sequence of successive trees. Every single tree is used to predict the residuals of the preceding tree. In general, the gradient boosting trees algorithm aims to combine weak classifiers into a single strong classifier.
3.1. Regularized learning objective
Like any supervised learning algorithm, the objective of the gradient boosting algorithm is to classify objects by mini-mizing a loss function. To identify the solution for this Pam3CSK4 optimization problem, boosting trees are trained by using gradient descent algorithms, and predictions of new boosting trees are updated and improved based on a learning rate.
tree that can classify training data u into the corresponding group. Each single tree h(u) consists of D leaves. Thus,
th leaf of the
each fk is a function of a regression tree h and corresponding leaf weights w. We denote the score on the i
regression tree h by δi. Each example is classified in accordance with the leaves of the regression tree h based on a set of decision rules. As a result, we can compute the last prediction for this example by summing up the scores in the corre-sponding leaves. The set of functions can be found by solving the following minimization problem
where δ( f ) = γ D + 12 λ δ 2. In this formula, the difference between the prediction vˆi and the target vi is computed by the differentiable convex loss function g and the complexity of the model is measured by the additional regularization term δ. The additional regularization term is added to this equation to avoid the overfitting problem and to smooth the learning weights. This is because the model of boosting trees tends to select simple functions for the prediction task. Accordingly, the training parameters are minimized. Ideally, if the loss function G(φ) can be minimized, the sum of its residuals should be approximately zero. In fact, each training data point has one residual equal to the difference between the observed value vi and the predicted value vˆi. The idea of the gradient boosting algorithm is the repetitive updating of the new boosting trees to reduce residuals and strengthen the model. The training process can stop if the sum of these residuals is less than a predefined threshold.
3.2. Gradient tree boosting
The optimal solution for Eq. (2) cannot be directly found by conventional optimization methods. Instead, we apply an effective approximation method to optimize parameters and functions in this equation. We denote the prediction of the ith example at the tth iteration by vˆi(t ) . The solution for Eq. (2) can be found by solving the following minimization problem: