Intensive reading decision tree: XG Boost: an extensible tree lifting system.
? For a given data set D with n training samples and m features, the used tree integration model (see "tree integration model" in the figure) uses the sum function of k degrees (formula 1 in the figure below) to predict the output (addition model).
? The tree integration model based on the objective function formula in 2. 1 (corresponding to formula 2 in this paper) contains functions as parameters, which cannot be optimized by the traditional Euclidean space optimization method. So the model is trained by addition (forward step-by-step algorithm). Miss y? _i(t) is the prediction of the i-th instance in the t iteration. Add a new ft to minimize the following objectives. The objective function expression is:
? For the optimization of this function, XGBoost uses Taylor expansion. Different from GDBT, XGBoost uses Taylor quadratic expansion. Remove the constant term (as shown in the figure below) and finally get the simplified function (loss function), as shown in the figure below.
? Define the Q function to map the input X to a leaf node, and then define the sample set on each leaf node J as
? In addition to adjusting the target, two other techniques are used to further prevent over-fitting.
? When building trees and looking for split points, we need to pay attention to two problems: what features (dimensions) to cut and what thresholds to cut in the selected features (dimensions).
? List all possible divisions on all features. Accurate greedy algorithm starts from the root node of the tree and enumerates all available features of each leaf node. It is pointed out that in order to be more efficient, the algorithm must first sort the data according to the eigenvalues and access the data in order to enumerate the gradient statistics of structural scoring in the scoring formula. (That is, write a two-level loop to exhaust these two parameters and try them one by one to maintain the optimal segmentation scheme. )
? This algorithm needs exhaustive data, and it can't be done effectively when the data can't be completely loaded into memory (the feature quantity is large and the device doesn't support it).
? In bucket operation, the algorithm will first propose candidate segmentation points according to the percentile of feature distribution. Continuous features are mapped into buckets divided by these candidate points, and statistical information is aggregated. According to the summarized statistical data, find the optimal solution in the scheme. (Calculate the statistical information in each bucket, and you can get the optimal splitting income value of the optimal splitting point. )
? Quantile histogram is used to approximately calculate quantiles, so as to approximately obtain a specific query. Random mapping is used to project the data stream into a small storage space as a summary of the whole data stream. The summary data stored in this small space (the minimum and maximum values in the original sequence need to be maintained) is called Sketch, which can be used to approximately answer a specific query.
? Add a default direction for each tree node, so that the model can automatically learn the default division direction of missing data. In each segmentation, the missing value is divided into the left node and the right node respectively. By calculating the score and comparing which of the two segmentation methods is better, the best default segmentation direction will be learned for the missing value of each feature.
? XGBoost adds a penalty term to the objective function, which greatly enhances the generalization ability of the model, supports the downsampling of rows and columns, and optimizes the calculation speed.
? What is more interesting is the sparse value processing, which makes the model learn automatically, divides nodes by default, and selects the best.