前言

像基本流程，划分选择，剪枝处理，我这边就不讲了，直接看后面的重点。

缺失值处理

我之前一直以为决策树是不能处理缺失值的，所以我之前那篇集成学习才会自己实现随机森林的缺失值处理，这次给大家看看能处理缺失值的决策树。

class DecisionTreeClassifierWithMissing(DecisionTreeClassifier):
    def _split_node(self, X, y, sample_weight, depth, impurity, n_node_samples,
                    weighted_n_node_samples, feature, threshold):
        left_indices = []
        right_indices = []
        for i in range(X.shape[0]):
            if np.isnan(X[i, feature]):
                left_indices.append(i)
                right_indices.append(i)
            elif X[i, feature] <= threshold:
                left_indices.append(i)
            else:
                right_indices.append(i)

        left_indices = np.array(left_indices)
        right_indices = np.array(right_indices)

        if sample_weight is None:
            left_sample_weight = np.ones(left_indices.shape[0])
            right_sample_weight = np.ones(right_indices.shape[0])
        else:
            left_sample_weight = sample_weight[left_indices]
            right_sample_weight = sample_weight[right_indices]

        left_child = super()._split_node(X[left_indices], y[left_indices],
                                         left_sample_weight, depth + 1,
                                         impurity, n_node_samples,
                                         weighted_n_node_samples, feature, threshold)
        right_child = super()._split_node(X[right_indices], y[right_indices],
                                          right_sample_weight, depth + 1,
                                          impurity, n_node_samples,
                                          weighted_n_node_samples, feature, threshold)

        return left_child, right_child

    def predict(self, X):
        predictions = []
        for i in range(X.shape[0]):
            node_id = 0
            while self.tree_.children_left[node_id] != self.tree_.children_right[node_id]:
                feature = self.tree_.feature[node_id]
                threshold = self.tree_.threshold[node_id]
                if np.isnan(X[i, feature]):
                    left_node_id = self.tree_.children_left[node_id]
                    right_node_id = self.tree_.children_right[node_id]
                    left_pred = self._predict_from_node(X[i], left_node_id)
                    right_pred = self._predict_from_node(X[i], right_node_id)
                    votes = [left_pred, right_pred]
                    classes, counts = np.unique(votes, return_counts=True)
                    prediction = classes[np.argmax(counts)]
                    break
                elif X[i, feature] <= threshold:
                    node_id = self.tree_.children_left[node_id]
                else:
                    node_id = self.tree_.children_right[node_id]
            else:
                value = self.tree_.value[node_id][0]
                prediction = np.argmax(value)
            predictions.append(prediction)
        return np.array(predictions)

    def _predict_from_node(self, x, node_id):
        while self.tree_.children_left[node_id] != self.tree_.children_right[node_id]:
            feature = self.tree_.feature[node_id]
            threshold = self.tree_.threshold[node_id]
            if np.isnan(x[feature]):
                left_node_id = self.tree_.children_left[node_id]
                right_node_id = self.tree_.children_right[node_id]
                left_pred = self._predict_from_node(x, left_node_id)
                right_pred = self._predict_from_node(x, right_node_id)
                votes = [left_pred, right_pred]
                classes, counts = np.unique(votes, return_counts=True)
                return classes[np.argmax(counts)]
            elif x[feature] <= threshold:
                node_id = self.tree_.children_left[node_id]
            else:
                node_id = self.tree_.children_right[node_id]
        value = self.tree_.value[node_id][0]
        return np.argmax(value)

_split_node：如果遇到缺失值，就不管它，同时分配给左右子树。
_predict_from_node：遇到缺失值就同时走左右子树，最后投票决定归属。

为什么要让模型带着缺失值训练？

因为如果单纯的把带缺失值的数据剔除，有可能会错过一些信息。

不过在随机森林中，不需要调用这个能处理缺失值的决策树，用正常的决策树就行了。

因为随机森林是通过随机选择特征来训练的，所以即使在某些决策树中剔除了一些数据，也会有其他决策树学习到那些数据。

这个随机森林也是要自己实现的，具体就看我之前那一篇就行了。

多变量决策树

这一块怎么说呢，这个组合变量其实更应该算是数据挖掘的内容。

决策树，处理缺失值

前言

缺失值处理

多变量决策树