sklearn 随机森林分类类RandomForestClassifier
随机森林分类器。 scikit-learn v0.19.1
随机森林是⼀个元估计器,它适合数据集的各个⼦样本上的多个决策树分类器,并使⽤平均值来提⾼预测精度和控制过度拟合。 ⼦样本⼤⼩始终与原始输⼊样本⼤⼩相同,但如果bootstrap = True(默认值),则会使⽤替换来绘制样本。
先看这个类的参数:
ble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fra 具体参数意义如下:
歌舞青春1歌曲参数:
n_estimators : integer, optional (default=10) 整数,可选择(默认值为10)。The number of trees in the forest.
森林⾥(决策)树的数⽬。
criterion : string, optional (default=”gini”) 字符串,可选择(默认值为“gini”)。
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.Note: this parameter is tree-specific.
衡量分裂质量的性能(函数)。 受⽀持的标准是基尼不纯度的"gini",和信息增益的"entropy"(熵)。
特朗普国会演讲注意:这个参数是特定树的。
⾸先Gini 不纯度和Gini 系数(coefficient )没有关系。Gini impurity 衡量的是从⼀个集合中随机选择⼀个元素,基于该集合中标签的概率分布为元素分配标签的错误率。对于任何⼀个标签下的元素,其被分类正确的条件概率可以理解为在选择元素时选中该标签的概率与在分类时选中该标签的概率。基于上述描述,Gini impurity 的计算就⾮常简单了,即1减去所有分类正确的概率,得到的就是分类不正确的概率。若元素数量⾮常多,切所有元素单独属于⼀个分类时,Gini 不纯度达到极⼩值0。
设元素的标签为,为该标签在集合中的⽐例,那么
plannedmax_features : int, float, string or None, optional (default=”auto”) 整数,浮点数,字符串或者⽆值,可选的(默认值为"auto")
1,2,…,m 1,2,…,m fi f i (f)=(1−)=(–)=–=1–I G ∑m i=1f i f i ∑m i=1f i f i 2∑m i=1f i ∑m i=1f i 2∑m i=1f i
2
max_depth : integer or None, optional (default=None) 整数或者⽆值,可选的(默认为None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
(决策)树的最⼤深度。如果值为None,那么会扩展节点,直到所有的叶⼦是纯净的,或者直到所有叶⼦包含少于min_sample_split 的样本。
以上叶⼦节点是纯净的,这句话其实我不太理解。
infun
min_samples_split : int, float, optional (default=2) 整数,浮点数,可选的(默认值为2)
min_samples_leaf : int, float, optional (default=1) 整数,浮点数,可选的(默认值为1)
manly
min_weight_fraction_leaf : float, optional (default=0.) 浮点数,可选的(默认值是0.0)
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
⼀个叶⼦节点所需要的权重总和(所有的输⼊样本)的最⼩加权分数。当sample_weight没有提供时,样本具有相同的权重
max_leaf_nodes : int or None, optional (default=None) 整数或者⽆值,可选的(默认值为None)
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
以最优的⽅法使⽤max_leaf_nodes来⽣长树。最好的节点被定义为不纯度上的相对减少。如果为None,那么不限制叶⼦节点的数量。min_impurity_split : float, 浮点数
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwi it is a leaf.
gpi是什么意思
树早期⽣长的阈值。如果⼀个节点的不纯度超过阈值那么这个节点将会分裂,否则它还是⼀⽚叶⼦。
cnpmDeprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrea in 0.19 and will be removed in 0.21. U min_impurity_decrea instead.
⾃0.19版以后不推荐使⽤:min_impurity_split 已被弃⽤,取⽽代之的是0.19中的min_impurity_decrea 。min_impurity_split 将在0.21中被删除。 使⽤min_impurity_decrea 。
min_impurity_decrea : float, optional (default=0.) 浮点数,可选的(默认值为0)A node will be split if this split induces a decrea of the impurity greater than or equal to this value.
The weighted impurity decrea equation is the following:
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N , N_t , N_t_R and N_t_L all refer to the weighted sum, if sample_weight is pasd.
New in version 0.19.
如果节点的分裂导致的不纯度的下降程度⼤于或者等于这个节点的值,那么这个节点将会被分裂。
不纯度加权减少⽅程式如下:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
N 是样本总的数量,N_t 是当前节点处的样本数量,N_t_L 是左孩⼦节点样本的数量,还有N_t_R 是右孩⼦节点的样本数量。非主流翻译器
N ,N_t ,N_t_R 和N_t_L 全部是指加权总和,如果sample_weight 通过的话。
0.19版本新加的参数。
bootstrap : boolean, optional (default=True) 布尔值,可选的(默认值为True )
Whether bootstrap samples are ud when building trees.
建⽴决策树时,是否使⽤有放回抽样。few的用法
oob_score : bool (default=Fal) bool ,(默认值为Fal )
Whether to u out-of-bag samples to estimate the generalization accuracy.
是否使⽤袋外样本来估计泛化精度。sleep的过去式
n_jobs : integer, optional (default=1) 整数,可选的(默认值为1)
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is t to the number of cores.
⽤于拟合和预测的并⾏运⾏的⼯作(作业)数量。如果值为-1,那么⼯作数量被设置为核的数量。
random_state : int, RandomState instance or None, optional (default=None) 整数,RandomState实例,或者为None,可选(默认值为None)RandomState If int, random_state is the ed ud by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance ud by np.random.
RandomStateIf int,random_state是随机数⽣成器使⽤的种⼦; 如果是RandomState实例,random_state就是随机数⽣成器; 如果为None,则随机数⽣成器是np.random使⽤的RandomState实例。
verbo : int, optional (default=0) 整数,可选的(默认值为0)
Controls the verbosity of the tree building process.
控制决策树建⽴过程的冗余度。
warm_start : bool, optional (default=Fal) 布尔值,可选的(默认值为Fal)
When t to True, reu the solution of the previous call to fit and add more estimators to the enmble, otherwi, just fit a whole new forest.
当被设置为True时,重新使⽤之前呼叫的解决⽅案,⽤来给全体拟合和添加更多的估计器,反之,仅仅只是为了拟合⼀个全新的森林。
class_weight : dict, list of dicts, “balanced”, 字典,字典序列,"balanced"