数据挖掘与机器学习——离群点检测之孤⽴森林
(isolateforest)
1.概述
孤⽴森林简单来讲:利⽤⼆叉树和随机值,将数据分在左右。正常的⾃是⼦孙满堂,异常的显然孤家寡⼈。
某些异常检测⽅法是对正常样本的描述,⽽孤⽴森林不是描述正常的样本点,⽽是孤⽴异常点。
2.概念基础
⼆叉搜索树、森林、随机森林、调和级数
(⼆叉查找树/⼆叉排序树 ,Binary Search Tree,BST)
根节点的值⼤于其左⼦树中任意⼀个节点的值,⼩于其右节点中任意⼀节点的值。
调和级数举例:
派比安
3.理论定义
英⽂来⾃论⽂原⽂
Defifinition :Isolation Tree.
Let T be a node of an isolation tree.
如果T为孤⽴森林的⼀个节点,它具有下⾯的特性。
T is either an external-node with no child, or an internal-node with one test and exactly two daughter nodes (Tl,Tr).
T要么没有孩⼦节点(即T为叶⼦节点),要么有两个孩⼦节点。
经济书籍
A test consists of an attribute q and a split value p such that the test q < p divides data points into Tl and Tr.
⼀个测试包含⼀个属性q和⼀个分割点p,利⽤分割点p可将属性q下的值分割⾄左右孩⼦。
Given a sample of data X = {x1, ..., xn} of n instances from a d-variate distribution, to build an isolation tree (iTree).
在建⽴⼀个孤⽴树(isolation tree,后简称iTree)的时候,给定⼀个数据集X。该数据集有n条数据,d个维度(即d个属性)。
We recursively divide X by randomly lecting an attribute q and a split value p, until either:
后将数据集X递归地进⾏划分,具体地,随机选择⼀个属性q和⼀个分割值p。既然是递归,则有终⽌条件,如下:
(1)the tree reaches a height limit
这个树达到了最⼤⾼度限度。
(2)|X| = 1
当前分割的X,只有⼀个结点。
(3)all data in X have the same values
当前分割的X中的结点,有着相同的属性值。
劳动合同赔偿An iTree is a proper binary tree, where each node in the tree has exactly zero or two daughter nodes.
⼀个符合要求的孤⽴树(即构建好的),是⼀个⼆叉树,并且它的每⼀个结点应该有0个孩⼦结点或者两个孩⼦结点。
这⾥需要与完全⼆叉树做区分。完全⼆叉树只有最后⼀层可以不满,并且有着从左到右的顺序,即最后⼀层不满时也应该在右边有所缺陷。⽽这⾥所说的树仅仅对于孩⼦节点数有要求。
Assuming all instances are distinct, each instance is isolated to an external node when an iTree is fully grown, in which ca the number of external nodes is n and the number of internal nodes is n-1; the total number of nodes of an iTrees is 2n-1; and thus the memory requirement is bounded and only grows linearly with n.
假设所有实例是不同的,当iTree完全成长时,每个实例被隔离成⼀个外部节点,此时外部节点数为n,内部节点数为n-1;iTrees节点总数为2n-1;因此内存需求是有界的并且只随着n线性增长。
We defifine path length and anomaly score as follows.
路径长度与异常分数
Defifinition : Path Length
路径长度
h(x) of a point x is measured by the number of edges x travers an iTree from the root node until the traversal is terminated at an external node.
将点x作为外部节点时的路径长度定义为h(x),该长度为从根节点到节点x经过的边数。
An anomaly score is required for any anomaly detection method. The difficulty in deriving such a score from h(x) is that while the maximum possible height of iTree grows in the order of n, the average height grows in the order of log n.
任何异常检测⽅法都需要⼀个异常分数。从h(x)中得出这样⼀个分数的困难在于,当iTree的最⼤可能⾼度以n阶增长时,平均⾼度以log(n)阶增长。
Normalization of h(x) by any of the above terms is either not bounded or cannot be directly compared.
上述任何⼀项对h(x)的归⼀化要么是没有边界的,要么是不能直接⽐较的。
Since iTrees have an equivalent structure to Binary Search Tree or BST (e Table 1), the estimation of average h(x)
for external node terminations is the same as the unsuccessful arch in BST.特步广告
由于孤⽴树与⼆叉搜索树( Binary Search Tree,BST)具有相同的结构,所以外部节点终⽌的平均h(x)估计与BST中不成功搜索相同。
下表中,表⽰对应情况下的h(x)相等。
iTree BST
Proper binary trees Proper binary trees
External node termination 外部节点终⽌Unsuccessful arch 查找不成功
Not applicable 树不适⽤的时候Successful arch 查找成功
We borrow the analysis from BST to estimate the average path length of iTree.
菟丝子的功效与作用
我们利⽤BST来估计iTree的平均路径长度。
平均路径长度为:
注意此公式为2012年修正后。
where H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)
. As c(n) is the average of h(x) given n, we u it to normali h(x). The anomaly score s of an instance x is defifined as:
梦见牙出血其中H(i)为调和级数,可以⽤ln(i) + 0.5772156649(欧拉常数)估计。
由于c(n)是给定n的h(x)的平均值,我们⽤它来规范化h(x)。实例x的异常分数s将被判定为:
where E(h(x)) is the average of h(x) from a collection of isolation trees.其中E(h(x))是⼀系列孤⽴树h(x)的平均值。
· when E(h(x)) → c(n), s → 0.5;
· when E(h(x)) → 0, s → 1;
· and when E(h(x)) → n n 1, s → 0.
s is monotonic to h(x). Figure 2 illustrates the relationship between E(h(x)) and s, and the following conditions applied where 0 < s ≤ 1 for 0 < h(x) ≤ n -1. Using the anomaly score s, we are able to make the following asssment:
s对h(x)是单调的。图2给出了E(h(x))与s的关系,当0 < h(x)≤n -1时,0 < s≤1。利⽤异常分数s,我们可以做如下评估:
· (a) if instances return s very clo to 1, then they are definitely anomalies,
s⾮常接近于1,则其为异常值。
远在咫尺· (b) if instances have s much smaller than 0.5, then they are quite safe to be regarded as normal instances, and
s远⼩于0.5,则其可视为正常值。
· (c) if all the instances return s ≈ 0.5, then the entire sample does not really have any distinct anomaly.
如果所有的s均约等于0.5,则样本实际上没有任何明显的异常。
4.实例与代码
数据来⾃原开源项⽬,原数据⼀共有15000⾏数据,七个维度的数据。其中第⼀个数据属性数据为ip在数据利⽤时将其删掉,剩下六个维度的数据,进⾏为了⽅便可视化进⾏了降维处理,分析属性之间的相关性,将相关性⼤利⽤PCA降维,这⾥不过多赘述。
餐厅环境