决策树Gini系数计算过程详细解答
最近看了篇⽂章,关于决策树的基尼系数计算过程,很详细,也很完整;
收藏记录⼀下。
An algorithm can be only if its decisions can be read and understood by people clearly. Even though deep learning is superstar of machine learning nowadays, it is an opaque algorithm and we do not know the reason of decision. Herein, Decision tree algorithms still keep their popularity becau they can produce transparent decisions. us information gain whereas us gain ratio for splitting. Here, CART is an alternative decision tree building algorithm. It can handle both classification and regression tasks. This algorithm us a new metric named gini index to create decision points for classification tasks. We will mention a step by step CART decision tree example by hand from scratch.
WOZ_1
We will work on same datat in ID3. There are 14 instances of golf playing decisions bad on outlook, temperature, humidity and wind factors.
Day Outlook Temp.Humidity Wind Decision
1Sunny Hot High Weak No
2Sunny Hot High Strong No
3Overcast Hot High Weak Yes
4Rain Mild High Weak Yes
5Rain Cool Normal Weak Yes
入党积极分子培训6Rain Cool Normal Strong No
劳防用品7Overcast Cool Normal Strong Yes
8Sunny Mild High Weak No
9Sunny Cool Normal Weak Yes
10Rain Mild Normal Weak Yes
11Sunny Mild Normal Strong Yes
12Overcast Mild High Strong Yes
13Overcast Hot Normal Weak Yes
14Rain Mild High Strong No
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)2 for i=1 to number of class学生单词
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for outlook feature.
Outlook Yes No Number of instances
Sunny235
Overcast404
Rain325
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Temperature
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild. Let’s summarize decisions for temperature feature.
Temperature Yes No Number of instances
Hot224
Cool314
Mild426
Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5
Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439 Humidity
Humidity is a binary class feature. It can be high or normal.
Humidity Yes No Number of instances
High347
Normal617
Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489
Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Wind
Wind is a binary class similar to humidity. It can be weak and strong.
Wind Yes No Number of instances
Weak628
洋白菜炒粉
Strong336
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
Time to decide
We’ve calculated gini index values for each feature. The winner will be outlook feature becau its cost is the lowest.
Feature Gini index
Outlook0.342
Temperature0.439
Humidity0.367
Wind0.428
We’ll put outlook decision at the top of the tree.
First decision would be outlook feature
You might realize that sub datat in the overcast leaf has only yes decisions. This means that overcast leaf is over. Tree is over for overcast outlook leaf
We will apply same principles to tho sub datats in the following steps.
Focus on the sub datat for sunny outlook. We need to find the gini index scores for temperature, humidity and wind features respectively.
Day Outlook Temp.Humidity Wind Decision
1Sunny Hot High Weak No
2Sunny Hot High Strong No
8Sunny Mild High Weak No
9Sunny Cool Normal Weak Yes
11Sunny Mild Normal Strong Yes
Gini of temperature for sunny outlook月经量特别少
Temperature Yes No Number of instances
Hot022
双榆树中心小学
Cool101
Mild112
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Gini of humidity for sunny outlook
Humidity Yes No Number of instances
High033
Normal202
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Gini of wind for sunny outlook
Wind Yes No Number of instances
Weak123
Strong112
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
Decision for sunny outlook
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity becau it has the lowest value.
Feature Gini index
Temperature0.2
Humidity0
Wind0.466
We’ll put humidity check at the extension of sunny outlook.
Sub datats for high and normal humidity
As en, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.
Decisions for high and normal humidity
Now, we need to focus on rain outlook.
Rain outlook
Day Outlook Temp.Humidity Wind Decision
4Rain Mild High Weak Yes
5Rain Cool Normal Weak Yes
6Rain Cool Normal Strong No屎壳郎读音
10Rain Mild Normal Weak Yes
14Rain Mild High Strong No
We’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain.
Gini of temprature for rain outlook
Temperature Yes No Number of instances
Cool112
Mild213
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of humidity for rain outlook
Humidity Yes No Number of instances
High112
Normal213
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Gini of wind for rain outlook
Wind Yes No Number of instances
Weak303
Strong022
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
Decision for rain outlook
The winner is wind feature for rain outlook becau it has the minimum gini index score in features.
Feature Gini index
离别曲Temperature0.466
Humidity0.466
Wind0
Put the wind feature for rain outlook branch and monitor the new sub data ts.
Sub data ts for weak and strong wind and rain outlook
As en, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.
Final form of the decision tree built by CART algorithm