基于ID3算法的决策树实现
本文最后更新于:1 年前
Decision Tree
name: He Xiang (贺翔)
date: 2018-11-10
Experimental enviroment
OS: Windows 10
IDE: Visual Studio 2017 & vim 8.1
Requirement
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
In Decision Tree the major challenge is the identification of the attribute for the root node in each level. This processis known as attribute selection. We have a popular attribute selection measure, Information Gain, which is defined based on entropy. Entropy is the measure of uncertainty of a random variable; it characterizes the impurity of an arbitrary collection of examples. As we use a node in decision tree to partition the training instances into smaller subsets, the entropy changes. Information gain is a measure of this change in entropy. The best attribute should maximize the information gain given a set of training instances.
The table below specifies the conditions for whether to play badmintion or not. You are required to construct a decision tree based on the information in the table.
| No. | Outlook | Temperature | Humidity | Windy | Play? |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | False | No |
| 2 | Sunny | Hot | High | True | No |
| 3 | Overcast | Hot | High | False | Yes |
| 4 | Rain | Mild | High | False | Yes |
| 5 | Rain | Cool | Normal | False | Yes |
| 6 | Rain | Cool | Normal | True | No |
| 7 | Overcast | Cool | Normal | True | Yes |
| 8 | Sunny | Mild | High | False | No |
| 9 | Sunny | Cool | Normal | False | Yes |
| 10 | Rain | Mild | Normal | False | Yes |
| 11 | Sunny | Mild | Normal | True | Yes |
| 12 | Overcast | Mild | High | True | Yes |
| 13 | Overcast | Hot | Normal | False | Yes |
| 14 | Rain | Mild | High | True | No |
Algorithm
The decision tree build is ruled by the ID3 algorithm(Iterative Dichotomiser 3). And its main theory is Information theory.
description
If X is a discrete random
variable and its probability distribution is p(x)=P(X=x),x∈X, then the information
entropy of X is H(X) H(X)=−∑ip(xi)log2p(xi)
Support that condition that an event yj happens, the conditional probability
of the random event xi occurring
is p(xi∣yj). Under the
condition of yj, the uncertainty
about X is defined as the posterior
entropy H(X∣yj). H(X|yj)=−∑ip(xi|yj)log2(p(xi|yj))
example
For the training table. Consider attribute Outlook={Sunny,Overcast,Rainy}.
- H(Play∣Sunny)=−25⋅log225−35⋅log235=0.970951
- H(Play∣Overcast)=−44⋅log244−0⋅log20=0
- H(Play∣Rainy)=−35⋅log235−23⋅log223=0.970951
So the conditional entropy of Outlook is H(Play∣Outlook)=514⋅0.970951+414⋅0+514⋅0.970951=0.693536. and then the information gain from Outlook to Play is IG(Play,Outlook)=H(Play)−H(Play∣Outlook)=0.24675.
For all attributes are the same:
IG(Play,Outlook)=0.24675
IG(Play,Temperature)=0.0292226
IG(Play,Humidity)=0.151836
IG(Play,Windy)=0.048127
Alfter get all attributes' IG, then choose the maximum IG. Here is IG(Play,Outlook), For the decision tree current node, make Outlook as the node's attribute and make all its value be the node's child nodes' index.
Divide the from when process the child node, and recursively build the tree.
class definition
1 | |
Analysis and discussion
ID3 algorithm is a kind of greedy algorithm, when set a node's attribute, alway choose the IG max one.
this algorithm is easy to train data, but there are also same disadvantages. First, it only can process descrete data, if data is continuous, it does not work. Second, this algorithm is very reply on the training data, so it is easy to build a insufficient decision tree. And at most time, there may be some attribute won't support any effective information, in this condition, the tree be built may be a bit unsuitable.
For me to implement this algorithm, I use a lot of STL like string, map and vector, may waste a lot of space and make the time efficiency lower.