跳到正文
Carol's Blog
返回

基于ID3算法的决策树实现

Decision Tree


Requirement

​ A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

​ In Decision Tree the major challenge is the identification of the attribute for the root node in each level. This processis known as attribute selection. We have a popular attribute selection measure, Information Gain, which is defined based on entropy. Entropy is the measure of uncertainty of a random variable; it characterizes the impurity of an arbitrary collection of examples. As we use a node in decision tree to partition the training instances into smaller subsets, the entropy changes. Information gain is a measure of this change in entropy. The best attribute should maximize the information gain given a set of training instances.

​ The table below specifies the conditions for whether to play badmintion or not. You are required to construct a decision tree based on the information in the table.

No.OutlookTemperatureHumidityWindyPlay?
1SunnyHotHighFalseNo
2SunnyHotHighTrueNo
3OvercastHotHighFalseYes
4RainMildHighFalseYes
5RainCoolNormalFalseYes
6RainCoolNormalTrueNo
7OvercastCoolNormalTrueYes
8SunnyMildHighFalseNo
9SunnyCoolNormalFalseYes
10RainMildNormalFalseYes
11SunnyMildNormalTrueYes
12OvercastMildHighTrueYes
13OvercastHotNormalFalseYes
14RainMildHighTrueNo

Algorithm

The decision tree build is ruled by the ID3 algorithm(Iterative Dichotomiser 3). And its main theory is Information theory.

description

If XX is a discrete random variable and its probability distribution is p(x)=P(X=x),xXp(x)=P(X=x), x\in X, then the information entropy of XX is H(X)H(X)

H(X)=ip(xi)log2p(xi) H(X) = -\sum_{i}^{}{p(x_i)\log_2p(x_i)}

H(X)H(X) is also called Prior Entropy of X.

​ Support that condition that an event yjy_j happens, the conditional probability of the random event xix_i occurring is p(xiyj)p(x_i\mid y_j). Under the condition of yjy_j, the uncertainty about XX is defined as the posterior entropy H(Xyj)H(X\mid y_j).

H(Xyj)=ip(xiyj)log2(p(xiyj)) H(X|y_j)=-\sum_{i}^{}p(x_i|y_j)\log_2(p(x_i|y_j))

​ For attribute YY (the set of each yjy_j), the conditional entropy of set XX is H(XY)H(X\mid Y)

H(XY)=jp(yj)H(Xyj)=jp(yj)ip(xiyj)log2p(xiyj) H(X|Y)=\sum_{j}{p(y_j)H(X|y_j)}=-\sum_{j}{p(y_j)\sum_{i}{p(x_i|y_j)\log_2p(x_i|y_j)}}

​ With prior entropy H(X)H(X) and conditional entropy H(XY)H(X \mid Y), We can define the information gain IG(X,Y)IG(X,Y), means that how much information gain can variable XX get from attribute YY

IG(X,Y)=H(X)H(XY)IG(X,Y)=H(X)-H(X|Y)

example

For the training table. Consider attribute Outlook={Sunny,Overcast,Rainy}Outlook=\{Sunny, Overcast,Rainy\}.

So the conditional entropy of OutlookOutlook is H(PlayOutlook)=5140.970951+4140+5140.970951=0.693536H(Play\mid Outlook)=\frac{5}{14}\cdot0.970951+\frac{4}{14}\cdot0+\frac{5}{14}\cdot0.970951=0.693536. and then the information gain from OutlookOutlook to PlayPlay is IG(Play,Outlook)=H(Play)H(PlayOutlook)=0.24675IG(Play,Outlook)=H(Play)-H(Play\mid Outlook)=0.24675.

​ For all attributes are the same:

class definition
  /**   
   * @Brief store the attribute(col) in the sample  
   */  
  class Attr {
  public:
      // which col  
      int colIndex;  
      // how many types of the attr
      int typeNum;  
      double conEntroy, InforGain;
      std::string attribute; 
      std::vector<std::string> attriValue;
      // init 0.0
      std::map<std::string, double> typeEntropy;
      // start from 1
      std::map<std::string, unsigned int> typeCnt;
      std::map<std::string, std::map<std::string, unsigned int> > attriValue_to_result;
  
      Attr(int col = 0, std::string str = ""):
          colIndex(col), typeNum(0), conEntroy(0), InforGain(0), attribute(str) { }
      ~Attr() { }
  };
  
  /**  
   * @Brief define the node infor
   */
  class TreeNode {
  public:
      // current attribute Infor
      std::string attri; 
      std::map<std::string, TreeNode *> attri_To_Children;
      std::vector<Attr> Table;
      int rowNum, colNum;
      double InforGain;
      int biggestIGidx;
      bool isLeaf;
  
      TreeNode(): biggestIGidx(0), InforGain(0.0),
      isLeaf(true), rowNum(0), colNum(0) { }
      ~TreeNode() { }
  
      void chooseIG();
      void clearCol();
  };
  
  /**  
   * @Brief define the decision tree
   */
  class DecisionTree {
  public:
      DecisionTree() { root = new TreeNode; }
      ~DecisionTree() { removeAll(root); }
  
      void BuildTree() { builder(root); }
      void ReadTable(std::ifstream &);
      void testTree() { testTreeHelp(root); }
      void printTree() { printTreeHelp(root, "NULL"); }; 
  private:
      void removeAll(TreeNode *);
      void printSample(TreeNode *);
      void computeEntropyFir(TreeNode *);
      void computeEntropySec(TreeNode *);
      void builder(TreeNode *);
      void processSample(TreeNode *);
      void divideChild(TreeNode *);
      void testTreeHelp(TreeNode *);
      void printTreeHelp(TreeNode *, std::string);
  
      TreeNode *root;
  };

Analysis and discussion

​ ID3 algorithm is a kind of greedy algorithm, when set a node’s attribute, alway choose the IG max one.

​ this algorithm is easy to train data, but there are also same disadvantages. First, it only can process descrete data, if data is continuous, it does not work. Second, this algorithm is very reply on the training data, so it is easy to build a insufficient decision tree. And at most time, there may be some attribute won’t support any effective information, in this condition, the tree be built may be a bit unsuitable.

For me to implement this algorithm, I use a lot of STLSTL like stringstring, mapmap and vectorvector, may waste a lot of space and make the time efficiency lower.


分享这篇文章:
通过邮件分享这篇文章

分享到微信

微信对普通网页没有开放通用直连分享协议。更稳妥的方式是复制链接、扫码打开,或在支持的设备上调用系统分享。

上一篇
Huffman编码压缩文件时的文件储存结构