python implements ID3 decision tree algorithm

  • 2020-06-19 10:44:42
  • OfStack

The ID3 algorithm of decision tree and its Python implementation are as follows

The main content

Decision tree background
Build the process as decision tree 1
ID3 algorithm splits the selection of attributes
ID3 algorithm flow and analysis of its advantages and disadvantages
ID3 algorithm Python code implementation

1. Decision tree background knowledge

The & # 8195; The & # 8195; Decision tree is one of the most important and commonly used methods in data mining, which is mainly used in data mining classification and prediction. Decision tree is a representation of knowledge, in which the path from vertex to each node is a classification rule. Decision tree algorithm was first developed based on information theory. After several decades of development, the commonly used algorithms are ID3, C4.5 and CART algorithm.

2. Build process like decision tree 1

The & # 8195; The & # 8195; Building a decision tree is a top-down process. The growth process of a tree is a process in which data are continuously segmented and subdivided, and a node corresponding to a data subset will be generated in each segmentation. Starting from the root node containing all the data, the training set is divided into different data subsets according to the attribute value of the split attribute selected, and new non-leaf nodes corresponding to each training data subset are generated. The above process is repeated for the generated non-leaf nodes until specific termination conditions are met, the subsets of data are stopped, and the leaf nodes corresponding to the data subsets are generated, namely the required categories. The test set tests its performance after the decision tree is built. If the performance is not up to standard, we need to improve the decision tree algorithm until the expected performance index is achieved.
The & # 8195; The & # 8195; Note: The selection of split attribute is the key in the process of decision tree production, which determines the performance and structure of the generated decision tree. The criterion of split attribute selection is the fundamental difference between decision tree algorithms.

3. ID3 algorithm split attribute selection -- Information gain

The & # 8195; The & # 8195; Attribute selection is the core of decision tree algorithm. It plays a decisive role in the structure and performance of the decision tree. ID3 algorithm is based on information gain split attribute selection. Attribute selection based on information gain refers to the method of selecting attributes with the decreasing speed of information entropy. Based on the information theory of, it selects the attribute with the highest information gain as the splitting attribute of the current node. After choosing this attribute as the splitting attribute, the information of the split sample is maximized and the uncertainty is minimized, that is, the entropy is minimized.
The & # 8195; The & # 8195; Information gain is defined as the difference of entropy before and after change, while entropy is defined as the expected value of information. Therefore, before understanding entropy and information gain, we need to understand the definition of information.
The & # 8195; The & # 8195; Information: The frequency of the classification tag xi in the sample set S is denoted as p(xi), then the information of xi is defined as: − log2p (xi).
The & # 8195; The & # 8195; The entropy of the sample set before splitting: E(S)=− ∑Ni=1p(xi)log2p(xi), where N is the number of classification tags.
The & # 8195; The & # 8195; The entropy of the sample set after the splitting of attribute A: EA(S)=− ∑mj=1|Sj||S|E(Sj), in which m represents the original sample set divided into m sub-sample set by the attribute value of A, |Sj| represents the number of samples in the j sub-sample set, and |S| represents the total number of samples in the data set before splitting.
The & # 8195; The & # 8195; The information gain of the sample set after splitting by attribute A: InfoGain(S,A)=E(S)− EA (S)
The & # 8195; The & # 8195; Note: The selection criteria for the splitting attribute are: the larger the information gain before and after the splitting, the better; that is, the smaller the entropy after the splitting, the better.

4. ID3 algorithm

The & # 8195; The & # 8195; ID3 algorithm is a decision tree learning method based on information gain attribute selection. The core idea is: by calculating the information gain of attributes, the split attributes at all levels of the decision tree can be selected, so that the maximum category information about the tested samples can be obtained when each non-leaf node is tested. Basic method is: do all its properties, choose the maximum attribute of information gain decision tree node splitting, based on the attribute set up branches, the different attribute value to each branch a subset of the recursive calls to the method to set up the branch of child nodes, until all the subsets include only 1 category with or without the attributes can be divided. Thus a decision tree is obtained, which can be used to classify the new sample data.

ID3 Algorithm flow:

(1) Create an initial node. If the samples in the node are in the same category, the algorithm terminates, and the node is marked as a leaf node and marked with the category.
(2) Otherwise, according to the algorithm, the attribute with the maximum information gain is selected as the splitting attribute of the node.
(3) For each value in the split attribute, extend the corresponding branch, and divide samples according to the attribute value.
(4) Use the same procedure, recursion from the top down until one of the following three conditions is met and the recursion stops.
The & # 8195; The & # 8194; All samples of A and to be split node belong to the same class 1.
The & # 8195; The & # 8194; B. All samples in the training sample set were classified.
The & # 8195; The & # 8194; C, all attributes are executed once as split attributes. At this point, if there are still samples belonging to different categories in the leaf node, the category containing the most samples is selected as the classification of the leaf node.

Analysis of advantages and disadvantages of ID3 algorithm

Advantages: The speed of building decision tree is relatively fast, the algorithm is simple, and the generated rules are easy to understand.
Disadvantages: In the selection of attributes, it tends to select those with multiple attribute values as split attributes, and these attributes are not necessarily the best split attributes; Cannot handle attributes with continuous values; Without pruning process, the decision tree cannot be optimized, and the generated decision tree may be over-fitting.

5. ID3 algorithm Python code implementation


# -*- coding: utf-8 -*-
__author__ = 'zhihua_oba'

import operator
from numpy import *
from math import log

# File to read 
def file2matrix(filename, attribute_num): # Pass in parameters: file name, number of attributes 
 fr = open(filename)
 arrayOLines = fr.readlines()
 numberOfLines = len(arrayOLines) # Number of rows of statistical data sets (number of samples) 
 dataMat = zeros((numberOfLines, attribute_num))
 classLabelVector = [] # Category labels 
 index = 0
 for line in arrayOLines:
  line = line.strip() #strip()  Delete string '\n'
  listFromLine = line.split() # will 1 The string is split into a list of multiple strings, which is separated by whitesspace when no parameters are taken, and by this parameter when modern parameters are taken 
  dataMat[index, : ] = listFromLine[0:attribute_num] # Read data object property values 
  classLabelVector.append(listFromLine[-1]) # Read classification information 
  index += 1
 dataSet = [] # The array is converted to a list 
 index = 0
 for index in range(0, numberOfLines):
  temp = list(dataMat[index, :])
  temp.append(classLabelVector[index])
  dataSet.append(temp)
 return dataSet

# Partitioned data set 
def splitDataSet(dataSet, axis, value):
 retDataSet = []
 for featvec in dataSet: # Each row 
  if featvec[axis] == value: # Each row in the first axis Elements and value equal  # Delete the corresponding element and add this row to the rerDataSet
   reducedFeatVec = featvec[:axis]
   reducedFeatVec.extend(featvec[axis+1:])
   retDataSet.append(reducedFeatVec)
 return retDataSet

# Calculate the Shannon entropy  # Calculate the Shannon entropy of the dataset  ==  Calculate shannon entropy of dataset class label 
def calcShannonEnt(dataSet):
 numEntries = len(dataSet) # Number of sample points in the data set 
 labelCounts = {} # Class label 
 for featVec in dataSet: # The number of data set class tags in dictionary form 
  currentLabel = featVec[-1]
  if currentLabel not in labelCounts.keys():
   labelCounts[currentLabel] = 0
  labelCounts[currentLabel] += 1
 shannonEnt = 0.0
 for key in labelCounts:
  prob = float(labelCounts[key])/numEntries
  shannonEnt -= prob * log(prob, 2)
 return shannonEnt

# According to Shannon entropy, the optimal partition method is selected  # According to a certain 1 After attribute division, the lower shannon entropy is, the better the effect is 
def chooseBestFeatureToSplit(dataSet):
 baseEntropy = calcShannonEnt(dataSet) # Calculate the Shannon entropy of the dataset 
 numFeatures = len(dataSet[0])-1
 bestInfoGain = 0.0 # Maximum information gain 
 bestFeature = 0 # The optimal characteristics 
 for i in range(0, numFeatures):
  featList = [example[i] for example in dataSet] # The first of all sublists (rows) i Three elements, composition 1 A new list 
  uniqueVals = set(featList)
  newEntorpy = 0.0
  for value in uniqueVals: # Data set according to the i The shannon entropy of the partitioned data set was calculated 
   subDataSet = splitDataSet(dataSet, i, value)
   prob = len(subDataSet)/float(len(dataSet))
   newEntorpy += prob*calcShannonEnt(subDataSet)
  infoGain = baseEntropy-newEntorpy # For the partitioned data set, the smaller shannon entropy is, the better, that is, the greater the information gain is 
  if(infoGain > bestInfoGain):
   bestInfoGain = infoGain
   bestFeature = i
 return bestFeature

# If the dataset has processed all the attributes, the class tag in the leaf node is still not unique 1 , you need to decide how to define the leaf node. In this case, the majority voting method is adopted to classify the leaf node 
def majorityCnt(classList): # Pass in parameter: class label in leaf node 
 classCount = {}
 for vote in classList:
  if vote not in classCount.keys():
   classCount[vote] = 0
   classCount[vote] += 1
 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
 return sortedClassCount[0][0]

# Create a tree 
def createTree(dataSet, labels): # Parameters passed in: data set, attribute tag (attribute tag function: to make decision tree construction clearer when output result) 
 classList = [example[-1] for example in dataSet] # The class label for the sample data set 
 if classList.count(classList[0]) == len(classList): # If the data set samples belong to the same category 1 Class, indicating that the leaf node has been partitioned 
  return classList[0]
 if len(dataSet[0]) == 1: # If the sample data set has only one 1 Column (which is the class label), indicating that all attributes have been divided, the leaf node will be classified according to the majority voting method 
  return majorityCnt(classList)
 bestFeat = chooseBestFeatureToSplit(dataSet) # According to Shannon entropy, the optimal partition method is selected 
 bestFeatLabel = labels[bestFeat] # Record the property tag 
 myTree = {bestFeatLabel:{}} # The tree 
 del(labels[bestFeat]) # Delete the property in the property tag 
 # Build the tree based on the optimal attributes 
 featValues = [example[bestFeat] for example in dataSet]
 uniqueVals = set(featValues)
 for value in uniqueVals:
  subLabels = labels[:]
  subDataSet = splitDataSet(dataSet, bestFeat, value)
  myTree[bestFeatLabel][value] = createTree(subDataSet, subLabels)
 return myTree

# Test algorithm: Use decision tree to classify the classified samples 
def classify(inputTree, featLabels, testVec): # Pass in parameters: decision tree, attribute tag, sample to be classified 
 firstStr = inputTree.keys()[0] # The attributes that the root represents 
 secondDict = inputTree[firstStr]
 featIndex = featLabels.index(firstStr) # The attribute represented by the root, the position in the attribute tag, and the number of attributes 
 for key in secondDict.keys():
  if testVec[featIndex] == key:
   if type(secondDict[key]).__name__ == 'dict':
    classLabel = classify(secondDict[key], featLabels, testVec)
   else:
    classLabel = secondDict[key]
 return classLabel

def main():
 dataSet = file2matrix('test_sample.txt', 4)
 labels = ['attr01', 'attr02', 'attr03', 'attr04']
 labelsForCreateTree = labels[:]
 Tree = createTree(dataSet, labelsForCreateTree )
 testvec = [2, 3, 2, 3]
 print classify(Tree, labels, testvec)
if __name__ == '__main__':
  main()


Related articles: