python implements ID3 decision tree algorithm
- 2020-06-19 10:44:42
- OfStack
The ID3 algorithm of decision tree and its Python implementation are as follows
The main content
Decision tree background
Build the process as decision tree 1
ID3 algorithm splits the selection of attributes
ID3 algorithm flow and analysis of its advantages and disadvantages
ID3 algorithm Python code implementation
1. Decision tree background knowledge
The & # 8195; The & # 8195; Decision tree is one of the most important and commonly used methods in data mining, which is mainly used in data mining classification and prediction. Decision tree is a representation of knowledge, in which the path from vertex to each node is a classification rule. Decision tree algorithm was first developed based on information theory. After several decades of development, the commonly used algorithms are ID3, C4.5 and CART algorithm.
2. Build process like decision tree 1
The & # 8195; The & # 8195; Building a decision tree is a top-down process. The growth process of a tree is a process in which data are continuously segmented and subdivided, and a node corresponding to a data subset will be generated in each segmentation. Starting from the root node containing all the data, the training set is divided into different data subsets according to the attribute value of the split attribute selected, and new non-leaf nodes corresponding to each training data subset are generated. The above process is repeated for the generated non-leaf nodes until specific termination conditions are met, the subsets of data are stopped, and the leaf nodes corresponding to the data subsets are generated, namely the required categories. The test set tests its performance after the decision tree is built. If the performance is not up to standard, we need to improve the decision tree algorithm until the expected performance index is achieved.
The & # 8195; The & # 8195; Note: The selection of split attribute is the key in the process of decision tree production, which determines the performance and structure of the generated decision tree. The criterion of split attribute selection is the fundamental difference between decision tree algorithms.
3. ID3 algorithm split attribute selection -- Information gain
The & # 8195; The & # 8195; Attribute selection is the core of decision tree algorithm. It plays a decisive role in the structure and performance of the decision tree. ID3 algorithm is based on information gain split attribute selection. Attribute selection based on information gain refers to the method of selecting attributes with the decreasing speed of information entropy. Based on the information theory of, it selects the attribute with the highest information gain as the splitting attribute of the current node. After choosing this attribute as the splitting attribute, the information of the split sample is maximized and the uncertainty is minimized, that is, the entropy is minimized.
The & # 8195; The & # 8195; Information gain is defined as the difference of entropy before and after change, while entropy is defined as the expected value of information. Therefore, before understanding entropy and information gain, we need to understand the definition of information.
The & # 8195; The & # 8195; Information: The frequency of the classification tag xi in the sample set S is denoted as p(xi), then the information of xi is defined as: − log2p (xi).
The & # 8195; The & # 8195; The entropy of the sample set before splitting: E(S)=− ∑Ni=1p(xi)log2p(xi), where N is the number of classification tags.
The & # 8195; The & # 8195; The entropy of the sample set after the splitting of attribute A: EA(S)=− ∑mj=1|Sj||S|E(Sj), in which m represents the original sample set divided into m sub-sample set by the attribute value of A, |Sj| represents the number of samples in the j sub-sample set, and |S| represents the total number of samples in the data set before splitting.
The & # 8195; The & # 8195; The information gain of the sample set after splitting by attribute A: InfoGain(S,A)=E(S)− EA (S)
The & # 8195; The & # 8195; Note: The selection criteria for the splitting attribute are: the larger the information gain before and after the splitting, the better; that is, the smaller the entropy after the splitting, the better.
4. ID3 algorithm
The & # 8195; The & # 8195; ID3 algorithm is a decision tree learning method based on information gain attribute selection. The core idea is: by calculating the information gain of attributes, the split attributes at all levels of the decision tree can be selected, so that the maximum category information about the tested samples can be obtained when each non-leaf node is tested. Basic method is: do all its properties, choose the maximum attribute of information gain decision tree node splitting, based on the attribute set up branches, the different attribute value to each branch a subset of the recursive calls to the method to set up the branch of child nodes, until all the subsets include only 1 category with or without the attributes can be divided. Thus a decision tree is obtained, which can be used to classify the new sample data.
ID3 Algorithm flow:
(1) Create an initial node. If the samples in the node are in the same category, the algorithm terminates, and the node is marked as a leaf node and marked with the category.
(2) Otherwise, according to the algorithm, the attribute with the maximum information gain is selected as the splitting attribute of the node.
(3) For each value in the split attribute, extend the corresponding branch, and divide samples according to the attribute value.
(4) Use the same procedure, recursion from the top down until one of the following three conditions is met and the recursion stops.
The & # 8195; The & # 8194; All samples of A and to be split node belong to the same class 1.
The & # 8195; The & # 8194; B. All samples in the training sample set were classified.
The & # 8195; The & # 8194; C, all attributes are executed once as split attributes. At this point, if there are still samples belonging to different categories in the leaf node, the category containing the most samples is selected as the classification of the leaf node.
Analysis of advantages and disadvantages of ID3 algorithm
Advantages: The speed of building decision tree is relatively fast, the algorithm is simple, and the generated rules are easy to understand.
Disadvantages: In the selection of attributes, it tends to select those with multiple attribute values as split attributes, and these attributes are not necessarily the best split attributes; Cannot handle attributes with continuous values; Without pruning process, the decision tree cannot be optimized, and the generated decision tree may be over-fitting.
5. ID3 algorithm Python code implementation
# -*- coding: utf-8 -*-
__author__ = 'zhihua_oba'
import operator
from numpy import *
from math import log
# File to read
def file2matrix(filename, attribute_num): # Pass in parameters: file name, number of attributes
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines) # Number of rows of statistical data sets (number of samples)
dataMat = zeros((numberOfLines, attribute_num))
classLabelVector = [] # Category labels
index = 0
for line in arrayOLines:
line = line.strip() #strip() Delete string '\n'
listFromLine = line.split() # will 1 The string is split into a list of multiple strings, which is separated by whitesspace when no parameters are taken, and by this parameter when modern parameters are taken
dataMat[index, : ] = listFromLine[0:attribute_num] # Read data object property values
classLabelVector.append(listFromLine[-1]) # Read classification information
index += 1
dataSet = [] # The array is converted to a list
index = 0
for index in range(0, numberOfLines):
temp = list(dataMat[index, :])
temp.append(classLabelVector[index])
dataSet.append(temp)
return dataSet
# Partitioned data set
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featvec in dataSet: # Each row
if featvec[axis] == value: # Each row in the first axis Elements and value equal # Delete the corresponding element and add this row to the rerDataSet
reducedFeatVec = featvec[:axis]
reducedFeatVec.extend(featvec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
# Calculate the Shannon entropy # Calculate the Shannon entropy of the dataset == Calculate shannon entropy of dataset class label
def calcShannonEnt(dataSet):
numEntries = len(dataSet) # Number of sample points in the data set
labelCounts = {} # Class label
for featVec in dataSet: # The number of data set class tags in dictionary form
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
# According to Shannon entropy, the optimal partition method is selected # According to a certain 1 After attribute division, the lower shannon entropy is, the better the effect is
def chooseBestFeatureToSplit(dataSet):
baseEntropy = calcShannonEnt(dataSet) # Calculate the Shannon entropy of the dataset
numFeatures = len(dataSet[0])-1
bestInfoGain = 0.0 # Maximum information gain
bestFeature = 0 # The optimal characteristics
for i in range(0, numFeatures):
featList = [example[i] for example in dataSet] # The first of all sublists (rows) i Three elements, composition 1 A new list
uniqueVals = set(featList)
newEntorpy = 0.0
for value in uniqueVals: # Data set according to the i The shannon entropy of the partitioned data set was calculated
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntorpy += prob*calcShannonEnt(subDataSet)
infoGain = baseEntropy-newEntorpy # For the partitioned data set, the smaller shannon entropy is, the better, that is, the greater the information gain is
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
# If the dataset has processed all the attributes, the class tag in the leaf node is still not unique 1 , you need to decide how to define the leaf node. In this case, the majority voting method is adopted to classify the leaf node
def majorityCnt(classList): # Pass in parameter: class label in leaf node
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
# Create a tree
def createTree(dataSet, labels): # Parameters passed in: data set, attribute tag (attribute tag function: to make decision tree construction clearer when output result)
classList = [example[-1] for example in dataSet] # The class label for the sample data set
if classList.count(classList[0]) == len(classList): # If the data set samples belong to the same category 1 Class, indicating that the leaf node has been partitioned
return classList[0]
if len(dataSet[0]) == 1: # If the sample data set has only one 1 Column (which is the class label), indicating that all attributes have been divided, the leaf node will be classified according to the majority voting method
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet) # According to Shannon entropy, the optimal partition method is selected
bestFeatLabel = labels[bestFeat] # Record the property tag
myTree = {bestFeatLabel:{}} # The tree
del(labels[bestFeat]) # Delete the property in the property tag
# Build the tree based on the optimal attributes
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
subDataSet = splitDataSet(dataSet, bestFeat, value)
myTree[bestFeatLabel][value] = createTree(subDataSet, subLabels)
return myTree
# Test algorithm: Use decision tree to classify the classified samples
def classify(inputTree, featLabels, testVec): # Pass in parameters: decision tree, attribute tag, sample to be classified
firstStr = inputTree.keys()[0] # The attributes that the root represents
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr) # The attribute represented by the root, the position in the attribute tag, and the number of attributes
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel
def main():
dataSet = file2matrix('test_sample.txt', 4)
labels = ['attr01', 'attr02', 'attr03', 'attr04']
labelsForCreateTree = labels[:]
Tree = createTree(dataSet, labelsForCreateTree )
testvec = [2, 3, 2, 3]
print classify(Tree, labels, testvec)
if __name__ == '__main__':
main()