python implements the decision tree
- 2020-06-19 10:52:54
- OfStack
This article examples for you to share python implementation of the decision tree specific code, for your reference, the specific content is as follows
Advantages and disadvantages of the algorithm:
Advantages: low computational complexity, easy to understand the output results, not sensitive to the absence of intermediate values, can handle irrelevant characteristic data
Cons: Excessive matching can be a problem
Applicable data types: numeric and nominal
Algorithm idea:
1. Overall idea of decision tree construction:
Decision tree in plain English is like if - 1 sample else structure, its result is that you will generate this one can start from the root to judge selection to the leaf nodes of the tree, but if - else will not necessarily is here let's think to set, we need to do is to provide a kind of method, a computer can according to the decision tree method to get what we need. The point of this approach is to select from so many features that are valuable and select from root to leaf in the best order. And once we've done that we can recursively construct a decision tree
2. Information gain
The biggest principle of partitioning a data set is to make the unordered data more ordered. Since this involves the problem of information order and disorder, it is natural to think of information entropy. Information entropy is also used here (another method is gini impurity). The formula is as follows:
Data requirements to be met:
The data must be a list of list elements, and all the column elements must have the same data length
The last column of data or the last element of each instance should be the category label for the current instance
Function:
calcShannonEnt(dataSet)
The shannon entropy of the data set is calculated in two steps. The first step is to calculate the frequency, and the second step is to calculate the Shannon entropy according to the formula
splitDataSet(dataSet, aixs, value)
Partition the data set, divide the values that meet X[aixs]==value to 1, and return a partitioned set (excluding the aixs attribute used for partitioning, because it is not required)
chooseBestFeature(dataSet)
Choose the best attribute to divide, the idea is very simple is to divide each attribute, see which is good. An set is used here to select the only element in the list that has 1, which is a quick method in 1
majorityCnt(classList)
Since the recursive decision tree is calculated based on the consumption of attributes, there may be a situation where the last attribute is used up, but the classification is still not finished, and then the node classification will be calculated by majority vote
createTree(dataSet, labels)
Build a decision tree based on recursion. label here is more about the name of the classification feature, for a better look and understanding.
#coding=utf-8
import operator
from math import log
import time
def createDataSet():
dataSet=[[1,1,'yes'],
[1,1,'yes'],
[1,0,'no'],
[0,1,'no'],
[0,1,'no']]
labels = ['no surfaceing','flippers']
return dataSet, labels
# Calculate the Shannon entropy
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for feaVec in dataSet:
currentLabel = feaVec[-1]
if currentLabel not in labelCounts:
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1# Because at the end of the data set 1 Is the label
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy -newEntropy
if infoGain > bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
return bestFeature
# Because we recursively build the decision tree based on the cost of the attributes, there may be some cases where the last attributes run out, but the classification
# If the calculation is still incomplete, the node classification will be calculated by majority vote
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
return max(classCount)
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) ==len(classList):# If the categories are the same, the division is stopped
return classList[0]
if len(dataSet[0]) == 1:# All the features are used up
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]# The contents of the original list were copied in order not to change 1 Under the
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,
bestFeat, value),subLabels)
return myTree
def main():
data,label = createDataSet()
t1 = time.clock()
myTree = createTree(data,label)
t2 = time.clock()
print myTree
print 'execute for ',t2-t1
if __name__=='__main__':
main()