python implements the decision tree

  • 2020-06-19 10:52:54
  • OfStack

This article examples for you to share python implementation of the decision tree specific code, for your reference, the specific content is as follows

Advantages and disadvantages of the algorithm:

Advantages: low computational complexity, easy to understand the output results, not sensitive to the absence of intermediate values, can handle irrelevant characteristic data

Cons: Excessive matching can be a problem

Applicable data types: numeric and nominal

Algorithm idea:

1. Overall idea of decision tree construction:

Decision tree in plain English is like if - 1 sample else structure, its result is that you will generate this one can start from the root to judge selection to the leaf nodes of the tree, but if - else will not necessarily is here let's think to set, we need to do is to provide a kind of method, a computer can according to the decision tree method to get what we need. The point of this approach is to select from so many features that are valuable and select from root to leaf in the best order. And once we've done that we can recursively construct a decision tree

2. Information gain

The biggest principle of partitioning a data set is to make the unordered data more ordered. Since this involves the problem of information order and disorder, it is natural to think of information entropy. Information entropy is also used here (another method is gini impurity). The formula is as follows:

Data requirements to be met:

The data must be a list of list elements, and all the column elements must have the same data length
The last column of data or the last element of each instance should be the category label for the current instance

Function:

calcShannonEnt(dataSet)

The shannon entropy of the data set is calculated in two steps. The first step is to calculate the frequency, and the second step is to calculate the Shannon entropy according to the formula

splitDataSet(dataSet, aixs, value)

Partition the data set, divide the values that meet X[aixs]==value to 1, and return a partitioned set (excluding the aixs attribute used for partitioning, because it is not required)

chooseBestFeature(dataSet)

Choose the best attribute to divide, the idea is very simple is to divide each attribute, see which is good. An set is used here to select the only element in the list that has 1, which is a quick method in 1

majorityCnt(classList)

Since the recursive decision tree is calculated based on the consumption of attributes, there may be a situation where the last attribute is used up, but the classification is still not finished, and then the node classification will be calculated by majority vote

createTree(dataSet, labels)

Build a decision tree based on recursion. label here is more about the name of the classification feature, for a better look and understanding.


#coding=utf-8
import operator
from math import log
import time

def createDataSet():
  dataSet=[[1,1,'yes'],
      [1,1,'yes'],
      [1,0,'no'],
      [0,1,'no'],
      [0,1,'no']]
  labels = ['no surfaceing','flippers']
  return dataSet, labels

# Calculate the Shannon entropy 
def calcShannonEnt(dataSet):
  numEntries = len(dataSet)
  labelCounts = {}
  for feaVec in dataSet:
    currentLabel = feaVec[-1]
    if currentLabel not in labelCounts:
      labelCounts[currentLabel] = 0
    labelCounts[currentLabel] += 1
  shannonEnt = 0.0
  for key in labelCounts:
    prob = float(labelCounts[key])/numEntries
    shannonEnt -= prob * log(prob, 2)
  return shannonEnt

def splitDataSet(dataSet, axis, value):
  retDataSet = []
  for featVec in dataSet:
    if featVec[axis] == value:
      reducedFeatVec = featVec[:axis]
      reducedFeatVec.extend(featVec[axis+1:])
      retDataSet.append(reducedFeatVec)
  return retDataSet
  
def chooseBestFeatureToSplit(dataSet):
  numFeatures = len(dataSet[0]) - 1# Because at the end of the data set 1 Is the label 
  baseEntropy = calcShannonEnt(dataSet)
  bestInfoGain = 0.0
  bestFeature = -1
  for i in range(numFeatures):
    featList = [example[i] for example in dataSet]
    uniqueVals = set(featList)
    newEntropy = 0.0
    for value in uniqueVals:
      subDataSet = splitDataSet(dataSet, i, value)
      prob = len(subDataSet) / float(len(dataSet))
      newEntropy += prob * calcShannonEnt(subDataSet)
    infoGain = baseEntropy -newEntropy
    if infoGain > bestInfoGain:
      bestInfoGain = infoGain
      bestFeature = i
  return bestFeature
      
# Because we recursively build the decision tree based on the cost of the attributes, there may be some cases where the last attributes run out, but the classification 
# If the calculation is still incomplete, the node classification will be calculated by majority vote 
def majorityCnt(classList):
  classCount = {}
  for vote in classList:
    if vote not in classCount.keys():
      classCount[vote] = 0
    classCount[vote] += 1
  return max(classCount)     
  
def createTree(dataSet, labels):
  classList = [example[-1] for example in dataSet]
  if classList.count(classList[0]) ==len(classList):# If the categories are the same, the division is stopped 
    return classList[0]
  if len(dataSet[0]) == 1:# All the features are used up 
    return majorityCnt(classList)
  bestFeat = chooseBestFeatureToSplit(dataSet)
  bestFeatLabel = labels[bestFeat]
  myTree = {bestFeatLabel:{}}
  del(labels[bestFeat])
  featValues = [example[bestFeat] for example in dataSet]
  uniqueVals = set(featValues)
  for value in uniqueVals:
    subLabels = labels[:]# The contents of the original list were copied in order not to change 1 Under the 
    myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, 
                    bestFeat, value),subLabels)
  return myTree
  
def main():
  data,label = createDataSet()
  t1 = time.clock()
  myTree = createTree(data,label)
  t2 = time.clock()
  print myTree
  print 'execute for ',t2-t1
if __name__=='__main__':
  main()


Related articles: