python algorithm walkthrough _One Rule algorithm of

  • 2020-06-01 10:06:33
  • OfStack

In this way, there are only two values of 0 and 1 for a certain feature, and there are three categories in the data set. When you take 0, let's say category A has 20 of these individuals, category B has 60 of these individuals, and category C has 20 of these individuals. Therefore, when this feature is 0, it is most likely to be category B. However, there are still 40 individuals that are not in the category B, so the error rate of assigning this feature 0 to category B is 40%. Then, after all the features are counted, the error rate of all the features is calculated, and the feature with the lowest error rate is selected as the classification criterion of the only one -- this is OneR.

Now implement the algorithm in code.


# OneR Algorithm implementation 
import numpy as np
from sklearn.datasets import load_iris
#  loading iris The data set 
dataset = load_iris()
#  loading iris Data set data Array (data set characteristics) 
X = dataset.data
#  loading iris Data set target Array (category of data set) 
y_true = dataset.target
#  Calculate each 1 The average of the characteristics 
attribute_means = X.mean(axis=0)
#  Compared with the average value, greater than or equal to 1 ", less than is ". 0 " . Transform the continuous eigenvalues into discrete categorical types. 
x = np.array(X >= attribute_means, dtype="int")


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y_true, random_state=14)
from operator import itemgetter
from collections import defaultdict
#  find 1 The category to which the different values of. 
def train_feature_class(x, y_true, feature_index, feature_values):
  num_class = defaultdict(int)
  for sample, y in zip(x, y_true):
    if sample[feature_index] == feature_values:
      num_class[y] += 1
  #  Sort to find the most categories. Rank them from large to small 
  sorted_num_class = sorted(num_class.items(), key=itemgetter(1), reverse=True)
  most_frequent_class = sorted_num_class[0][0]
  error = sum(value_num for class_num , value_num in sorted_num_class if class_num != most_frequent_class)
  return most_frequent_class, error
# print train_feature_class(x_train, y_train, 0, 1)
#  Then define 1 To find out the best feature with the lowest error rate and the category of each eigenvalue under this feature. 
def train_feature(x, y_true, feature_index):
  n_sample, n_feature = x.shape
  assert 0 <= feature_index < n_feature
  value = set(x[:, feature_index])
  predictors = {}
  errors = []
  for current_value in value:
    most_frequent_class, error = train_feature_class(x, y_true, feature_index, current_value)
    predictors[current_value] = most_frequent_class
    errors.append(error)
  total_error = sum(errors)
  return predictors, total_error
#  Find the category of each eigenvalue under all features, and the format is as follows: {0 : ({0: 0, 1: 2}, 41)} First of all to 1 Dictionary, the key of the dictionary is a feature, the value of the dictionary by 1 The set is made up of, and the set is made up of 1 A dictionary and 1 The dictionary keys are the eigenvalues, the dictionary values are the categories, and finally 1 The individual values are the error rate. 
all_predictors = {feature: train_feature(x_train, y_train, feature) for feature in xrange(x_train.shape[1])}
# print all_predictors
#  The error rate under each feature was screened out 
errors = {feature: error for feature, (mapping, error) in all_predictors.items()}
#  Order the error rate to get the optimal characteristic and the lowest error rate, which is the model and rule. This is the one Rule ( OneR ) algorithm. 
best_feature, best_error = sorted(errors.items(), key=itemgetter(1), reverse=False)[0]
# print "The best model is based on feature {0} and has error {1:.2f}".format(best_feature, best_error)
# print all_predictors[best_feature][0]
#  Build a model 
model = {"feature": best_feature, "predictor": all_predictors[best_feature][0]}
# print model
#  Start testing - categorize the category of the eigenvalue under the optimal feature. 
def predict(x_test, model):
  feature = model["feature"]
  predictor = model["predictor"]
  y_predictor = np.array([predictor[int(sample[feature])] for sample in x_test])
  return y_predictor

y_predictor = predict(x_test, model)
# print y_predictor
#  Under this optimal feature, the classification of each characteristic value is compared with the test data set to obtain the accuracy. 
accuracy = np.mean(y_predictor == y_test) * 100
print "The test accuracy is {0:.2f}%".format(accuracy)

from sklearn.metrics import classification_report

# print(classification_report(y_test, y_predictor))

Conclusion: OneR algorithm, I initially thought that it could judge the classification of all features after finding one feature with the lowest error rate. In fact, now I understand that it can only judge the classification of each characteristic value under this feature. Therefore, it obviously has some limitations. It's just that it's faster and simpler. However, it is still the case to decide whether to use it or not.

class precision recall f1-score support

0 0.94 1.00 0.97 17
1 0.00 0.00 0.00 13
2 0.40 1.00 0.57 8

avg / total 0.51 0.66 0.55 38

Note:

# is in the code above.
for sample in x_test:
print sample[0]
# gets the first column of x_test. The code below is the first line of x_test.
print x_test[0]
Notice the difference


Related articles: