Void’s Vault

Knowledge source for efficiency.

Estimate Performance of a Classifier With Cross-Validation and Confidence Approximation

During the winter session at Laval University, I followed a course about machine learning. One of the homeworks to do was to learn how to use scikit-learn in Python. I teamed up with Mathieu Dumoulin. In this article, I present you some code, so you will learn how this tool work.

First, you can go to the website of scikit-learn. There is a lot of useful documentation just waiting to be found. Also, there is a lot of code examples!

Now, in the example, I’ll show how to do cross-validation. But first, what is cross-validation?

In a perfect world, when you try to do some machine learning, you always have a million data. You split your data in two parts, S and T. You also create a classifier that uses an algorithm that can learn. Then, you will give S to the classifier so it can learn. When the classifier is ready, you give T to the classifier and ask it to predict the outputs y of the X parts of T. Because you have a lot of data to train with and a lot of data for testing your classifier, the success rate of the predicted y (when predicted yi is the same as the true yi) can be thought as the real probability of success given any input X.

Now, in the real world, you won’t have enough data. Furthermore, you cannot use the testing dataset more than once, otherwise your classifier may overfit this testing dataset. This may will cause the classifier to work poorly on a new unknown X. The only this you can use is a dataset S of 300 examples. Moreover, you need to find for a classifier what values must take the hyperparameters of the algorithm. What can you possibly do? The answer to this question is indeed cross-validation.

Here’s how it work. You take your training dataset S and cut it in k folds (T1,T2,…,Tk). For i = 1..k, you build a training dataset S\Ti and you keep Ti as a testing dataset. You then train the classifier, with a defined set of hyperparameters, on every training dataset and test with the corresponding testing datasets. After that, you just have to evaluate the performance the the classifier according to the number of wrong classifications. Then, you change the hyperparameters and start over. When this part is done, you set the hyperparameters that worked better and you train the classifier with the whole dataset S. That’s it, you can now try to predict with this classifier!

As always, there is nothing like a code example.

Here’s the code to do cross validation with a KNearestNeighbors classifier.

During the winter session at Laval University, I followed a course about machine learning. One of the homeworks to do was to learn how to use scikit-learn in Python. I teamed up with Mathieu Dumoulin. In this article, I present you some code, so you will learn how this tool work.

First, you can go to the website of scikit-learn. There is a lot of useful documentation just waiting to be found. Also, there is a lot of code examples!

Now, in the example, I’ll show how to do cross-validation. But first, what is cross-validation?

In a perfect world, when you try to do some machine learning, you always have a million data. You split your data in two parts, S and T. You also create a classifier that uses an algorithm that can learn. Then, you will give S to the classifier so it can learn. When the classifier is ready, you give T to the classifier and ask it to predict the outputs y of the X parts of T. Because you have a lot of data to train with and a lot of data for testing your classifier, the success rate of the predicted y (when predicted yi is the same as the true yi) can be thought as the real probability of success given any input X.

Now, in the real world, you won’t have enough data. Furthermore, you cannot use the testing dataset more than once, otherwise your classifier may overfit this testing dataset. This may will cause the classifier to work poorly on a new unknown X. The only this you can use is a dataset S of 300 examples. Moreover, you need to find for a classifier what values must take the hyperparameters of the algorithm. What can you possibly do? The answer to this question is indeed cross-validation.

Here’s how it work. You take your training dataset S and cut it in k folds (T1,T2,…,Tk). For i = 1..k, you build a training dataset S\Ti and you keep Ti as a testing dataset. You then train the classifier, with a defined set of hyperparameters, on every training dataset and test with the corresponding testing datasets. After that, you just have to evaluate the performance the the classifier according to the number of wrong classifications. Then, you change the hyperparameters and start over. When this part is done, you set the hyperparameters that worked better and you train the classifier with the whole dataset S. That’s it, you can now try to predict with this classifier!

As always, there is nothing like a code example.

Here’s the code to do cross validation with a KNearestNeighbors classifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from math import sqrt

import confidence_intervals as ci
from sklearn import neighbors        #sklearn k-nearest-neighbors classifier
from sklearn import cross_validation #useful to cut datasets for cross-validation

def run_knn_with():
    X_s, y_s, X_t, y_t = loadDataset() #X_s and y_s are for training, X_t and y_t are for testing performance
    folds = 10
    estimators = {}
    for n_neighbors in (1, 3, 10):
        clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors,) #set hyperparameters (here, it's the number of nearest neighbors)

        kf = cross_validation.KFold(X_s.shape[0], n_folds=folds) #Used to cut the example set into 10 folds

        scores = []
        for train_index, test_index in kf:
            X_train, X_test = X_s[train_index], X_s[test_index] #Create the training dataset
            y_train, y_test = y_s[train_index], y_s[test_index]  #Create the testing dataset

            clf.fit(X_train, y_train)        #It's that easy with scikit to make a classifier learn!
            y_pred = clf.predict(X_test)      #The classifier returns the predictions over X_test  
            score = float(np.sum(y_pred != y_test))/X_test.shape[0] #We count the number of wrong predictions
            scores.append(score)

        #These lines are used to estimate the performance of the classifier with the hyperparameters
        Rcv = float(sum(scores)) / folds
        total = 0
        for score in scores:
            total += (score-Rcv)2
        variance = total / ( folds(folds-1) )

        estimators[n_neighbors] = (Rcv,variance) #Save the performance estimator

    bestScore = 100000
    bestKey = -1
    for key in estimators:
        print ("The classifier ", key, " scored : ", estimators[key][0], " +/- ", estimators[key][1])
        if  estimators[key][0] < bestScore:
            bestScore = estimators[key]
            bestKey = key
        elif estimators[key][0] == bestScore and estimators[key][1] < estimators[bestKey][1]:
            bestScore = estimators[key]
            bestKey = key

    print("The best hyperparameter is k=", bestKey)
    clf = neighbors.KNeighborsClassifier(n_neighbors=bestKey) #set the hyperparameter to the classifier
    clf.fit(X_s, y_s) #Train the classifier with the whole training dataset

    y_pred = clf.predict(X_s) #Evaluate the performance on training dataset (helps to check if classifier is overfitting on training set)
    Rs = float(np.sum(y_pred != y_s)) / X_s.shape[0]
    print("The empirical risk Rs is:", Rs)

    y_pred = clf.predict(X_t) #Evaluation of performance with the testing dataset
    Rt = float(np.sum(y_pred != y_t)) / X_t.shape[0]
    print("The empirical risk Rt is: ", Rt)

    #Now calculate the confidence intervals
    n = X_t.shape[0]
    delta = 0.1
    print("Interval of confidence over Rt:")
    print("Gaussian: +/- %0.5f" % (ci.confint_normal(n, Rt, delta)))
    print("Hoeffding: +/- %0.5f" % (ci.confint_hoeffding(n, delta)))
    print("Binomial: +/- %0.5f" % ci.confint_binomial(n, n  Rt, delta))

See the last lines? They evaluate the interval of confidence of the empirical risk Rt. In short, Gaussian underestimate the error of Rt, while Hoeffding will overestimate it most of the time. The real way to evaluate the interval of confidence is by using the Binomial method.

I am glad to put the binomial method in my blog, thanks to Mathieu Dumoulin who implemented it. Below is the code file that show the three methods.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from math import log, sqrt
from scipy.stats import binom

def find_rmax(n, k, delta, step_size=500):
    #print("find_rmax")
    i = 0
    r = float(i)/step_size
    prob = binom.cdf(k, n, r)
    #print("%d, r=%0.4f, prob=%f"%(i,r, prob))
    while prob > delta and i < step_size:
        i += 1; r = float(i)/step_size
        prob = binom.cdf(k, n, r)
        #print("%d, r=%0.4f, prob=%f"%(i,r, prob))

    return r

def find_rmin(n, k, delta, step_size=500):
    #print("find_rmin")
    i = 0
    r = float(i)/step_size
    prob = binom.cdf(k, n, r)
    #print("%d, r=%0.4f, prob=%f"%(i,r, prob))
    while prob > 1 - delta and i < step_size:
        i += 1; r = float(i)/step_size
        prob = binom.cdf(k, n, r)
        #print("%d, r=%0.4f, prob=%f"%(i,r, prob))

    return r

def confint_binomial(n, k, delta):
    upperBound = find_rmax(n, k, delta/2)
    lowerBound = find_rmin(n, k, delta/2)
    return upperBound - lowerBound

z_table = {0.5:0.67, 0.68:1, 0.80:1.28, 0.90:1.64, 0.95:1.96, 0.98:2.33, 0.99:2.58}
def confint_normal(n, Rt, delta):
    z = z_table[1 - delta]
    return z  sqrt((1.0 / n)  (Rt)  (1 - Rt))

def confint_hoeffding(n, delta):
    delta = 0.1
    return sqrt((1.0 / (2  n)) * log(2.0 / delta))

n = 500
k = 10
delta = 0.1

Enjoy!