BioRpy: ML1. K-Nearest Neighbor in Python

K-Nearest Neighbor is a supervised lazy learning technique.

The Iris dataset is used, with 150 instances, 4 features and 3 classes. The first 50 observations (rows) correspond to class 0, next 50 rows to class 1 and last 50 rows to class 2. The program prints the class names.

10-fold cross validation is used. Thus 150/10 = 15 instances are used for testing, and the rest for training. This is done 10 times, each time with a new set of indices. The KFold function has the shuffle parameter set to True so each test/training will have samples from all 3 classes.

The accuracy_score function is used to find the fraction of correctly labelled test values. Since there are 135 training labels, we thus find 135 distances in 4D space during each train-test iteration.

# ML1.py
from __future__ import print_function, division
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Loading data (150,4)
data = load_iris()
x = data.data
y = data.target
print('The three classes are',data.target_names)

# Use 5 nearest neighbors
classifier = KNeighborsClassifier(n_neighbors=5)

# Running 10 tests using 10-fold cross validataon
test = set()
acc = []
kf = KFold(len(x), n_folds=10, shuffle=True)
for trn,tst in kf:
    x_train = x[trn]
    y_train = y[trn]
    print('length of x_train:',len(x_train))
    classifier.fit(x_train, y_train)
    x_test = x[tst]
    y_test = y[tst]
    test = test.intersection(tst)
    print('length of x_test:',len(x_test))
    print('tst:',tst)
    pred = classifier.predict(x_test)
    acc.append(accuracy_score(y_test,pred))

# Accuracy
print('Result: {}'.format(sum(acc)/len(acc)))
print('length of test: {}'.format(len(test)))

#The three classes are ['setosa' 'versicolor' 'virginica']
#length of x_train: 135
#length of x_test: 15
#tst: [ 11  20  37  42  58  88  94  95  99 101 117 121 132 136 146]
#length of x_train: 135
#length of x_test: 15
#tst: [  0  13  19  26  47  64  76  86  97  98 104 105 120 133 143]
#length of x_train: 135
#length of x_test: 15
#tst: [ 12  18  24  27  30  33  35  38  48  51  55  60 106 122 144]
#length of x_train: 135
#length of x_test: 15
#tst: [  5  32  45  52  65  66  81  83  90 102 116 131 137 139 148]
#length of x_train: 135
#length of x_test: 15
#tst: [  3   4  17  23  29  31  40  41  49  79  85  87 109 114 145]
#length of x_train: 135
#length of x_test: 15
#tst: [  1   2  54  57  61  80  89  96 113 115 118 127 128 134 141]
#length of x_train: 135
#length of x_test: 15
#tst: [  7   8  10  15  16  71  74  82 125 129 130 135 140 142 149]
#length of x_train: 135
#length of x_test: 15
#tst: [  9  14  53  56  68  69  73  75  77  91 100 103 107 110 111]
#length of x_train: 135
#length of x_test: 15
#tst: [  6  21  25  28  34  44  46  62  63  70  92 119 126 138 147]
#length of x_train: 135
#length of x_test: 15
#tst: [ 22  36  39  43  50  59  67  72  78  84  93 108 112 123 124]
#Result: 0.973333333333
#length of test: 0

BioRpy

Sunday, May 10, 2015

ML1. K-Nearest Neighbor in Python

K-Nearest Neighbor is a supervised lazy learning technique.

The Iris dataset is used, with 150 instances, 4 features and 3 classes. The first 50 observations (rows) correspond to class 0, next 50 rows to class 1 and last 50 rows to class 2. The program prints the class names.

10-fold cross validation is used. Thus 150/10 = 15 instances are used for testing, and the rest for training. This is done 10 times, each time with a new set of indices. The KFold function has the shuffle parameter set to True so each test/training will have samples from all 3 classes.

The accuracy_score function is used to find the fraction of correctly labelled test values. Since there are 135 training labels, we thus find 135 distances in 4D space during each train-test iteration.

No comments:

Post a Comment