k-Nearest Neighbors Classifier

Classification using K-Nearest Neighbor (KNN)

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

from IPython.display import display
pd.set_option('display.notebook_repr_html', True)

Prescription Drug Classification

KNN bases its classifications on the nearest k-neighbors. A neighbor’s “near-ness” is based on their attributes or predictors. For example, below, the attributes are simple. Every patient at a hospital has an age attribute, and a Na/K ratio attribute. Based on those attributes, a patient is assigned a classification (or type of drug). If you share the same age and the same Na/K ratio as another patient, that patient is considered “near”, and you’re probably going to be given the same classification.

Of course, you can have multiple neighbors, and you likely will, so it’s important to specify a reasonable number of nearest neighbors to base the classification on. Too small, and you might be inaccurate. An even number might get you a tie between two classifications. And if k is too large, you might be looking at a long compute time.

The target categorical variable in this example is drug to be prescribed, which is partitioned into different classes—drug A, drug B, and and drug C. The predictor variables are sodium/potassium ratio and age. This example isn’t really ideal because there are only three records; there should be way more. And the more records there are, the better we can find some rare cases to include in our model. It’s important to find some balance between common and rare cases.

#A new patient that we want to classify which drug to prescribe
new = np.array([0.05, 0.25])

#Three existing patient records
A = np.array([0.0467, 0.2471])
B = np.array([0.0533, 0.1912])
C = np.array([0.0917, 0.2794])

#X, the training set
X = [A, B, C]
#y, the target (or class)
y = ["Drug A", "Drug B", "Drug C"]

#A dataframe to get a glance at the relationship of the variables
df = pd.DataFrame(data = X, index = y, columns = ["Age (MMN)", "Na/K (MMN)"])
display(df)
Age (MMN) Na/K (MMN)
Drug A 0.0467 0.2471
Drug B 0.0533 0.1912
Drug C 0.0917 0.2794
#Fits the model using the training data and targets
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
#Printing the euclidean distance between the new patient, and the other three recorded patients
print(neigh.kneighbors([new]))
(array([[ 0.00439318,  0.05102205,  0.05889253]]), array([[0, 2, 1]], dtype=int64))
#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions = neigh.predict([new])
prob = neigh.predict_proba([new])
print("Class of provided data: " + str(predictions) + "\nProbability of classification: " + str(prob))
Class of provided data: ['Drug A']
Probability of classification: [[ 0.33333333  0.33333333  0.33333333]]

Surprise, the code and solution in the book are wrong. For whatever reason, it gives a probability of 0.66667, but intuitively, this doesn’t make sense, given the three classes have an equal vote, given that they are unweighted and k = 3 (so it just chooses the three patients that are most similar to classify the prescription). The same solution appears when I run the code in R as well. Since this is a useless model, I’ll make it more useful.

neigh2 = KNeighborsClassifier(n_neighbors=1)
neigh2.fit(X, y)
print(neigh.kneighbors([new]))

predictions2 = neigh2.predict([new])
prob2 = neigh2.predict_proba([new])
print("Class of provided data: " + str(predictions2) + "\nProbability of classification: " + str(prob2))
(array([[ 0.00439318,  0.05102205,  0.05889253]]), array([[0, 2, 1]], dtype=int64))
Class of provided data: ['Drug A']
Probability of classification: [[ 1.  0.  0.]]

Now that k=1, there is no tie between the three neighbors. The model chooses the point that is in closest proximity, which is a patient that has been prescribed Drug A. We know this because when we look at the distances, that patient is only .00439 units away, while the others are .05102 and .05889 units away.

But instead of playing with the number of neighbors, we should go back to k=3 and treat particular predictors as more important. If a domain expert were to come in and say that the sodium/potassium ratio were 3x more important (note the 3 coefficient in the equations below) than the age predictor, so we can scale that axis accordingly—remember, euclidean distance is defined in terms of change in x and y, so if Na/K are represented on the y axis, its coordinate values can be scaled. Below are the initial distance calculations, and then the scaled calculations.

$$ \text{d(new,A) = }\sqrt{(0.05 - 0.0467)^2 + (0.25 - 0.2471)^2}\text{= 0.004393, becomes} $$

$$ \text{d(new, A) = }\sqrt{(0.05 - 0.0467)^2 + [3(0.25 - 0.2471)]^2}\text{ = 0.009305.} $$

$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + (0.25 - 0.1912)^2}\text{ = 0.58893 becomes} $$

$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + [3(0.25 - 0.1912)]^2}\text{ = 0.17643.} $$

$$ \text{d(new, C) = }\mathellipsis $$

Credit Risk Classification

risk = pd.read_csv("classifyrisk.txt")
display(risk)
mortgage loans age marital_status income risk
0 y 3 34 other 28060.70 bad loss
1 n 2 37 other 28009.34 bad loss
2 n 2 29 other 27614.60 bad loss
3 y 2 33 other 27287.18 bad loss
4 y 2 39 other 26954.06 bad loss
5 n 2 28 other 26271.86 bad loss
6 n 3 28 other 40445.00 bad loss
... ... ... ... ... ... ...
241 y 0 51 married 46810.12 good risk
242 y 0 55 married 45709.78 good risk
243 y 0 51 married 44896.42 good risk
244 y 0 54 married 44301.52 good risk
245 y 1 60 married 54096.00 good risk

246 rows × 6 columns

#Random sample for training data
risk2 = risk.iloc[[50, 64, 78, 86, 123, 140, 149, 161], [4, 0, 3, 5]]
risk2['married'] = np.where(risk2['marital_status'] == "married", 1, 0)
risk2['single'] = np.where(risk2['marital_status'] == "single", 1, 0)
display(risk2)
type(risk2.ix[64, 'married'])
del risk2['mortgage']
del risk2['marital_status']
income mortgage marital_status risk married single
50 20188.10 n married bad loss 1 0
64 24787.34 y other bad loss 0 0
78 19886.72 y other bad loss 0 0
86 43281.44 y single bad loss 0 1
123 39994.90 y single good risk 0 1
140 34716.50 n single good risk 0 1
149 55186.75 n married good risk 1 0
161 52726.50 n married good risk 1 0
display(risk2)
income risk married single
50 20188.10 bad loss 1 0
64 24787.34 bad loss 0 0
78 19886.72 bad loss 0 0
86 43281.44 bad loss 0 1
123 39994.90 good risk 0 1
140 34716.50 good risk 0 1
149 55186.75 good risk 1 0
161 52726.50 good risk 1 0
new2 = risk.iloc[162, [4, 0, 3]]
new2['married'] = 1
new2['single'] = 0
del new2['mortgage']
del new2['marital_status']
print(new2)
income     42120.3
married          1
single           0
Name: 162, dtype: object

================================================================================================

This classification is dependent on three fields (predictors): Income, married, and single. The target variable is risk.

X = All observations’ incomes and marital status (whether married and/or single is 1 or 0)

y = All observations’ risk classification

  1. The classification model is fit with fit(X, y) as its training data
  2. The model is used to find the distance of the new observation’s k-nearest neighbors with the kneighbors method
  3. The model predicts the classification of the new observation with the predict method
  4. The model finds the associated probability of each classification, given the nearest neighbors

The new observation is classified as ‘good risk.’

=================================================================================================

neigh3 = KNeighborsClassifier(n_neighbors=3)
neigh3.fit(risk2.iloc[:, [0, 2, 3]], risk2.iloc[:,1])
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
#Printing the euclidean distance between the new patient, and the nearest three neighbors
print(neigh3.kneighbors([new2]))

#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions3 = neigh3.predict([new2])
prob3 = neigh3.predict_proba([new2])
print("Class of provided data: " + str(predictions3) + "\nProbability of classification: " + str(prob3))
(array([[ 1161.10086125,  2125.44047049,  7403.84013507]]), array([[3, 4, 5]], dtype=int64))
Class of provided data: ['good risk']
Probability of classification: [[ 0.33333333  0.66666667]]

Useful Resources:

comments powered by Disqus