Classification using K-Nearest Neighbor (KNN)

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

from IPython.display import display
pd.set_option('display.notebook_repr_html', True)

Prescription Drug Classification

KNN bases its classifications on the nearest k-neighbors. A neighbor’s “near-ness” is based on their attributes or predictors. For example, below, the attributes are simple. Every patient at a hospital has an age attribute, and a Na/K ratio attribute. Based on those attributes, a patient is assigned a classification (or type of drug). If you share the same age and the same Na/K ratio as another patient, that patient is considered “near”, and you’re probably going to be given the same classification.

Of course, you can have multiple neighbors, and you likely will, so it’s important to specify a reasonable number of nearest neighbors to base the classification on. Too small, and you might be inaccurate. An even number might get you a tie between two classifications. And if k is too large, you might be looking at a long compute time.

The target categorical variable in this example is drug to be prescribed, which is partitioned into different classes—drug A, drug B, and and drug C. The predictor variables are sodium/potassium ratio and age. This example isn’t really ideal because there are only three records; there should be way more. And the more records there are, the better we can find some rare cases to include in our model. It’s important to find some balance between common and rare cases.

#A new patient that we want to classify which drug to prescribe
new = np.array([0.05, 0.25])

#Three existing patient records
A = np.array([0.0467, 0.2471])
B = np.array([0.0533, 0.1912])
C = np.array([0.0917, 0.2794])

#X, the training set
X = [A, B, C]
#y, the target (or class)
y = ["Drug A", "Drug B", "Drug C"]

#A dataframe to get a glance at the relationship of the variables
df = pd.DataFrame(data = X, index = y, columns = ["Age (MMN)", "Na/K (MMN)"])
display(df)

	Age (MMN)	Na/K (MMN)
Drug A	0.0467	0.2471
Drug B	0.0533	0.1912
Drug C	0.0917	0.2794

#Fits the model using the training data and targets
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

#Printing the euclidean distance between the new patient, and the other three recorded patients
print(neigh.kneighbors([new]))

(array([[ 0.00439318,  0.05102205,  0.05889253]]), array([[0, 2, 1]], dtype=int64))

#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions = neigh.predict([new])
prob = neigh.predict_proba([new])
print("Class of provided data: " + str(predictions) + "\nProbability of classification: " + str(prob))

Class of provided data: ['Drug A']
Probability of classification: [[ 0.33333333  0.33333333  0.33333333]]

Surprise, the code and solution in the book are wrong. For whatever reason, it gives a probability of 0.66667, but intuitively, this doesn’t make sense, given the three classes have an equal vote, given that they are unweighted and k = 3 (so it just chooses the three patients that are most similar to classify the prescription). The same solution appears when I run the code in R as well. Since this is a useless model, I’ll make it more useful.

neigh2 = KNeighborsClassifier(n_neighbors=1)
neigh2.fit(X, y)
print(neigh.kneighbors([new]))

predictions2 = neigh2.predict([new])
prob2 = neigh2.predict_proba([new])
print("Class of provided data: " + str(predictions2) + "\nProbability of classification: " + str(prob2))

(array([[ 0.00439318,  0.05102205,  0.05889253]]), array([[0, 2, 1]], dtype=int64))
Class of provided data: ['Drug A']
Probability of classification: [[ 1.  0.  0.]]

Now that k=1, there is no tie between the three neighbors. The model chooses the point that is in closest proximity, which is a patient that has been prescribed Drug A. We know this because when we look at the distances, that patient is only .00439 units away, while the others are .05102 and .05889 units away.

But instead of playing with the number of neighbors, we should go back to k=3 and treat particular predictors as more important. If a domain expert were to come in and say that the sodium/potassium ratio were 3x more important (note the 3 coefficient in the equations below) than the age predictor, so we can scale that axis accordingly—remember, euclidean distance is defined in terms of change in x and y, so if Na/K are represented on the y axis, its coordinate values can be scaled. Below are the initial distance calculations, and then the scaled calculations.

$$ \text{d(new,A) = }\sqrt{(0.05 - 0.0467)^2 + (0.25 - 0.2471)^2}\text{= 0.004393, becomes} $$

$$ \text{d(new, A) = }\sqrt{(0.05 - 0.0467)^2 + [3(0.25 - 0.2471)]^2}\text{ = 0.009305.} $$

$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + (0.25 - 0.1912)^2}\text{ = 0.58893 becomes} $$

$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + [3(0.25 - 0.1912)]^2}\text{ = 0.17643.} $$

$$ \text{d(new, C) = }\mathellipsis $$

Credit Risk Classification

risk = pd.read_csv("classifyrisk.txt")
display(risk)

	mortgage	loans	age	marital_status	income	risk
0	y	3	34	other	28060.70	bad loss
1	n	2	37	other	28009.34	bad loss
2	n	2	29	other	27614.60	bad loss
3	y	2	33	other	27287.18	bad loss
4	y	2	39	other	26954.06	bad loss
5	n	2	28	other	26271.86	bad loss
6	n	3	28	other	40445.00	bad loss
...	...	...	...	...	...	...
241	y	0	51	married	46810.12	good risk
242	y	0	55	married	45709.78	good risk
243	y	0	51	married	44896.42	good risk
244	y	0	54	married	44301.52	good risk
245	y	1	60	married	54096.00	good risk

246 rows × 6 columns

#Random sample for training data
risk2 = risk.iloc[[50, 64, 78, 86, 123, 140, 149, 161], [4, 0, 3, 5]]
risk2['married'] = np.where(risk2['marital_status'] == "married", 1, 0)
risk2['single'] = np.where(risk2['marital_status'] == "single", 1, 0)
display(risk2)
type(risk2.ix[64, 'married'])
del risk2['mortgage']
del risk2['marital_status']

	income	mortgage	marital_status	risk	married	single
50	20188.10	n	married	bad loss	1	0
64	24787.34	y	other	bad loss	0	0
78	19886.72	y	other	bad loss	0	0
86	43281.44	y	single	bad loss	0	1
123	39994.90	y	single	good risk	0	1
140	34716.50	n	single	good risk	0	1
149	55186.75	n	married	good risk	1	0
161	52726.50	n	married	good risk	1	0

display(risk2)

	income	risk	married	single
50	20188.10	bad loss	1	0
64	24787.34	bad loss	0	0
78	19886.72	bad loss	0	0
86	43281.44	bad loss	0	1
123	39994.90	good risk	0	1
140	34716.50	good risk	0	1
149	55186.75	good risk	1	0
161	52726.50	good risk	1	0

new2 = risk.iloc[162, [4, 0, 3]]
new2['married'] = 1
new2['single'] = 0
del new2['mortgage']
del new2['marital_status']
print(new2)

income     42120.3
married          1
single           0
Name: 162, dtype: object

================================================================================================

This classification is dependent on three fields (predictors): Income, married, and single. The target variable is risk.

X = All observations’ incomes and marital status (whether married and/or single is 1 or 0)

y = All observations’ risk classification

The classification model is fit with fit(X, y) as its training data
The model is used to find the distance of the new observation’s k-nearest neighbors with the kneighbors method
The model predicts the classification of the new observation with the predict method
The model finds the associated probability of each classification, given the nearest neighbors

The new observation is classified as ‘good risk.’

=================================================================================================

neigh3 = KNeighborsClassifier(n_neighbors=3)
neigh3.fit(risk2.iloc[:, [0, 2, 3]], risk2.iloc[:,1])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

#Printing the euclidean distance between the new patient, and the nearest three neighbors
print(neigh3.kneighbors([new2]))

#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions3 = neigh3.predict([new2])
prob3 = neigh3.predict_proba([new2])
print("Class of provided data: " + str(predictions3) + "\nProbability of classification: " + str(prob3))

(array([[ 1161.10086125,  2125.44047049,  7403.84013507]]), array([[3, 4, 5]], dtype=int64))
Class of provided data: ['good risk']
Probability of classification: [[ 0.33333333  0.66666667]]

Useful Resources:

This page is a compilation of my notes from the book Data Mining and Predictive Analytics.