Classification using K-Nearest Neighbor (KNN)
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from IPython.display import display
pd.set_option('display.notebook_repr_html', True)
Prescription Drug Classification
KNN bases its classifications on the nearest k-neighbors. A neighbor’s “near-ness” is based on their attributes or predictors. For example, below, the attributes are simple. Every patient at a hospital has an age attribute, and a Na/K ratio attribute. Based on those attributes, a patient is assigned a classification (or type of drug). If you share the same age and the same Na/K ratio as another patient, that patient is considered “near”, and you’re probably going to be given the same classification.
Of course, you can have multiple neighbors, and you likely will, so it’s important to specify a reasonable number of nearest neighbors to base the classification on. Too small, and you might be inaccurate. An even number might get you a tie between two classifications. And if k is too large, you might be looking at a long compute time.
The target categorical variable in this example is drug to be prescribed, which is partitioned into different classes—drug A, drug B, and and drug C. The predictor variables are sodium/potassium ratio and age. This example isn’t really ideal because there are only three records; there should be way more. And the more records there are, the better we can find some rare cases to include in our model. It’s important to find some balance between common and rare cases.
#A new patient that we want to classify which drug to prescribe
new = np.array([0.05, 0.25])
#Three existing patient records
A = np.array([0.0467, 0.2471])
B = np.array([0.0533, 0.1912])
C = np.array([0.0917, 0.2794])
#X, the training set
X = [A, B, C]
#y, the target (or class)
y = ["Drug A", "Drug B", "Drug C"]
#A dataframe to get a glance at the relationship of the variables
df = pd.DataFrame(data = X, index = y, columns = ["Age (MMN)", "Na/K (MMN)"])
display(df)
Age (MMN) | Na/K (MMN) | |
---|---|---|
Drug A | 0.0467 | 0.2471 |
Drug B | 0.0533 | 0.1912 |
Drug C | 0.0917 | 0.2794 |
#Fits the model using the training data and targets
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
#Printing the euclidean distance between the new patient, and the other three recorded patients
print(neigh.kneighbors([new]))
(array([[ 0.00439318, 0.05102205, 0.05889253]]), array([[0, 2, 1]], dtype=int64))
#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions = neigh.predict([new])
prob = neigh.predict_proba([new])
print("Class of provided data: " + str(predictions) + "\nProbability of classification: " + str(prob))
Class of provided data: ['Drug A']
Probability of classification: [[ 0.33333333 0.33333333 0.33333333]]
Surprise, the code and solution in the book are wrong. For whatever reason, it gives a probability of 0.66667, but intuitively, this doesn’t make sense, given the three classes have an equal vote, given that they are unweighted and k = 3 (so it just chooses the three patients that are most similar to classify the prescription). The same solution appears when I run the code in R as well. Since this is a useless model, I’ll make it more useful.
neigh2 = KNeighborsClassifier(n_neighbors=1)
neigh2.fit(X, y)
print(neigh.kneighbors([new]))
predictions2 = neigh2.predict([new])
prob2 = neigh2.predict_proba([new])
print("Class of provided data: " + str(predictions2) + "\nProbability of classification: " + str(prob2))
(array([[ 0.00439318, 0.05102205, 0.05889253]]), array([[0, 2, 1]], dtype=int64))
Class of provided data: ['Drug A']
Probability of classification: [[ 1. 0. 0.]]
Now that k=1, there is no tie between the three neighbors. The model chooses the point that is in closest proximity, which is a patient that has been prescribed Drug A. We know this because when we look at the distances, that patient is only .00439 units away, while the others are .05102 and .05889 units away.
But instead of playing with the number of neighbors, we should go back to k=3 and treat particular predictors as more important. If a domain expert were to come in and say that the sodium/potassium ratio were 3x more important (note the 3 coefficient in the equations below) than the age predictor, so we can scale that axis accordingly—remember, euclidean distance is defined in terms of change in x and y, so if Na/K are represented on the y axis, its coordinate values can be scaled. Below are the initial distance calculations, and then the scaled calculations.
$$ \text{d(new,A) = }\sqrt{(0.05 - 0.0467)^2 + (0.25 - 0.2471)^2}\text{= 0.004393, becomes} $$
$$ \text{d(new, A) = }\sqrt{(0.05 - 0.0467)^2 + [3(0.25 - 0.2471)]^2}\text{ = 0.009305.} $$
$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + (0.25 - 0.1912)^2}\text{ = 0.58893 becomes} $$
$$ \text{d(new, B) = }\sqrt{(0.05 - 0.0533)^2 + [3(0.25 - 0.1912)]^2}\text{ = 0.17643.} $$
$$ \text{d(new, C) = }\mathellipsis $$
Credit Risk Classification
risk = pd.read_csv("classifyrisk.txt")
display(risk)
mortgage | loans | age | marital_status | income | risk | |
---|---|---|---|---|---|---|
0 | y | 3 | 34 | other | 28060.70 | bad loss |
1 | n | 2 | 37 | other | 28009.34 | bad loss |
2 | n | 2 | 29 | other | 27614.60 | bad loss |
3 | y | 2 | 33 | other | 27287.18 | bad loss |
4 | y | 2 | 39 | other | 26954.06 | bad loss |
5 | n | 2 | 28 | other | 26271.86 | bad loss |
6 | n | 3 | 28 | other | 40445.00 | bad loss |
... | ... | ... | ... | ... | ... | ... |
241 | y | 0 | 51 | married | 46810.12 | good risk |
242 | y | 0 | 55 | married | 45709.78 | good risk |
243 | y | 0 | 51 | married | 44896.42 | good risk |
244 | y | 0 | 54 | married | 44301.52 | good risk |
245 | y | 1 | 60 | married | 54096.00 | good risk |
246 rows × 6 columns
#Random sample for training data
risk2 = risk.iloc[[50, 64, 78, 86, 123, 140, 149, 161], [4, 0, 3, 5]]
risk2['married'] = np.where(risk2['marital_status'] == "married", 1, 0)
risk2['single'] = np.where(risk2['marital_status'] == "single", 1, 0)
display(risk2)
type(risk2.ix[64, 'married'])
del risk2['mortgage']
del risk2['marital_status']
income | mortgage | marital_status | risk | married | single | |
---|---|---|---|---|---|---|
50 | 20188.10 | n | married | bad loss | 1 | 0 |
64 | 24787.34 | y | other | bad loss | 0 | 0 |
78 | 19886.72 | y | other | bad loss | 0 | 0 |
86 | 43281.44 | y | single | bad loss | 0 | 1 |
123 | 39994.90 | y | single | good risk | 0 | 1 |
140 | 34716.50 | n | single | good risk | 0 | 1 |
149 | 55186.75 | n | married | good risk | 1 | 0 |
161 | 52726.50 | n | married | good risk | 1 | 0 |
display(risk2)
income | risk | married | single | |
---|---|---|---|---|
50 | 20188.10 | bad loss | 1 | 0 |
64 | 24787.34 | bad loss | 0 | 0 |
78 | 19886.72 | bad loss | 0 | 0 |
86 | 43281.44 | bad loss | 0 | 1 |
123 | 39994.90 | good risk | 0 | 1 |
140 | 34716.50 | good risk | 0 | 1 |
149 | 55186.75 | good risk | 1 | 0 |
161 | 52726.50 | good risk | 1 | 0 |
new2 = risk.iloc[162, [4, 0, 3]]
new2['married'] = 1
new2['single'] = 0
del new2['mortgage']
del new2['marital_status']
print(new2)
income 42120.3
married 1
single 0
Name: 162, dtype: object
================================================================================================
This classification is dependent on three fields (predictors): Income, married, and single. The target variable is risk.
X = All observations’ incomes and marital status (whether married and/or single is 1 or 0)
y = All observations’ risk classification
- The classification model is fit with fit(X, y) as its training data
- The model is used to find the distance of the new observation’s k-nearest neighbors with the kneighbors method
- The model predicts the classification of the new observation with the predict method
- The model finds the associated probability of each classification, given the nearest neighbors
The new observation is classified as ‘good risk.’
=================================================================================================
neigh3 = KNeighborsClassifier(n_neighbors=3)
neigh3.fit(risk2.iloc[:, [0, 2, 3]], risk2.iloc[:,1])
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
#Printing the euclidean distance between the new patient, and the nearest three neighbors
print(neigh3.kneighbors([new2]))
#predicts the class of the 'new' array, given its values' proximity to the other values/classes
predictions3 = neigh3.predict([new2])
prob3 = neigh3.predict_proba([new2])
print("Class of provided data: " + str(predictions3) + "\nProbability of classification: " + str(prob3))
(array([[ 1161.10086125, 2125.44047049, 7403.84013507]]), array([[3, 4, 5]], dtype=int64))
Class of provided data: ['good risk']
Probability of classification: [[ 0.33333333 0.66666667]]
Useful Resources:
- This page is a compilation of my notes from the book Data Mining and Predictive Analytics.