'KNN Accuracy of 100%?
I have used the following code for KNN
jd <- jobdata
head (jd)
jd$ipermanency rate= as.integer(as.factor(jd$ipermanency rate))
jd$`permanency rate`=as.integer(as.factor(jd$`permanency rate`))
jd$`job skills`=as.integer(as.factor(jd$`job skills`))
jd$Default <- factor(jd$Default)
num.vars <- sapply(jd, is.numeric)
jd[num.vars] <- lapply(jd[num.vars], scale)
jd$`permanency rate` <- factor(jd$`permanency rate`)
num.vars <- sapply(jd, is.numeric)
jd[num.vars] <- lapply(jd[num.vars], scale)
myvars <- c("permanency rate", "job skills")
jd.subset <- jd[myvars]
summary(jd.subset)
set.seed(123)
test <- 1:100
train.jd <- jd.subset[-test,]
test.jd <- jd.subset[test,]
train.def <- jd$`permanency rate`[-test]
test.def <- jd$`permanency rate`[test]
library(class)
knn.1 <- knn(train.jd, test.jd, train.def, k=1)
knn.3 <- knn(train.jd, test.jd, train.def, k=3)
knn.5 <- knn(train.jd, test.jd, train.def, k=5)
But whenever I calculate the proportion of correct classification for k = 1, 3 & 5 I always get 100% correctness. Is this normal or have I gone wrong somewhere
Thanks
Solution 1:[1]
We can't say that knn classifier always produces wrong results.Actually it is based on the dataset. In best case, the train data can be equal to the test data,where it always produces the 100% results.
Train data == Test data - 100% Efficient in all cases
Solution 2:[2]
Only if the model is an overfit case. That means model is not able to capture randomness and hence is predicting with 100 percent on training data
Solution 3:[3]
This is likely not the case in most projects as most y_labels (target) are likely to fall close together when you have a complex dataset with a large number of independent variables (predictors).
It would be good for you to try implementing some clustering techniques or a simple pair plot of your variables with the color set to your target variable to see if they are nicely grouped together.
An example would be:
# This is an implementation in python
import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(data = jd, hue = "permanency rate")
Depending on the language and library you are using, KNN classifier usually sets n_neighbours (K) = 5 by default. Thus you can try to go above this value to see if it returns a different result.
You should also construct your confusion matrix and review your metrics.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Sridhar Cr |
Solution 2 | abak1802 |
Solution 3 | ahjim0m0 |