KNN Function in R Programming: A Beginner's Guide
Published on
The KNN algorithm is a widely used supervised machine learning algorithm that is used for classification and regression problems. It is a simple and easy-to-understand algorithm that is also known as K-Nearest Neighbors. With its versatile applications in healthcare, finance, retail, and other industries, the KNN algorithm has become a popular choice for building models.
In this tutorial, we will explore how to implement KNN in R programming. We will go over the KNN function in R, KNN regression, and other related concepts. So, let's get started!
Want to quickly create Data Visualisation from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.
K Nearest Neighbor in R
The KNN algorithm works on the principle of finding the K nearest neighbors to a new data point and making predictions based on the majority class or mean value of the K neighbors. Here, K represents the number of neighbors we want to consider.
The distance between the new data point and other data points in the dataset is calculated using the distance formula. The distance formula can be used to calculate Euclidean distance, Manhattan distance, or other distance metrics based on the data type.
Once the distances are calculated, the K nearest data points are selected, and predictions are made based on the majority class or mean value of the K neighbors.
KNN in R
To implement the KNN algorithm in R programming, we can use the class
package in R. The class
package provides several functions for KNN classification and regression. Let's explore some of these functions.
KNN Classification
To perform KNN classification in R, we can use the knn
function. The syntax for the knn
function is as follows:
knn(train, test, cl, k)
Here, train
is the training dataset, test
is the test dataset, cl
is the class labels, and k
is the number of neighbors to consider. The output of the knn
function is a vector of predicted class labels for the test dataset.
Let's see an example of how to use the knn
function in R for classification.
library(class)
# Load the iris dataset
data(iris)
# Split the dataset into training and test datasets
train <- iris[1:100, ]
test <- iris[101:150, ]
# Perform KNN classification
pred <- knn(train[, -5], test[, -5], train[, 5], k = 3)
# Print the predicted class labels
print(pred)
Here, we loaded the iris
dataset and split it into the training and test datasets. We used the knn
function to perform KNN classification and predicted the class labels for the test dataset.
KNN Regression
To perform KNN regression in R, we can use the knn.reg
function. The syntax for the knn.reg
function is as follows:
knn.reg(train, test, y, k)
Here, train
is the training dataset, test
is the test dataset, y
is the target variable, and k
is the number of neighbors to consider. The output of the knn.reg
function is a vector of predicted target variable values for the test dataset.
Let's see an example of how to use the knn.reg
function in R for regression.
library(class)
# Load the cars dataset
data(cars)
# Split the dataset into training and test datasets
train <- cars[1:20, ]
test <- cars[21:50, ]
# Perform KNN regression
pred <- knn.reg(train[, 1], test[, 1], train[, 2], k = 3)
# Print the predicted target variable values
print(pred)
Here, we loaded the cars
dataset and split it into the training and test datasets. We used the knn.reg
function to perform KNN regression and predicted the target variable values for the test dataset.
Choosing the K Value
Choosing the optimal value of K is critical in the performance of the KNN algorithm. A small value of K may lead to overfitting, while a large value of K may lead to underfitting. To choose the optimal value of K, we can use cross-validation techniques, such as k-fold cross-validation or leave-one-out cross-validation.
Let's see an example of how to choose the K value using the k-fold cross-validation technique in R.
library(class)
# Load the iris dataset
data(iris)
# Split the dataset into training and validation datasets
train <- iris[1:100, ]
valid <- iris[101:150, ]
# Create a vector of possible K values
k_values <- seq(1, 20, 2)
# Initialize a vector to store the cross-validation errors
cv_errors <- rep(0, length(k_values))
# Perform k-fold cross-validation
for (i in 1:length(k_values)) {
cv_preds <- knn(train[, -5], valid[, -5], train[, 5], k = k_values[i])
cv_errors[i] <- sum(cv_preds != valid[, 5])
}
# Plot the cross-validation errors vs. K values
plot(k_values, cv_errors, type = "b", xlab = "K value", ylab = "Cross-validation error")
Here, we loaded the iris
dataset and split it into the training and validation datasets. We created a vector of possible K values and initialized a vector to store the cross-validation errors. We performed k-fold cross-validation for each K value and plotted the cross-validation errors vs. K values.
Conclusion
That's it! In this tutorial, we learned how to implement the KNN algorithm in R programming. We explored the KNN function in R, KNN regression, and other related concepts. We also saw how to choose the optimal value of K using cross-validation techniques.
The KNN algorithm is a powerful and easy-to-understand algorithm that has versatile applications in various industries. By implementing the K Nearest Neighbours R, we can solve data science problems in a simplified way. So, start building models using the KNN algorithm in R today!