Skip to content

Need help? Join our Discord Community!

Tutorials
R
KNN Function in R Programming: A Beginner's Guide

KNN Function in R Programming: A Beginner's Guide

The KNN algorithm is a widely used supervised machine learning algorithm that is used for classification and regression problems. It is a simple and easy-to-understand algorithm that is also known as K-Nearest Neighbors. With its versatile applications in healthcare, finance, retail, and other industries, the KNN algorithm has become a popular choice for building models.

In this tutorial, we will explore how to implement KNN in R programming. We will go over the KNN function in R, KNN regression, and other related concepts. So, let's get started!

Want to quickly create Data Visualisation from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

K Nearest Neighbor in R

The KNN algorithm works on the principle of finding the K nearest neighbors to a new data point and making predictions based on the majority class or mean value of the K neighbors. Here, K represents the number of neighbors we want to consider.

The distance between the new data point and other data points in the dataset is calculated using the distance formula. The distance formula can be used to calculate Euclidean distance, Manhattan distance, or other distance metrics based on the data type.

Once the distances are calculated, the K nearest data points are selected, and predictions are made based on the majority class or mean value of the K neighbors.

KNN in R

To implement the KNN algorithm in R programming, we can use the class package in R. The class package provides several functions for KNN classification and regression. Let's explore some of these functions.

KNN Classification

To perform KNN classification in R, we can use the knn function. The syntax for the knn function is as follows:

knn(train, test, cl, k)

Here, train is the training dataset, test is the test dataset, cl is the class labels, and k is the number of neighbors to consider. The output of the knn function is a vector of predicted class labels for the test dataset.

Let's see an example of how to use the knn function in R for classification.

library(class)

# Load the iris dataset
data(iris)

# Split the dataset into training and test datasets
train <- iris[1:100, ]
test <- iris[101:150, ]

# Perform KNN classification
pred <- knn(train[, -5], test[, -5], train[, 5], k = 3)

# Print the predicted class labels
print(pred)

Here, we loaded the iris dataset and split it into the training and test datasets. We used the knn function to perform KNN classification and predicted the class labels for the test dataset.

KNN Regression

To perform KNN regression in R, we can use the knn.reg function. The syntax for the knn.reg function is as follows:

knn.reg(train, test, y, k)

Here, train is the training dataset, test is the test dataset, y is the target variable, and k is the number of neighbors to consider. The output of the knn.reg function is a vector of predicted target variable values for the test dataset.

Let's see an example of how to use the knn.reg function in R for regression.

library(class)

# Load the cars dataset
data(cars)

# Split the dataset into training and test datasets
train <- cars[1:20, ]
test <- cars[21:50, ]

# Perform KNN regression
pred <- knn.reg(train[, 1], test[, 1], train[, 2], k = 3)

# Print the predicted target variable values
print(pred)

Here, we loaded the cars dataset and split it into the training and test datasets. We used the knn.reg function to perform KNN regression and predicted the target variable values for the test dataset.

Choosing the K Value

Choosing the optimal value of K is critical in the performance of the KNN algorithm. A small value of K may lead to overfitting, while a large value of K may lead to underfitting. To choose the optimal value of K, we can use cross-validation techniques, such as k-fold cross-validation or leave-one-out cross-validation.

Let's see an example of how to choose the K value using the k-fold cross-validation technique in R.

library(class)

# Load the iris dataset
data(iris)

# Split the dataset into training and validation datasets
train <- iris[1:100, ]
valid <- iris[101:150, ]

# Create a vector of possible K values
k_values <- seq(1, 20, 2)

# Initialize a vector to store the cross-validation errors
cv_errors <- rep(0, length(k_values))

# Perform k-fold cross-validation
for (i in 1:length(k_values)) {
  cv_preds <- knn(train[, -5], valid[, -5], train[, 5], k = k_values[i])
  cv_errors[i] <- sum(cv_preds != valid[, 5])
}

# Plot the cross-validation errors vs. K values
plot(k_values, cv_errors, type = "b", xlab = "K value", ylab = "Cross-validation error")

Here, we loaded the iris dataset and split it into the training and validation datasets. We created a vector of possible K values and initialized a vector to store the cross-validation errors. We performed k-fold cross-validation for each K value and plotted the cross-validation errors vs. K values.

Conclusion

That's it! In this tutorial, we learned how to implement the KNN algorithm in R programming. We explored the KNN function in R, KNN regression, and other related concepts. We also saw how to choose the optimal value of K using cross-validation techniques.

The KNN algorithm is a powerful and easy-to-understand algorithm that has versatile applications in various industries. By implementing the K Nearest Neighbours R, we can solve data science problems in a simplified way. So, start building models using the KNN algorithm in R today!