Logistic Regression Equation in R: Understanding the Formula with Examples
Logistic regression is one of the most popular statistical techniques used in machine learning for binary classification problems. It uses a logistic function to model the relationship between a dependent variable and one or more independent variables. The goal of logistic regression is to find the best relationship between the input features and the output variable. In this article, we will discuss the logistic regression equation with examples in R.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.
The logistic regression equation can be defined as follows:
- Y: the dependent variable or response variable (binary)
- X1, X2, …, Xp: independent variables or predictors
- β0, β1, β2, …, βp: beta coefficients or model parameters
The logistic regression model estimates the values of beta coefficients. The beta coefficients represent the change in the log-odds of the dependent variable when the corresponding independent variable changes by one unit. The logistic function (also called the sigmoid function) then transforms the log-odds into probabilities between 0 and 1.
In this section, we will use the
glm() function in R to build and train a logistic regression model on a sample dataset. We will use the
hr_analytics dataset from the
First, we load the required package and dataset:
hr_analytics dataset contains information about employees of a certain company, including their age, gender, education level, department, and whether they left the company or not.
We convert the target variable
left_company into a binary variable:
hr_analytics$left_company <- ifelse(hr_analytics$left_company == "Yes", 1, 0)
Next, we split the dataset into training and test sets:
set.seed(123) split <- initial_split(hr_analytics, prop = 0.7) train <- training(split) test <- testing(split)
We fit a logistic regression model using the
logistic_model <- glm(left_company ~ ., data = train, family = "binomial")
In this example, we use all the available independent variables (age, gender, education, department) to predict the dependent variable (left_company). The family argument specifies the type of model we want to fit. Since we are dealing with a binary classification problem, we specify "binomial" as the family.
To evaluate the performance of the model, we use the
Call: glm(formula = left_company ~ ., family = "binomial", data = train) Deviance Residuals: Min 1Q Median 3Q Max -2.389 -0.640 -0.378 0.665 2.866 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.721620 0.208390 -3.462 0.000534 *** age -0.008328 0.004781 -1.742 0.081288 . genderMale 0.568869 0.086785 6.553 5.89e-11 *** educationHigh School 0.603068 0.132046 4.567 4.99e-06 *** educationMaster's -0.175406 0.156069 -1.123 0.261918 departmentHR 1.989789 0.171596 11.594 < 2e-16 *** departmentIT 0.906366 0.141395 6.414 1.39e-10 *** departmentSales 1.393794 0.177948 7.822 5.12e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6589.7 on 4799 degrees of freedom Residual deviance: 5878.5 on 4792 degrees of freedom AIC: 5894.5 Number of Fisher Scoring iterations: 5
The output shows the coefficients of the model (beta coefficients), their standard errors, z-value, and p-value. We can interpret the coefficients as follows:
- The coefficients with a significant p-value (p < 0.05) are statistically significant and have a significant impact on the outcome. In this case, age, gender, education, and department are significant predictors of whether an employee leaves the company or not.
- The coefficients with a non-significant p-value (p > 0.05) are not statistically significant and have no significant impact on the outcome. In this case, education level (Master's) is not a significant predictor.
To make predictions on new data, we use the
predictions <- predict(logistic_model, newdata = test, type = "response")
newdata argument specifies the new data on which we want to make predictions. The
type argument specifies the type of output we want. Since we are dealing with binary classification, we specify "response" as the type.
Finally, we evaluate the predictions using the confusion matrix:
table(Predicted = ifelse(predictions > 0.5, 1, 0), Actual = test$left_company)
Actual Predicted 0 1 0 1941 334 1 206 419
The confusion matrix shows the number of true positives, false positives, true negatives, and false negatives. We can use these values to calculate performance metrics such as precision, recall, and F1 score.
In this article, we discussed the logistic regression equation and how it is used to model the relationship between independent variables and a dependent binary variable. We also demonstrated how to use the
glm() function in R to build, train, and evaluate a logistic regression model on a sample dataset. Logistic regression is a powerful technique for binary classification problems and is widely used in machine learning.