Plotting a Logistic Regression In Base R

Steven P. Sanderson II, MPH

1 month ago

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< section id="introduction" class="level1">

Introduction

Logistic regression is a statistical method used for predicting the probability of a binary outcome. It’s a fundamental tool in machine learning and statistics, often employed in various fields such as healthcare, finance, and marketing. We use logistic regression when we want to understand the relationship between one or more independent variables and a binary outcome, which can be “yes/no,” “1/0,” or any two-class distinction.

< section id="getting-started" class="level1">

Getting Started

Before we dive into plotting the logistic regression curve, let’s start with the basics. First, you’ll need some data. For this blog post, I’ll assume you have your dataset ready. If you don’t, you can easily find sample datasets online to practice with.

< section id="load-the-data" class="level1">

Load the Data

In R, we use the read.csv function to load a CSV file into a data frame. For example, if you have a dataset called “mydata.csv,” you can load it like this:

# Load the data into a data frame
data <- read.csv("mydata.csv")

We will instead use the following data set:

library(dplyr)

set.seed(123)
df <- tibble(
    x = runif(100, 0, 10),
    y = rbinom(100, 1, 1 / (1 + exp(-1 * (0.5 * x - 2.5))))
)

head(df)

# A tibble: 6 × 2
      x     y
  <dbl> <int>
1 2.88      0
2 7.88      1
3 4.09      0
4 8.83      0
5 9.40      1
6 0.456     0

< section id="fit-a-logistic-regression-model" class="level1">

Fit a Logistic Regression Model

Next, we need to fit a logistic regression model to our data. We’ll use the glm (Generalized Linear Model) function to do this. Suppose we want to predict the probability of a “success” (1) based on a single predictor variable “x.”

# Fit a logistic regression model
model <- glm(y ~ x, data = df, family = binomial)

broom::glance(model)

# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          138.      99  -51.5  107.  112.     103.          98   100

broom::tidy(model)

# A tibble: 2 × 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   -2.63      0.571     -4.60 0.00000422 
2 x              0.505     0.102      4.96 0.000000699

head(broom::augment(model), 1) |> 
  dplyr::glimpse()

Rows: 1
Columns: 8
$ y          <int> 0
$ x          <dbl> 2.875775
$ .fitted    <dbl> -1.175925
$ .resid     <dbl> -0.7333581
$ .hat       <dbl> 0.01969748
$ .sigma     <dbl> 1.028093
$ .cooksd    <dbl> 0.003162007
$ .std.resid <dbl> -0.7406892

< section id="predict-probabilities" class="level1">

Predict Probabilities

Now that we have our model, we can use it to predict probabilities. We’ll create a sequence of values for our predictor variable, and for each value, we’ll predict the probability of success, in this case y.

# Create a sequence of predictor values
x_seq <- seq(0, 10, 0.01)

# Predict probabilities
probabilities <- predict(
  model, 
  newdata = data.frame(x = x_seq), 
  type = "response"
  )

head(x_seq)

[1] 0.00 0.01 0.02 0.03 0.04 0.05

head(probabilities)

         1          2          3          4          5          6 
0.06732923 0.06764710 0.06796636 0.06828702 0.06860908 0.06893255

The predict function here calculates the probabilities using our logistic regression model.

< section id="plot-the-logistic-regression-curve" class="level1">

Plot the Logistic Regression Curve

Finally, let’s plot the logistic regression curve. We’ll use the plot function to create a scatter plot of the data points, and then we’ll overlay the logistic curve using the lines function.

# Plot the data points
plot(
  df$x, df$y, 
  pch = 16, 
  col = "blue", 
  xlab = "Predictor Variable", 
  ylab = "Probability of Success"
  )

# Add the logistic regression curve
lines(x_seq, probabilities, col = "red", lwd = 2)

And there you have it! You’ve successfully plotted a logistic regression curve in base R. The blue dots represent your data points, and the red curve is the logistic regression curve, showing how the probability of success changes with the predictor variable.

< section id="conclusion" class="level1">

Conclusion

I encourage you to try this out with your own dataset. Logistic regression is a powerful tool for modeling binary outcomes, and visualizing the curve helps you understand the relationship between your predictor variable and the probability of success. Experiment with different datasets and predictor variables to gain a deeper understanding of this essential statistical technique.

Remember, practice makes perfect, and the more you work with logistic regression in R, the more proficient you’ll become. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.