Site icon R-bloggers

A Gentle Introduction to Poisson Regression for Count Data: School’s Out, Job Offers Incoming!

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

Hey data enthusiasts! Today, we’re diving into the fascinating world of count data and its trusty sidekick, Poisson regression. Buckle up, because we’re about to explore how this statistical powerhouse helps us understand the factors influencing, you guessed it, counts.

Scenario: Imagine you’re an education researcher, eager to understand how a student’s GPA might influence their job offer count after graduation. But hold on, job offers aren’t continuous – they’re discrete, ranging from 0 to a handful. That’s where Poisson regression comes in!

< section id="generating-data" class="level1">

Generating Data

We’ll keep the data generation part the same, just adjusting the variables in our data frame.

library(tidyverse)

# Set seed for reproducibility
set.seed(123)

# Creating data frame
data <- data.frame(
  School = sample(c("A", "B", "C"), 100, replace = TRUE),
  GPA = c(
    round(runif(50, 1, 3), 1),
    round(runif(30, 2, 3.5), 1),
    round(runif(20, 3, 4), 1)
  ),
  JobOffers = c(rep(0, 50), rep(1, 30), rep(2, 10), rep(3, 7), rep(4, 3))
)

summary(data)
    School               GPA          JobOffers   
 Length:100         Min.   :1.000   Min.   :0.00  
 Class :character   1st Qu.:1.875   1st Qu.:0.00  
 Mode  :character   Median :2.600   Median :0.50  
                    Mean   :2.532   Mean   :0.83  
                    3rd Qu.:3.200   3rd Qu.:1.00  
                    Max.   :4.000   Max.   :4.00  
data |>
  group_by(JobOffers) |>
  summarise(mean_gpa = mean(GPA))
# A tibble: 5 × 2
  JobOffers mean_gpa
      <dbl>    <dbl>
1         0     1.94
2         1     2.87
3         2     3.41
4         3     3.5 
5         4     3.8 
< section id="visualizing-the-data" class="level2">

Visualizing the Data

Let’s update the plots to reflect the change in the predictor and outcome.

library(ggplot2)

# Plotting GPA distribution by school
ggplot(data, aes(JobOffers, fill = School)) +
  geom_histogram(binwidth=.5, position="dodge") +
  theme_minimal()

The density plot now showcases the distribution of GPA scores for each school.

Next, let’s visualize the relationship between GPA and job offers.

# Plotting Job Offers vs. GPA
ggplot(data, aes(x = GPA, y = JobOffers, color = School)) +
  geom_point(aes(y = JobOffers), alpha = .628,
             position = position_jitter(h = .2)) +
  labs(title = "Scatter Plot of Job Offers vs. GPA",
       x = "GPA", y = "Job Offers") +
  theme_minimal()

This scatter plot gives us a visual cue that higher GPAs might correlate with more job offers.

< section id="poisson-regression" class="level2">

Poisson Regression

Now, let’s adjust the Poisson Regression model to reflect the change in predictor and outcome.

# Fitting Poisson Regression model
poisson_model <- glm(JobOffers ~ GPA + School, data = data, family = "poisson")

# Summary of the model
summary(poisson_model)
Call:
glm(formula = JobOffers ~ GPA + School, family = "poisson", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2452  -0.5856  -0.3483   0.3221   1.6491  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.25817    0.70621  -7.446 9.65e-14 ***
GPA          1.73169    0.21241   8.153 3.56e-16 ***
SchoolB      0.03135    0.27524   0.114    0.909    
SchoolC     -0.19137    0.27637  -0.692    0.489    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 138.07  on 99  degrees of freedom
Residual deviance:  43.18  on 96  degrees of freedom
AIC: 168.06

Number of Fisher Scoring iterations: 5

The model summary will now provide insights into how GPA influences the number of job offers.

< section id="visualizing-model-fits" class="level2">

Visualizing Model Fits

Let’s update the plot to reflect the relationship between GPA and predicted job offers.

# Adding predicted values to the data frame
data$Predicted <- predict(poisson_model, type = "response")

# Plotting observed vs. predicted values
ggplot(data, aes(x = GPA, y = Predicted, color = School)) +
  geom_point(aes(y = JobOffers), alpha = .628,
             position = position_jitter(h = .2)) +
  geom_line() +
  labs(
    title = "Observed vs. Predicted Job Offers",
    x = "GPA", 
    y = "Predicted Job Offers",
    color = "School") +
  theme_minimal()

This plot now illustrates how the Poisson Regression model predicts job offers based on GPA.

< section id="predicted-vs.-actual-values" class="level1">

Predicted vs. actual values

ggplot(data, aes(x = JobOffers, y = predict(poisson_model),
                 color = School)) +
  geom_point(aes(y = JobOffers), alpha = .628,
             position = position_jitter(h = .2)) +
  labs(
    title = "Predicted vs. Actual Job Offers", 
    x = "Actual Job Offers", 
    y = "Predicted Job Offers",
    color = "School") +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_minimal()

< section id="residuals-vs.-predicted-values" class="level1">

Residuals vs. predicted values

plot(poisson_model, which = 1)

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version