Examining the Tweeting Patterns of Prominent Crossfit Gyms

[This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A. Introduction

The growth of Crossfit has been one of the biggest developments in the fitness industry over the past decade. Promoted as both a physical exercise philosophy and also as a competitive fitness sport, Crossfit is a high-intensity fitness program incorporating elements from several sports and exercise protocols such as high-intensity interval training, Olympic weightlifting, plyometrics, powerlifting, gymnastics, strongman, and so forth. Now with over 10,000 Crossfit affiliated gyms (boxes) throughout the United States, the market has certainly become more saturated and gyms must initiate more unique marketing strategies to attract new members. In this post, I will investigate how some prominent Crossfit boxes are utilizing Twitter to engage with consumers. While Twitter is a great platform for news and entertainment, it is usually not the place for customer acquisition given the lack of targeted messaging. Furthermore, unlike platforms like Instagram,Twitter is simply not an image/video centric tool where followers can view accomplishments from their favorite fitness heroes, witness people achieving their goals, and so forth. Given these shortcomings, I wanted to understand how some prominent Crossfit boxes are actually using their Twitter accounts. In this post, I will investigate tweeting patterns, content characteristics, perform sentiment analysis, and build a predictive model to predict retweets.

B. Extract Data From Twitter

We begin by extracting the desired data from Twitter using the rtweet package in R. There are six prominent Crossfit gyms whose entire Twitter timeline we will use. To get this data, I looped through a vector containing each of their Twitter handles and used the get_timeline function to pull the desired data. Notice that there is a user defined function called add_engineered_features that is used to add a number of extra date columns. That function is available on my GitHub page here.


library(rtweet)
library(lubridate)
library(devtools)
library(data.table)
library(ggplot2)
library(hms)
library(scales)

# set working directory
setwd("~/Desktop/rtweet_crossfit")

final_dat.tmp <- list()

cf_gyms <- c("reebokcrossfit5", "crossfitmayhem", "crossfitsanitas", "sfcrossfit", "cfbelltown", "WindyCityCF")

for(each_box in cf_gyms){

message("Getting data for: ", each_box)

each_cf % data.table()
each_cf$crossfit_name <- each_box
suppressWarnings( add_engineered_dates(each_cf, date_col = "created_at") )

final_dat.tmp[[each_box]] <- each_cf
message("")

}

final_dat <- rbindlist(final_dat.tmp)

final_dat$contains_hashtags <- ifelse(!is.na(final_dat$hashtags), 1, 0)

final_dat$hashtags_count <- lapply(final_dat$hashtags, function(x) length(x[!is.na(x)]))

C. Exploratory Data Analysis

Let us start by investigating this data set to get a better understanding of trends and patterns across these various Crossfit boxes. The important thing to note is that not all these twitter accounts are currently active. We can see that crossfitmayhem, sfcrossfit, and WindyCityCF are the only ones who remain active.


final_dat[, .(min_max = range(as.Date(created_at))), by=crossfit_name][,label := rep(c("first_tweet","last_tweet"))][]

C1. Total Number of Tweets

Sfcrossfit, which is the oldest of these gyms, has the highest number of tweets. However, when looking at the total number of tweets per active month, they were less active than two other gyms.


## total number of tweets 

p1 = final_dat[, .N, by=crossfit_name] %>%
   ggplot(., aes(x=reorder(crossfit_name, N), y=N)) + geom_bar(stat='identity', fill="steelblue") +
      coord_flip() + labs(x="", y="") + ylim(0,3000) + ggtitle("Total Number of Tweets") +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black"))  

## number of tweets per active month 

p2 = final_dat[, .(.N, start=lubridate::ymd_hms(min(created_at)), months_active=lubridate::interval(lubridate::ymd_hms(min(created_at)), Sys.Date()) %/% months(1)), by=crossfit_name][,
     .(tweets_per_month = N/months_active), by=crossfit_name] %>%
   ggplot(., aes(x=reorder(crossfit_name, tweets_per_month), y=tweets_per_month)) +
      geom_bar(stat='identity', fill="steelblue") + coord_flip() + labs(x="", y="") + ylim(0,32) +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black")) +
            ggtitle("Total Number of Tweets per Active Month")

## add both plots to a single pane

grid.arrange(p1, p2, nrow=1)

C2. Total Number of Tweets Over Time

The time series for the total number of tweets by month shows that each gym had one or two peaks from 2012 through 2016 where they were aggressively sharing content with their followers. However, over the past two years, each gym has reduced their twitter usage significantly.


## total number of tweets by month 

final_dat[, .N, by = .(crossfit_name, created_at_YearMonth)][order(crossfit_name, created_at_YearMonth)][,
      created_at_YearMonth := lubridate::ymd(paste(created_at_YearMonth, "-01"))] %>%
   ggplot(., aes(created_at_YearMonth, N, colour=crossfit_name)) + geom_line(group=1, lwd=0.6) +
      facet_wrap(~crossfit_name) + labs(x="", y="") + theme(legend.position="none") +
         theme(plot.title = element_text(hjust = 0.5),
               axis.ticks.x = element_line(colour = "black"),
               axis.ticks.y = element_line(colour = "black"),
               strip.text.x = element_text(size = 10)) +
            ggtitle("Total Number of Tweets")

## total number of tweets by year

ggplot(data = final_dat,
       aes(lubridate::month(created_at, label=TRUE, abbr=TRUE),
           group=factor(lubridate::year(created_at)), color=factor(lubridate::year(created_at))))+
   geom_line(stat="count") + geom_point(stat="count") +
   facet_wrap(~crossfit_name) + labs(x="", colour="Year") + xlab("") + ylab("") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Number of Tweets by Year")

C3. Tweeting Volume by Year, Month, and Day

For each Crossfit gym, I plotted the volume of tweets by year, month, and day. Oddly enough, there really are not any noticeable patterns in these charts.


## years with the highest number of tweets 

ggplot(final_dat, aes(created_at_Year)) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,800) +
   ggtitle("Total Number of Tweets by Year")

## months with the highest number of tweets 

final_dat[, created_at_YearMonth2 := lubridate::ymd(paste(created_at_YearMonth, "-01"))][] %>%
   ggplot(., aes(lubridate::month(created_at_YearMonth2, label=TRUE, abbr=TRUE))) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,500) +
   ggtitle("Total Number of Tweets by Month")

## days with the highest number of tweets 

final_dat[, created_at_YearMonth2 := lubridate::wday(lubridate::ymd(paste(created_at_YearMonth, "-01")), label=T)][] %>%
   ggplot(., aes(created_at_YearMonth2)) + geom_bar(fill="steelblue") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) + ylim(0,600) +
   ggtitle("Total Number of Tweets by Day")

C4. Tweeting Volume by Time of Day

For most of these gyms, we see that the majority of their tweets came during the second half of the day or early in the morning. Given that most Crossfit boxes have their first classes at around 5am or 6am, and are usually most busy in the evening, this indicates that these businesses are tweeting during those hours where their facility is busiest.


## tweet density over the day 

ggplot(data = final_dat) +
   geom_density(aes(x = created_at_Time, y = ..scaled..),
                fill="steelblue", alpha=0.3) +
   xlab("Time") + ylab("Density") +
   scale_x_datetime(breaks = date_breaks("2 hours"),
                    labels = date_format("%H:%M")) +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.text.x=element_text(angle=90,hjust=1)) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Tweet Density by Time of Day")

C5. Source of Tweets

While a couple of the gyms are using a marketing management platform like Hootsuite and IFTTT, the majority of Tweets are coming from a Twitter application or some other social media tool such as Facebook and Instagram.


## source of tweets

final_dat[, .N, by = .(crossfit_name, source)][order(crossfit_name,-N)][,
        head(.SD, 3),by=crossfit_name] %>%
   ggplot(., aes(x=source, y=N)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   labs(x="", y="", colour="") + ylim(0,2500) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Source of Tweets") 

D. Content Characteristics

Let us investigate the characteristics of the content that is being tweeted. From looking at average tweet length to the amount of retweets, this information will further expand our understanding of how these prominent Crossfit boxes are utilizing their Twitter accounts.

D1. Frequency of Retweets and Original Tweets

Only one gym, cfbelltown, had a majority of their tweets being retweets. Furthermore, a plurality of tweets from crossfitmayhem were not original. In the plots showing retweets over time, we can also see that crossfitmayhem did a lot of retweeting from 2014 through 2016, but started pushing original tweets from then on out.


## retweets and original tweets 

final_dat[, .N, by=.(crossfit_name,is_retweet)][, percen := N/sum(N), by=.(crossfit_name)][is_retweet=="TRUE"] %>%
   ggplot(., aes(x=reorder(crossfit_name, N), y=percen)) +
      geom_bar(stat="identity", position ="dodge", fill="steelblue") + coord_flip() +
         labs(x="", y="", colour="") + ylim(0,0.8) +
            theme(plot.title = element_text(hjust = 0.5),
                  axis.ticks.x = element_line(colour = "black"),
                  axis.ticks.y = element_line(colour = "black")) +
            ggtitle("Percentage of Tweets That Were Not Original")

# total number of retweets over time 

ggplot(data = final_dat, aes(x = created_at, fill = is_retweet)) +
   geom_histogram(bins=48) +
   labs(x="", y="", colour="") +
   scale_fill_manual(values = c("steelblue2","steelblue"), name = "Retweet") +
   facet_wrap(~crossfit_name) + labs(x="", y="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Number of Retweets Over Time")

D2. Frequency of Favourited Tweets

Crossfitmayhem was the only box where a majority of their tweets were actually favorited at least once. Furthermore, they had a moderately large number of those tweets that were getting favorited three or more times.


## amount of favorited tweets 

final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count==0,] %>%
   ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") + ylim(0,1) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Percentage of Tweets That Were Not Favorited")

## amount of favorited tweets by count

final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count <= 10,][order(crossfit_name,favorite_count)] %>%
   ggplot(., aes(x=reorder(favorite_count, -favorite_count), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") +
   facet_wrap(~crossfit_name) + ylim(0,1) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Percentage of Tweets Based on Their Favorite Count")

D3. Use of Hashtags

Reebokcrossfit5 had the largest percentage of tweets that contained hashtags while crossfitsanitas and sfcrossfit rarely used hashtags in their tweets. And when it comes to the hashtags that are being used, it seems that these boxes are using the crossfit hashtag and the name of their own gym.


## contains hashtags 

final_dat[, .N, by=.(crossfit_name,contains_hashtags)][, percen := N/sum(N), by=.(crossfit_name)][contains_hashtags==1] %>%
   ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) +
   geom_bar(stat="identity", fill="steelblue") + coord_flip() +
   labs(x="", y="", colour="") + ylim(0,0.8) +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black")) +
   ggtitle("Percentage of Tweets That Contain Hashtags") 

## frequently used hashtags 

final_dat[hashtags != "", ][,.N, by=.(crossfit_name, hashtags)][order(crossfit_name, -N)][,
    head(.SD, 3),by=crossfit_name] %>%
   ggplot(., aes(x = reorder(hashtags, -N), y=N)) +
   geom_bar(stat="identity", fill="steelblue") +
   coord_flip() + ylim(0,150) +
   facet_wrap(~crossfit_name) + labs(x="", y="", colour="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Most Commonly Used Hashtags") 

D4. Tweet Length

The following chart shows the summary statistics pertaining to the length of the tweets from each of the crossfit boxes. crossfitsanitas and WindyCityCF seem to have the longest tweets on average, and sfcrossfit has the shortest tweet.


## tweet length 

melt(final_dat[, .(avg_length = mean(display_text_width,na.rm=T),
                   med_length = median(display_text_width,na.rm=T),
                   min_length = min(display_text_width,na.rm=T),
                   max_length = mean(display_text_width,na.rm=T)), by=.(crossfit_name)], id="crossfit_name") %>%
   ggplot(., aes(x=reorder(crossfit_name, value), y=value)) +
      geom_bar(stat="identity", fill="steelblue") + coord_flip() +
         labs(x="Date", y="Number of Tweets", colour="Crossfit Box") +
         theme(plot.title = element_text(hjust = 0.5)) +
      facet_wrap(~variable) + labs(x="Time", y="Number of Tweets", colour="Crossfit Box") +
      theme(axis.text.x=element_text(angle=90,hjust=1)) +
      theme(plot.title = element_text(hjust = 0.5),
            strip.text.x = element_text(size = 10)) + ylim(0,200) +
      ggtitle("Tweet Length Summary Statistics") 

E. Sentiment Analysis

In order to evaluate the emotion associated with the tweets of each crossfit box, I used the syuzhet package. This package is based on emotion lexicon which maps different words with the various emotions (joy, fear, etc.) and sentiment polarity (positive/negative). We’ll have to calculate the emotion score based on the words present in the tweets and plot the same.

We can see that the majority of tweets for each Crossfit box had a largely positive sentiment, trust and anticipation were other common emotions. For cfbelltown and crossfitsanitas, the third most common emotion was negative.


## sentiment analysis 

library(syuzhet)

sentiment_scores = list() 

#each_gym="sfcrossfit"
for(each_gym in unique(final_dat[,crossfit_name])){

   print(each_gym)

   tweet_text = final_dat[crossfit_name==each_gym, text]

   # removing retweets
   tweet_text <- gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweet_text)

   # removing mentions
   tweet_text <- gsub("@\\w+","",tweet_text)

   all_sentiments <- get_nrc_sentiment((tweet_text))
   #head(all_sentiments)

   final_sentimentscores <- data.table(colSums(all_sentiments))
   #head(final_sentimentscores)

   names(final_sentimentscores) <- "score"
   final_sentimentscores <- cbind("sentiment"=colnames(all_sentiments), final_sentimentscores)
   final_sentimentscores$gym = each_gym
   #final_sentimentscores

   sentiment_scores[[each_gym]] <- final_sentimentscores

}

sentiment_scores <- rbindlist(sentiment_scores)
dim(sentiment_scores)
sentiment_scores

ggplot(sentiment_scores, aes(x=sentiment, y=score))+
   geom_bar(aes(fill=sentiment), stat = "identity")+
   theme(legend.position="none")+
   xlab("Sentiments")+ylab("Scores")+ coord_flip() +
   facet_wrap(~gym) + labs(x="", y="", colour="") +
   theme(plot.title = element_text(hjust = 0.5),
         axis.ticks.x = element_line(colour = "black"),
         axis.ticks.y = element_line(colour = "black"),
         strip.text.x = element_text(size = 10)) +
   ggtitle("Total Sentiment Across All Original Tweets") 

F. Predictive Model

Out of curiosity, I wanted to see if a predictive model could be generated to predict whether a tweet would be retweeted. For this task, I looked at just the tweets from sfcrossfit. The target variable was 0 or 1, with zero representing a tweet that was not retweeted and one representing any tweet that was retweeted one or more times. Using the caret package, I trained a random forest on my training data and tried to predict on new data. We can see from the confusion matrix that the model did a very job at predicting whether a tweet would be retweeted or not. Perhaps with some better feature, a better model could have been produced, but I chose not to invest more time on this task.


library(tidyverse)
library(tidytext)
library(stringr)
library(caret)
library(tm)

final_dat[1:2]

final_dat[, .N, by=crossfit_name]
final_dat_sub = final_dat[crossfit_name=="sfcrossfit",]
final_dat_sub[, favorite_count_new := ifelse(favorite_count==0, 0, 1)]
final_dat_sub[, retweet_count_new := ifelse(retweet_count==0, 0, 1)]
final_dat_sub[1:3]
#final_dat_sub$retweet_count_new

splt <- createDataPartition(final_dat_sub$retweet_count_new, p = 0.70, list = F)
train <- final_dat_sub[splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)]
dim(train)
test <- final_dat_sub[-splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)]
dim(test)

congress_rf <- train(x = as.matrix(train[,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)]),
                     y = factor(train$retweet_count_new),
                     method = "rf",
                     ntree = 200,
                     trControl = trainControl(method = "oob"))
congress_rf$finalModel
varImpPlot(congress_rf$finalModel)

pred = predict(congress_rf, as.matrix(final_dat_sub[-splt,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)]))
confusionMatrix(final_dat_sub[-splt,retweet_count_new], pred, positive="1")

G. Conclusion

So there you have it. An investigation of how several prominent Crossfit gyms are using their Twitter accounts to engage with consumers. At the end of the day, I would suggest that any business in the health and wellness space should invest more time on Instagram or YouTube than Twitter when it comes to brand marketing or customer acquisition. Twitter is great for news and entertainment, but it isn’t the ideal platform to share fitness content or inspire new members.

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)