Blue Bikes Sharing in Boston
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Blue Bikes is a bicycle sharing system in the Boston, Massachusetts. The bikes sharing program started on 28 July 2011. This program aimed for individuals to use it for short-term basis for a price. The program allows individuals to borrow a bike from a dock station and retrun to another dock station after using it. According to 2017 report, there are more than 1600 bicycles in used.
The aims of the current project are to gain insight of the user who used this program such as gender. It also aimed to investigate the duration of the usage. The dataset used in this project can be found in this link. For the purpose of this project, two datasets are used the station dataset as recorded in 16th August 2017 and the trip data on 8th August 2018.
library(data.table) # for faster loading on large dataset library(tidyverse) # load ggplot, dplyr library(leaflet) # interactive mapping library(leaflet.extras) #extra mapping for leaflet library(geosphere) #used to calculate distance library(lubridate) #formatting date and time library(gridExtra) # multiple plot arrange library(grid) #multiple plot arrange library(psych) #to get describe function #load the data set data <- as.data.frame(fread("201807-bluebikes-tripdata.csv")) #load station data station <- read.csv("Hubway_Stations_as_of_July_2017.csv")
Before the analysis, let's visualize all the dock stations around Boston.
map1 <- station %>% leaflet() %>% setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>% addTiles() %>% addMarkers(lng = station$Longitude, lat = station$Latitude, popup = station$Station) map1
As you can see above, these are the distribution of dock station around Boston. You can run the code in RStudio and interact with the map.
Data Cleaning
glimpse(data) ## Observations: 242,891 ## Variables: 15 ## $ tripduration <int> 220, 6578, 1311, 831, 1318, 609, 523... ## $ starttime <chr> "2018-07-01 00:01:06.5090", "2018-07... ## $ stoptime <chr> "2018-07-01 00:04:46.9020", "2018-07... ## $ `start station id` <int> 11, 26, 20, 116, 194, 75, 149, 75, 7... ## $ `start station name` <chr> "Longwood Ave at Binney St", "Washin... ## $ `start station latitude` <dbl> 42.33863, 42.34152, 42.35991, 42.370... ## $ `start station longitude` <dbl> -71.10650, -71.06892, -71.05143, -71... ## $ `end station id` <int> 45, 26, 61, 117, 100, 179, 104, 179,... ## $ `end station name` <chr> "Jersey St. at Boylston St.", "Washi... ## $ `end station latitude` <dbl> 42.34468, 42.34152, 42.34876, 42.366... ## $ `end station longitude` <dbl> -71.09785, -71.06892, -71.08238, -71... ## $ bikeid <int> 1188, 2020, 2875, 2334, 2813, 2152, ... ## $ usertype <chr> "Subscriber", "Subscriber", "Subscri... ## $ `birth year` <int> 1969, 1969, 1961, 1990, 1990, 1993, ... ## $ gender <int> 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ...
A quick overview of the dataset shows that there is 242,891 observation of the dat and 15 variables. The date range from 1st July 2018 to 31st July 2018. In addition, the data is not as clean as it looks. First, the trip duration is caluculated in second which we need to convert into minutes. Next, only birth year gave which we need to perform a simple calculation to calculate users' age. Besides that, gender variable is dummy coded and the data dictionary does not explain what does each value mean. Therefore, I'm going to code them based on assumption (0 = Female, 1 = Male, 2 = Prefered not to say). Although the data dictionary explained that there are different user types such as Casual = 24-Hour or 72-Hour Pass user; Member = Annual or Monthly Member. But, all have been coded into one type which is a subscriber. This does not provide any use for the analysis. Lastly, I'm going to calculate the trip distances based on the start and end point of longitude and latitude using haversine formula. The formula is given below:
\(a = \sin^2(\frac{\bigtriangleup \varphi}{2} + \cos \varphi_l * \cos \varphi_2 * \sin^2(\frac{\bigtriangleup \lambda}{2}))\)
\(c = 2 * a \tan2 (\sqrt{a}, \sqrt{(1 – a)}\)
\(d = R * c\)
where \(\varphi\) is latitude, \(\lambda\) is longitude, R is earth's radius which is 6371. This formula calculates the distance into the meter. To convert it into miles changes the earth's radius to 3958.756. The details of the formula can be seen here.. The distance calculated here might not be accurate because haversine formula is calculated based on shortest distance without considering the slope, hill and other factors.
#convert second to minutes and round it non decimal points data$minutes <- round(data$tripduration/60, 0) #calculate the age from birth year data$age <- 2018 - data$`birth year` #recoding gender variables data$gender <- as.factor(recode(data$gender, "0" = "Female", "1" = "Male", "2" = "Prefer not to say")) #convert the time and date variable to correct format data$starttime <- ymd_hms(data$starttime) data$stoptime <- ymd_hms(data$stoptime) #separate date and time data <- data %>% mutate_at(vars(starttime), ymd_hms) %>% mutate_at(vars(starttime), funs("start_date" = date(.))) data <- data %>% mutate_at(vars(stoptime), ymd_hms) %>% mutate_at(vars(stoptime), funs("stop_date" = date(.))) #extracting day, weekday and hour data$day <- day(data$starttime) data$weekday <- wday(data$starttime, label = TRUE) data$hour <- hour(data$starttime) # writing function for the distance formula dist <- function(long1, lat1, long2, lat2) { R <- 6371 radian <- pi / 180 a1 <- lat1 * radian a2 <- long1 * radian b1 <- lat2 * radian b2 <- long2 * radian diff_long <- b2 - a2 diff_lat <- b1 - a1 a <- (sin(diff_lat/2))^2 + cos(a1) * cos(b1) * (sin(diff_long/2))^2 c <- 2 * atan2(sqrt(a), sqrt(1 - a)) d <- R * c return(d) } # return the calculation into 2 decimals and data frame a <- as.data.frame(round(dist(data$`start station longitude`, data$`start station latitude`, data$`end station longitude`, data$`end station latitude`),2)) #binding two data frame new_data <- cbind(data, a) #select relevant variable bike_data <- select(new_data, "tripduration", "start_date", "stop_date", "start station latitude", "start station longitude", "end station latitude", "end station longitude", "gender", "minutes", "age", "round(dist(data$`start station longitude`, data$`start station latitude`, data$`end station longitude`, data$`end station latitude`), 2)", "day", "weekday", "hour") #rename variabe colnames(bike_data) <- c("tripduration", "start_date", "stop_date", "start_lat", "start_long", "end_lat", "end_long", "gender", "minutes", "age", "distances", "day", "weekday", "hour") # convert into factor bike_data$weekday <- as.factor(bike_data$weekday) glimpse(bike_data) ## Observations: 242,891 ## Variables: 14 ## $ tripduration <int> 220, 6578, 1311, 831, 1318, 609, 523, 564, 1499, ... ## $ start_date <date> 2018-07-01, 2018-07-01, 2018-07-01, 2018-07-01, ... ## $ stop_date <date> 2018-07-01, 2018-07-01, 2018-07-01, 2018-07-01, ... ## $ start_lat <dbl> 42.33863, 42.34152, 42.35991, 42.37080, 42.38614,... ## $ start_long <dbl> -71.10650, -71.06892, -71.05143, -71.10441, -71.0... ## $ end_lat <dbl> 42.34468, 42.34152, 42.34876, 42.36610, 42.39697,... ## $ end_long <dbl> -71.09785, -71.06892, -71.08238, -71.08639, -71.1... ## $ gender <fct> Female, Female, Male, Male, Male, Male, Male, Pre... ## $ minutes <dbl> 4, 110, 22, 14, 22, 10, 9, 9, 25, 22, 5, 23, 3, 1... ## $ age <dbl> 49, 49, 57, 28, 28, 25, 45, 26, 24, 26, 31, 28, 2... ## $ distances <dbl> 0.98, 0.00, 2.83, 1.57, 3.87, 0.92, 1.86, 0.92, 3... ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1... ## $ weekday <ord> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun,... ## $ hour <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... describe(bike_data$age) ## vars n mean sd median trimmed mad min max range skew ## X1 1 242891 35.65 11.81 32 34.91 11.86 16 132 116 0.64 ## kurtosis se ## X1 -0.07 0.02
A quick examination shows that there are many impossible values from the age where the maximum age is 132. By guessing, it maybe because some of the users prefered not to identify their age and choose the default setting from the Blue Bikes system. Therefore, any age above 99 will be removed.
bike_data <- bike_data[bike_data$age < 99,]
Exploratory Data Analysis
First we are going to examine the Blue Bikes usage in July, 2018.
group <- bike_data %>% group_by(start_date) %>% summarise(total = n()) group %>% ggplot(aes(x = start_date, y = total)) + geom_line(lty = 1) + geom_point() + labs(title = "Total of Blue Bikes Usage in July, 2018", x = "Date", y = "Total of Blue Bikes Used") + theme_bw() + annotate(geom="text", x = as.Date("2018-07-03"), y = 5000, label = " mean = 7833.87\n sd = 1508.95\n min = 3466\n max = 9860", color = "red", size = 3.5)
The plot above shows that total usage of the Blue Bikes on July 2018. It shows that there are two lowest usage of Blue Bikes during 17th and 22nd of July. Is there a reason why this is happening maybe due to weather, weekend or holiday?
Next, I'm going to examine which day have the most usage of the Blue Bikes.
days <- bike_data %>% group_by(weekday, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-08")) day2 <- bike_data %>% group_by(weekday, start_date) %>% summarise(n = n()) %>% filter(start_date < as_date("2018-07-15") & start_date > as_date("2018-07-07")) day3 <- bike_data %>% group_by(weekday, start_date) %>% summarise(n1 = n()) %>% filter(start_date < as_date("2018-07-22") & start_date > as_date("2018-07-14")) day4 <- bike_data %>% group_by(weekday, start_date) %>% summarise(n2 = n()) %>% filter(start_date < as_date("2018-07-29") & start_date > as_date("2018-07-21")) p1 <- ggplot(days, aes(x = weekday, y = total)) + geom_col(fill = "red") + geom_label(aes(label = total), color = "black", hjust = 1) + coord_flip() + labs(title = "From 1st July to 7th July", x = "Weekday", y = "Total") + theme_bw() p2 <- ggplot(day2, aes(x = weekday, y = n)) + geom_col(fill = "red") + geom_label(aes(label = n), color = "black", hjust = 1) + coord_flip() + labs(title = "From 8th July to 14th July", x = "Weekday", y = "Total") + theme_bw() p3 <- ggplot(day3, aes(x = weekday, y = n1)) + geom_col(fill = "red") + geom_label(aes(label = n1), color = "black", hjust = 1) + coord_flip() + labs(title = "From 15th July to 21st July", x = "Weekday", y = "Total") + theme_bw() p4 <- ggplot(day4, aes(x = weekday, y = n2)) + geom_col(fill = "red") + geom_label(aes(label = n2), color = "black", hjust = 1) + coord_flip() + labs(title = "From 22nd July to 28th July", x = "Weekday", y = "Total") + theme_bw() grid.arrange(p1,p2,p3,p4, top = textGrob("Total of Blue Bikes Usage in Four Different Weeks",gp = gpar(fontsize = 15, font = 3)))
As presented above, it shows that weekdays have the highest Blue Bikes usage while weekends have the lowest.
Next, I'm going to visulize hour that have most Blue Bikes usage.
Sun <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-02")) Mon <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-03") & start_date > as_date("2018-07-01")) Tue <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-04") & start_date > as_date("2018-07-02")) Wed <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-05") & start_date > as_date("2018-07-03")) Thu <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-06") & start_date > as_date("2018-07-04")) Fri <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-07") & start_date > as_date("2018-07-05")) Sat <- bike_data %>% group_by(hour, start_date) %>% summarise(total = n()) %>% filter(start_date < as_date("2018-07-08") & start_date > as_date("2018-07-06")) m1 <- Sun %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Sunday", x = "Hour", y = "Total") + theme_bw() m2 <- Mon %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Monday", x = "Hour", y = "Total") + theme_bw() m3 <- Tue %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Tuesday", x = "Hour", y = "Total") + theme_bw() m4 <- Wed %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Wednesday", x = "Hour", y = "Total") + theme_bw() m5 <- Thu %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Thursday", x = "Hour", y = "Total") + theme_bw() m6 <- Fri %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Friday", x = "Hour", y = "Total") + theme_bw() m7 <- Sat %>% ggplot(aes(x = hour, y = total)) + geom_line(lty = 2) + geom_point() + labs(title = "Saturday", x = "Hour", y = "Total") + theme_bw() grid.arrange(m1,m2,m3,m4,m5,m6,m7, top = textGrob("Total of Blue Bikes Usage in Hour for One Week",gp = gpar(fontsize = 15,font = 3)))
As you can see in the plot, it shows that there is a pattern across different hour during weekdays and weekends. On weekends, Blue Bikes mostly likely to be used during the afternoon. However, during weekday Blue Bikes are most likely to be used on before and after working hour.
Next, I'm going to visualize the gender and age group most likely to use Blue Bikes.
gender <- bike_data %>% group_by(gender) %>% summarise(total = n()) %>% filter(gender != "Prefer not to say") gender %>% ggplot(aes(x = gender, y = total, fill = gender)) + geom_col() + geom_text(aes(label = total), vjust = -0.1) + labs(main = "Gender on Blue Bikes Usage", x = "Gender", y = "Total") + guides(fill = FALSE) + theme_bw()
Based on the observation, Male are more likely to used Blue Bikes compared to female in July 2018. However, this plot might not be true because we code the gender based on our assumption.
Let's test is there is a significant difference between gender and distance travel using Blue Bikes.
Before that, let's examine the distance.
ggplot(data = bike_data, aes(x = distances)) + geom_density(alpha = 0.3, fill = "black") + guides(fill = FALSE) + labs(title = "Desntiy Plot of Distance", x = "Distance", y = "Density") + annotate(geom ="text", x = 6, y = 0.3, label = " mean = 1.97\n sd = 1.37\n min = 0\n max = 11.24", color = "red", size = 4) + theme_bw()
Based on the description analysis, it shows that the distance is not normally distributed and it is positively skewed. It was suggested that on average the travel distances of Blue Bikes are 1.97 meter and the maximum travel distance is 11.24 meter.
Let's examine the difference between male and female.
gender_distance <- bike_data %>% group_by(gender, distances) %>% summarise(total = n()) %>% filter(gender != "Prefer not to say") gender_distance %>% ggplot(aes(x = distances)) + geom_density(alpha = 0.3, fill = "black") + guides(fill = FALSE) + labs(title = "Density Plot of Distance Based on Gender" , x = "Distance", y = "Density") + facet_wrap(~ gender)
The plot shows that both male and female have a similar distribution, but this does not give much information about the differeces between gender on distance travel using Blue Bikes. I'm going to conduct a T-Test to test the hypothesis.
t.test(gender_distance$distances ~ gender_distance$gender) ## ## Welch Two Sample t-test ## ## data: gender_distance$distances by gender_distance$gender ## t = -2.2447, df = 1610.1, p-value = 0.02492 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.52101431 -0.03509147 ## sample estimates: ## mean in group Female mean in group Male ## 4.078028 4.356081
The T-test shows that \(t_{1610.1} = -2.2447\) with p = 0.025 which is lower than 0.05 suggested that there is a significant difference between male and female when using Blue Bikes. It was suggested that male are more likely to travel futther 0.28 meter compared to female.
gender_distance %>% ggplot(aes(x = gender, y = distances, fill = gender)) + geom_boxplot() + labs(title = "Boxplot of Gender vs Distance", x = "Gender", y = "Distances") + guides(fill=guide_legend(title = "Gender")) + theme_bw()
Next, I'm going to visualize the heatmap of starting location and ending location of using Blue Bikes.
#starting location #mapping bike_data %>% leaflet() %>% setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>% addTiles() %>% addHeatmap(lng = bike_data$start_long, lat = bike_data$start_lat, max = 2, radius = 15)
#ending location #mapping bike_data %>% leaflet() %>% setView(lng = -71.0589, lat = 42.3601, zoom = 13) %>% addTiles() %>% addHeatmap(lng = bike_data$end_long, lat = bike_data$end_lat, max = 2, radius = 15)
Conclusion
This analysis is just my personal interest and trying to mapping using leaflet.
Related Post
- Analysis of Los Angeles Crime with R
- Mapping the Prevalence of Alzheimer Disease Mortality in the USA
- Animating the Goals of the World Cup: Comparing the old vs. new gganimate and tweenr API
- Machine Learning Results in R: one plot to rule them all! (Part 1 – Classification Models)
- Seaborn Categorical Plots in Python
Categories
Tags
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.