R usage in Brazil

Posted on May 16, 2019 by R on msperlin in R bloggers | 0 Comments

[This article was first published on R on msperlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m using R for at least five years and always been curious about its usage in Brazil. I see some minor personal evidence that the number of users is increasing over time. My book in portuguese is steadily increasing its sells, and I’ve been receiving far more emails about my R packages. Conference are also booming. Every year there are at least two or three R conferences in Brazil.

What I learned from experience is that software choice is a group decision. It is very likely that you will use whatever your peer group uses. For example, if you are a PhD student, you will never convince your adviser to change research software, even if you have perfectly good reasons!

It takes some independence and autonomy to be able to break free from bad group choices. In academia, you can only do that later on, when you finish your PhD and start your career. Then you can use whatever rocks your boat. And, even for that, it takes courage and humbleness to relearn all you research tricks, from data acquisition to research report.

In this post I’ll investigate the use of R in Brazil. Rstudio publishes a log page covering all R downloads and package installations, stored in downloadable .csv files. The data is organized by day and very easy to download and parse within R. After downloading it, I organized it by filtering only downloads in Brazil, and saved it in a .rds file. Let’s explore it.

library(tidyverse)

df.dls <- read_rds('data/r-downloads-brazil.rds')

glimpse(df.dls)
## Observations: 72,853
## Variables: 7
## $ date    <date> 2012-10-31, 2012-10-31, 2012-10-31, 2012-10-31, 2012-10…
## $ time    <drtn> 16:17:34, 18:21:35, 19:26:20, 19:28:03, 19:26:03, 19:06…
## $ size    <dbl> 49351035, 33301364, 49351035, 49351024, 1424794, 6652340…
## $ version <chr> "2.15.2", "2.15.2", "2.15.2", "2.15.2", "2.15.2", "2.15.…
## $ os      <chr> "win", "win", "win", "win", "win", "osx", "win", "win", …
## $ country <chr> "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "BR", "B…
## $ ip_id   <dbl> 30, 59, 73, 30, 87, 90, 143, 213, 231, 260, 260, 134, 18…

As you can see, we have the date, time, version, os (platform), country and ip (randomized daily). First of all, let’s see how many downloads per day we have for Brazil. I’m also including the different release dates for major R versions.

df_by_day <- df.dls %>%
  group_by(ref.date = date) %>%
  summarise(n = n())

df.R.releases <- tibble(ref.date = as.Date(c('2013-04-03', '2014-04-10','2015-04-16',
                                             '2016-05-03', '2017-04-21',
                                             '2018-04-23', '2019-04-26')),
                            R_version  = c('3.0.0', '3.1.0','3.2.0', 
                                 '3.3.0','3.4.0', '3.5.0', 
                                 '3.6.0') )

p <- ggplot(data = df_by_day, aes(y = n, 
                                  x = ref.date) ) + 
  geom_point() + geom_smooth(size = 2) + 
  labs(x = 'Date (day)', y= 'Number of Downloads', 
       title = paste0('Number of R downloads in Brazil'),
       subtitle = 'Data from Rstudio logs <http://cran-logs.rstudio.com/>') + 
  geom_vline(data = df.R.releases,
             aes(xintercept = ref.date, color = R_version ), size = 1) + 
  scale_color_grey(start = 0.8, end = 0.2 )

print(p)

The number of downloads is steadily increasing over time. The new releases of R also seems to explain the outliers in the dataset. Let’s clean it a bit by decreasing the frequency and calculating the number of downloads per month, instead of by day.

df_by_month <- df.dls %>%
  group_by(ref.month = lubridate::ymd(format(date, '%Y-%m-01'))) %>%
  summarise(n = n())
  
p <- ggplot(data = df_by_month, aes(y = n, 
                                  x = ref.month) ) + 
  geom_point() + geom_smooth(size = 2) + 
  labs(x = 'Date (month)', y= 'Number of Downloads', 
       title = paste0('Number of R downloads in Brazil'),
       subtitle = 'Data from Rstudio logs <http://cran-logs.rstudio.com/>') + 
  geom_vline(data = df.R.releases,
             aes(xintercept = ref.date, color = R_version ), size = 1) + 
  scale_color_grey(start = 0.8, end = 0.2 )


print(p)

Much better! Overall, R downloads average about 910.7 per month, with a monthly compound rate of 6.40%. It means that, each month, the number of downloads is increasing by 6.40% from previous month.

The data also includes information about the operating system. Let’s check its distribution:

df_by_os <- df.dls %>%
  group_by(os) %>%
  count() %>%
  na.omit() %>% ungroup() %>%
  mutate(os = fct_recode(os, 
                         "Windows" = "win",
                         'Mac OS' = 'osx',
                         'Linux' = 'src'))

p <- ggplot(df_by_os, aes(x = os, y = n)) + 
  geom_col() + 
  labs(x = 'Operation System', y = 'Number of Download Cases', 
       title = 'Distribution of OS',
       subtitle = 'Data from Rstudio logs <http://cran-logs.rstudio.com/>')

print(p)

Not unexpectedly, Windows is the winner! I’m very surprised to see that Mac OS presents more downloads than Linux. With an unfavorable exchange rate and many import taxes, the price of a Mac computer (desktop or laptop) are exorbitantly expensive in Brazil. This tells a lot about the purchase power of R users.

I hope you liked this post. Next time I’ll analyze the logs of package installation in Brazil.

Ten years as a professor -- six advices to young academics

In the 18th of march 2021 I’ll complete exactly ten years since finishing my PhD and taking a professorship position at UFRGS, south of Brazil. In this post I’ll write about what I learned during this period and, hopefully, help other academics tha...

March 9, 2021

In "R bloggers"

R in Latin America

Following on from our post on R in Africa this next post in the series looks at the R community across Latin America. Conferences As the use of R has grown in Latin America, there has been an increasing demand for local R conferences. The community has responded and a…

February 4, 2018

In "R bloggers"

R in Latin America

February 4, 2018

In "R bloggers"

To leave a comment for the author, please follow the link and comment on their blog: R on msperlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

R usage in Brazil

Related

Ten years as a professor -- six advices to young academics

R in Latin America

R in Latin America

Related

Ten years as a professor -- six advices to young academics

R in Latin America

R in Latin America

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)