Connecting to MAMP install of MySQL in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the course of doing a web dev tutorial (namely Tania Rascia’s great post on building a simple CRUD app) I had to install MAMP. This gives you a local server environment consisting of Apache, MySQL, and PHP.
Since I work with databases so often, I was curious about the simplest way to connect to my newly installed instance of MySQL in R.
Which R package?
This tutorial assumes you have already installed MAMP and have it currently running on your local machine in a macOS environment.
There are lots of different ways to connect to a db in R. RStudio has a great website that walks through their enterprise packages for working with databases, but I’ve often found this site confusing. Setting up drivers, ODBC connections, and DSN files is difficult enough in Windows, let alone on a Mac.
So instead of attempting to set up an ODBC connection to MySQL, I am simply going to use the RMariaDB
package coupled with DBI
(I’m skipping the odbc
package altogether). I tend to always trust the folks at RStudio for great package development, and this is a part of the DBI interface that in my experience gives the best performance for database work in R.
So if you haven’t done so already, install the following packages (along with nycflights13
which we’ll be using for a demo):
install.packages("tidyverse") install.packages("DBI") install.packages("RMariaDB") install.packages("nycflights13")
Creating a connection to the database
Okay, let’s first see if we can connect to the database. Since I don’t want to embed my username and password, I am using environmental variables. You can learn more on how to do that on RStudio’s Databases using R page.
library(tidyverse) library(odbc) library(DBI) library(RMariaDB) library(nycflights13) con <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), host = "localhost")
Error: Failed to connect: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)
Great! Immediately we have an error. According to MySQL’s documentation, Unix systems use this socket file for connecting to MySQL databases. Mine isn’t there because I used MAMP to install MySQL; MAMP places the socket file here: /Applications/MAMP/tmp/mysql/mysql.sock
.
So what we can do is write a bash command that creates a hard link from the MAMP location to where the RMariaDB
package expects it:
#either open terminal and run what's within quotes, or run it write from r with the system() function. system("ln -s /Applications/MAMP/tmp/mysql/mysql.sock /tmp/mysql.sock")
Okay, let’s try to connect again.
con <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), host = "localhost") dbGetQuery(conn = con, statement = "show databases;") ## Database ## 1 information_schema ## 2 crud ## 3 mysql ## 4 nyc_flights ## 5 performance_schema ## 6 sys
Awesome! We are connected to our MySQL instance that we installed with MAMP. Note: If you restart your machine, you will have to re-run the bash command above…Not quite sure why that is, but you can probably figure out a more permanent solution.
Reading and writing from the database
The dbGetQuery
function is the easiest way to read/write to the database. The first argument is just the name of your connection (mine is con
), and the second is your sql statement.
Let’s create a database of the nycflights13
data, and query it!
dbGetQuery(con, "CREATE DATABASE nyc_flights;") #create the database
Great, okay my plan is to insert each of the five dataframes of nycflights13
into the db as tables. So below, I first need to modify my connection to connect to the database we just created. I’m then going to create a simple function that takes each dataframe in the package and copies it to a table in the database.
#new connection con_flights <- DBI::dbConnect(RMariaDB::MariaDB(), user = Sys.getenv("mysql_user"), password = Sys.getenv("mysql_pass"), db = "nyc_flights", host = "localhost") #added db argument. #function insert_db_table <- function(x) { #creating function insert_db_table copy_to(con_flights, x, #x is the dataframe deparse(substitute(x)), #here I'm using the substitute() and deparse() function just to get the dataframe name as a string temporary = FALSE, #we'll make them permanent tables overwrite = TRUE #and overwrite them if they already exist ) } #run the function for each dataframe in nycflights13 insert_db_table(airlines) insert_db_table(airports) insert_db_table(flights) insert_db_table(planes) insert_db_table(weather)
Let’s see if the data’s there.
dbGetQuery(con_flights, statement = "select * from flights") %>% as_tibble() ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Awesome!
Querying with dbplyr
So far we have been using dbGetQuery
, which queries the data and brings it back in memory in R. But since the data is in a SQL database, we can actually write queries remotely and not pull the data into R until we need it; dbplyr
executes dplyr
code on the MySQL server and lazily evaluates it’s results. You can then use collect()
to actually capture the total results in memory for plotting.
tbl_flights <- tbl(con_flights, "flights") #dbplyr's tbl function tbl_airports <- tbl(con_flights, "airports") tbl_flights %>% inner_join(tbl_airports, by = c("dest" = "faa")) %>% count(name) ## # Source: lazy query [?? x 2] ## # Database: mysql [root@localhost:NA/nyc_flights] ## name n ## <chr> <int64> ## 1 Akron Canton Regional Airport 864 ## 2 Albany Intl 439 ## 3 Albuquerque International Sunport 254 ## 4 Asheville Regional Airport 275 ## 5 Austin Bergstrom Intl 2439 ## 6 Baltimore Washington Intl 1781 ## 7 Bangor Intl 375 ## 8 Birmingham Intl 297 ## 9 Blue Grass 1 ## 10 Bob Hope 371 ## # … with more rows
Although in my early experience of using dbplyr
commands I have found it much slower than just bringing all the data into R, I think it’s still a useful method for working with databases. You can read more about it on RStudio’s website. Cheers!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.