Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am trying to have better workflows to ensure data quality and two important things for me are first, scripting as much as posible the data manipulation process, and second, backing up the database we use under version control (e.g. Git*). I succeeded on that, but it was a 11 TAB problem**, so I though would be good post it here for others, and for my future self.
One goal of the task is to be able to do everything within the same program (Rstudio) for simplicity, so the task implies connecting R to MySQL server via RMySQL, update the database, make a backup copy from R and commit changes.
1) Running mysqldump from R to make an initial backup: RMySQL can’t (that I know) run the most commonly used backup mysqldump. But you can do it by calling the shell (aka Terminal) from R with the function system{base}.
#first we can see which is our path on the shell/terminal system("pwd") #it should be the working directory. #Second, we make a mysqldump (has to be in one non broken line, #sorry beautiful 80 characters-max code lines) system("/Applications/MAMP/Library/bin/mysqldump -u USER -h HOSP_IP -pPASWORD --port=8889 --skip-extended-insert DATABASE > DATABASE.sql") #note that my mysqldump is located in applications folder because #instaled MySQL using MAMP (highly recommended for an easy set up #of a local MySQL environment in a Mac). #Yours may be in /usr/local/mysq/bin/mysqldump #The --skip-extended-insert makes a INSERT for each row, #which makes version control lighter and more readable later on.
2) Put the backup under version control. Rstudio provide a nice hook to git, so you just need to ‘git init‘ the directory, ‘git add DATABASE.sql‘ and ‘git commit‘ the changes with a nice commit message.
3) Connecting R to MYSQL server to update it: There are good tutorials on this, so I’ll be brief. In this example I manipulated data within R (your.data.frame) and want to append the resulting dataframe to an existing table.
library(RMySQL) source(psw.R) # this is a way to avoid committing your password. #You can have an r file (added to .gitignore) with the line #psw <- "mypassword", which you can call later from the code. conn <- dbConnect(dbDriver("MySQL"), user = "USER", password = psw, dbname="DATABASE", port = 8889 , host= "HOST_IP") #this set up the connection. dbListTables(conn) # show tables dbWriteTable(conn, "TABLE", your.data.frame, append = TRUE, overwrite=FALSE, row.name=FALSE) #adds the new rows at the bottom of the table. dbDisconnect(conn) #always disconnect at the end.
4) Make another backup:
system("/Applications/MAMP/Library/bin/mysqldump -u USER -h HOSP_IP -pPASWORD --port=8889 --skip-extended-insert DATABASE > DATABASE.sql")
5) And finally you can make another commit reflecting the changes ‘git add DATABASE.sql‘ and ‘git commit’ with a nice commit message. You can compare the two versions now, and let anyone in the project play with it by pushing it to a central repo (e.g. Github).
I am sure there are other ways to do this, but I am pretty happy of how it works!
—-
*Git forces you to not only document changes, but explain why those changes are done in a nice story. So I prefer it to Audit Trail options in MySQL.
**Someone on the internet suggested to rate the difficulty of a problem in the number of browser tabs you need to open in order to fix it. Sorry I forgot the source.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.