Site icon R-bloggers

Google Big Query with R

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lightning fast database querying with the R API.

Source

What is Google Big Query?

Big Query is a highly performant cloud data storage service which started in 2011. You can manage it inside the Google Cloud Console and query the storage with standard SQL commands from the bq console or via the API. It is easy to set up, auto scales and there are a variety of established connectors to Google and other services. In this article I will show you the advantages of working with Big Query and how to use the API from R and build queries with dplyr functions.

What are the advantages of Big Query?

Google Big Query has a few advantages in comparison to other cloud data storages. Other technologies will be comparable or even better, but the combination of all the advantages and especially the Google integration is what makes Big Query really outstanding. Advantages are:

Big Query shows exceptional Performance with queries without GROUP BY statements (Q1) and has significant worse performance with GROUP BY statements (Q2-Q3). With many large JOINs and many GROUP BYs (Q4) it performs in the middle of the other tested technologies. Source
Biq Query really shines with performance under concurrent queries. Query time stays constant in comparison to the other technologies due to fast auto scaling. Source

Use Big Query with R

Enable Big Query and get your credentials

  1. Go to the Google Cloud Platform and login with your Google account. At the top left corner go to “Choose Project” and start a new project. If you go on your home dashboard to “Go to APIs overview” you will see the activated APIs of the Google Cloud Service. “BigQuery API” and “BigQuery Storage API” should be activated by default for all new projects.
Activated APIs for a new Google Cloud project.

2. Get your API key as described by the gargle R package here. In short, go to the Credentials section at the Google Cloud Platform in the dashboard shown above and create credentials > API key. You can rename your API key and restrict it to only certain APIs like “BigQuery API”. If you need application access to BQ you will need a service account token which you can download as JSON.

Querying with R

For querying BQ we will use the R library bigrquery. The other prominent R library for BQ is bigQueryR which in contrast to bigrquery depends on library googleAuthR which makes it more compatible with Shiny and other packages.

First, we get the libraries and authenticate either with our created API key or with the downloaded service account token JSON.

Now we can start querying our Big Query data or public datasets. We will query the Real-time Air Quality dataset from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time) from 5490 world wide air quality measurement stations, which is awesome! You can see the dataset and a short description on Big Query here if you are logged into Google. To find open datasets in the Cloud Console, scroll down on the left menu, there you should see “Big Query” under the header “Big Data”. If you go then to “+Add Data” you will be able to browse public data sets.

We will wrap the bigrquery API with DBI to be able to use it with dplyr verbs, however the bigrquery package provides an low level API, too.

In this example you can see querying with dplyr functions that are converted to SQL queries, however you can do not get the full flexibility that direct SQL querying provides. For this you could send SQL queries via DBI::dbGetQuery() from R. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. Check out my next post on how to build a Dockerized Cron-job to get the newest air pollution data from India while keeping older records.

This article was also published on https://www.r-bloggers.com/.


Google Big Query with R was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.