Site icon R-bloggers

Using Azure as an R datasource, Part 4 – Pulling data from SQL Server to Linux

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Gregory Vandenbrouck
Software Engineer, Microsoft

This post is the fourth in a series that covers pulling data from Microsoft SQL Server or MySQL/MariaDB on Azure to an R client on Windows or Linux.

In the previous posts, we covered pulling data from SQL Server to Windows and from MySQL/MariaDB to both Windows and Linux. This time we’ll be pulling data from Microsoft SQL Server to an R client on Linux.

Setting up the Linux client machine

Creating the client VM

We ended creating an Ubuntu Virtual Machine (VM) in Azure, using the same process as the previous post in the series. We named it MyAzureUbuntu.cloudapp.net and are accessing it using account azureuser.

Accessing the client VM

To access the client VM from your local machine, you can type ssh azureuser@MyAzureUbuntu.cloudapp.net from a command prompt if you have an ssh client installed. If that fails, verify that ssh (TCP port 22) is one of the endpoints of the VM.

If your local machine runs Windows, you can use PuTTY or MobaXterm as an ssh client.

Connecting to the database from R on Linux

Using RODBC’s odbcDriverConnect function

To install R and the tools/libraries to connect to SQL Server via RODBC, type:

sudo apt-get install r-cran-rodbc unixodbc-bin unixodbc odbcinst freetds-bin tdsodbc

Note: if not using Ubuntu, you may have to use an alternative to apt-get, such as yum, zypper, etc.

To test connecting to SQL Server, you can use the tsql utility, which is included in the FreeTDS package.

FreeTDS defaults are for another database vendor (Sybase) that uses a different port and TDS version than SQL Server. There are a few complications around the TDS version when connecting to SQL Server using FreeTDS. You need to know:

  1. that the TDS version needs to be explicitly specified (not obvious since tsql’s help and manual don’t mention it),
  2. what TDS version should be used, and
  3. how to specify it (tsql’s manual is silent about this).

In our case, 7.1 ended up being the TDS version that worked (although the documentation suggests instead 7.3) and 1433 is the port used by our SQL Azure instance.

Below we’re successfully connecting to our SQL Server instance using tsql and getting back the version of SQL Server:

azureuser@MyAzureUbuntu:~$ TDSVER=7.1 tsql -H MySqlAzure.database.windows.net -p 1433 -U MyUser -P MyPassword
locale is "en_US.UTF-8"
locale charset is "UTF-8"
using default charset "UTF-8"
1> version
using TDS version 7.1
1> select serverproperty('Edition'), serverproperty('ProductVersion')
2> go

SQL Azure       12.0.2000.8
(1 row affected)
1> quit

Notes:

  • tsql does not appear to support specifying the default database and SQL Azure does not allow for changing the database once connected; therefore using tsql we only have access to the master database when connecting to SQL Azure. Not ideal but still useful to test the connection.
  • When not specifying TDSVER or specifying an incorrect TDSVER, you’ll get an error message similar to Adaptive Server connection failed

FreeTDS’s ODBC drivers need to be registered in order to connect using RODBC. Below we show how to do it and how to see the contents of the configuration file.

azureuser@MyAzureUbuntu:~$ cd /etc
azureuser@MyAzureUbuntu:/etc$ wc -l odbcinst.ini
0 odbcinst.ini
azureuser@MyAzureUbuntu:/etc$ sudo odbcinst -i -d -f /usr/share/tdsodbc/odbcinst.ini
odbcinst: Driver installed. Usage count increased to 1.
    Target directory is /etc
azureuser@MyAzureUbuntu:/etc$ head -2 odbcinst.ini
[FreeTDS]
Description=TDS driver (Sybase/MS SQL)
azureuser@MyAzureUbuntu:/etc$

Now we can connect to our SQL Azure instance using RODBC’s odbcDriverConnect() function. We just need to make sure we pass the correct driver name (the name in the square brackets in odbcinst.ini), port number and TDS version in the connection string in addition to the regular settings: server, database, login and password.

azureuser@MyAzureUbuntu:~$ cat test.R
library(RODBC)
connStr <- "Server=MySqlAzure.database.windows.net;uid=MyUser;pwd=MyPassword;Database=MyDatabase" # regular info
connStr <- paste0(connStr, ";Driver=FreeTDS;TDS_Version=8.0;Port=1433") # FreeTDS specific info
conn <- odbcDriverConnect(connStr)
sqlQuery(conn, "select serverproperty('ProductLevel') as [Product Level], db_name() as [Database name]")
close(conn)

azureuser@MyAzureUbuntu:~$ R --slave -f test.R
  Product Level Database name
1           RTM    MyDatabase
azureuser@MyAzureUbuntu:~$

Using RODBC’s odbcConnect function

Some of this information can be persisted in a Data Source Name (DSN), allowing for shorter connection strings. Unfortunately, login and password cannot be persisted in the DSN for SQL Server whereas they can in the case of MySQL.

Below we show how to set up a SQL Azure/SQL Server DSN with FreeTDS, and how to test the changes using isql:

azureuser@MyAzureUbuntu:~$ cat /etc/odbc.ini
[Azure-MSSQL-FreeTDS]
# Matches the name in odbcinst.ini
Driver = FreeTDS
Description = Testing MSSQL Azure
Trace = No
Server = MySqlAzure.database.windows.net
Database = MyDatabase
# Port must be specified as default corresponds to Sybase
PORT=1433
# Sometimes incorrectly documented as "TDS Version" (space instead of underscore)
# Silently ignored when incorrect. Must be specified as default corresponds to Sybase
TDS_Version=7.1
# UID and PWD can only be specified in connection strings
# and are silently ignored in odbc.ini

azureuser@MyAzureUbuntu:~$ echo select 1 + 1 as Answer | isql Azure-MSSQL-FreeTDS -b MyUser MyPassword
+------------+
| Answer     |
+------------+
| 2          |
+------------+
SQLRowCount returns 1
1 rows fetched
azureuser@MyAzureUbuntu:~$

While we’re at it, we use isql to run a script that will create a table with 2 columns and 10K rows, all with random values between 0 and 1. isql does not support batch delimiters (such as GO), therefore each batch must fit on a single line:

azureuser@MyAzureUbuntu:~$ cat createTable.sql
set nocount on
select 'Creating table ' + db_name() + '..MyTable on machine ' + @@servername+ '...'
create table MyTable(x float not null, y float not null)
declare @i int = 0; while @i < 10000 begin insert into dbo.MyTable values(rand(), rand()); set @i += 1; end
select count(*) as [Total number of rows] from dbo.MyTable

azureuser@MyAzureUbuntu:~$ cat createTable.sql | isql Azure-MSSQL-FreeTDS -b MyUser MyPassword | grep -v +-
SQLRowCount returns -1
|
|
| Creating table MyDatabase..MyTable on machine MySqlAzure...
|
SQLRowCount returns 1
1 rows fetched
SQLRowCount returns -1
SQLRowCount returns 1
| Total number of rows|
| 10000            |
SQLRowCount returns 1
1 rows fetched
azureuser@MyAzureUbuntu:~$ 

Note: script may take a while to complete. Be patient 🙂

We can now query our table using the DNS and RODBC’s odbcConnect() function. Here we’re querying the x column of MyTable and showing the impact of increasing the sample size: the distribution of the samples mean gets more concentrated around the population mean (0.5).

library(RODBC)
conn <- odbcConnect("Azure-MSSQL-FreeTDS", "MyUser", "MyPassword")
ds <- sqlQuery(conn, "SELECT x FROM MyTable")
close(conn)
randomSamples <- function(times, sampleSize) {
    replicate(n = times, expr = mean(sample(ds$x, size = sampleSize, replace = FALSE)), simplify = TRUE)
}
par(mfrow=c(2,2))

numberSamples <- 100
mean <- .5

for (i in 2:5) {
    sampleSize <- i ^ 2
    plot(density(randomSamples(numberSamples, sampleSize)),
        main = paste0("Sample size: ", sampleSize),
        col = i,
        xlab = "",
        xlim = c(.2, .8))
    abline(v = mean)
}

 


par(mfrow=c(1,1))

Using RJDBC

Setup

To use RJDBC to connect to SQL Server, you need to:

  • From a command prompt:
    • Install Java: sudo apt-get install openjdk-8-jdk libjtds-java sqlline
    • Register Java in R: sudo R CMD javareconf
  • From an R terminal:
    • Install the R package (no Ubuntu packages via apt-get are provided): install.packages("RJDBC")

In R:

In addition to the url connector prefix, we also need to specify in the call to JDBC:

  • classPath: the location of the jar file. Running locate *jtds*.jar from a command prompt may help here. If you don’t get any results, run sudo updatedb first.
  • driverClass: the class name. If the one we use below doesn’t work, check the corresponding JDBC driver documentation.

In our particular setup:

library(RJDBC)
## Loading required package: DBI
## Loading required package: rJava
drv <- JDBC(
  driverClass = "net.sourceforge.jtds.jdbc.Driver", # check driver's documentation name
  classPath = "/usr/share/java/jtds.jar", # use locate and updatedb to find location
  identifier.quote="`")
conn <- dbConnect(drv, 
  "jdbc:jtds:sqlserver://MySqlAzure.database.windows.net/MyDatabase", 
  "MyUser",
  "MyPassword")
# only query 1/4 of the records otherwise we get a solid color graph 
ds <- dbGetQuery(conn, "select top 25 percent * from MyTable")
dbDisconnect(conn)
## [1] TRUE
plot(ds, 
     col=rgb(200,0,0,50, maxColorValue=255), 
     main = "Querying SQL Server with RJDBC", 
     pch = 19, cex = 1)

Using RSQLServer

Much easier to use than RJDBC (no url connector, driverClass and classpath to figure out). The prerequisites are the same as for RJDBC (RSSQLServer is a wrapper on top of RJDBC): install and register Java, then install the package.

    library(RSQLServer)
    conn <- dbConnect(
        RSQLServer::SQLServer(), 
        "MySqlAzure.database.windows.net", 
        "useNTLMv2=false;user=MyUser;Password=MyPassword",
        database = "MyDatabase")
    ds <- dbGetQuery(conn, "select * from MyTable")
    dbDisconnect(conn)
## [1] TRUE
    plot(
         density(ds$x*ds$y),
         col = "salmon",
         lwd = 3,
         main = "Querying SQL Server with RSQLServer")

Using Microsoft’s ODBC drivers

Microsoft provides Linux ODBC drivers for SQL Server that can be used with RODBC.

Before you get started

Here are some pros/cons that may help deciding if the pros justify the cons in your specific case:

  • Pros:
    • Better defaults than FreeTDS so configuration is simplified.
    • From our limited testing, better overall performance than FreeTDS and RJDBC/RSQLServer.
    • Theoretical feature-set compatibility with SQL Server should be better than FreeTDS’s one.
  • Cons:
    • These drivers are only supported on a couple specific Linux distributions.
    • The process of getting these installed is much more involved than FreeTDS since it is done with scripts instead of packages.
    • These drivers depend on a version of unixODBC that is also not available as a package meaning that unixODBC needs to be manually installed, and in turn any other component that depends on unixODBC may have to be manually compiled and installed.

Here are some links that provide some more insights:

Driver setup

For our testing, we created a machine in Azure using the latest SP of the “SUSE Linux Enterprise 11” images: Microsoft’s ODBC drivers for SQL Server on Linux require Redhat or Suse.

We named this VM MyAzureSuse.cloudapp.net and are connecting to it with account azureuser.

Another of the prerequisites is “64-bit UnixODBC 2.3.0”, which unfortunately is not available as a package at this time. To check what version is available, run zypper info UnixODBC and look at the version field (in our case: 2.2.12-198.17).

Since two versions of unixODBC cannot be installed at the same time, you need to ensure it’s not yet installed (for example by running locate unixodbc) and un-install it if needed.

After downloading the driver under /tmp, we did the following from a root prompt started by running sudo bash (“[…]” means we’ve trimmed the output for brievety):

MyAzureSuse:/tmp # zypper install gcc 
MyAzureSuse:/tmp # tar zxvf msodbcsql-11.0.2260.0.tar.gz
[...]
MyAzureSuse:/tmp # cd msodbcsql-11.0.2260.0/
MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./build_dm.sh
[...]

The script when successful prompts to run a second command to install unixODBC, in our case: cd /tmp/unixODBC.27735.30044.32204/unixODBC-2.3.0; make install

Once unixODBC 2.3 is successfully installed, we can proceed with the installation of Microsoft’s drivers:

MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./install.sh verify
[...]
MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./install.sh install
[...]
MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # odbcinst -q -d -n "ODBC Driver 11 for SQL Server"
[...]
MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # cat /etc/odbcinst.ini
[ODBC Driver 11 for SQL Server]
Description=Microsoft ODBC Driver 11 for SQL Server
Driver=/opt/microsoft/msodbcsql/lib64/libmsodbcsql-11.0.so.2260.0
Threading=1
UsageCount=1

We can now use RODBC’s odbcDriverConnect() or odbcConnect() functions.

Below is the appropriate DSN configuration entry to our test server in /etc/odbc.ini:

[Azure-MSSQL-Microsoft]
Driver = ODBC Driver 11 for SQL Server
Server = tcp:MySqlAzure.database.windows.net
Database = MyDatabase
Description = Testing MSSQL Azure using Microsoft's drivers

Note: user/password entries in odbc.ini are silently ignored for SQL Server.

The quick test below with isql confirms everything is working:

azureuser@MyAzureSuse:~> echo select 'Server=' + @@servername | isql Azure-MSSQL-Microsoft MyUser MyPassword | grep Server
| Server=MySqlAzure          

Using in R

We installed R from a terminal: sudo zypper install r-base. For some versions of Suse you may have to look a little to find the package. This page is helpful (click on the “Show other versions” link).

Then we installed RODBC from an R session install.packages("RODBC"), and verified we have connectivity to SQL Server using the DSN that uses Microsoft’s drivers:

library(RODBC)
conn <- odbcConnect("Azure-MSSQL-Microsoft", "MyUser", "MyPassword")
sqlQuery(conn, "SELECT COUNT(*) AS CountRows FROM dbo.MyTable")
##   CountRows
## 1     10000
close(conn)

Summary

  • We’ve tried the RODBC, RJDBC and RSQLServer packages to connect to a SQL Azure/SQL Server database from a Linux client, also hosted in a VM on Azure.
  • We also tested Microsoft’s ODBC drivers for Linux, on Suse.
  • There’s no clear winner here; it really depends on what matters most:
    • Those that favor performance will want Microsoft’s ODBC drivers with RODBC.
    • Those that favor portability will use RODBC with a DSN (using odbcConnect) or RSQLServer.
    • Those that favor ease of setup and configuration will use RSQLServer.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.