Tabular Data I/O in Julia
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Importing tabular data into Julia can be done in (at least) three ways: reading a delimited file into an array, reading a delimited file into a DataFrame and accessing databases using ODBC.
Reading a file into an array using readdlm
The most basic way to read data into Julia is through the use of the readdlm function, which will create an array:
readdlm
(source
, delim::Char
, T::Type
; options...
)
If you are reading in a fairly normal delimited file, you can get away with just using the first two arguments, source and delim:
julia> airline_array = readdlm("/Users/randyzwitch/airline/1987.csv", ','); | |
julia> size(airline_array) | |
(1311827,29) | |
julia> typeof(airline_array) | |
Array{Any,2} |
julia> airline_array = readdlm("/Users/randyzwitch/airline/1987.csv", ',' , String); | |
julia> size(airline_array) | |
(1311827,29) | |
julia> typeof(airline_array) | |
Array{String,2} |
It’s probably the case that unless you are looking to do linear algebra or other specific mathy type work, you’ll likely find that reading your data into a DataFrame will be more comfortable to work with (especially if you are coming from an R, Python/pandas or even spreadsheet tradition).
To write an array out to a file, you can use the writedlm function (defaults to comma-separated):
writedlm
(filename
, array
, delim::Char
)
Reading a file into a DataFrame using readtable
As I covered in my prior blog post about Julia, you can also read in delimited files into Julia using the DataFrames package, which returns a DataFrame instead of an array. Besides just being able to read in delimited files, the DataFrames package also supports reading in gzippped files on the fly:
julia> using DataFrames | |
julia> airline_df = readtable("/Users/randyzwitch/airline/1987.csv.gz"); | |
julia> size(airline_df) | |
(1311826,29) | |
julia> typeof(airline_df) | |
DataFrame (use methods(DataFrame) to see constructors) |
writetable(filename::String
, df::DataFrame
) By default, the writetable function will use the delimiter specified by the filename extension and default to printing the column names as a header.
Accessing Databases using ODBC
The third major way of importing tabular data into Julia is through the use of ODBC access to various databases such as MySQL and PostgreSQL.
Using a DSN
The Julia ODBC package provides functionality to connect to a database using a Data Source Name (DSN). Assuming you store all the credentials in your DSN (server name, username, password, etc.), connecting to a database is as easy as:
julia> using ODBC | |
julia> ODBC.connect("MySQL") | |
Connection 1 to MySQL successful. |
Of course, if you don’t want to store your password in your DSN (especially in the case where there are multiple users for a computer), you can pass the “usr” and “pwd” arguments to the ODBC.connect function:
ODBC.connect(dsn; usr="", pwd="")
Using a connection string
Alternatively, you can build your own connection strings within a Julia session using the advancedconnect function:
#Amazon Redshift/Postgres connection string | |
Julia> red = advancedconnect("Driver={psqlODBC};ServerName=reporting.XXXXX.us-east-1.redshift.amazonaws.com;Username=XXXX;Password=XXXX;Database=XXXX;Port=XXXX"); | |
Connection 1 to Driver={psqlODBC};ServerName=reporting.XXXXX.us-east-1.redshift.amazonaws.com;Username=XXXX;Password=XXXX;Database=XXXX;Port=XXXX successful. | |
#MySQL connection string | |
julia> my = advancedconnect("Driver={MySQL};user=root;server=localhost;database=airline;") | |
Connection 1 to Driver={MySQL};user=root;server=localhost;database=airline; successful. |
Regardless of which way you connect, you can query data using the query function. If you want your output as a DataFrame, you can assign the result of the function to an object. If you want to save the results to a file, you specify the “file” argument:
julia> using ODBC | |
julia> ODBC.connect("MySQL") | |
Connection 1 to MySQL successful. | |
#Save query results into a DataFrame called 'results' | |
julia> results = query("Select * from a1987;"); | |
julia> typeof(results) | |
DataFrame (use methods(DataFrame) to see constructors) | |
#Save query results to a file, tab-delimited (default) | |
julia> query("Select * from a1987;", file="output.tab", delim = '\t'); | |
Summary
Overall, importing data into Julia is no easier/more difficult than any other language. The biggest thing I’ve noticed thus far is that Julia is a bit less efficient than Python/pandas or R in terms of the amount of RAM needed to store data. In my experience, this is really only an issue once you are working with 1GB+ files (of course, depending on the resources available to you on your machine).
Tabular Data I/O in Julia is an article from randyzwitch.com, a blog dedicated to helping newcomers to Web Analytics, with a heavy focus on WordPress self-hosted blogs and Google Analytics. If you liked this post, please visit randyzwitch.com to read more. Or better yet, tell a friend…the best compliment is to share with others!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.