Exchange Data between Python and R with SQLite
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SQLite is a light-weight database with zero-configuration. Being fast, reliable, and simple, SQLite is a good choice to store / query large data, e.g. terabytes, and is well supported by both Python and R.
In [1]: # LOAD PYTHON PACKAGES
In [2]: import pandas as pd
In [3]: import pandas.io.sql as pd_sql
In [4]: import sqlite3 as sql
In [5]: import pyper as pr
In [6]: # READ DATA
In [7]: py_data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt")
In [8]: print py_data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4421 entries, 0 to 4420
Data columns:
LEV_LT3 4421 non-null values
TAX_NDEB 4421 non-null values
COLLAT1 4421 non-null values
SIZE1 4421 non-null values
PROF2 4421 non-null values
GROWTH2 4421 non-null values
AGE 4421 non-null values
LIQ 4421 non-null values
IND2A 4421 non-null values
IND3A 4421 non-null values
IND4A 4421 non-null values
IND5A 4421 non-null values
dtypes: float64(7), int64(5)
In [9]: # CREATE A CONNECTION TO SQLITE DB
In [10]: con = sql.connect("/home/liuwensui/Documents/data/tmp.db")
In [11]: # WRITE THE DATAFRAME INTO SQLITE DB
In [12]: con.execute("drop table if exists tbldata")
Out[12]: <sqlite3.Cursor at 0xa00d820>
In [13]: pd_sql.write_frame(py_data, "tbldata", con)
In [14]: con.commit()
In [15]: # TEST THE DATA WRITTEN INTO SQLITE DB
In [16]: test_data = pd_sql.read_frame("select * from tbldata limit 5", con)
In [17]: print test_data
LEV_LT3 TAX_NDEB COLLAT1 SIZE1 PROF2 GROWTH2 AGE LIQ IND2A IND3A IND4A IND5A
0 0 0.530298 0.079172 13.131993 0.082016 1.166493 53 0.385779 0 0 1 0
1 0 0.370025 0.040745 12.132626 0.082615 11.092048 54 0.224123 1 0 0 0
2 0 0.636884 0.307242 13.322921 0.245129 -6.316099 43 0.055441 1 0 0 0
3 0 0.815549 0.295864 16.274536 0.164052 1.394809 24 0.016731 1 0 0 0
4 0 0.097690 0.033567 13.491299 0.160505 10.204010 49 0.387136 1 0 0 0
In [18]: # CREATE A R INSTANCE
In [19]: r = pr.R()
In [20]: # LOAD R LIBRARY
In [21]: print r("library(sqldf)")
try({library(sqldf)})
Loading required package: DBI
Loading required package: gsubfn
Loading required package: proto
Loading required namespace: tcltk
Loading Tcl/Tk interface ... done
Loading required package: chron
Loading required package: RSQLite
Loading required package: RSQLite.extfuns
In [22]: # READ DATA FROM SQLITE DB
In [23]: print r("r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')")
try({r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')})
Loading required package: tcltk
In [24]: print r("str(r_data)")
try({str(r_data)})
'data.frame': 4421 obs. of 12 variables:
$ LEV_LT3 : num 0 0 0 0 0 0 0 0 0 0 ...
$ TAX_NDEB: num 0.5303 0.37 0.6369 0.8155 0.0977 ...
$ COLLAT1 : num 0.0792 0.0407 0.3072 0.2959 0.0336 ...
$ SIZE1 : num 13.1 12.1 13.3 16.3 13.5 ...
$ PROF2 : num 0.082 0.0826 0.2451 0.1641 0.1605 ...
$ GROWTH2 : num 1.17 11.09 -6.32 1.39 10.2 ...
$ AGE : int 53 54 43 24 49 24 35 77 33 81 ...
$ LIQ : num 0.3858 0.2241 0.0554 0.0167 0.3871 ...
$ IND2A : int 0 1 1 1 1 1 1 1 1 0 ...
$ IND3A : int 0 0 0 0 0 0 0 0 0 0 ...
$ IND4A : int 1 0 0 0 0 0 0 0 0 0 ...
$ IND5A : int 0 0 0 0 0 0 0 0 0 1 ...
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.