[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SQLite is a light-weight database with zero-configuration. Being fast, reliable, and simple, SQLite is a good choice to store / query large data, e.g. terabytes, and is well supported by both Python and R.
In [1]: # LOAD PYTHON PACKAGES In [2]: import pandas as pd In [3]: import pandas.io.sql as pd_sql In [4]: import sqlite3 as sql In [5]: import pyper as pr In [6]: # READ DATA In [7]: py_data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt") In [8]: print py_data <class 'pandas.core.frame.DataFrame'> Int64Index: 4421 entries, 0 to 4420 Data columns: LEV_LT3 4421 non-null values TAX_NDEB 4421 non-null values COLLAT1 4421 non-null values SIZE1 4421 non-null values PROF2 4421 non-null values GROWTH2 4421 non-null values AGE 4421 non-null values LIQ 4421 non-null values IND2A 4421 non-null values IND3A 4421 non-null values IND4A 4421 non-null values IND5A 4421 non-null values dtypes: float64(7), int64(5) In [9]: # CREATE A CONNECTION TO SQLITE DB In [10]: con = sql.connect("/home/liuwensui/Documents/data/tmp.db") In [11]: # WRITE THE DATAFRAME INTO SQLITE DB In [12]: con.execute("drop table if exists tbldata") Out[12]: <sqlite3.Cursor at 0xa00d820> In [13]: pd_sql.write_frame(py_data, "tbldata", con) In [14]: con.commit() In [15]: # TEST THE DATA WRITTEN INTO SQLITE DB In [16]: test_data = pd_sql.read_frame("select * from tbldata limit 5", con) In [17]: print test_data LEV_LT3 TAX_NDEB COLLAT1 SIZE1 PROF2 GROWTH2 AGE LIQ IND2A IND3A IND4A IND5A 0 0 0.530298 0.079172 13.131993 0.082016 1.166493 53 0.385779 0 0 1 0 1 0 0.370025 0.040745 12.132626 0.082615 11.092048 54 0.224123 1 0 0 0 2 0 0.636884 0.307242 13.322921 0.245129 -6.316099 43 0.055441 1 0 0 0 3 0 0.815549 0.295864 16.274536 0.164052 1.394809 24 0.016731 1 0 0 0 4 0 0.097690 0.033567 13.491299 0.160505 10.204010 49 0.387136 1 0 0 0 In [18]: # CREATE A R INSTANCE In [19]: r = pr.R() In [20]: # LOAD R LIBRARY In [21]: print r("library(sqldf)") try({library(sqldf)}) Loading required package: DBI Loading required package: gsubfn Loading required package: proto Loading required namespace: tcltk Loading Tcl/Tk interface ... done Loading required package: chron Loading required package: RSQLite Loading required package: RSQLite.extfuns In [22]: # READ DATA FROM SQLITE DB In [23]: print r("r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')") try({r_data <- sqldf('select * from tbldata', dbname = '/home/liuwensui/Documents/data/tmp.db')}) Loading required package: tcltk In [24]: print r("str(r_data)") try({str(r_data)}) 'data.frame': 4421 obs. of 12 variables: $ LEV_LT3 : num 0 0 0 0 0 0 0 0 0 0 ... $ TAX_NDEB: num 0.5303 0.37 0.6369 0.8155 0.0977 ... $ COLLAT1 : num 0.0792 0.0407 0.3072 0.2959 0.0336 ... $ SIZE1 : num 13.1 12.1 13.3 16.3 13.5 ... $ PROF2 : num 0.082 0.0826 0.2451 0.1641 0.1605 ... $ GROWTH2 : num 1.17 11.09 -6.32 1.39 10.2 ... $ AGE : int 53 54 43 24 49 24 35 77 33 81 ... $ LIQ : num 0.3858 0.2241 0.0554 0.0167 0.3871 ... $ IND2A : int 0 1 1 1 1 1 1 1 1 0 ... $ IND3A : int 0 0 0 0 0 0 0 0 0 0 ... $ IND4A : int 1 0 0 0 0 0 0 0 0 0 ... $ IND5A : int 0 0 0 0 0 0 0 0 0 1 ...
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.