Turning Data Into Awesome With sqldf and pandasql
[This article was first published on   Mathew Analytics » R, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Both R and Python possess libraries for using SQL statements to interact with data frames. While both languages have native facilities for manipulating data, the sqldf and pandasql provide a simple and elegant interface for conducting tasks using an intuitive framework that’s widely used by analysts.
R and sqldf
sqldf("SELECT COUNT(*) FROM df2 WHERE state = 'CA'")
 
  COUNT(*)
1        4
 
sqldf("SELECT 
            df2.firstname, 
            df2.lastname, 
            df1.var1, 
            df2.state 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id 
      WHERE df2.state = 'TX'")
 
  firstname lastname  var1 state
1     David    Spade -2.09    TX
2       Joe  Montana  1.16    TX
 
sqldf("SELECT 
            df2.state, 
            COUNT(df1.var1) 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state")
 
   state COUNT(df1.var1)
1     AZ               1
2     CA               1
3     GA               1
4     IL               1
5     NC               1
6     NY               1
7     OK               1
8     SC               1
9     TX               1
10    VT               1
Python and pandasql
import pandasql as ps
 
 
q1 = """SELECT COUNT(*) FROM df2 WHERE state = 'CA'"""
 
print ps.sqldf(q1, locals())
 
   COUNT(*)
0         4
 
q2 = """
    SELECT 
        df2.firstname, 
        df2.lastname, 
        df1.var1, 
        df2.state 
    FROM df1 INNER JOIN df2 ON df1.personid = df2.id 
    WHERE df2.state = "TX";
    """
 
print ps.sqldf(q2, locals())
 
  firstname lastname  var1 state
0     David    Spade -2.09    TX
1       Joe  Montana  1.16    TX
 
q3 = """SELECT 
            df2.state, 
            COUNT(df1.var1) 
      FROM df1
      INNER JOIN df2 ON df1.personid = df2.id WHERE df1.var1 > 0
      GROUP BY df2.state"""
 
print ps.sqldf(q3, locals())
 
  state  COUNT(df1.var1)
0    AZ                1
1    CA                1
2    GA                1
3    IL                1
4    NC                1
5    NY                1
6    OK                1
7    SC                1
8    TX                1
9    VT                1
 
		
            
To leave a comment for the author, please follow the link and comment on their blog:  Mathew Analytics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
 

