PythonMusings #6: dplyr in Python? First impressions of the siuba (小巴) module

Benjamin Smith

6 months ago

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Whats great about Blogging and social media is that you get to learn so much if you use it right. After sharing my last blog on LinkedIn, Casper Crause told me about the siuba module created by Michael Chow from which allows Python users experience to using R’s powerful dplyr package (developed by Hadley Wickham) for data wrangling. What impressed me the most was seeing how similar the code was to dplyr syntax and functionality. There is even a pipe operator that you can use with >> used in lieu of the magrittr pipe %>%.

In this blog post, I am going to give the siuba module a spin by looking at the wine ratings dataset from the TidyTuesday project. While I hope to post an entire analysis in Python in the future, my focus is primarily on the features that the siuba has to offer. The questions, however trivial, are aimed at trying to highlight the features that are available.

The data

To stay in the spirit of doing things entirely in Python, I’m going to load the raw data from the github with pandas’ read.csv() function.

import pandas as pd


dt = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv")

Now that the data is loaded, lets begin with playing with the data! The following questions we’re going to ask are the following:

What are the 10 most frequent countries in the data set?
Who were the tasters who reviewed Cabernet Sauvignon wines?
What was is the 5 number summary for cost of Merlot wines?
What were the top wine varieties with the best scores?

While these questions are basic, it will give us the opportunity to show what the siuba module can do. I’m sure there are more sophisticated questions that we can ask, but this should be good for now to show off what can be done with siuba.

So without further ado, lets get into it!

1. The 10 most frequent countries in the data set.

This seems like something pretty standard that can be done in R with dplyr. The following is a comparison between dplyr in R and siuba in Python:

With `dplyr` (in R)

dt2 %>% 
  select(country) %>% 
  group_by(country) %>%
  summarize(n=n()) %>% 
  arrange(desc(n)) %>% 
  top_n(10)


## Error: `n()` must only be used inside dplyr verbs.

With `siuba` (in Python)

from siuba import *
from siuba.dply.vector import *



(
  dt >> 
  select(_.country)>>
  group_by(_.country)>>
  summarize(n= n( _.country)) >>

  # using desc
  arrange(desc(_.n)) >>
  top_n(10)
)


     country      n
40         US  54504
37      Spain   6645
31   Portugal   5691
22      Italy  19540
17    Germany   2165
15     France  22093
8       Chile   4472
3     Austria   3345
2   Australia   2329
0   Argentina   3800

There are a few differences that are immediately noticeable. With siuba we have to import all the the functions available and also import the siuba.dply.vector to get the relevant dplyr functions. Additionally, the desc(_.n) argument in the arrange() doesn’t give the top 10 most frequented countries in descending order.

After a bit of trial and error and some Googling, to remedy the siuba code, instead of writing arrange(desc(_.n)), we can write arrange(-_.n).

(
dt >> 
  select(_.country)>>
  group_by(_.country)>>
  summarize(n= n( _.country)) >>

  # desc doesn't work
  arrange(-_.n) >>
  top_n(10)

)


      country      n
40         US  54504
15     France  22093
22      Italy  19540
37      Spain   6645
31   Portugal   5691
8       Chile   4472
0   Argentina   3800
3     Austria   3345
2   Australia   2329
17    Germany   2165

Much better!

2.The tasters who reviewed Cabernet Sauvignon wines

The difference between siuba and dplyr here seems to be more rooted in the difference between R and Python (but I could be wrong). R accounts for NA values while Python does not.

With `dplyr` (in R)

dt2 %>% 
  filter(variety =="Cabernet Sauvignon") %>% 
  select(taster_name) %>% 
  group_by(taster_name) %>% 
  summarize(n=n()) %>% 
  arrange(desc(n))


## Error: `n()` must only be used inside dplyr verbs.

With `siuba` (in Python)

(
  dt >>
  filter(_.variety=="Cabernet Sauvignon") >>
  select(_.taster_name)>>
  group_by(_.taster_name)>>
  summarize(n= n( _.taster_name)) >>

  # using desc
  arrange(-_.n)
)


           taster_name     n
17      Virginie Boone  1829
11   Michael Schachner  1358
15    Sean P. Sullivan   785
13        Paul Gregutt   677
10       Matt Kettmann   454
6           Jim Gordon   399
7       Joe Czerwinski   263
9        Lauren Buzzeo   140
1   Anna Lee C. Iijima    94
16     Susan Kostrzewa    67
12       Mike DeSimone    66
14          Roger Voss    49
5         Jeff Jenssen    35
8        Kerin O'Keefe    32
0   Alexander Peartree     8
3         Carrie Dykes     7
4          Fiona Adams     6
2     Anne Krebiehl MW     2

What I like about R is that it accounts for missing data which I can see and account for, but with the way I wrote the Python code I am unable to see this.

3. 5 number summary for price of Merlot wines

With `dplyr` (in R)

dt2 %>% 
  filter(variety =="Merlot") %>% 
  select(price) %>% 
  summary()


##      price       
##  Min.   :  4.00  
##  1st Qu.: 15.00  
##  Median : 24.00  
##  Mean   : 29.54  
##  3rd Qu.: 35.00  
##  Max.   :625.00  
##  NA's   :40

With `siuba` (in Python)

(
dt >> 
  filter(_.variety =="Merlot")>>
  select(_.price)
).describe()


             price
count  3062.000000
mean     29.543436
std      33.340882
min       4.000000
25%      15.000000
50%      24.000000
75%      35.000000
max     625.000000

With dplyr I am able to pipe my filtered data directly into summary() to get a five number summary. With siuba, since the object output is still a pandas object (for good reason), I am only able to get the 5 number summary by using the .describe() method.

4. Top wine varieties with the best scores

With `dplyr` (in R)

dt2 %>% 
  distinct(variety,points) %>% 
  filter(points==max(points))


## # A tibble: 13 x 2
##    points variety                   
##     <dbl> <chr>                     
##  1    100 Muscat                    
##  2    100 Prugnolo Gentile          
##  3    100 Champagne Blend           
##  4    100 Merlot                    
##  5    100 Portuguese Red            
##  6    100 Sangiovese                
##  7    100 Cabernet Sauvignon        
##  8    100 Bordeaux-style Red Blend  
##  9    100 Chardonnay                
## 10    100 Sangiovese Grosso         
## 11    100 Syrah                     
## 12    100 Port                      
## 13    100 Bordeaux-style White Blend

With `siuba` (in Python)

import numpy as np
(
dt >> 
  distinct(_.variety,_.points) >>
  filter(_.points==np.max(_.points))
)


                         variety  points
186                       Muscat     100
1188            Prugnolo Gentile     100
2477             Champagne Blend     100
2539                      Merlot     100
2629              Portuguese Red     100
2744                  Sangiovese     100
2747          Cabernet Sauvignon     100
3047    Bordeaux-style Red Blend     100
3567                  Chardonnay     100
3840           Sangiovese Grosso     100
3875                       Syrah     100
3887                        Port     100
3977  Bordeaux-style White Blend     100

Aside for numpy being imported to get the max() function the code with siuba and dplyr is the same.

Conclusion

With everything I checked out. Aside for a hiccup with the desc() function, using the siuba module is great for making use of dplyr verbs with dataframes in Python. If properly maintained and stable, I can see tons of applications of siuba being used beyond jupyter notebooks in back end programs which deal with data.

This is only my small review, so I recommend everyone else give it a try themselves by installing it (pip install siuba) and seeing what you think! You won’t be disappointed!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The data

1. The 10 most frequent countries in the data set.

With dplyr (in R)

With siuba (in Python)

2.The tasters who reviewed Cabernet Sauvignon wines

With dplyr (in R)

With siuba (in Python)

3. 5 number summary for price of Merlot wines

With dplyr (in R)

With siuba (in Python)

4. Top wine varieties with the best scores

With dplyr (in R)

With siuba (in Python)

Conclusion

With `dplyr` (in R)

With `siuba` (in Python)

With `dplyr` (in R)

With `siuba` (in Python)

With `dplyr` (in R)

With `siuba` (in Python)

With `dplyr` (in R)

With `siuba` (in Python)