PythonMusings #6: dplyr in Python? First impressions of the siuba (小巴) module
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Whats great about Blogging and social media is that you get to learn so much if you use it right. After sharing my last blog on LinkedIn, Casper Crause told me about the siuba
module created by Michael Chow from which allows Python users experience to using R’s powerful dplyr
package (developed by Hadley Wickham) for data wrangling. What impressed me the most was seeing how similar the code was to dplyr
syntax and functionality. There is even a pipe operator that you can use with >>
used in lieu of the magrittr
pipe %>%
.
In this blog post, I am going to give the siuba
module a spin by looking at the wine ratings dataset from the TidyTuesday project. While I hope to post an entire analysis in Python in the future, my focus is primarily on the features that the siuba
has to offer. The questions, however trivial, are aimed at trying to highlight the features that are available.
The data
To stay in the spirit of doing things entirely in Python, I’m going to load the raw data from the github with pandas’ read.csv()
function.
import pandas as pd dt = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv")
Now that the data is loaded, lets begin with playing with the data! The following questions we’re going to ask are the following:
- What are the 10 most frequent countries in the data set?
- Who were the tasters who reviewed Cabernet Sauvignon wines?
- What was is the 5 number summary for cost of Merlot wines?
- What were the top wine varieties with the best scores?
While these questions are basic, it will give us the opportunity to show what the siuba
module can do. I’m sure there are more sophisticated questions that we can ask, but this should be good for now to show off what can be done with siuba
.
So without further ado, lets get into it!
1. The 10 most frequent countries in the data set.
This seems like something pretty standard that can be done in R with dplyr
. The following is a comparison between dplyr
in R and siuba
in Python:
With dplyr
(in R)
dt2 %>% select(country) %>% group_by(country) %>% summarize(n=n()) %>% arrange(desc(n)) %>% top_n(10) ## Error: `n()` must only be used inside dplyr verbs.
With siuba
(in Python)
from siuba import * from siuba.dply.vector import * ( dt >> select(_.country)>> group_by(_.country)>> summarize(n= n( _.country)) >> # using desc arrange(desc(_.n)) >> top_n(10) ) country n 40 US 54504 37 Spain 6645 31 Portugal 5691 22 Italy 19540 17 Germany 2165 15 France 22093 8 Chile 4472 3 Austria 3345 2 Australia 2329 0 Argentina 3800
There are a few differences that are immediately noticeable. With siuba
we have to import all the the functions available and also import the siuba.dply.vector
to get the relevant dplyr functions. Additionally, the desc(_.n)
argument in the arrange()
doesn’t give the top 10 most frequented countries in descending order.
After a bit of trial and error and some Googling, to remedy the siuba
code, instead of writing arrange(desc(_.n))
, we can write arrange(-_.n)
.
( dt >> select(_.country)>> group_by(_.country)>> summarize(n= n( _.country)) >> # desc doesn't work arrange(-_.n) >> top_n(10) ) country n 40 US 54504 15 France 22093 22 Italy 19540 37 Spain 6645 31 Portugal 5691 8 Chile 4472 0 Argentina 3800 3 Austria 3345 2 Australia 2329 17 Germany 2165
Much better!
2.The tasters who reviewed Cabernet Sauvignon wines
The difference between siuba
and dplyr
here seems to be more rooted in the difference between R and Python (but I could be wrong). R accounts for NA
values while Python does not.
With dplyr
(in R)
dt2 %>% filter(variety =="Cabernet Sauvignon") %>% select(taster_name) %>% group_by(taster_name) %>% summarize(n=n()) %>% arrange(desc(n)) ## Error: `n()` must only be used inside dplyr verbs.
With siuba
(in Python)
( dt >> filter(_.variety=="Cabernet Sauvignon") >> select(_.taster_name)>> group_by(_.taster_name)>> summarize(n= n( _.taster_name)) >> # using desc arrange(-_.n) ) taster_name n 17 Virginie Boone 1829 11 Michael Schachner 1358 15 Sean P. Sullivan 785 13 Paul Gregutt 677 10 Matt Kettmann 454 6 Jim Gordon 399 7 Joe Czerwinski 263 9 Lauren Buzzeo 140 1 Anna Lee C. Iijima 94 16 Susan Kostrzewa 67 12 Mike DeSimone 66 14 Roger Voss 49 5 Jeff Jenssen 35 8 Kerin O'Keefe 32 0 Alexander Peartree 8 3 Carrie Dykes 7 4 Fiona Adams 6 2 Anne Krebiehl MW 2
What I like about R is that it accounts for missing data which I can see and account for, but with the way I wrote the Python code I am unable to see this.
3. 5 number summary for price of Merlot wines
With dplyr
(in R)
dt2 %>% filter(variety =="Merlot") %>% select(price) %>% summary() ## price ## Min. : 4.00 ## 1st Qu.: 15.00 ## Median : 24.00 ## Mean : 29.54 ## 3rd Qu.: 35.00 ## Max. :625.00 ## NA's :40
With siuba
(in Python)
( dt >> filter(_.variety =="Merlot")>> select(_.price) ).describe() price count 3062.000000 mean 29.543436 std 33.340882 min 4.000000 25% 15.000000 50% 24.000000 75% 35.000000 max 625.000000
With dplyr
I am able to pipe my filtered data directly into summary()
to get a five number summary. With siuba
, since the object output is still a pandas
object (for good reason), I am only able to get the 5 number summary by using the .describe()
method.
4. Top wine varieties with the best scores
With dplyr
(in R)
dt2 %>% distinct(variety,points) %>% filter(points==max(points)) ## # A tibble: 13 x 2 ## points variety ## <dbl> <chr> ## 1 100 Muscat ## 2 100 Prugnolo Gentile ## 3 100 Champagne Blend ## 4 100 Merlot ## 5 100 Portuguese Red ## 6 100 Sangiovese ## 7 100 Cabernet Sauvignon ## 8 100 Bordeaux-style Red Blend ## 9 100 Chardonnay ## 10 100 Sangiovese Grosso ## 11 100 Syrah ## 12 100 Port ## 13 100 Bordeaux-style White Blend
With siuba
(in Python)
import numpy as np ( dt >> distinct(_.variety,_.points) >> filter(_.points==np.max(_.points)) ) variety points 186 Muscat 100 1188 Prugnolo Gentile 100 2477 Champagne Blend 100 2539 Merlot 100 2629 Portuguese Red 100 2744 Sangiovese 100 2747 Cabernet Sauvignon 100 3047 Bordeaux-style Red Blend 100 3567 Chardonnay 100 3840 Sangiovese Grosso 100 3875 Syrah 100 3887 Port 100 3977 Bordeaux-style White Blend 100
Aside for numpy
being imported to get the max()
function the code with siuba
and dplyr
is the same.
Conclusion
With everything I checked out. Aside for a hiccup with the desc()
function, using the siuba
module is great for making use of dplyr
verbs with dataframes in Python. If properly maintained and stable, I can see tons of applications of siuba
being used beyond jupyter notebooks in back end programs which deal with data.
This is only my small review, so I recommend everyone else give it a try themselves by installing it (pip install siuba
) and seeing what you think! You won’t be disappointed!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.