R-bloggers

Why I’m Switching to Polars

[This article was first published on R – Ari Lamstein, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently decided to switch from Pandas to Polars for my Python projects that use dataframes. I came to this decision while taking a workshop on Polars last week: I found its syntax to be so intuitive that I couldn’t justify continuing to try to get “better” at Pandas, despite Pandas being the more established library. The fact that Polars is faster (it’s main selling point) was, surprisingly, not a factor in my decision.

A similar transformation recently happened in R. For most of the history of R there was only one way to interact with dataframes: Base R. Then the Tidyverse came along, and offered both performance improvements and easier syntax. Eventually the Tidyverse became the primary way that many people interact with dataframes. I believe that the Tidyverse’s easier syntax is what led to its widespread adoption, and I think that something similar is likely to happen with Polars.

A lot of this might be explainable by Bloom’s Taxonomy. I first learned about Bloom’s Taxonomy when I took a “train the trainer” course many years ago. The Taxonomy lists the stages people go through on the path from beginner to expert. Here’s the key point: the foundation of the pyramid is “remember”. If you cannot remember how to do a basic task (say subsetting a dataframe) then you cannot apply it to your work, evaluate someone else’s code, or contribute your own extension to the language / library.

I believe that both Polars and the Tidyverse have an easier-to-remember syntax than the libraries that came before them. In the case of the Tidyverse this likely contributed to it becoming the primary dataframe library for many users. I expect something similar will happen with Polars.

Understanding these syntax differences can help us all become better programmers. While few of us work on libraries with millions of downloads, most of us do write code that others use. Figuring out what makes some APIs easier to master than others can help make our next project be more successful. To help with this, below I solve the same trivial problem with both Polars and the Tidyverse, as well as the dataframe libraries that came before them (Pandas and Base R). The problem is:

  1. Read in a CSV file of US Counties to a dataframe.
  2. Subset the rows to counties named “Washington”.
  3. Subset the columns to “county.name” and “state.name”.

(This example was chosen because subsetting rows and columns is one of the most basic operations you can do, yet it still demonstrates my point. And back when I was doing a project with US County data I found it funny that so many states have a county named “Washington”).

The code I use in this post is also available in github. Feel free to use it as a starting off point to explore these libraries on your own.

Polars vs. Pandas

Polars

In polars you subset rows with the function filter. And you subset columns with the function select. These functions are both methods on the dataframe class. Python users style long method “chains” by putting them inside (). So after reading in the data the code looks like this:

(
    df
    .filter(pl.col('county.name')=='washington')
    .select(['county.name','state.name'])
)

My first thought when reading this code was that each action you perform on the dataframe has a descriptive function name. While this might sound obvious, both Pandas and Base R frequently use operators / symbols instead of functions, and the operators can do different things depending on the input. This can make it hard to remember how to use the library.

R programmers will note that filter and select are the exact same name that dplyr uses for the same tasks. When I saw this I assumed that Ritchie Vink (the founder of Polars) simply copied what the Tidyverse did. But when I asked him on LinkedIn he said that he does not program in R, and was actually unaware of this! He humorously called it “convergent evolution”.

When I first saw this syntax I was happy because I thought it would be easy to remember: each task (subsetting rows, subsetting columns, pulling columns by name) has one function associated with it, and the function is named in a way that makes it easy to remember. As we’ll see below, this is very similar to the tidyverse, and very different than both Pandas and Base R.

Pandas

When I attempted to write this code using pandas I thought “Ah, I should be using .loc (as opposed to .iloc) here, right? But does that function take [] or ()? I always forget. Let me ask Copilot.”

Copilot’s response surprised me. It did not mention either .loc or .iloc. Instead it said to use plain old []:

df[df['county.name'] == 'washington']

This syntax has the advantage of being very terse. So if you know what you’re doing then it is fast to both read and write. But as a newcomer it can be confusing. One way it’s confusing is that you are using an operator / symbol instead of a clearly named function (so you have to remember what [] does). Second, [] is actually doing two different things in the same line of code. In the inner expression you give it a column name and it returns the column’s values. But in the outer expression you give it a logical Series and it returns the corresponding rows in the dataframe.

Since I wanted to see the .loc version I told Copilot “show me another way”. It returned this:

df.query('`county.name` == "washington"')

When I first learned Pandas I was excited about the “query” API because it seemed simple to use. But then an “advanced” Pandas user I looked up to told me that he never uses it. So I decided to not use it myself. So once again I told Copilot “show me another way”. It returned:

df.loc[df['county.name'] == 'washington']

Exactly what I was looking for. So my instinct to use .loc here was correct. And even though I wasn’t sure whether it takes [] instead of (), I was able to remember that it was something to be aware of. Adding in code to select the columns gives us this solution:

(
    df
    .loc[df['county.name'] == 'washington']
    [['county.name', 'state.name']]
)

Interestingly, the column-select code adds in two more pairs of []. And again they mean different things (the inner one means “a Python list” and the outer ones mean “subset columns”).

For me, there are two obstacles to climbing Bloom’s Taxonomy with regards to Pandas. The first is knowing which of the many possible ways to solve a task I should use. This actually reminds me of the quote “There should be one– and preferably only one –obvious way to do it.” from The Zen of Python). The second is remembering the details of the syntax.

Tidyverse vs. Base R

Tidyverse

The tidyverse has a principle that code should be designed for humans. In practice this means creating functions that have clear names, and having each function do just one thing. It also means composing those functions with the pipe operator |> (read “then”). This means that our simple analysis can be done like this:

df |>
    filter(county.name == "washington") |>
    select(county.name, state.name)

This code is remarkably similar to the equivalent Polars code. In July I taught a “Intro R” workshop. We covered both the Tidyverse and Base R. The students were able to solve simple problems using the Tidyverse much faster than they were able to solve similar problems using Base R. I attribute this to the Tidyverse relying on functions that have explicit names that do only one thing.

Another feature of this code is that it uses Non-Standard Evaluation (NSE). In the call to filter we can refer to the contents of a column by writing the column name with no quotes (e.g. county.name). In Polars we need to write pl.col(‘county.name’) and in Pandas we need to write something like df[‘county.name’]. NSE is so useful, and leads to such clean code, that I am not sure why neither Pandas nor Polars has adopted it.

Base R

The “Base R” version of the above code is very different. Similar to Pandas, there are no explicit function calls. Instead you use the operators / symbols [] , and $:

df[df$county.name == "washington", c("county.name", "state.name")]

As a longtime R user I find code like this to be very easy to both read and write. But the students in my workshop struggled to write it. They were comfortable writing the vectorized logical test by itself (df$county.name == “washington”). But they struggled to put that test inside the [].

Another issue this code exposes is that operators can often be overloaded, and this can further confuse newcomers. For example, in the above code df$county.name == “washington” is being used as a subscript. That’s fine, and once boolean indexing “clicks” you are good to go. But there are Five kinds of subscripts in R, and newcomers need to learn all of them. This is less of an issue when using explicit functions and, indeed, the Tidyverse has a lot of explicitly named functions for niche cases (e.g. starts_with and ends_with).

Summary

Last December, when I first began learning how to work with tabular data in Python, I chose to learn Pandas because it is the most popular dataframe library in Python. I now think that it is better for newcomers to learn Polars, and am putting my energy there. My primary reason is that I find the syntax for Polars to be much easier to remember. Per Bloom’s Taxonomy, I think that a library that is easier to remember how to use will in turn enable me to make significant contributions faster. As a bonus, Polars has better performance than Pandas.

The “Polars vs. Pandas” debate reminds me of the “Tidyverse vs. Base R” debate that began about a decade ago. At that time experienced R users (such as myself) scoffed at people who thought that “learning the Tidyverse” was somehow the same thing as “learning R”. In retrospect, we were wrong. We underestimated how quickly people would adopt a simpler, more explicit dataframe API. I think that experienced Pandas users who take a similar stance against Polars today will likely be proven wrong as well, and for similar reasons. Of course, only time will tell.

The creation of Polars and the Tidyverse reminds me of something that a lead engineer at my first job told me: “The first time you build something, focus on making it work. The second time you build something, focus on making it pretty.” Both Pandas and Base R are major contributions to the world of statistical computing. They work. Very, very well. Polars and the Tidyverse, which came after, have the luxury of being able to focus on being “pretty”.

As the saying goes: “History doesn’t repeat itself, but it sure does rhyme.”

While comments on my blog are not enabled, I welcome feedback from my readers. Use this form to contact me.

To leave a comment for the author, please follow the link and comment on their blog: R – Ari Lamstein.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version