Site icon R-bloggers

Book Review – Pandas Workout

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Python seems to be everywhere these days, and I’m really into learning languages, so it should come as no surprise that I’m learning a lot of Python. This post serves as a review of Pandas Workout as well as a ‘first impression’ review of the Pandas approach to data.

I am not entirely unfamiliar with Python, but I haven’t really had to do anything serious with a lot of data in the way that I do with R. {dplyr} and friends make working with rectangular data so clean and simple; why would I want anything else?

I recently needed to work with an API that would return some tabular data – it enables programmatic communication with a system containing a lot of data, and the only wrapper I could find for it was in Python (and not all that well documented). I’ve built a wrapper for a very similar system myself using {httr} in R but I didn’t necessarily want to go through all of that again. “Fine, I’ll use Python” I said, and promptly realised that I wasn’t familiar with how to rectangle the resulting data.

Around the same time as the API needed wrapping I was fortunate enough to be asked to review the book Pandas Workout written by Reuven Lerner, from Manning Publications. I really enjoy books from Manning – I published my own book Beyond Spreadsheets with R with them and I’m grateful for their DRM-free offerings across a wide range of tech topics. What a perfect opportunity! I will note that apart from receiving a digital copy of the book to review, I was not otherwise paid or compensated for this review. If I’m reviewing a book, it’s an honest review.

As I’m entirely unfamiliar with Pandas but know enough Python to keep my head above water, this seemed like a good chance to review both at the same time; how well does this book provide an introduction to Pandas for a newcomer?

The subtitle of the book is “200 Exercises to Strengthen Your Data Analysis Skills” and it delivers on the ‘exercises’ part, providing real-world data import/cleaning tasks that go well beyond a mtcars dataset.

I’m halfway through the book, and I’m actually following the exercises by typing out my own solutions in a python file and/or the REPL – “you learn with your hands, not your eyes”.

The first problem, since this is python, is getting Pandas to work with a script, which means dealing with environments, since pip now tries to prevent us from messing up our global package availability.

I figured this was a good time to try out uv as an alternative to pip and as far as I can tell, this worked well. I don’t have expectations that Pandas Workout should have guided me through any of the “get Python environments working” parts as it does assume Python knowledge, but the author mentions using pip install pandas which doesn’t really cut it (though plausibly did at the time of writing, April 2024). Apart from that, it’s just a matter of

import pandas as pd

(and all of the Python code in this post is generated from the code blocks thanks to {reticulate} and its virtualenv support).

There’s a joke to be made here about my home city Adelaide and the fact that we have once again imported some pandas.

import pandas

Chapter 1 walks through using the Series data type, which for an R user is most similar to a regular vector, except that there isn’t a strict restriction on the ‘singular’ type of the elements; if you provide a mixture of types the resulting Series will have dtype: object. This is, I suspect, a necessity, given that Python doesn’t have the ‘classed NA’ values that R uses – all of the elements of

x <- c(1L, NA, 3L)

are the same class (‘type’); integer, including the missing value

x[2]
## [1] NA
class(x[2])
## [1] "integer"

which is actually NA_integer_. pd.Series([1, pd.NA, 3]) still produces dtype: object, and can’t be converted to np.int8.

The first real annoyance comes when describing how indexing works in Pandas vs regular Python; the endpoint of the s.loc[] syntax is “up to and including” whereas Python would usually use “up to and not including”. There’s reasons, but things like this are good to keep in the back of one’s mind whenever someone complains that “R is confusing/inconsistent”. With that said, it’s called out in Pandas Workout with some concrete examples, so it shouldn’t be a gotcha.

Where an inconsistency isn’t so well handled is when mentioning rounding of values; Pandas Workout suggests

“Be sure to read the documentation for the round method (http://mng.bz/8rzg) to understand its arguments and how it handles numbers like 15 and 75.

which points to the documentation for pd.Series.round but that makes no mention of how these halfway values are handled – I dug up the answer myself and Pandas does “banker’s rounding” taking half-values to evens. Trying this out myself also demonstrates this

pd.Series([0.2, 0.5, 1.5, 2.5, 3.5]).round()
## 0    0.0
## 1    0.0
## 2    2.0
## 3    2.0
## 4    4.0
## dtype: float64

Incidentally, this is the same for R

round(c(0.2, 0.5, 1.5, 2.5, 3.5))
## [1] 0 0 2 2 4

Some additional Series gotchas are detailed, including the fact that while this monstrosity works in Python

'1' + '2' + '3' + '4'
## '1234'

and this does not

sum(['1', '2', '3', '4'])
## TypeError: unsupported operand type(s) for +: 'int' and 'str'

this one does in fact work

pd.Series(['1', '2', '3', '4']).sum()
## '1234'

The similarities between R’s (named) vectors and Pandas’ Series help me to grasp what I might want to do with these, but there are some distinct differences in how the ‘index’/‘name’ part is handled; In R the names can be repeated, although that makes it very difficult to extract elements based on names

x <- c(a = 1, b = 2, a = 3, d = 4)
x
## a b a d 
## 1 2 3 4
names(x)
## [1] "a" "b" "a" "d"
x["a"]
## a 
## 1

whereas in Pandas the index can be repeated and extracting with a repeated value actually returns all the relevant values

x = pd.Series([1, 2, 3, 4], index=list('abad'))
x
## a    1
## b    2
## a    3
## d    4
## dtype: int64
x.loc['a']
## a    1
## a    3
## dtype: int64

which means that x.loc[z] is essentially x[names(x) == z]

x[names(x) == "a"]
## a a 
## 1 3

As a sidenote, that “use a list as a series of characters” bit is going to take me a while to get used to, but I see the value of it. One of my big regrets about R is that strings are not vectors of characters; something I worked around by making my own “character vector” class.

What really blew my mind was the behaviour around adding two Series with overlapping indexes; the comparison is explicitly based on the index, so 'a' matches to 'a' and 'd' matches to 'd' regardless of where they appear in the Series

s1 = pd.Series([10, 20, 30, 40], index=list('abcd'))
s1
## a    10
## b    20
## c    30
## d    40
## dtype: int64
s2 = pd.Series([100, 200, 300, 400], index=list('dcba'))
s2
## d    100
## c    200
## b    300
## a    400
## dtype: int64
s1+s2
## a    410
## b    320
## c    230
## d    140
## dtype: int64

This seems like such an obvious choice – I had to double check what happens in R and a worried look started to creep across my face

s1 = c(a = 10, b = 20, c = 30, d = 40)
s1
##  a  b  c  d 
## 10 20 30 40
s2 = c(d = 100, c = 200, b = 300, a = 400)
s2
##   d   c   b   a 
## 100 200 300 400
s1 + s2
##   a   b   c   d 
## 110 220 330 440
oh no!

R just ignores the names entirely. If I wanted the same behaviour, I’d need to do the aligning myself

s1[order(names(s1))] + s2[order(names(s2))]
##   a   b   c   d 
## 410 320 230 140

Closing out the chapter is a set of exercises using some real data. There is a link to that data on the book website, but that location is not mentioned in the book itself, which isn’t ideal. There is a hyperlink in the PDF version

The data is in the file taxi-passenger-count.csv, available along with the other data files used in this course.

but that points to a repository of the exercises, not the data. I presume the print version lacks a way for the reader to find this data. These sort of mistakes happen – I noted some of them on Manning’s discussion forum but did not hear back from the author about any of them yet.

It’s worth mentioning that while the data is from the real world and has genuine ‘problems’ that one would experience when performing an analysis, the data is a single zip file comprising a whopping 852MB download, which might be a bit much for some people.

Chapter 2 naturally moves on to DataFrames; rectangular structures constructed as a combination of multiple Series (though the only mention along those lines seems to be a throwaway comment to that effect).

The humble ‘Data Frame’ (data.frame) in R is a core data type. I’m not entirely clear on the history, but they were present in R’s predecessor language S in 1992 and likely even earlier. Nowadays, lots of Data Frame implementations exist for the purpose of rectangling – and slicing thereof – including Polars in Rust, Tidier.jl in Julia, dataframe in Haskell, data-frame in Racket, and I’m sure many others. This structure resembles a database table and operations on these tend to mimic SQL syntax – think filter, select, join, etc…

In R, I know that data.frame is explicitly a “list of vectors all of the same length” but that seems to be glossed over here in favor of “you know about tables – this is similar”. The constructor examples show either a list of lists or a list of dicts, and that’s perhaps because a list of Series objects doesn’t do what I expect

a = pd.Series([1, 2, 3])
b = pd.Series([4, 5, 6])
pd.DataFrame([a, b])
##    0  1  2
## 0  1  2  3
## 1  4  5  6

The resulting DataFrame is created row-wise, which is enough of a headache in R, let alone the differences here depending on how the object is created.

There are apparently a handful of ways one can do this, but they’re not mentioned in the book

a = pd.Series([1, 2, 3])
b = pd.Series([4, 5, 6])

pd.DataFrame({'a':a, 'b':b})
##    a  b
## 0  1  4
## 1  2  5
## 2  3  6
pd.DataFrame(dict(a=a, b=b))
##    a  b
## 0  1  4
## 1  2  5
## 2  3  6
pd.concat([a, b], axis=1, keys=list('ab'))
##    a  b
## 0  1  4
## 1  2  5
## 2  3  6

I was also curious about what happens if they aren’t the same length

a = pd.Series([1, 2, 3])
b = pd.Series([4, 5, 6, 7])

pd.DataFrame({'a':a, 'b':b})
##      a  b
## 0  1.0  4
## 1  2.0  5
## 2  3.0  6
## 3  NaN  7

The inserted NaN is perhaps potentially surprising; R bails out of trying to construct such an object

a <- c(1, 2, 3)
b <- c(4, 5, 6, 7)

data.frame(a, b)
## Error in data.frame(a, b): arguments imply differing number of rows: 3, 4

except when it tries to be clever by recycling some values, with often surprising results…

a <- 1:3
b <- 4:9

data.frame(a, b)
##   a b
## 1 1 4
## 2 2 5
## 3 3 6
## 4 1 7
## 5 2 8
## 6 3 9

The various methods for subsetting rows and columns mostly makes sense coming from R, although it will still take me some time to get used to seeing things like

df[list('bcd')]

to extract the columns labelled 'b', 'c', and 'd'. One thing I noticed here was that there was only one axis specified for extraction here (column) and while the same thing works in R, e.g.

m <- head(mtcars)
m[c("cyl", "mpg", "wt")]
##                   cyl  mpg    wt
## Mazda RX4           6 21.0 2.620
## Mazda RX4 Wag       6 21.0 2.875
## Datsun 710          4 22.8 2.320
## Hornet 4 Drive      6 21.4 3.215
## Hornet Sportabout   8 18.7 3.440
## Valiant             6 18.1 3.460

the absence of a note about it was probably an oversight. In R, if the vector specifying the selection contains any missing values (NA) then an error is triggered

m[c("cyl", NA, "wt")]
## Error in `[.data.frame`(m, c("cyl", NA, "wt")): undefined columns selected

This is confusing, for sure. R is fine with us selecting missing rows

m[c(1, NA, 3), ]
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
## NA           NA  NA   NA  NA   NA   NA    NA NA NA   NA   NA
## Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1

but not missing columns

m[1:3, c(1, NA, 3)]
## Error in `[.data.frame`(m, 1:3, c(1, NA, 3)): undefined columns selected

So, what about Pandas? The fact that x[cols] works to select columns, but selecting rows requires x.loc[rows, cols] is a bit ugly, in my opinion, and there are parallels here with the mess of R’s drop=TRUE argument which results in a vector rather than a data.frame when only a single dimension is selected.

x = pd.DataFrame({'a':a, 'b':b})
x
##      a  b
## 0  1.0  4
## 1  2.0  5
## 2  3.0  6
## 3  NaN  7
x['a']
## 0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## Name: a, dtype: float64
x.loc[:, 'a']
## 0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## Name: a, dtype: float64
x.loc[1:2, ['a', 'b']]
##      a  b
## 1  2.0  5
## 2  3.0  6
x.loc[1:2, ['a', pd.NA]]
## KeyError: '[<NA>] not in index'

One powerful but dangerous feature of {dplyr} (and some parts of base R) is Non-Standard Evaluation (NSE) which enables a ‘shortcut’ in writing out expressions for filtering or selecting data in a data.frame. Essentially, the user can use column names as variables and they are translated as such within the function, so rather than writing out

filter(mydataframe, mydataframe$a > 300 & mydataframe$w %% 2 == 1)

one can use the column names ‘a’ and ‘w’ as if they were defined as variables

filter(mydataframe, a > 300 & w %% 2 == 1)

This is handy, but of course has some sharp edges. I have mixed feelings about finding the equivalent in Pandas in the form of df.query. This takes an SQL-like statement and similarly treats columns as variables, but in this case the entire thing is provided as a string

df.query('v > 300 & w % 2 == 1')

I imagine this is incompatible with any language-server features, though the metaprogramming-enjoying part of me does wonder if it makes building up these expressions programatically a little easier.

Another gotcha mentioned in this section is the fact that Pandas makes internal copies of data, and produces an interesting warning

df = pd.DataFrame(
    {'a': [10, 50, 90],
     'b': [20, 60, 100],
     'c': [30, 70, 110],
     'd': [40, 80, 120]},
     index = list('xyz'))
df[df['b'] > 30]['b'] = 0
df
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

(with the value not being set to 0).

This is reminiscent of SQL ‘views’ into data – and not being able to modify those – and that seems like a useful feature, but it again makes me wonder what happens in R…

df <- data.frame(a = c(10, 50, 90),
                 b = c(20, 60, 100),
                 c = c(30, 70, 110), 
                 d = c(40, 80, 120), 
                 row.names = c("x", "y", "z"))
dfo <- df
df
##    a   b   c   d
## x 10  20  30  40
## y 50  60  70  80
## z 90 100 110 120

The following updates all of the b column because as we saw above, just specifying a single selector restricts the columns, so since df$b > 30 is c(FALSE, TRUE, TRUE) the intermediate subset is just the second and third columns (“b” and “c”) after which we select the “b” column and assign the value of 0 to all elements. No actual filtering of rows has occurred here, and the entire column in the full data is updated

df[df$b > 30]["b"] <- 0 
df
##    a b   c   d
## x 10 0  30  40
## y 50 0  70  80
## z 90 0 110 120

Equivalently, [][["b"]] or []$b.

In order to select rows for which b > 30 this needs to df[df$b > 30, ] (with a comma), and this updates just the relevant slice (again in the full data)

df <- dfo
df[df$b > 30, ]$b <- 0 
df
##    a  b   c   d
## x 10 20  30  40
## y 50  0  70  80
## z 90  0 110 120

and this works just fine in R, despite the fact that we have explicitly subset the data prior to selecting a column, and assigned a value to that subset.

Maybe that warning isn’t a bad thing at all – such subsets are done regularly in R, and there’s no memory issues with doing so (it’s well-defined in terms of [.<-) but I can see why it’s confusing.

Chapter 3 details importing and exporting data, and it’s here that I start to see how much has been rolled into Pandas – things that are spread across several R packages.

I’m familiar with scraping HTML tables from websites in R – there are many base packages which can read from a URL, and several ways to convert the resulting data into a rectangle such as a data.frame, but Pandas surprises me with pd.read_html(url) which returns a list of DataFrames, one for each table on a webpage. Trying this out, it works quite nicely

url = 'https://en.wikipedia.org/wiki/R_(programming_language)#Version_names'
tabs = pd.read_html(url)[1]
x=tabs.loc[:,['Version', 'Name', 'Release date']]
x
##     Version                      Name Release date
## 0     4.4.2            Pile of Leaves   2024-10-31
## 1     4.4.1        Race for Your Life   2024-06-14
## 2     4.4.0                 Puppy Cup   2024-04-24
## 3     4.3.3           Angel Food Cake   2024-02-29
## 4     4.3.2                 Eye Holes   2023-10-31
## 5     4.3.1             Beagle Scouts   2023-06-16
## 6     4.3.0          Already Tomorrow   2023-04-21
## 7     4.2.3          Shortstop Beagle   2023-03-15
## 8     4.2.2     Innocent and Trusting   2022-10-31
## 9     4.2.1         Funny-Looking Kid   2022-06-23
## 10    4.2.0     Vigorous Calisthenics   2022-04-22
## 11    4.1.3               One Push-Up   2022-03-10
## 12    4.1.2               Bird Hippie   2021-11-01
## 13    4.1.1               Kick Things   2021-08-10
## 14    4.1.0           Camp Pontanezen   2021-05-18
## 15    4.0.5           Shake and Throw   2021-03-31
## 16    4.0.4         Lost Library Book   2021-02-15
## 17    4.0.3   Bunny-Wunnies Freak Out   2020-10-10
## 18    4.0.2          Taking Off Again   2020-06-22
## 19    4.0.1            See Things Now   2020-06-06
## 20    4.0.0                 Arbor Day   2020-04-24
## 21    3.6.3      Holding the Windsock   2020-02-29
## 22    3.6.2     Dark and Stormy Night   2019-12-12
## 23    3.6.1        Action of the Toes   2019-07-05
## 24    3.6.0        Planting of a Tree   2019-04-26
## 25    3.5.3               Great Truth   2019-03-11
## 26    3.5.2           Eggshell Igloos   2018-12-20
## 27    3.5.1             Feather Spray   2018-07-02
## 28    3.5.0            Joy in Playing   2018-04-23
## 29    3.4.4        Someone to Lean On   2018-03-15
## 30    3.4.3          Kite-Eating Tree   2017-11-30
## 31    3.4.2              Short Summer   2017-09-28
## 32    3.4.1             Single Candle   2017-06-30
## 33    3.4.0       You Stupid Darkness   2017-04-21
## 34    3.3.3             Another Canoe   2017-03-06
## 35    3.3.2     Sincere Pumpkin Patch   2016-10-31
## 36    3.3.1          Bug in Your Hair   2016-06-21
## 37    3.3.0    Supposedly Educational   2016-05-03
## 38    3.2.5  Very, Very Secure Dishes   2016-04-11
## 39    3.2.4        Very Secure Dishes   2016-03-11
## 40    3.2.3     Wooden Christmas-Tree   2015-12-10
## 41    3.2.2               Fire Safety   2015-08-14
## 42    3.2.1    World-Famous Astronaut   2015-06-18
## 43    3.2.0       Full of Ingredients   2015-04-16
## 44    3.1.3           Smooth Sidewalk   2015-03-09
## 45    3.1.2            Pumpkin Helmet   2014-10-31
## 46    3.1.1             Sock it to Me   2014-07-10
## 47    3.1.0              Spring Dance   2014-04-10
## 48    3.0.3                Warm Puppy   2014-03-06
## 49    3.0.2           Frisbee Sailing   2013-09-25
## 50    3.0.1                Good Sport   2013-05-16
## 51    3.0.0             Masked Marvel   2013-04-03
## 52   2.15.3          Security Blanket   2013-03-01
## 53   2.15.2            Trick or Treat   2012-10-26
## 54   2.15.1      Roasted Marshmallows   2012-06-22
## 55   2.15.0             Easter Beagle   2012-03-30
## 56   2.14.2       Gift-Getting Season   2012-02-29
## 57   2.14.1       December Snowflakes   2011-12-22
## 58   2.14.0             Great Pumpkin   2011-10-31
## 59  r-devel   Unsuffered Consequences          NaN

I’m all too familiar with issues of nested tables and data that doesn’t rectangle so easily, but for this simple example it worked well, I think.

Chapter 4 covers indexes and again the idea of repeated values comes into play. The long-standing issue that the {tibble} team have with rownames comes to mind – I like being able to name both axes of the data and hate that {tibble} is opposed to them – so it’s extra surprising to have a Data Frame where the ‘row’ labels can repeat. What’s more, the index can be hierarchical as a multi-index. With the example from above, splitting the version into a multi-index is interesting

v=x.Version.str.split('.', expand=True)
v.columns=['major', 'minor', 'patch']
xv=pd.concat([x, v], axis=1)
xv=xv.set_index(['major','minor','patch'])
xv.head(25)
##                   Version                     Name Release date
## major minor patch                                              
## 4     4     2       4.4.2           Pile of Leaves   2024-10-31
##             1       4.4.1       Race for Your Life   2024-06-14
##             0       4.4.0                Puppy Cup   2024-04-24
##       3     3       4.3.3          Angel Food Cake   2024-02-29
##             2       4.3.2                Eye Holes   2023-10-31
##             1       4.3.1            Beagle Scouts   2023-06-16
##             0       4.3.0         Already Tomorrow   2023-04-21
##       2     3       4.2.3         Shortstop Beagle   2023-03-15
##             2       4.2.2    Innocent and Trusting   2022-10-31
##             1       4.2.1        Funny-Looking Kid   2022-06-23
##             0       4.2.0    Vigorous Calisthenics   2022-04-22
##       1     3       4.1.3              One Push-Up   2022-03-10
##             2       4.1.2              Bird Hippie   2021-11-01
##             1       4.1.1              Kick Things   2021-08-10
##             0       4.1.0          Camp Pontanezen   2021-05-18
##       0     5       4.0.5          Shake and Throw   2021-03-31
##             4       4.0.4        Lost Library Book   2021-02-15
##             3       4.0.3  Bunny-Wunnies Freak Out   2020-10-10
##             2       4.0.2         Taking Off Again   2020-06-22
##             1       4.0.1           See Things Now   2020-06-06
##             0       4.0.0                Arbor Day   2020-04-24
## 3     6     3       3.6.3     Holding the Windsock   2020-02-29
##             2       3.6.2    Dark and Stormy Night   2019-12-12
##             1       3.6.1       Action of the Toes   2019-07-05
##             0       3.6.0       Planting of a Tree   2019-04-26

This does mean that I can extract all of the 4.3.x series releases

xv.loc[('4', '3')]
## <string>:1: PerformanceWarning: indexing past lexsort depth may impact performance.
##       Version              Name Release date
## patch                                       
## 3       4.3.3   Angel Food Cake   2024-02-29
## 2       4.3.2         Eye Holes   2023-10-31
## 1       4.3.1     Beagle Scouts   2023-06-16
## 0       4.3.0  Already Tomorrow   2023-04-21

Getting the x.x.0 releases is a little messier, requiring slice(None)

xv.loc[(slice(None), slice(None), '0')] 
##             Version                    Name Release date
## major minor                                             
## 4     4       4.4.0               Puppy Cup   2024-04-24
##       3       4.3.0        Already Tomorrow   2023-04-21
##       2       4.2.0   Vigorous Calisthenics   2022-04-22
##       1       4.1.0         Camp Pontanezen   2021-05-18
##       0       4.0.0               Arbor Day   2020-04-24
## 3     6       3.6.0      Planting of a Tree   2019-04-26
##       5       3.5.0          Joy in Playing   2018-04-23
##       4       3.4.0     You Stupid Darkness   2017-04-21
##       3       3.3.0  Supposedly Educational   2016-05-03
##       2       3.2.0     Full of Ingredients   2015-04-16
##       1       3.1.0            Spring Dance   2014-04-10
##       0       3.0.0           Masked Marvel   2013-04-03
## 2     15     2.15.0           Easter Beagle   2012-03-30
##       14     2.14.0           Great Pumpkin   2011-10-31

and while this was interesting to achieve, a filter would probably be better

xv=pd.concat([x, v], axis=1)
xv.query('patch == "0"')
##    Version                    Name Release date major minor patch
## 2    4.4.0               Puppy Cup   2024-04-24     4     4     0
## 6    4.3.0        Already Tomorrow   2023-04-21     4     3     0
## 10   4.2.0   Vigorous Calisthenics   2022-04-22     4     2     0
## 14   4.1.0         Camp Pontanezen   2021-05-18     4     1     0
## 20   4.0.0               Arbor Day   2020-04-24     4     0     0
## 24   3.6.0      Planting of a Tree   2019-04-26     3     6     0
## 28   3.5.0          Joy in Playing   2018-04-23     3     5     0
## 33   3.4.0     You Stupid Darkness   2017-04-21     3     4     0
## 37   3.3.0  Supposedly Educational   2016-05-03     3     3     0
## 43   3.2.0     Full of Ingredients   2015-04-16     3     2     0
## 47   3.1.0            Spring Dance   2014-04-10     3     1     0
## 51   3.0.0           Masked Marvel   2013-04-03     3     0     0
## 55  2.15.0           Easter Beagle   2012-03-30     2    15     0
## 58  2.14.0           Great Pumpkin   2011-10-31     2    14     0

I believe that is the {tibble} team’s argument – that a filter is more suitable, but I stand by the fact that selecting rows with names which don’t need to be in the data itself is a useful approach.

This chapter also introduces pivot tables in the form of

df.pivot_table(index, columns, values, aggfunc)

the equivalent of which in {dplyr} is dplyr::summarise() or dplyr::count(). Way back when I started learning R I recall finding it odd that the (often useful) name “pivot table” rarely came up in R documentation, which is a bit of a shame. I used them all the time in Excel.

xv.pivot_table(index='major', 
               columns='minor', 
               values='patch', 
               aggfunc='count', 
               fill_value=0)
## minor  0  1  14  15  2  3  4  5  6
## major                             
## 2      0  0   3   4  0  0  0  0  0
## 3      4  4   0   0  6  4  5  4  4
## 4      6  4   0   0  4  4  3  0  0

My major complaint here is that the function is passed as a string – Y U NO use first class functions???

My notes for Chapter 5 (Cleaning Data) only involve the fact that pd.isna and pd.isnull are apparently the exact same thing, which is vastly different from R.

For Chapter 6 (Grouping, joining, and sorting) I have a rekindled annoyance at the mixing of methods and attributes driven by the fact that df.transpose() (a method) has an alias df.T. I have enough of a hard time trying to remember whether it’s x.len or x.len() or length(x) or whatever.


I will continue to make my way through the rest of the book, but figured this was a good point to take stock of how I feel about it and how much I’ve learned. I’m already able to work with the API I was trying to and can rectangle the results into something useful, including being able to save those as a CSV or even Excel sheet(s).

Pandas Workout is, in my opinion, a good resource for learning Pandas providing you already know your way around Python and perhaps data analysis in another language as well. The minor issues of typos and omissions happen in any book and I wouldn’t say any of them are dealbreakers. The book covers a lot of the questions I had as I was working through problems – one of the hardest things when trying to follow along is having a ‘simple’ question that’s hard to answer yourself. I’ve abandoned learning other languages (or at least using specific resources) when I’ve hit that point.

Pandas itself seems like it makes some good choices in terms of Series and DataFrame structures, and when I’m using Python I’ll be sure to load this library and make use of them.

Comments, improvements, or sharing your experiences are most welcome. I can be found on Mastodon or use the comments below.


< details> < summary> devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.1 (2024-06-14)
##  os       macOS Sonoma 14.6
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Australia/Adelaide
##  date     2025-02-13
##  pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date (UTC) lib source
##  blogdown      1.19       2024-02-01 [1] CRAN (R 4.4.0)
##  bookdown      0.41       2024-10-16 [1] CRAN (R 4.4.1)
##  bslib         0.8.0      2024-07-29 [1] CRAN (R 4.4.0)
##  cachem        1.1.0      2024-05-16 [1] CRAN (R 4.4.0)
##  cli           3.6.3      2024-06-21 [1] CRAN (R 4.4.0)
##  devtools      2.4.5      2022-10-11 [1] CRAN (R 4.4.0)
##  digest        0.6.37     2024-08-19 [1] CRAN (R 4.4.1)
##  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate      1.0.1      2024-10-10 [1] CRAN (R 4.4.1)
##  fansi         1.0.6      2023-12-08 [1] CRAN (R 4.4.0)
##  fastmap       1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
##  fs            1.6.5      2024-10-30 [1] CRAN (R 4.4.1)
##  glue          1.8.0      2024-09-30 [1] CRAN (R 4.4.1)
##  htmltools     0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets   1.6.4      2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv        1.6.15     2024-03-26 [1] CRAN (R 4.4.0)
##  jquerylib     0.1.4      2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite      1.8.9      2024-09-20 [1] CRAN (R 4.4.1)
##  knitr         1.48       2024-07-07 [1] CRAN (R 4.4.0)
##  later         1.4.1      2024-11-27 [1] CRAN (R 4.4.1)
##  lattice       0.22-6     2024-03-20 [1] CRAN (R 4.4.1)
##  lifecycle     1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
##  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
##  Matrix        1.7-1      2024-10-18 [1] CRAN (R 4.4.1)
##  memoise       2.0.1      2021-11-26 [1] CRAN (R 4.4.0)
##  mime          0.12       2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI        0.1.1.1    2018-05-18 [1] CRAN (R 4.4.0)
##  pillar        1.9.0      2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild      1.4.5      2024-10-28 [1] CRAN (R 4.4.1)
##  pkgload       1.4.0      2024-06-28 [1] CRAN (R 4.4.0)
##  png           0.1-8      2022-11-29 [1] CRAN (R 4.4.0)
##  profvis       0.4.0      2024-09-20 [1] CRAN (R 4.4.1)
##  promises      1.3.2      2024-11-28 [1] CRAN (R 4.4.1)
##  purrr         1.0.2      2023-08-10 [1] CRAN (R 4.4.0)
##  R6            2.5.1      2021-08-19 [1] CRAN (R 4.4.0)
##  Rcpp          1.0.13-1   2024-11-02 [1] CRAN (R 4.4.1)
##  remotes       2.5.0.9000 2024-11-03 [1] Github (r-lib/remotes@5b7eb08)
##  reticulate    1.40.0     2024-11-15 [1] CRAN (R 4.4.1)
##  rlang         1.1.4      2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown     2.28       2024-08-17 [1] CRAN (R 4.4.0)
##  rstudioapi    0.17.1     2024-10-22 [1] CRAN (R 4.4.1)
##  sass          0.4.9      2024-03-15 [1] CRAN (R 4.4.0)
##  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.4.0)
##  shiny         1.9.1      2024-08-01 [1] CRAN (R 4.4.0)
##  urlchecker    1.0.1      2021-11-30 [1] CRAN (R 4.4.0)
##  usethis       3.0.0      2024-07-29 [1] CRAN (R 4.4.0)
##  utf8          1.2.4      2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs         0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
##  xfun          0.50.5     2025-01-23 [1] Github (yihui/xfun@116d689)
##  xtable        1.8-4      2019-04-21 [1] CRAN (R 4.4.0)
##  yaml          2.3.10     2024-07-26 [1] CRAN (R 4.4.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## ─ Python configuration ───────────────────────────────────────────────────────
##  python:         /Users/jono/.virtualenvs/r-reticulate/bin/python
##  libpython:      /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib
##  pythonhome:     /Users/jono/.virtualenvs/r-reticulate:/Users/jono/.virtualenvs/r-reticulate
##  version:        3.9.6 (default, Oct  4 2024, 08:01:31)  [Clang 16.0.0 (clang-1600.0.26.4)]
##  numpy:          /Users/jono/.virtualenvs/r-reticulate/lib/python3.9/site-packages/numpy
##  numpy_version:  2.0.2
##  
##  NOTE: Python version was forced by use_python() function
## 
## ──────────────────────────────────────────────────────────────────────────────


To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version