Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Python seems to be everywhere these days, and I’m really into learning languages, so it should come as no surprise that I’m learning a lot of Python. This post serves as a review of Pandas Workout as well as a ‘first impression’ review of the Pandas approach to data.
I am not entirely unfamiliar with Python, but I haven’t really had to do anything serious with a lot of data in the way that I do with R. {dplyr} and friends make working with rectangular data so clean and simple; why would I want anything else?
I recently needed to work with an API that would return some tabular data – it enables programmatic communication with a system containing a lot of data, and the only wrapper I could find for it was in Python (and not all that well documented). I’ve built a wrapper for a very similar system myself using {httr} in R but I didn’t necessarily want to go through all of that again. “Fine, I’ll use Python” I said, and promptly realised that I wasn’t familiar with how to rectangle the resulting data.
Around the same time as the API needed wrapping I was fortunate enough to be asked to review the book Pandas Workout written by Reuven Lerner, from Manning Publications. I really enjoy books from Manning – I published my own book Beyond Spreadsheets with R with them and I’m grateful for their DRM-free offerings across a wide range of tech topics. What a perfect opportunity! I will note that apart from receiving a digital copy of the book to review, I was not otherwise paid or compensated for this review. If I’m reviewing a book, it’s an honest review.
As I’m entirely unfamiliar with Pandas but know enough Python to keep my head above water, this seemed like a good chance to review both at the same time; how well does this book provide an introduction to Pandas for a newcomer?
The subtitle of the book is “200 Exercises to Strengthen Your Data Analysis Skills”
and it delivers on the ‘exercises’ part, providing real-world data import/cleaning
tasks that go well beyond a mtcars
dataset.
I’m halfway through the book, and I’m actually following the exercises by typing out my own solutions in a python file and/or the REPL – “you learn with your hands, not your eyes”.
The first problem, since this is python, is getting Pandas to work with a script,
which means dealing with environments, since pip
now tries
to prevent us from messing up our global package availability.
I figured this was a good time to try out uv
as an alternative to pip
and as far as I can tell, this worked well. I
don’t have expectations that Pandas Workout should have guided me through any of
the “get Python environments working” parts as it does assume Python knowledge,
but the author mentions using pip install pandas
which doesn’t really cut it
(though plausibly did at the time of writing, April 2024).
Apart from that, it’s just a matter of
import pandas as pd
(and all of the Python code in this post is generated from the code blocks thanks to {reticulate} and its virtualenv support).
There’s a joke to be made here about my home city Adelaide and the fact that we have once again imported some pandas.
Chapter 1 walks through using the Series
data type, which for an R user is most
similar to a regular vector, except that there isn’t a strict restriction on the
‘singular’ type of the elements; if you provide a mixture of types the resulting
Series
will have dtype: object
. This is, I suspect, a necessity, given that
Python doesn’t have the ‘classed NA’ values that R uses – all of the elements of
x <- c(1L, NA, 3L)
are the same class (‘type’); integer, including the missing value
x[2] ## [1] NA class(x[2]) ## [1] "integer"
which is actually NA_integer_
. pd.Series([1, pd.NA, 3])
still produces
dtype: object
, and can’t be converted to np.int8
.
The first real annoyance comes when describing how indexing works in Pandas vs
regular Python; the endpoint of the s.loc[]
syntax is “up to and including”
whereas Python would usually use “up to and not including”. There’s reasons,
but things like this are good to keep in the back of one’s mind whenever someone
complains that “R is confusing/inconsistent”. With that said, it’s called out in
Pandas Workout with some concrete examples, so it shouldn’t be a gotcha.
Where an inconsistency isn’t so well handled is when mentioning rounding of values; Pandas Workout suggests
“Be sure to read the documentation for the round method (http://mng.bz/8rzg) to understand its arguments and how it handles numbers like 15 and 75.
which points to the
documentation for pd.Series.round
but that makes no mention of how these halfway values are handled – I dug up the
answer myself and Pandas
does “banker’s rounding” taking half-values to evens. Trying this out myself
also demonstrates this
pd.Series([0.2, 0.5, 1.5, 2.5, 3.5]).round() ## 0 0.0 ## 1 0.0 ## 2 2.0 ## 3 2.0 ## 4 4.0 ## dtype: float64
Incidentally, this is the same for R
round(c(0.2, 0.5, 1.5, 2.5, 3.5)) ## [1] 0 0 2 2 4
Some additional Series
gotchas are detailed, including the fact that while this
monstrosity works in Python
'1' + '2' + '3' + '4' ## '1234'
and this does not
sum(['1', '2', '3', '4']) ## TypeError: unsupported operand type(s) for +: 'int' and 'str'
this one does in fact work
pd.Series(['1', '2', '3', '4']).sum() ## '1234'
The similarities between R’s (named) vectors and Pandas’ Series
help me to grasp
what I might want to do with these, but there are some distinct differences in
how the ‘index’/‘name’ part is handled; In R the names can be repeated, although
that makes it very difficult to extract elements based on names
x <- c(a = 1, b = 2, a = 3, d = 4) x ## a b a d ## 1 2 3 4 names(x) ## [1] "a" "b" "a" "d" x["a"] ## a ## 1
whereas in Pandas the index can be repeated and extracting with a repeated value actually returns all the relevant values
x = pd.Series([1, 2, 3, 4], index=list('abad')) x ## a 1 ## b 2 ## a 3 ## d 4 ## dtype: int64 x.loc['a'] ## a 1 ## a 3 ## dtype: int64
which means that x.loc[z]
is essentially x[names(x) == z]
x[names(x) == "a"] ## a a ## 1 3
As a sidenote, that “use a list as a series of characters” bit is going to take me a while to get used to, but I see the value of it. One of my big regrets about R is that strings are not vectors of characters; something I worked around by making my own “character vector” class.
What really blew my mind was the behaviour around adding two Series
with overlapping
indexes; the comparison is explicitly based on the index, so 'a'
matches to 'a'
and
'd'
matches to 'd'
regardless of where they appear in the Series
s1 = pd.Series([10, 20, 30, 40], index=list('abcd')) s1 ## a 10 ## b 20 ## c 30 ## d 40 ## dtype: int64 s2 = pd.Series([100, 200, 300, 400], index=list('dcba')) s2 ## d 100 ## c 200 ## b 300 ## a 400 ## dtype: int64 s1+s2 ## a 410 ## b 320 ## c 230 ## d 140 ## dtype: int64
This seems like such an obvious choice – I had to double check what happens in R and a worried look started to creep across my face
s1 = c(a = 10, b = 20, c = 30, d = 40) s1 ## a b c d ## 10 20 30 40 s2 = c(d = 100, c = 200, b = 300, a = 400) s2 ## d c b a ## 100 200 300 400 s1 + s2 ## a b c d ## 110 220 330 440
R just ignores the names entirely. If I wanted the same behaviour, I’d need to do the aligning myself
s1[order(names(s1))] + s2[order(names(s2))] ## a b c d ## 410 320 230 140
Closing out the chapter is a set of exercises using some real data. There is a link to that data on the book website, but that location is not mentioned in the book itself, which isn’t ideal. There is a hyperlink in the PDF version
The data is in the file taxi-passenger-count.csv, available along with the other data files used in this course.
but that points to a repository of the exercises, not the data. I presume the print version lacks a way for the reader to find this data. These sort of mistakes happen – I noted some of them on Manning’s discussion forum but did not hear back from the author about any of them yet.
It’s worth mentioning that while the data is from the real world and has genuine ‘problems’ that one would experience when performing an analysis, the data is a single zip file comprising a whopping 852MB download, which might be a bit much for some people.
Chapter 2 naturally moves on to DataFrame
s; rectangular structures constructed
as a combination of multiple Series
(though the only mention along those lines
seems to be a throwaway comment to that effect).
The humble ‘Data Frame’ (data.frame
) in R is a core data type. I’m not entirely
clear on the history, but they were present in R’s predecessor language S in 1992 and
likely even earlier. Nowadays, lots of Data Frame implementations exist for the
purpose of rectangling – and slicing thereof – including Polars in Rust, Tidier.jl in Julia, dataframe in Haskell, data-frame in Racket, and I’m sure many others. This
structure resembles a database table and operations on these tend to mimic SQL
syntax – think filter, select, join, etc…
In R, I know that data.frame
is explicitly a “list of vectors all of the
same length” but that seems to be glossed over here in favor of “you know about
tables – this is similar”. The constructor examples show either a list of lists
or a list of dicts, and that’s perhaps because a list of Series
objects doesn’t
do what I expect
a = pd.Series([1, 2, 3]) b = pd.Series([4, 5, 6]) pd.DataFrame([a, b]) ## 0 1 2 ## 0 1 2 3 ## 1 4 5 6
The resulting DataFrame
is created row-wise, which is enough of a headache in R,
let alone the differences here depending on how the object is created.
There are apparently a handful of ways one can do this, but they’re not mentioned in the book
a = pd.Series([1, 2, 3]) b = pd.Series([4, 5, 6]) pd.DataFrame({'a':a, 'b':b}) ## a b ## 0 1 4 ## 1 2 5 ## 2 3 6 pd.DataFrame(dict(a=a, b=b)) ## a b ## 0 1 4 ## 1 2 5 ## 2 3 6 pd.concat([a, b], axis=1, keys=list('ab')) ## a b ## 0 1 4 ## 1 2 5 ## 2 3 6
I was also curious about what happens if they aren’t the same length
a = pd.Series([1, 2, 3]) b = pd.Series([4, 5, 6, 7]) pd.DataFrame({'a':a, 'b':b}) ## a b ## 0 1.0 4 ## 1 2.0 5 ## 2 3.0 6 ## 3 NaN 7
The inserted NaN
is perhaps potentially surprising; R bails out of trying to
construct such an object
a <- c(1, 2, 3) b <- c(4, 5, 6, 7) data.frame(a, b) ## Error in data.frame(a, b): arguments imply differing number of rows: 3, 4
except when it tries to be clever by recycling some values, with often surprising results…
a <- 1:3 b <- 4:9 data.frame(a, b) ## a b ## 1 1 4 ## 2 2 5 ## 3 3 6 ## 4 1 7 ## 5 2 8 ## 6 3 9
The various methods for subsetting rows and columns mostly makes sense coming from R, although it will still take me some time to get used to seeing things like
df[list('bcd')]
to extract the columns labelled 'b'
, 'c'
, and 'd'
. One thing I noticed here
was that there was only one axis specified for extraction here (column) and while
the same thing works in R, e.g.
m <- head(mtcars) m[c("cyl", "mpg", "wt")] ## cyl mpg wt ## Mazda RX4 6 21.0 2.620 ## Mazda RX4 Wag 6 21.0 2.875 ## Datsun 710 4 22.8 2.320 ## Hornet 4 Drive 6 21.4 3.215 ## Hornet Sportabout 8 18.7 3.440 ## Valiant 6 18.1 3.460
the absence of a note about it was probably an oversight. In R, if the vector specifying the selection contains any missing values (NA) then an error is triggered
m[c("cyl", NA, "wt")] ## Error in `[.data.frame`(m, c("cyl", NA, "wt")): undefined columns selected
This is confusing, for sure. R is fine with us selecting missing rows
m[c(1, NA, 3), ] ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 ## NA NA NA NA NA NA NA NA NA NA NA NA ## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
but not missing columns
m[1:3, c(1, NA, 3)] ## Error in `[.data.frame`(m, 1:3, c(1, NA, 3)): undefined columns selected
So, what about Pandas? The fact that x[cols]
works to select columns, but
selecting rows requires x.loc[rows, cols]
is a bit ugly, in my opinion, and
there are parallels here with the mess of R’s drop=TRUE
argument which results in
a vector rather than a data.frame
when only a single dimension is selected.
x = pd.DataFrame({'a':a, 'b':b}) x ## a b ## 0 1.0 4 ## 1 2.0 5 ## 2 3.0 6 ## 3 NaN 7 x['a'] ## 0 1.0 ## 1 2.0 ## 2 3.0 ## 3 NaN ## Name: a, dtype: float64 x.loc[:, 'a'] ## 0 1.0 ## 1 2.0 ## 2 3.0 ## 3 NaN ## Name: a, dtype: float64 x.loc[1:2, ['a', 'b']] ## a b ## 1 2.0 5 ## 2 3.0 6 x.loc[1:2, ['a', pd.NA]] ## KeyError: '[<NA>] not in index'
One powerful but dangerous feature of {dplyr} (and some parts of base R) is
Non-Standard Evaluation (NSE) which enables a ‘shortcut’ in writing out expressions
for filtering or selecting data in a data.frame
. Essentially, the user can
use column names as variables and they are translated as such within the function,
so rather than writing out
filter(mydataframe, mydataframe$a > 300 & mydataframe$w %% 2 == 1)
one can use the column names ‘a’ and ‘w’ as if they were defined as variables
filter(mydataframe, a > 300 & w %% 2 == 1)
This is handy, but of course has some sharp edges. I have mixed feelings about
finding the equivalent in Pandas in the form of df.query
. This takes an SQL-like
statement and similarly treats columns as variables, but in this case the entire
thing is provided as a string
df.query('v > 300 & w % 2 == 1')
I imagine this is incompatible with any language-server features, though the metaprogramming-enjoying part of me does wonder if it makes building up these expressions programatically a little easier.
Another gotcha mentioned in this section is the fact that Pandas makes internal copies of data, and produces an interesting warning
df = pd.DataFrame( {'a': [10, 50, 90], 'b': [20, 60, 100], 'c': [30, 70, 110], 'd': [40, 80, 120]}, index = list('xyz')) df[df['b'] > 30]['b'] = 0 df ## <string>:1: SettingWithCopyWarning: ## A value is trying to be set on a copy of a slice from a DataFrame. ## Try using .loc[row_indexer,col_indexer] = value instead ## ## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
(with the value not being set to 0).
This is reminiscent of SQL ‘views’ into data – and not being able to modify those – and that seems like a useful feature, but it again makes me wonder what happens in R…
df <- data.frame(a = c(10, 50, 90), b = c(20, 60, 100), c = c(30, 70, 110), d = c(40, 80, 120), row.names = c("x", "y", "z")) dfo <- df df ## a b c d ## x 10 20 30 40 ## y 50 60 70 80 ## z 90 100 110 120
The following updates all of the b
column because as we saw above, just
specifying a single selector restricts the columns, so since df$b > 30
is
c(FALSE, TRUE, TRUE)
the intermediate subset is just the second and third
columns (“b” and “c”) after which we select the “b” column and assign the value
of 0
to all elements. No actual filtering of rows has occurred here, and the
entire column in the full data is updated
df[df$b > 30]["b"] <- 0 df ## a b c d ## x 10 0 30 40 ## y 50 0 70 80 ## z 90 0 110 120
Equivalently, [][["b"]]
or []$b
.
In order to select rows for which b > 30
this needs to df[df$b > 30, ]
(with
a comma), and this updates just the relevant slice (again in the full data)
df <- dfo df[df$b > 30, ]$b <- 0 df ## a b c d ## x 10 20 30 40 ## y 50 0 70 80 ## z 90 0 110 120
and this works just fine in R, despite the fact that we have explicitly subset the data prior to selecting a column, and assigned a value to that subset.
Maybe that warning isn’t a bad thing at all – such subsets are done regularly in
R, and there’s no memory issues with doing so (it’s well-defined in terms of [.<-
)
but I can see why it’s confusing.
Chapter 3 details importing and exporting data, and it’s here that I start to see how much has been rolled into Pandas – things that are spread across several R packages.
I’m familiar with scraping HTML tables from websites in R – there are many base
packages which can read from a URL, and several ways to convert the resulting
data into a rectangle such as a data.frame
, but Pandas surprises me with
pd.read_html(url)
which returns a list of DataFrame
s, one for each table on a
webpage. Trying this out, it works quite nicely
url = 'https://en.wikipedia.org/wiki/R_(programming_language)#Version_names' tabs = pd.read_html(url)[1] x=tabs.loc[:,['Version', 'Name', 'Release date']] x ## Version Name Release date ## 0 4.4.2 Pile of Leaves 2024-10-31 ## 1 4.4.1 Race for Your Life 2024-06-14 ## 2 4.4.0 Puppy Cup 2024-04-24 ## 3 4.3.3 Angel Food Cake 2024-02-29 ## 4 4.3.2 Eye Holes 2023-10-31 ## 5 4.3.1 Beagle Scouts 2023-06-16 ## 6 4.3.0 Already Tomorrow 2023-04-21 ## 7 4.2.3 Shortstop Beagle 2023-03-15 ## 8 4.2.2 Innocent and Trusting 2022-10-31 ## 9 4.2.1 Funny-Looking Kid 2022-06-23 ## 10 4.2.0 Vigorous Calisthenics 2022-04-22 ## 11 4.1.3 One Push-Up 2022-03-10 ## 12 4.1.2 Bird Hippie 2021-11-01 ## 13 4.1.1 Kick Things 2021-08-10 ## 14 4.1.0 Camp Pontanezen 2021-05-18 ## 15 4.0.5 Shake and Throw 2021-03-31 ## 16 4.0.4 Lost Library Book 2021-02-15 ## 17 4.0.3 Bunny-Wunnies Freak Out 2020-10-10 ## 18 4.0.2 Taking Off Again 2020-06-22 ## 19 4.0.1 See Things Now 2020-06-06 ## 20 4.0.0 Arbor Day 2020-04-24 ## 21 3.6.3 Holding the Windsock 2020-02-29 ## 22 3.6.2 Dark and Stormy Night 2019-12-12 ## 23 3.6.1 Action of the Toes 2019-07-05 ## 24 3.6.0 Planting of a Tree 2019-04-26 ## 25 3.5.3 Great Truth 2019-03-11 ## 26 3.5.2 Eggshell Igloos 2018-12-20 ## 27 3.5.1 Feather Spray 2018-07-02 ## 28 3.5.0 Joy in Playing 2018-04-23 ## 29 3.4.4 Someone to Lean On 2018-03-15 ## 30 3.4.3 Kite-Eating Tree 2017-11-30 ## 31 3.4.2 Short Summer 2017-09-28 ## 32 3.4.1 Single Candle 2017-06-30 ## 33 3.4.0 You Stupid Darkness 2017-04-21 ## 34 3.3.3 Another Canoe 2017-03-06 ## 35 3.3.2 Sincere Pumpkin Patch 2016-10-31 ## 36 3.3.1 Bug in Your Hair 2016-06-21 ## 37 3.3.0 Supposedly Educational 2016-05-03 ## 38 3.2.5 Very, Very Secure Dishes 2016-04-11 ## 39 3.2.4 Very Secure Dishes 2016-03-11 ## 40 3.2.3 Wooden Christmas-Tree 2015-12-10 ## 41 3.2.2 Fire Safety 2015-08-14 ## 42 3.2.1 World-Famous Astronaut 2015-06-18 ## 43 3.2.0 Full of Ingredients 2015-04-16 ## 44 3.1.3 Smooth Sidewalk 2015-03-09 ## 45 3.1.2 Pumpkin Helmet 2014-10-31 ## 46 3.1.1 Sock it to Me 2014-07-10 ## 47 3.1.0 Spring Dance 2014-04-10 ## 48 3.0.3 Warm Puppy 2014-03-06 ## 49 3.0.2 Frisbee Sailing 2013-09-25 ## 50 3.0.1 Good Sport 2013-05-16 ## 51 3.0.0 Masked Marvel 2013-04-03 ## 52 2.15.3 Security Blanket 2013-03-01 ## 53 2.15.2 Trick or Treat 2012-10-26 ## 54 2.15.1 Roasted Marshmallows 2012-06-22 ## 55 2.15.0 Easter Beagle 2012-03-30 ## 56 2.14.2 Gift-Getting Season 2012-02-29 ## 57 2.14.1 December Snowflakes 2011-12-22 ## 58 2.14.0 Great Pumpkin 2011-10-31 ## 59 r-devel Unsuffered Consequences NaN
I’m all too familiar with issues of nested tables and data that doesn’t rectangle so easily, but for this simple example it worked well, I think.
Chapter 4 covers indexes and again the idea of repeated values comes into play. The long-standing issue that the {tibble} team have with rownames comes to mind – I like being able to name both axes of the data and hate that {tibble} is opposed to them – so it’s extra surprising to have a Data Frame where the ‘row’ labels can repeat. What’s more, the index can be hierarchical as a multi-index. With the example from above, splitting the version into a multi-index is interesting
v=x.Version.str.split('.', expand=True) v.columns=['major', 'minor', 'patch'] xv=pd.concat([x, v], axis=1) xv=xv.set_index(['major','minor','patch']) xv.head(25) ## Version Name Release date ## major minor patch ## 4 4 2 4.4.2 Pile of Leaves 2024-10-31 ## 1 4.4.1 Race for Your Life 2024-06-14 ## 0 4.4.0 Puppy Cup 2024-04-24 ## 3 3 4.3.3 Angel Food Cake 2024-02-29 ## 2 4.3.2 Eye Holes 2023-10-31 ## 1 4.3.1 Beagle Scouts 2023-06-16 ## 0 4.3.0 Already Tomorrow 2023-04-21 ## 2 3 4.2.3 Shortstop Beagle 2023-03-15 ## 2 4.2.2 Innocent and Trusting 2022-10-31 ## 1 4.2.1 Funny-Looking Kid 2022-06-23 ## 0 4.2.0 Vigorous Calisthenics 2022-04-22 ## 1 3 4.1.3 One Push-Up 2022-03-10 ## 2 4.1.2 Bird Hippie 2021-11-01 ## 1 4.1.1 Kick Things 2021-08-10 ## 0 4.1.0 Camp Pontanezen 2021-05-18 ## 0 5 4.0.5 Shake and Throw 2021-03-31 ## 4 4.0.4 Lost Library Book 2021-02-15 ## 3 4.0.3 Bunny-Wunnies Freak Out 2020-10-10 ## 2 4.0.2 Taking Off Again 2020-06-22 ## 1 4.0.1 See Things Now 2020-06-06 ## 0 4.0.0 Arbor Day 2020-04-24 ## 3 6 3 3.6.3 Holding the Windsock 2020-02-29 ## 2 3.6.2 Dark and Stormy Night 2019-12-12 ## 1 3.6.1 Action of the Toes 2019-07-05 ## 0 3.6.0 Planting of a Tree 2019-04-26
This does mean that I can extract all of the 4.3.x series releases
xv.loc[('4', '3')] ## <string>:1: PerformanceWarning: indexing past lexsort depth may impact performance. ## Version Name Release date ## patch ## 3 4.3.3 Angel Food Cake 2024-02-29 ## 2 4.3.2 Eye Holes 2023-10-31 ## 1 4.3.1 Beagle Scouts 2023-06-16 ## 0 4.3.0 Already Tomorrow 2023-04-21
Getting the x.x.0 releases is a little messier, requiring slice(None)
xv.loc[(slice(None), slice(None), '0')] ## Version Name Release date ## major minor ## 4 4 4.4.0 Puppy Cup 2024-04-24 ## 3 4.3.0 Already Tomorrow 2023-04-21 ## 2 4.2.0 Vigorous Calisthenics 2022-04-22 ## 1 4.1.0 Camp Pontanezen 2021-05-18 ## 0 4.0.0 Arbor Day 2020-04-24 ## 3 6 3.6.0 Planting of a Tree 2019-04-26 ## 5 3.5.0 Joy in Playing 2018-04-23 ## 4 3.4.0 You Stupid Darkness 2017-04-21 ## 3 3.3.0 Supposedly Educational 2016-05-03 ## 2 3.2.0 Full of Ingredients 2015-04-16 ## 1 3.1.0 Spring Dance 2014-04-10 ## 0 3.0.0 Masked Marvel 2013-04-03 ## 2 15 2.15.0 Easter Beagle 2012-03-30 ## 14 2.14.0 Great Pumpkin 2011-10-31
and while this was interesting to achieve, a filter would probably be better
xv=pd.concat([x, v], axis=1) xv.query('patch == "0"') ## Version Name Release date major minor patch ## 2 4.4.0 Puppy Cup 2024-04-24 4 4 0 ## 6 4.3.0 Already Tomorrow 2023-04-21 4 3 0 ## 10 4.2.0 Vigorous Calisthenics 2022-04-22 4 2 0 ## 14 4.1.0 Camp Pontanezen 2021-05-18 4 1 0 ## 20 4.0.0 Arbor Day 2020-04-24 4 0 0 ## 24 3.6.0 Planting of a Tree 2019-04-26 3 6 0 ## 28 3.5.0 Joy in Playing 2018-04-23 3 5 0 ## 33 3.4.0 You Stupid Darkness 2017-04-21 3 4 0 ## 37 3.3.0 Supposedly Educational 2016-05-03 3 3 0 ## 43 3.2.0 Full of Ingredients 2015-04-16 3 2 0 ## 47 3.1.0 Spring Dance 2014-04-10 3 1 0 ## 51 3.0.0 Masked Marvel 2013-04-03 3 0 0 ## 55 2.15.0 Easter Beagle 2012-03-30 2 15 0 ## 58 2.14.0 Great Pumpkin 2011-10-31 2 14 0
I believe that is the {tibble} team’s argument – that a filter is more suitable, but I stand by the fact that selecting rows with names which don’t need to be in the data itself is a useful approach.
This chapter also introduces pivot tables in the form of
df.pivot_table(index, columns, values, aggfunc)
the equivalent of which in {dplyr} is dplyr::summarise()
or dplyr::count()
.
Way back when I started learning R I recall finding it odd that the (often useful)
name “pivot table” rarely came up in R documentation, which is a bit of a shame.
I used them all the time in Excel.
xv.pivot_table(index='major', columns='minor', values='patch', aggfunc='count', fill_value=0) ## minor 0 1 14 15 2 3 4 5 6 ## major ## 2 0 0 3 4 0 0 0 0 0 ## 3 4 4 0 0 6 4 5 4 4 ## 4 6 4 0 0 4 4 3 0 0
My major complaint here is that the function is passed as a string – Y U NO use first class functions???
My notes for Chapter 5 (Cleaning Data) only involve the fact that pd.isna
and
pd.isnull
are apparently the exact same thing, which is vastly different from R.
For Chapter 6 (Grouping, joining, and sorting) I have a rekindled annoyance at the
mixing of methods and attributes driven by the fact that df.transpose()
(a method)
has an alias df.T
. I have enough of a hard time trying to remember whether
it’s x.len
or x.len()
or length(x)
or whatever.
I will continue to make my way through the rest of the book, but figured this was a good point to take stock of how I feel about it and how much I’ve learned. I’m already able to work with the API I was trying to and can rectangle the results into something useful, including being able to save those as a CSV or even Excel sheet(s).
Pandas Workout is, in my opinion, a good resource for learning Pandas providing you already know your way around Python and perhaps data analysis in another language as well. The minor issues of typos and omissions happen in any book and I wouldn’t say any of them are dealbreakers. The book covers a lot of the questions I had as I was working through problems – one of the hardest things when trying to follow along is having a ‘simple’ question that’s hard to answer yourself. I’ve abandoned learning other languages (or at least using specific resources) when I’ve hit that point.
Pandas itself seems like it makes some good choices in terms of Series
and DataFrame
structures, and when I’m using Python I’ll be sure to load this library and make
use of them.
Comments, improvements, or sharing your experiences are most welcome. I can be found on Mastodon or use the comments below.
< details> < summary> devtools::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.4.1 (2024-06-14) ## os macOS Sonoma 14.6 ## system aarch64, darwin20 ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Australia/Adelaide ## date 2025-02-13 ## pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## blogdown 1.19 2024-02-01 [1] CRAN (R 4.4.0) ## bookdown 0.41 2024-10-16 [1] CRAN (R 4.4.1) ## bslib 0.8.0 2024-07-29 [1] CRAN (R 4.4.0) ## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0) ## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0) ## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0) ## digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1) ## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0) ## evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1) ## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0) ## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) ## fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1) ## glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1) ## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) ## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0) ## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0) ## jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1) ## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0) ## later 1.4.1 2024-11-27 [1] CRAN (R 4.4.1) ## lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.1) ## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) ## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0) ## Matrix 1.7-1 2024-10-18 [1] CRAN (R 4.4.1) ## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0) ## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0) ## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0) ## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0) ## pkgbuild 1.4.5 2024-10-28 [1] CRAN (R 4.4.1) ## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.0) ## png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0) ## profvis 0.4.0 2024-09-20 [1] CRAN (R 4.4.1) ## promises 1.3.2 2024-11-28 [1] CRAN (R 4.4.1) ## purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0) ## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0) ## Rcpp 1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1) ## remotes 2.5.0.9000 2024-11-03 [1] Github (r-lib/remotes@5b7eb08) ## reticulate 1.40.0 2024-11-15 [1] CRAN (R 4.4.1) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0) ## rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0) ## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.1) ## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0) ## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) ## shiny 1.9.1 2024-08-01 [1] CRAN (R 4.4.0) ## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0) ## usethis 3.0.0 2024-07-29 [1] CRAN (R 4.4.0) ## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0) ## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) ## xfun 0.50.5 2025-01-23 [1] Github (yihui/xfun@116d689) ## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0) ## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0) ## ## [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library ## ## ─ Python configuration ─────────────────────────────────────────────────────── ## python: /Users/jono/.virtualenvs/r-reticulate/bin/python ## libpython: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/config-3.9-darwin/libpython3.9.dylib ## pythonhome: /Users/jono/.virtualenvs/r-reticulate:/Users/jono/.virtualenvs/r-reticulate ## version: 3.9.6 (default, Oct 4 2024, 08:01:31) [Clang 16.0.0 (clang-1600.0.26.4)] ## numpy: /Users/jono/.virtualenvs/r-reticulate/lib/python3.9/site-packages/numpy ## numpy_version: 2.0.2 ## ## NOTE: Python version was forced by use_python() function ## ## ──────────────────────────────────────────────────────────────────────────────
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.