The day after

HighlandR

2 years ago

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I woke yesterday to see ‘Excel’ trending on Twitter.

However it didn’t take to long to discover this was not a good thing.

For those not in the know, it turned out that an issue with importing data via Excel had caused a large number of cases to be missed from the UK Government Covid-19 data.

The data pipeline consisted of data in csv format being imported to Excel which was then to be loaded to a database.

Unfortunately, an old version of Excel was being used, so once it maxed out at around 65.3K rows data never got through to the back end.

Any Excel geek will tell you that this is Excel 2003 – from 2007 onwards you have over a million rows at your disposal. Which begs the question – why was such an old .xls format being used?

I personally wonder if the database architecture is quite old as well. I know from experience that importing xlsx files into a 2008 instance of SQL Server ( for example) was quite difficult if you were running Office 2003, and in some cases, you had to use Access drivers rather than Excel ones to be able to import the data. Or, you simply went down the .xls route and everything was more straighforward.

However, to be clear, I know nothing of the architecture of the database. Maybe someone somewhere just really likes Excel 2003.

Either way, it didn’t work out, and Twitter was aghast, appalled and apoplectic. That was just the UK. Then by lunchtime, the US started waking up, and my timeline was full of indignant US data scientists.

“Don’t they know any better?”
“Haven’t they heard of R or Python?” “OMFG why is anyone using Excel for anything?”

Should we jump up and down and demand that Excel gets thrown out of the window?

Excel isn’t going away. I don’t hate it. But I use it a lot less than I used to. I don’t think I ever want to go back to using it as my main tool for working with data. I’d much rather be using R full time.

Why do I advocate using R instead? Is my analysis better because I used R? Maybe not, but it is easier to reproduce, and easier to share, and more importantly, easier for others to check, test and verify.

How could R have helped in this situation? Again, without knowing the details, it’s difficult to be precise, but data.table allows you to read CSV and text files very quickly using its fread() function.

People use it to handle millions of rows every day.

Once the data is in R, it can be written direct to a database. But before that, the data can be checked and tested. Various tools and packages exist to describe the variables in a dataset, and give you an indication of data quality.

Its easy to get carried away with R. I know, I am too zealous about it. But that’s because it makes my life easier.

Other analysts are nervous about it. They don’t want to code. They’ve got excel expertise and they don’t want to lose that.

However, the Excel guru is probably quite likely to realise that they like R. And they will quickly find themselves building up their expertise. It another tool, and it’s different, but fundamentally it makes working with data easier so that you can get onto the parts of your job that you find more stimulating. We know that 80% of data work is so called drudgery. At least have some fun with it.

That’s the other thing. R is fun. If you are learning it from scratch, you are probably using the tidyverse. I would say the key strength of these packages are their high return for fairly minimal effort.

A few years ago I sat in on a training session on the eve of the first NHS-R conference. The girl beside me had never used R.

Within ten minutes she had imported data from Excel, done some data manipulation and plotted it.

Amazing. When I started learning R (slightly before dplyr), Excel was a pain and you had to export everything to csv first.

The tidyverse makes things easier and you can get into a flow more easily.

Further down the line, you might get into data.table.
It’s bit harder to pick up – you do have to think about it, but it is fast, and concise. Nothing is slowing down the data.table train, and if you get on board, it’s quite a journey.

Anyway, what is my point?

Oh yeah, Excel. It won’t get chucked out. SQL isn’t going anywhere either.

R is not a like for like replacement for either of those.

And it doesn’t have to be.

It’s enough that it can help counteract some of the problems that can occur when working with the other tools.

(I say this as someone who has had to port their own R package to SQL, and who had to write some fairly tortuous SQL to do so).

Knowing R at the moment is a bonus, or, can make you an oddity. However, the momentum is building. R is coming down the track. There is a small core of analysts who can see this, and maybe events of the last couple of days will ultimately help push R / Python that bit further forward, and we will all be better off.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.