When life gives you coloured cells, make categories
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.
I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
I have created a simple Excel workbook. It contains one sheet, in which you will find one column headed “x” followed by the numbers 1-10. The cells A2:A11 are filled with colour and alternate: red-green-red-green-red-green-red-green-red-green.
We’ll read it into R using the xlsx package and extract the worksheet.
library(xlsx) wb <- loadWorkbook("test.xlsx") sheet1 <- getSheets(wb)[[1]]
Now we can get the rows and the cells.
# get all rows rows <- getRows(sheet1) cells <- getCells(rows) # quick look at the values sapply(cells, getCellValue) # 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 # "x" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Colour information is contained in the cell styles.
styles <- sapply(cells, getCellStyle)
Somewhat confusingly, although the Excel GUI specifies “fill background”, the property is accessed using getFillForegroundXSSFColor().
Here’s a little function which, given a cell, returns the RGB code for the fill colour. The header cell has no fill colour, so getRgb() returns an error that we can catch.
cellColor <- function(style) { fg <- style$getFillForegroundXSSFColor() rgb <- tryCatch(fg$getRgb(), error = function(e) NULL) rgb <- paste(rgb, collapse = "") return(rgb) }
Applying that to our styles gives this:
sapply(styles, cellColor) 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 "" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" "00ff00" "ff0000" 11.1 "00ff00"
Knowing that, for example, green = 00ff00 = normal and red = ff0000 = tumour, we can now generate category labels something like this:
pheno <- list(normal = "00ff00", tumour = "ff0000") m <- match(sapply(styles, cellColor), pheno) labs <-names(pheno)[m] labs [1] NA "tumour" "normal" "tumour" "normal" "tumour" "normal" "tumour" "normal" [10] "tumour" "normal"
and get on with our lives. Or at least, our analyses.
Filed under: bioinformatics, programming, R, statistics Tagged: excel, spreadsheet
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.