Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
… or Inferring Identity from Observations
A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.
“I am he as you are he as you are me …
(… And we are all together” – I am the Walrus / Beatles) So far so good – but where it gets a bit tricky is when it comes to decide how to deal with IDs of clusters / plants when a newly introduced location estimate
“Who Cares?”
Fair question as one might argue that an ID only serves the purpose of differentiating and there is no need for maintaining a family tree of clusters. Also in above use case this argument is not easily denied. But a stable inheritance of IDs might simplify understanding dynamics of how clustering takes place – a large number of representatives might render a cluster and its represented entity “important” and it would be weird if you have no stable way to refer to it. And some other possible motivations come to my mind. Maybe the organisation will send to selected plants researchers to perform an examination on them and henceforth intends to refer to those ones specifically.
“Take arms!”
# calculates the contingency table described below cross <- function(c0, cx) { uc0 <- unique(c0[c0 != "?"]) ucx <- unique(cx) cross <- matrix(0, ncol=length(ucx), nrow=length(uc0), dimnames=list(uc0, ucx) ) for(id_c0 in uc0) { for(id_cx in ucx) { cross[id_c0, id_cx] <- length(intersect( which(c0 == id_c0), which(cx == id_cx) )) } } return(cross) } # helper function: "A B" -> c("A","B") sv <- function(str) { strsplit(str," +")[[1]] }
> c0 <- sv("A A B B C C C ?") > cx <- sv("3 3 2 2 2 1 1 2") > > cross(c0,cx) 3 2 1 A 2 0 0 B 0 2 0 C 0 1 2
Choosing a Label for a Mixed Set
Or take the situation illustrated to the right. For set 1 the label is a clear choice. But with above democratic labeling heuristic we would have to choose the same label for 2 and this would lead to a conflict. :/
A Conservative Approach to Restore Peace
To make a long story short a possible way to go might be to take a very conservative stance and expect from a cluster to properly tend its flock if it would like to keep its label. Id est, a cluster looses an element or gains one, then its new label is chosen randomly. This can be told by checking the contingency table – the condition is met if one and only one field in a row is non-zero and the corresponding column is as well non-zero exclusively for that field.
# determines unambiguous cluster labeling cases labeling <- function(cross) { labels <- c() for(id_cx in colnames(cross)) { if(sum(cross[,id_cx]) == max(cross[,id_cx])){ id_c0 <- which.max(cross[,id_cx]) if(sum(cross[id_c0,]) == max(cross[id_c0,])) { labels[id_cx] <- names(id_c0) } else { labels[id_cx] <- "+" } } else { labels[id_cx] <- "+" } } return(labels) }
And now in action:
> c0 <- sv("A A B B C C C D D ?") > cx <- sv("3 3 2 2 1 1 1 1 4 2") > > x <- cross(c0,cx) > x 3 2 1 4 A 2 0 0 0 B 0 2 0 0 C 0 0 3 0 D 0 0 1 1 > > labeling(x) 3 2 1 4 "A" "B" "+" "+"
Much Ado about Something
Congratulations for making it to this point – you are now part of a small distinguished circle! Write me a mail and I will organize for you a session so you will receive the fierce looking joyofdata-tattoo on your forehead which will grant you bargains in bio supermarkets all over the world and will facilitate meeting people at night clubs. Okay, seriously, I’d be interested in input!
(original article published on www.joyofdata.de)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.