Find the intersection of overlapping histograms in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here, I demonstrate how to find the point where two histograms overlap. While this is an approximation, it seems to have a very high level of precision.
Prepare simulated data
I created two data sets, gamma_dist
and norm_dist
, which are made up of a different number of values sampled randomly from a gamma distribution and normal distribution, respectively. I specicially made the data sets different sizes to make the point that this method is still applicable.
library(tibble) set.seed(0) gamma_dist <- rgamma(1e5, shape = 2, scale = 2) norm_dist <- rnorm(5e5, mean = 20, sd = 5) df <- tibble( x = c(gamma_dist, norm_dist), original_dataset = c(rep("gamma_dist", 1e5), rep("norm_dist", 5e5)) ) df #> # A tibble: 600,000 x 2 #> x original_dataset #> <dbl> <chr> #> 1 6.89 gamma_dist #> 2 2.25 gamma_dist #> 3 1.30 gamma_dist #> 4 4.10 gamma_dist #> 5 7.77 gamma_dist #> 6 5.08 gamma_dist #> 7 4.58 gamma_dist #> 8 2.30 gamma_dist #> 9 1.36 gamma_dist #> 10 1.67 gamma_dist #> # … with 599,990 more rows
I used ‘ggplot2’ to plot the densities of the two data sets. The gamma distribution is in red and the normal distribution is in blue. I broke the creation of the plot into two steps: the essential step to create the density curves, and the styling step to make the plot look nice. Of course, these could be combined into a single long ggplot statement.
library(ggplot2) p <- ggplot(df) + geom_density(aes(x = x, color = original_dataset)) p <- p + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + scale_color_manual(values = c("tomato", "dodgerblue")) + theme_minimal() + theme( legend.title = element_blank(), plot.title = element_text(hjust = 0.5) ) + labs(x = "values", title = "Two density curves")
Finding the point of intersection
To find the point of intersection, I first binned the data sets using density
. It is essential to use the same from
and to
values for each data set. The density
function creates 512 bins, thus, providing the same starting and ending parameters makes density
use the same bins for each data set.
from <- 0 to <- 40 gamma_density <- density(gamma_dist, from = from, to = to) norm_density <- density(norm_dist, from = from, to = to)
The final step was to find where the density of the gamma distribution was less than the normal distribution. Therefore, I applied this logic to create the boolean vector idx
. I also included two other filters to contain the result between 5 to 20 because, from the plot above, I can see that the intersection falls within this range.
idx <- (gamma_density$y < norm_density$y) & (gamma_density$x > 5) & (gamma_density$x < 20) poi <- min(gamma_density$x[idx]) poi #> 10.64579
That’s it, the point of intersection has been approximated to a high precision. A vertical line was added to the plot below at poi
.
p <- p + geom_vline(xintercept = poi, linetype = 2, size = 0.3, color = "black") + annotate(geom = "text", label = round(poi, 3), x = poi - 1, y = 0.1, size = 4, angle = 90)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.