A simple histogram (and why you need to practice it)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In data science, before doing almost anything else, you need to know your data.
This is why visualization is one of the pillars of data-science: visualization allows you to see your data and “know” it in a way that your mind is wired for.
(And, it’s why I emphasize mastering data visualization before almost anything else.)
In practice, “knowing your data” typically begins by using data visualizations and summary statistics to examine individual variables. You need to ask and answer questions: What’s in the variable? How is it distributed? What is the mean?
For answering some of these questions about individual variables, there are few visualization techniques that are simpler, or more useful, than the histogram.
Let’s take a look at a simple histogram in
Code
#------------- # LOAD GGPLOT2 #------------- library(ggplot2) #-------------------------------------- # CREATE VARIABLE, NORMALLY DISTRIBUTED #-------------------------------------- # set "seed" for random numbers set.seed(42) # create variable xvar_rand_norm <- rnorm(1000, mean = 5) #-------------------------------- # CREATE DATA FRAME FROM VARIABLE #-------------------------------- df.xvar <- data.frame(xvar_rand_norm) #--------------------------------- # CALCULATE MEAN # we'll use this in an annotation #--------------------------------- xvar_mean <- mean(xvar_rand_norm) #----------------------------------------------- # PLOT # Here, we're going to plot the histogram # We'll also add a line at the calculated mean # and also add an annotation to specify the # value of the calculated mean #----------------------------------------------- ggplot(data = df.xvar, aes(x = xvar_rand_norm)) + geom_histogram() + geom_vline(xintercept = xvar_mean, color = "dark red") + annotate("text", label = paste("Mean: ", round(xvar_mean,digits = 2)), x = xvar_mean, y = 30, color = "white", size = 5)
The output plot
How this code works
This is a pretty straightforward histogram with the addition of a vertical line to indicate where the mean is (
In case you’re not familiar with how
First, we’re loading the
Next, we’re using the
After creating the variable itself, we’re using
Next, we calculate the mean. This is extremely straightforward. The
Finally, we use
We initially call the
The
After specifying which variable that we’re going to include in the plot, on the next line,
Lastly, we’re using
You need to master the histogram
We’re using a few useful techniques in this visualization, but the critical piece that you really need to know is the first two lines of
ggplot(data = df.xvar, aes(x = xvar_rand_norm)) + geom_histogram()
I’ve been beating this drum for well over a year now, but this bears repeating:
The histogram is one of the plots that you need to master. You’ll use it constantly in analysis and reporting. When you move on to more advanced topics like machine learning, you’ll need to use the histogram to examine how your variables are distributed (although you can also use it’s fraternal twin, the density plot).
Here’s what I mean by “master”: you should be able to write the code for a histogram “in your sleep”.
You should be able to write the code to create a histogram with your eyes closed.
Fluency with the basics: a critical milestone
And not just the histogram.
You need the same level of fluency with all of the other primary tools of visualization like the scatterplot, the bar chart, the line chart. You also need that level of fluency with the basic data wrangling techniques of
You need to practice
Achieving that level of fluency isn’t hard, but most people never practice, so they never get there.
I want to make this clear: in order to master
In this regard, learning data science is much like learning a musical instrument: you need to practice. Ideally, you need to practice every day.
Sound like work? It is. But the rewards are profound.
Discover how to rapidly master data science
The big barrier is that most people don’t know how to practice writing code.
If you want to discover how to practice data science and rapidly master the techniques, then sign up for the Sharp Sight Labs email list.
In the near future, Sharp Sight Labs will be publishing a lot more material about the strategies and systems for practicing data science and achieving rapid results.
The post A simple histogram (and why you need to practice it) appeared first on SHARP SIGHT LABS.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.