Dynamic analysis on outliers

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.




Treating outliers

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

01_income_distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

02_income_p99.JPG

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

03_density_simple.JPG



Animating the example

The following animation shows in action this iterative deleting process: As we leave out the highest 1%, silhouette keeps a similar aspect to:

04_fractal_outliers.gif

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-. Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

05_fractal_outliers_density.gif

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.



Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.


Contact

Made by Pablo C. from Data Science Heroes.

R code and data available on github

Outliers issues and data treatment are topics of the e-learning course: Data Science with R (just email for a free demo:[email protected])

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)