Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Installation
You can install xray from github with:
# install.packages("devtools") devtools::install_github("sicarul/xray")
Usage
Anomaly detection
xray::anomalies
analyzes all your columns for anomalies, whether they are NAs, Zeroes, Infinite, etc, and warns you if it detects variables with at least 80% of rows with those anomalies. It also warns you when all rows have the same value.
Example:
data(longley) badLongley=longley badLongley$GNP=NA xray::anomalies(badLongley) #> Warning in xray::anomalies(badLongley): Found 1 possible problematic variables: #> GNP #> $variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> 2 GNP.deflator 16 0 0 0 0 0 0 0 0 16 #> 3 Unemployed 16 0 0 0 0 0 0 0 0 16 #> 4 Armed.Forces 16 0 0 0 0 0 0 0 0 16 #> 5 Population 16 0 0 0 0 0 0 0 0 16 #> 6 Year 16 0 0 0 0 0 0 0 0 16 #> 7 Employed 16 0 0 0 0 0 0 0 0 16 #> type anomalous_percent #> 1 Logical 1 #> 2 Numeric 0 #> 3 Numeric 0 #> 4 Numeric 0 #> 5 Numeric 0 #> 6 Integer 0 #> 7 Numeric 0 #> #> $problem_variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> type anomalous_percent #> 1 Logical 1 #> problems #> 1 Anomalies present in 100% of the rows. Less than 2 distinct values.
Distributions
xray::distributions
tries to analyze the distribution of your variables, so you can understand how each variable is statistically structured. It also returns a percentiles table of numeric variables as a result, which can inform you of the shape of the data.
distrLongley=longley distrLongley$testCategorical=c(rep('One',7), rep('Two', 9)) xray::distributions(distrLongley)
#> Variable p_1 p_10 p_25 p_50 p_75 p_90 #> 1 GNP.deflator 83.78 88.35 94.525 100.6 111.25 114.95 #> 2 GNP 237.8537 258.74 317.881 381.427 454.0855 510.387 #> 3 Unemployed 187.93 201.55 234.825 314.35 384.25 434.4 #> 4 Armed.Forces 147.61 160.3 229.8 271.75 306.075 344.85 #> 5 Population 107.7616 109.2025 111.7885 116.8035 122.304 126.61 #> 6 Year 1947.15 1948.5 1950.75 1954.5 1958.25 1960.5 #> 7 Employed 60.1938 60.7225 62.7125 65.504 68.2905 69.4475 #> p_99 #> 1 116.72 #> 2 549.3859 #> 3 478.725 #> 4 358.695 #> 5 129.7466 #> 6 1961.85 #> 7 70.403
Distributions along a time axis
xray::timebased
also investigates into your distributions, but shows you the change over time, so if there is any change in the distribution over time (For example a variable stops or starts being collected) you can easily visualize it.
dateLongley=longley dateLongley$Year=as.Date(paste0(dateLongley$Year,'-01-01')) dateLongley$Data='Original' ndateLongley=dateLongley ndateLongley$GNP=dateLongley$GNP+10 ndateLongley$Data='Offseted' xray::timebased(rbind(dateLongley, ndateLongley), 'Year')
If you are just starting to learn about Data Science or want to explore further, i encourage you to check this cool book made by my buddy Pablo Casas: The Data Science Live Book
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.