Descriptive Statistics in R

finnstats

5 months ago

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Descriptive Statistics in R, You’ll learn about descriptive statistics in this tutorial, which is one strategy you might employ in exploratory data analysis.

Before you invest time constructing intricate models, it’s necessary to first study your data when you start analyzing data. Calculating descriptive statistics for your data is an easy approach to do so.

Line Plots in R-Time Series Data Visualization »

Descriptive Statistics in R

Descriptive statistical analysis aids in describing the fundamental characteristics of a dataset and gives a brief description of the sample and data measurements.

One approach to do this is to use the tidyverse dplyr summarise() function.

The summarise() function is frequently used in conjunction with group by() to summarise each group into a single-row summary.

The column names that contain the categorical variables for which you want to create summary statistics are passed as parameters to group by().

How to Change Legend Position in ggplot2 »

Let’s take airline data set for analysis,

library(tidyverse)
library(dplyr)
library(ggplot2)
data<-read.csv("D:/RStudio/Airlinedata.csv",1)
summary<-data %>%
  group_by(Reporting_Airline) %>%
  summarize(mean=mean(ArrDelayMinutes,na.rm=TRUE),
std_dev=sd(ArrDelayMinutes,na.rm=TRUE))
summary
    Reporting_Airline  mean   std_dev
1                AA 10.12226  24.95855
2                AS 12.91111  25.59903
3                B6 18.55039  47.14758
4                DL 13.83650  48.08742
5                HP 19.21429  25.31700
6            PA (1) 33.54545 118.73265
7                TW 15.59459  36.82041
8                UA 11.73462  27.09474
9                VX 14.93798  31.14125

The mean, total number of data points, standard deviation, quartiles, and extreme values can all be displayed in summary statistics.

In these statistics, any NA values are automatically skipped. The summarize() function helps you understand how your variables are distributed.

Your dataset can also include categorical variables. These are variables with discrete values that can be classified into different categories or groups.

The reporting airline, for example, is a categorical variable in this dataset, with the following categories: UA, AS, DL, and six others.

Adding text labels to ggplot2 Bar Chart »

Using the count() function is one technique to summarise categorical data.

data  %>%
  count(Reporting_Airline)
   Reporting_Airline    n
1                AA 1096
2                AS   45
3                B6  258
4                DL  526
5                HP   14
6            PA (1)   33
7                TW  185
8                UA  569
9                VX  129

Boxplot

Boxplots are a wonderful way to visualize numeric data since they allow you to see the data’s various distributions. A boxplot depicts the data’s median or the location of the middle data point.

Let’s take one of the random box plots for illustration purposes.

The 75th percentile is represented by the Upper Quartile, whereas the 25th percentile is represented by the Lower Quartile.

The Interquartile Range is the data between the Upper and Lower Quartiles.

The Lower and Upper Extremes are next.

Above the 75th percentile, these are computed as 1.5 times the interquartile range, and below the 25th percentile, they are calculated as 1.5 times the IQR.

Finally, outliers appear as separate dots outside the upper and lower extremities on boxplots.

Outliers, as well as the distribution and skewness of the data, can be clearly identified using boxplots.

How to Make Boxplot in R-Quick Start Guide »

Scatterplot

A scatter plot is an excellent method to represent the relationship between two variables.

In a scatter plot, each observation is represented by a point. The relationship between the two variables is depicted in a scatterplot.

Let’s take one of the random scatter plots for illustration purposes.

The predictor variable is the one that you’re using to make a prediction.

The variable you’re trying to forecast is known as the target variable.

In a scatterplot, the predictor variable is usually plotted on the x-axis (horizontal axis), while the target variable is plotted on the y-axis (or vertical axis).

Data Visualization with R-Scatter plots »

In this tutorial, you learned that descriptive statistics are used to characterize basic characteristics of data and to provide brief summaries of the sample and data measurements.

You may visualize data distribution and correlations visually using boxplots and scatter plots.

The post Descriptive Statistics in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Methods – finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.