Fitting Distributions to Data with R

emraher

9 years ago

[This article was first published on Category: r | Emrah Er, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In “Fitting Distributions with R” Vito Ricci writes;

“Fitting distributions consists in finding a mathematical function which represents in a good way a statistical variable. A statistician often is facing with this problem: he has some observations of a quantitative character and he wishes to test if those observations, being a sample of an unknown population, belong from a population with a pdf (probability density function) , where is a vector of parameters to estimate with available data.

We can identify 4 steps in fitting distributions:

Model/function choice: hypothesize families of distributions;

Estimate parameters;

Evaluate quality of fit;

Goodness of fit statistical tests.”

In SAS this can be done by using proc capability whereas in R we can do the same thing by using fdistrplus and some other packages. In this post I will try to compare the procedures in R and SAS. < !--more-->

Following code chunk creates 10,000 observations from normal distribution with a mean of 10 and standard deviation of 5 and then gives the summary of the data and plots a histogram of it.

If we import the data we created in R into SAS and run the following code;

PROC CAPABILITY;
HISTOGRAM x / NORMAL;
RUN;

SAS gives us the following results;

Moments
Basic Statistical Measures (Location and Variability)
Tests for Location
Observed Quantiles
Extreme Observations
Histogram
Parameter Estimates
Goodness-of-Fit Test Results
Estimated Quantiles

We can obtain same results in R by using e1071, raster, plotrix, stats, fitdistrplus and nortest packages.

1. Moments

N :

Sum Weights : A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable. In our case, since we didn’t specify a weight variable, SAS uses the default weight variable. Therefore, the sum of weight is the same as the number of observations. (Source)

Mean :

Sum Observations :

Std Deviation :

Skewness :

Kurtosis :

Uncorrected SS : Sum of squared data values. (Source)

Corrected SS : The sum of squared distance of data values from the mean. (Source)

Coeff Variation : The ratio of the standard deviation to the mean. (Source)

Std Error Mean : The estimated standard deviation of the sample mean. (Source)

2. Basic Statistical Measures (Location and Variability)

Range :

Interquartile Range :

3. Tests for Location

Student’s t : Skipped this part

Sign : Skipped this part

Signed Rank :

4. Observed Quantiles

Signed Rank :

5. Extreme Observations : Skipped this part

6. Histogram

6. Parameter Estimates

Mean (Mu) :

Std Dev (Sigma) :

7. Goodness-of-Fit Test Results

Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling

Kolmogorov-Smirnov :

Cramer-von Mises :

Anderson-Darling :

Chi-Square :

8. Estimated Quantiles : Skipped this part

We can change the commands to fit other distributions. This is as simple as changing normal to something like beta(theta = SOME NUMBER, scale = SOME NUMBER) or weibull in SAS. Whereas in R one may change the name of the distribution in normal.fit <- fitdist(x,"norm") command to the desired distribution name. While fitting densities you should take the properties of specific distributions into account. For example, Beta distribution is defined between 0 and 1. So you may need to rescale your data in order to fit the Beta distribution.

To leave a comment for the author, please follow the link and comment on their blog: Category: r | Emrah Er.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.