Some Basic Concepts about Design of Experiments and How to Perform their Analysis in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Basic design of experiments in R for one factor and two factors designs.
You can find all the code, data and results in the GitHub repository for this post: Basic design of experiments.
There is no signal without noise
It never hurts to go back to basics before tackling more complex things. The purpose of this post is to give a brief overview of the basics of design of experiments, their analysis and how to present results using R and packages like ggplot2 and agricolae. Included are one- and two-factor experiments.
What is the design of experiments?
The design of experiments (DOE) deals with the planning and performance of tests with the objective of generating data. Statistical analysis of these data will provide objective evidence that will allow the researcher to resolve questions about a given situation, process or phenomenon.
DOE can be applied to scientific research and in industry. Its goal is to generate knowledge and learning in an efficient manner. Ideally, this process can be considered as a cycle where initial hypotheses are refined as more information is obtained. Thus, we start with an initial hypothesis that we test through experimentation. If the data (our evidence) does not agree with the hypothesis, a new hypothesis is sought to explain the observed discrepancy.
Hypothesis?
Hypotheses are tentative explanations of the research phenomenon that are stated as propositions or assertions. There are different types of hypotheses, such as:
- Research hypotheses. Tentative propositions about possible relationships between two or more variables.
- Null hypothesis. Propositions that deny or refute the relationship between variables.
- Alternative hypotheses. These are different or “alternate” possibilities to the research and null hypotheses.
Data and measurements
A datum is a number that is the product of a measurement. A measurement is a process that links abstract concepts with empirical indicators. Of course, measurements could not be made without a measurement instrument, which can be defined as the resource used by the researcher to record information or data.
Other basic definitions in the design of experiments
Experiment
It is a change in the operating conditions of a system or process, which is made with the objective of measuring the effect of the change on one or more properties of the product or result.
Experimental unit
Piece(s) or sample(s) used to generate a value that is representative of the test result.
Response variable (Depedent variable)
Through these variable(s) we know the effect or results of each experimental test. They are the product of our measurements once we make changes in the system.
Controllable factors (Independent variable)
They are process variables that can be modified and fixed at a given level.
Non-controllable factors
These are variables that cannot be controlled during the experiment or the normal operation of the process.
Factors studied
These are the variables that are investigated in the experiment to observe how they affect or influence the response variables.
Levels and treatments
The different values assigned to each factor studied in an experimental design are called levels. A combination of levels of all the factors studied is called a treatment or design point. In the case of experimenting with a single factor, each level is a treatment.
Design matrix
It is the arrangement formed by the treatments that will be carried out, including the repetitions. In this same array are usually included the results of each response variable for each treatment and repetition. For me this is a concept of great importance, since from this matrix it is possible to perform directly the different statistical tests in R.
Random error
It is the observed variability that cannot be explained by the factors studied. It is the result of the effect of factors not studied and experimental error.
Experimental error
A component of random error that reflects the experimenter’s errors in the planning and execution of the experiment.
Basic principles of experimental design
- Randomization. Performing the experimental runs (treatments) in a randomized manner is essential to ensure, as far as possible, the assumption of independence of errors.
- Repetition. It is running a treatment or combination of factors more than once. This makes it possible to estimate random error and in turn calculate realistic statistics in data analysis.
- Blocking. This consists of nullifying the effect of factors in which we are not really interested and which may affect the response variables. Not taking into account their effect can lead us to erroneous conclusions about the factors in which we are interested.
Some concepts of statistical inference
- Population. In a very general way, it can be defined as the totality of possible individuals, specimens, objects or measurements of interest on which a study is made. Populations can be finite or infinite.
- Parameters. Characteristics that, by means of their numerical value, describe the entire population.
- Representative sample. An appropriately selected part of a population that retains key aspects of a population.
- Statistic. A quantity obtained from the data of a sample that summarizes the characteristics of the sample.
- Statistical inference. These are statistical statements about the population or process based on the information contained in the sample.
- Probability distribution. It relates the set of values of a characteristic in a given population to the probability associated with each of these values.
Some hypothesis testing concepts
- Statistical hypothesis. An assertion about the values of the parameters of a population or process, which can be tested from the information in a sample.
- Test statistic. Formula with which a number is calculated from the data, the magnitude of which makes it possible to determine whether or not the null hypothesis is rejected.
- Acceptance region. These are the possible values of the test statistic where the null hypothesis is not rejected.
- Rejection region. It is the set of possible values of the test statistic that lead to rejecting the null hypothesis.
- Type I error. It is when a null hypothesis that is true is rejected. It can also be called false positive.
- Type II error. This is when a null hypothesis that is false is accepted. It is basically a false negative.
- Power of the test. It is the probability of rejecting the null hypothesis when it is false.
- Predefined significance. It is the maximum risk that the experimenter is willing to take with respect to the type I error. It is denoted by α.
- Observed significance. It is the area under the reference distribution beyond the value of the test statistic. It is called p-value.
Experiments with a single factor
When the objective is to study the effect of a single factor on the mean of a response variable taking into account more than two values of the factor under consideration, it is advisable to perform a completely randomized experiment and analyze the results by analysis of variance (ANOVA). The reason for comparing means by ANOVA and not by Student’s t-tests is that the latter test increases false positives as the number of means to be compared increases, i.e., we will find differences between means where there are none.
The analysis of variance consists of separating the total variation observed in each of the sources that contribute to it.
Design and results
Obtaining a single factor design is very simple using R commands. Let’s say we want to compare the effect of three baking times (35, 45 and 60 min) in the average diameter of cookie batches. For each time we baked ten cookie batches so we have ten replicates. Note that each time and replicate have to be run in aleatory order:
set.seed(6) b_time <- rep(c(35, 45, 60), each = 10) b_time <- sample(b_time) run_order <- 1:30 one_fct_dgn <- data.frame( run_order, b_time ) head(one_fct_dgn) ## run_order b_time ## 1 1 60 ## 2 2 35 ## 3 3 45 ## 4 4 45 ## 5 5 60 ## 6 6 35
Subsequently, the design can be exported in some other format such as CSV:
write.csv(one_fct_dgn, "data/one_fct_dgn.csv", row.names = FALSE)
The results can be integrated into the design matrix directly, let’s simulate the average diameter with a small function:
av_diam <- function(diameter = 45) { SD <- 2.5 if (diameter == 35) rnorm(1, 15, sd = SD) else if (diameter == 45) rnorm(1, 20, sd = SD) else if (diameter == 60) rnorm(1, 23, sd = SD) } set.seed(3) diameter <- mapply(av_diam, one_fct_dgn$b_time) one_fct_dgn$diameter <- diameter head(one_fct_dgn) ## run_order b_time diameter ## 1 1 60 20.59517 ## 2 2 35 14.26869 ## 3 3 45 20.64697 ## 4 4 45 17.11967 ## 5 5 60 23.48946 ## 6 6 35 15.07531
Or, once the results have been recorded externally, they can be imported from a CSV file:
exp_one_factor <- read.csv("data/exp_one_factor.csv", sep = ";") head(exp_one_factor) ## factor resp ## 1 1 264 ## 2 2 208 ## 3 3 220 ## 4 4 217 ## 5 1 260 ## 6 2 220
Analysis
The aov
function is used for the analysis of factorial designs. Previously, it is recommended to convert the factor values into the factor class and then perform the analysis. Here I going to analyze our first design (one_fct_dgn
):
one_fct_dgn$b_time <- as.factor(one_fct_dgn$b_time) one_fct_aov <- aov(diameter ~ b_time, data = one_fct_dgn)
Note the aov
requires a formula
that explicitly specifies the relationship between the response and the factor(s) in the design. Once this has been done, it is possible to obtain an ANOVA table using the “anova” function:
one_fct_anova <- anova(one_fct_aov) one_fct_anova ## Analysis of Variance Table ## ## Response: diameter ## Df Sum Sq Mean Sq F value Pr(>F) ## b_time 2 367.51 183.755 44.902 2.587e-09 *** ## Residuals 27 110.49 4.092 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And export it in the format of your choice:
write.csv(one_fct_anova, "results/anova_one_fct.csv", na = "", row.names = FALSE)
Multiple range comparisons or tests
Once significant differences are established between at least two of the means, the next step is to perform a multiple comparison test, which will allow us to define which means are significantly different. There are several methods that can be used as the LSD method, Tukey’s method, or Duncan’s method. Nice functions for these methods can be found in agricolae
package.
LSD
library(agricolae) one_fct_lsd <- LSD.test(one_fct_aov, "b_time") one_fct_lsd ## $statistics ## MSerror Df Mean CV t.value LSD ## 4.09237 27 18.75029 10.78896 2.051831 1.856282 ## ## $parameters ## test p.ajusted name.t ntr alpha ## Fisher-LSD none b_time 3 0.05 ## ## $means ## diameter std r LCL UCL Min Max Q25 Q50 ## 35 14.46991 1.566410 10 13.15732 15.78250 12.17195 17.79153 13.48176 14.24977 ## 45 18.73777 2.048728 10 17.42518 20.05036 15.83381 22.90154 17.27960 18.38145 ## 60 23.04320 2.371958 10 21.73061 24.35579 19.95286 26.16842 20.74760 23.43479 ## Q75 ## 35 15.17899 ## 45 19.73752 ## 60 25.02251 ## ## $comparison ## NULL ## ## $groups ## diameter groups ## 60 23.04320 a ## 45 18.73777 b ## 35 14.46991 c ## ## attr(,"class") ## [1] "group"
Tukey HSD
one_fct_hsd <- HSD.test(one_fct_aov, "b_time") one_fct_hsd ## $statistics ## MSerror Df Mean CV MSD ## 4.09237 27 18.75029 10.78896 2.243118 ## ## $parameters ## test name.t ntr StudentizedRange alpha ## Tukey b_time 3 3.506426 0.05 ## ## $means ## diameter std r Min Max Q25 Q50 Q75 ## 35 14.46991 1.566410 10 12.17195 17.79153 13.48176 14.24977 15.17899 ## 45 18.73777 2.048728 10 15.83381 22.90154 17.27960 18.38145 19.73752 ## 60 23.04320 2.371958 10 19.95286 26.16842 20.74760 23.43479 25.02251 ## ## $comparison ## NULL ## ## $groups ## diameter groups ## 60 23.04320 a ## 45 18.73777 b ## 35 14.46991 c ## ## attr(,"class") ## [1] "group"
Duncan
one_fct_duncan <- duncan.test(one_fct_aov, "b_time") one_fct_duncan ## $statistics ## MSerror Df Mean CV ## 4.09237 27 18.75029 10.78896 ## ## $parameters ## test name.t ntr alpha ## Duncan b_time 3 0.05 ## ## $duncan ## Table CriticalRange ## 2 2.901727 1.856282 ## 3 3.048662 1.950279 ## ## $means ## diameter std r Min Max Q25 Q50 Q75 ## 35 14.46991 1.566410 10 12.17195 17.79153 13.48176 14.24977 15.17899 ## 45 18.73777 2.048728 10 15.83381 22.90154 17.27960 18.38145 19.73752 ## 60 23.04320 2.371958 10 19.95286 26.16842 20.74760 23.43479 25.02251 ## ## $comparison ## NULL ## ## $groups ## diameter groups ## 60 23.04320 a ## 45 18.73777 b ## 35 14.46991 c ## ## attr(,"class") ## [1] "group"
The results are consistent among the three methods. The section groups indicates which means are different, different letters mean significant differences. You can export the results of this kind of methods, as a TXT file, with the function capture.output
. For example, to export the results of the Tukey method:
capture.output(one_fct_hsd, file = "results/hsd_one_fct.txt")
Verification of model assumptions
Something that is not very widespread in the scientific literature is the verification of the normality of the residuals, the equality between variances for the means of each treatment and the independence between each data recorded. Verification of these assumptions helps to validate the differences established as significant.
Shapiro-Wilks test to verify normality
This test is easy to perform:
one_fct_shapiro <- shapiro.test(one_fct_aov$residuals) one_fct_shapiro ## ## Shapiro-Wilk normality test ## ## data: one_fct_aov$residuals ## W = 0.96561, p-value = 0.4271
Here, if p-value > 0.05, we can say that data come, approximately, from a normal distribution.
The results of this analysis can be exported with the capture.output
function:
capture.output(one_fct_shapiro, file = "results/shapiro_one_fct.txt")
Bartlett’s test for equality of variances
To perform this test it is possible to use the bartlett.test
function:
one_fct_bt <- bartlett.test(diameter ~ b_time, data = one_fct_dgn) one_fct_bt ## ## Bartlett test of homogeneity of variances ## ## data: diameter by b_time ## Bartlett's K-squared = 1.4403, df = 2, p-value = 0.4867
A p-value > 0.05 mean that treatment means have equal variances. We can export the results in the same way:
capture.output(one_fct_bt, file = "results/bartlett_one_fct.txt")
Independence
To verify independence between observations, residuals are plotted as a function of run order. If the residuals follow a defined pattern it will be a clear indication of lack of independence. For this plot I used the ggplot2 package:
library(ggplot2) one_fct_dgn$residuals <- one_fct_aov$residuals ggplot(one_fct_dgn, aes(run_order, residuals)) + geom_point() + theme_classic() + xlab("Run order") + ylab("Residuals")
Tests such as the Durbin-Watson test are also available to verify this assumption. Use the function durbinWatsonTest
in the car
package:
library(car) one_fct_dw <- durbinWatsonTest(one_fct_aov) one_fct_dw ## lag Autocorrelation D-W Statistic p-value ## 1 -0.1264564 2.176091 0.658 ## Alternative hypothesis: rho != 0
A p-value > 0.05 means that data observations are not auto correlated. You can also save the results in the usual way:
capture.output(one_fct_dw, file = "results/durbin_watson_one_fct.txt")
Display of results
Bar graph
To visualize the results of this type of experiments, bar graphs are usually used, whose height represents the magnitude of each mean, together with error bars representing the standard deviation and letters showing the significant differences established by the multiple comparisons test:
library(dplyr) # Means, standard deviations and groups from Tukey HSD one_fct_means <- one_fct_dgn %>% group_by(b_time) %>% summarise( av_diam = mean(diameter), sd_diam = sd(diameter) ) %>% ungroup() %>% arrange(desc(av_diam)) %>% mutate(group = unlist(one_fct_hsd$groups["groups"])) # Bar plot ggplot(one_fct_means, aes(b_time, av_diam)) + geom_col(width = 0.5) + geom_errorbar( aes(ymin = av_diam - sd_diam, ymax = av_diam + sd_diam), width = 0.1 ) + geom_text(aes(label = group, y = av_diam + sd_diam), vjust = -0.5) + xlab("Baking time (min)") + ylab("Average diameter") + theme_classic()
Alternative
Although bar graphs are widely used in the scientific literature of various types, their use hides the true dispersion of the data for each treatment. One thing I have also noticed is that people tend to judge differences between means by the overlap or separation of standard deviations, which seems to me to be more of a bias than an objective criterion.
As an alternative to the bar chart, it is possible to make a graph showing each observation, the mean, and the letters obtained in the multiple comparisons test:
ggplot(one_fct_dgn, aes(b_time, diameter)) + geom_point() + stat_summary(fun = mean, geom = "crossbar", width = 0.2, color = "blue") + geom_text( data = one_fct_means, aes(label = group, y = av_diam), hjust = -3 ) + xlab("Baking time (min)") + ylab("Diameter") + theme_classic()
Two-factor designs
In this type of design the objective is to verify the individual and interaction effect of two factors. Before showing how to obtain and analyze them, it is necessary to recall some concepts:
- Qualitative factor. Its levels take discrete or nominal values.
- Quantitative factor. Its levels can take any value within a certain interval. The scale is continuous.
- Effect of a factor. It is the change observed in the response variable due to a change in the level of the factor.
- Interaction effect. Two factors interact significantly on the response variable when the effect of one depends on the level of the other.
Design and results
To obtain this kind of designs we can use the function fac.design
from DoE.base
package. For this we need specify the number of factors and the number of levels of each factor:
library(DoE.base) two_fct_dgn <- fac.design( nlevels = 3, nfactors = 2, replications = 3, seed = 4 ) head(two_fct_dgn) ## A B Blocks ## 1 2 3 .1 ## 2 3 1 .1 ## 3 3 3 .1 ## 4 1 3 .1 ## 5 1 2 .1 ## 6 3 2 .1
For more details about this function, just type ?fac.design
in your console. Also note that there is an extra column with the name “Blocks”, this is not big deal since there is just one and we won’t taking it into account further in the analysis. In future posts I will address block designs.
This design can also be exported in the desired format:
write.csv(two_fct_dgn, file = "data/two_fct_dgn.csv", row.names = FALSE)
The results can also be integrated directly or imported once recorded in the design matrix. Here, I’m going to import and work with data previously recorded:
exp_two_factor <- read.csv("data/exp_two_factor.csv", sep = ";") head(exp_two_factor) ## run_order factor1 factor2 res ## 1 1 0.15 0.2 74 ## 2 2 0.15 0.2 64 ## 3 3 0.15 0.2 60 ## 4 4 0.18 0.2 79 ## 5 5 0.18 0.2 68 ## 6 6 0.18 0.2 73
Analysis
To analyze this type of design, the “aov” function is used, specifying each main effect and the interaction effect in the formula:
exp_two_factor$factor1 <- as.factor(exp_two_factor$factor1) exp_two_factor$factor2 <- as.factor(exp_two_factor$factor2) two_fct_aov <- aov( res ~ factor1 + factor2 + factor1:factor2, data = exp_two_factor )
You can also use the short form Y ~ A*B
, which will be displayed in full once you get the ANOVA table:
two_fct_aov <- aov(res ~ factor1*factor2, data = exp_two_factor) two_fct_anova <- anova(two_fct_aov) two_fct_anova ## Analysis of Variance Table ## ## Response: res ## Df Sum Sq Mean Sq F value Pr(>F) ## factor1 3 2125.11 708.37 24.6628 1.652e-07 *** ## factor2 2 3160.50 1580.25 55.0184 1.086e-09 *** ## factor1:factor2 6 557.06 92.84 3.2324 0.01797 * ## Residuals 24 689.33 28.72 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 write.csv(two_fct_anova, file = "results/anova_two_fct.csv", na = "")
Note that previously I also converted the values of each factor in the factor class.
Multiple range comparisons tests
Once a significant interaction effect is established, the next step is to establish which means differ. For this, a Tukey test can be used, for which the interaction effect must first be explicitly specified in a separated aov model:
two_fct_aov_2 <- aov( res ~ interaction(factor1, factor2), data = exp_two_factor ) two_fct_hsd <- HSD.test(two_fct_aov_2, "interaction(factor1, factor2)") two_fct_hsd ## $statistics ## MSerror Df Mean CV MSD ## 28.72222 24 94.33333 5.681249 15.77773 ## ## $parameters ## test name.t ntr StudentizedRange alpha ## Tukey interaction(factor1, factor2) 12 5.09913 0.05 ## ## $means ## res std r Min Max Q25 Q50 Q75 ## 0.15.0.2 66.00000 7.211103 3 60 74 62.0 64 69.0 ## 0.15.0.25 88.66667 3.055050 3 86 92 87.0 88 90.0 ## 0.15.0.3 99.66667 2.081666 3 98 102 98.5 99 100.5 ## 0.18.0.2 73.33333 5.507571 3 68 79 70.5 73 76.0 ## 0.18.0.25 96.66667 8.082904 3 88 104 93.0 98 101.0 ## 0.18.0.3 99.33333 4.509250 3 95 104 97.0 99 101.5 ## 0.21.0.2 87.33333 5.033223 3 82 92 85.0 88 90.0 ## 0.21.0.25 100.66667 6.658328 3 95 108 97.0 99 103.5 ## 0.21.0.3 105.66667 5.859465 3 99 110 103.5 108 109.0 ## 0.24.0.2 99.66667 4.041452 3 96 104 97.5 99 101.5 ## 0.24.0.25 104.33333 5.507571 3 99 110 101.5 104 107.0 ## 0.24.0.3 110.66667 3.511885 3 107 114 109.0 111 112.5 ## ## $comparison ## NULL ## ## $groups ## res groups ## 0.24.0.3 110.66667 a ## 0.21.0.3 105.66667 a ## 0.24.0.25 104.33333 ab ## 0.21.0.25 100.66667 abc ## 0.15.0.3 99.66667 abc ## 0.24.0.2 99.66667 abc ## 0.18.0.3 99.33333 abc ## 0.18.0.25 96.66667 abc ## 0.15.0.25 88.66667 bcd ## 0.21.0.2 87.33333 cd ## 0.18.0.2 73.33333 de ## 0.15.0.2 66.00000 e ## ## attr(,"class") ## [1] "group"
The results of Tukey’s test can be exported directly:
capture.output(two_fct_hsd, file = "results/hsd_two_fct.txt")
Verification of model assumptions
For the verification of the assumptions it is possible to proceed in a manner similar to that of the single-factor designs.
Shapiro-Wilks test to verify normality
two_fct_shapiro <- shapiro.test(two_fct_aov$residuals) two_fct_shapiro ## ## Shapiro-Wilk normality test ## ## data: two_fct_aov$residuals ## W = 0.97051, p-value = 0.4397 capture.output(two_fct_shapiro, file = "results/shapiro_two_fct.txt")
Bartlett’s test for equality of variances
For Bartlett test I previously made an extra column specifying the combination between factors (treatments). This will indicate the groups to bartlett.test
function:
exp_two_factor$treatment <- mapply( paste, exp_two_factor$factor1, exp_two_factor$factor2 ) two_fct_bartlett <- bartlett.test(res ~ treatment, data = exp_two_factor) two_fct_bartlett ## ## Bartlett test of homogeneity of variances ## ## data: res by treatment ## Bartlett's K-squared = 4.6743, df = 11, p-value = 0.9459 capture.output(two_fct_bartlett, file = "results/bartlett_two_fct.txt")
Independence
exp_two_factor$residuals <- two_fct_aov$residuals ggplot(exp_two_factor, aes(run_order, residuals)) + geom_point() + theme_classic() + xlab("Run order") + ylab("Residuals")
two_fct_dw <- durbinWatsonTest(two_fct_aov) two_fct_dw ## lag Autocorrelation D-W Statistic p-value ## 1 -0.3920052 2.671663 0.828 ## Alternative hypothesis: rho != 0 capture.output(one_fct_dw, file = "results/durbin_watson_two_fct.txt")
Display of results
Bar graph
The realization of this type of graphs requires keeping in mind the individual means of each treatment and the interaction effects, if there is some. To calculate them together with the standard deviations we used the “group_by” function and then “summarise” from the “dplyr” package:
# Means, standard deviations and groups from Tukey HSD two_fct_means <- exp_two_factor %>% group_by(factor1, factor2) %>% summarise( av_res = mean(res), sd_res = sd(res) ) %>% ungroup() %>% arrange(desc(av_res)) %>% mutate(group = unlist(two_fct_hsd$groups["groups"])) # Bar plot ggplot(two_fct_means, aes(factor1, av_res, fill = factor2)) + geom_col(width = 0.5, position = position_dodge()) + geom_errorbar( aes(ymin = av_res - sd_res, ymax = av_res + sd_res), width = 0.1, position = position_dodge(width = 0.5) ) + geom_text( aes(label = group, y = av_res + sd_res), vjust = -0.5, position = position_dodge(width = 0.5) ) + labs(fill = "Factor 2") + xlab("Factor 1") + ylab("Average response") + theme_classic()
Alternative
To make the graph showing each observation, the mean, a confidence interval and the letters obtained in the multiple comparisons test, it is necessary to use the data.frame with the observations next to the one with the means:
ggplot(exp_two_factor, aes(factor1, res, group = factor2)) + geom_point(aes(color = factor2), position = position_dodge(width = 0.9)) + stat_summary( fun = mean, geom = "crossbar", width = 0.2, color = "blue", position = position_dodge(width = 0.9) ) + geom_text( data = two_fct_means, aes(label = group, y = av_res), vjust = -4, size = 3.5, position = position_dodge(width = 0.9) ) + ylim(c(55, 130)) + xlab("Factor 1") + ylab("Response") + labs(color = "Factor 2") + theme_classic()
## Interaction plot
A good way to visualize the interaction effect is through interaction.plot
function:
Factor1 <- exp_two_factor$factor1 Factor2 <- exp_two_factor$factor2 Response <- exp_two_factor$res interaction.plot(Factor1, Factor2, Response)
That’s it! Thank you very much for visiting this site, I hope you find the content of this post useful. See you soon!
Juan Pablo Carreón Hidalgo 🤓
[email protected]
https://twitter.com/JuanPa2601
The text and code on this tutorial is under Creative Commons Attribution 4.0 International License.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.