[This article was first published on Statistic on aiR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Suppose we want to create a polynomial that can approximate better the following dataset on the population of a certain Italian city over 10 years. The table summarizes the data:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
$$\begin{tabular}{|1|1|}\hline Year & Population\\ \hline 1959&4835\\ 1960&4970\\ 1961&5085\\ 1962&5160\\ 1963&5310\\ 1964&5260\\ 1965&5235\\ 1966&5255\\ 1967&5235\\ 1968&5210\\ 1969&5175\\ \hline \end{tabular}$$
First we import the data into R:
Year <- c(1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969) Population < - c(4835, 4970, 5085, 5160, 5310, 5260, 5235, 5255, 5235, 5210, 5175)
Now we create the dataframe named sample1:
sample1 <- data.frame(Year, Population) sample1 Year Population 1 1959 4835 2 1960 4970 3 1961 5085 4 1962 5160 5 1963 5310 6 1964 5260 7 1965 5235 8 1966 5255 9 1967 5235 10 1968 5210 11 1969 5175
At this point may be useful to chart these values, to observe the trend and take an idea of the final polynomial function. For convenience we modify the column
Year
, creating a neighborhood of zero, thus:sample1$Year <- sample1$Year - 1964 sample1 Year Population 1 -5 4835 2 -4 4970 3 -3 5085 4 -2 5160 5 -1 5310 6 0 5260 7 1 5235 8 2 5255 9 3 5235 10 4 5210 11 5 5175
Put the values on a chart
plot(sample1$Year, sample1$Population, type="b")
At this point we can start with the search for a polynomial model that adequately approximates our data. First, we specify that we want a polynomial function of X, ie a raw polynomial , is different from the orthogonal polynomial. This is an important addition because the controls and the results will change in the two cases R. So we want a function of X like:
$$f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+ ... +\beta_nx^n$$
At what degree of the polynomial stop? Depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; we must also verify the significance of coefficients that are found. But let's get straight to the point.
In R for fitting a polynomial regression model (not orthogonal), there are two methods, among them identical. Suppose we seek the values of beta coefficients for a polynomial of degree 1, then 2nd degree, and 3rd degree:
fit1 <- lm(sample1$Population ~ sample1$Year) fit2 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2)) fit3 < - lm(sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3))
Or we can write more quickly, for polynomials of degree 2 and 3:
fit2b <- lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE)) fit3b < - lm(sample1$Population ~ poly(sample1$Year, 3, raw=TRUE))
The function
poly
is useful if you want to get a polynomial of high degree, because it avoids explicitly write the formula. If we specify raw=TRUE
, the two methods provide the same output, but if we do not specify raw=TRUE
(or rgb(153, 0, 0);">raw=F), the function poly
give us the values of the beta parameters of an orthogonal polynomials, which is different from the general formula I wrote above, although the models are both effective.Let's look at the output.
summary(fit2) ## or summary(fit2b) Call: lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2)) Residuals: Min 1Q Median 3Q Max -46.888 -18.834 -3.159 2.040 86.748 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5263.159 17.655 298.110 < 2e-16 *** sample1$Year 29.318 3.696 7.933 4.64e-05 *** I(sample1$Year^2) -10.589 1.323 -8.002 4.36e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 38.76 on 8 degrees of freedom Multiple R-squared: 0.9407, Adjusted R-squared: 0.9259 F-statistic: 63.48 on 2 and 8 DF, p-value: 1.235e-05
The output of
summary(fit2b)
is the same. We obtained the values of beta0 (5263,159), beta1 (29,318) and beta2 (-10,589), which appear to be significant AII 3. The equation of polynomial of degree 2 of our model is:$$f(x)=5263.1597+29.318x-10.589x^2$$
If we want a polynomial of 3rd degree, we have:
summary(fit3) ## of summary(fit3b) Call: lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)) Residuals: Min 1Q Median 3Q Max -32.774 -14.802 -1.253 3.199 72.634 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5263.1585 15.0667 349.324 4.16e-16 *** sample1$Year 14.3638 8.1282 1.767 0.1205 I(sample1$Year^2) -10.5886 1.1293 -9.376 3.27e-05 *** I(sample1$Year^3) 0.8401 0.4209 1.996 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 33.08 on 7 degrees of freedom Multiple R-squared: 0.9622, Adjusted R-squared: 0.946 F-statistic: 59.44 on 3 and 7 DF, p-value: 2.403e-05
The equation is:
$$f(x)=5263.1585+14.3638x-10.5886x^2+0.8401x^3$$
In the latter case, however, the coefficients beta1 and beta3 are not significant, then the best model is the polynomial of 2nd degree. Furthermore look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. It seems that there has been an increase in accuracy of the model, but it is a significant increase? We can compare the two model with an ANOVA table:
anova(fit2, fit3) Analysis of Variance Table Model 1: sample1$Population ~ sample1$Year + I(sample1$Year^2) Model 2: sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3) Res.Df RSS Df Sum of Sq F Pr(>F) 1 8 12019.8 2 7 7659.5 1 4360.3 3.9848 0.0861 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn't a significant improvement of the model.
The biggest problem now is to represent graphically the result. In fact, R does not exist (as far as I know) a function for plotting polynomials found. We must therefore proceed with graphic artifacts still valid, but somewhat laborious.
First, we plotted the values, with the command seen before. This time only display the lines and not points, for convenience graphics:
plot(sample1$Year, sample1$Population, type="l", lwd=3)
Now add to this chart the progress of the 2nd degree polynomial, in this way:
points(sample1$Year, predict(fit2), type="l", col="red", lwd=2)
The function
predict()
compute the Y values given the X values. The the coordinates are linked with lines. Is not plotted the continuous, but the discrete. With a few values, this method is highly debilitating.Let's add the graph of the polynomial of 3rd degree:
points(sample1&Year, predict(fit3), type="l", col="blue", lwd=2)
As you can see the two models have very similar trends.
If we would instead obtain the graph of continuous functions obtained, we proceed in this manner. First you create the polynomial equation we previously found:
pol2 < - function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
Remember that:
- coefficient[1] = beta0
- coefficient[2] = beta1
- coefficient[3] = beta2
and so on.
At this point we plotted the coordinates of sample1 and then the created curve with
curve(x)
:plot(sample1$Year, sample1$Population, type="p", lwd=3) pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1] curve(pol2, col="red", lwd=2)
The point, however, disappear, but we can replace them with the command
points
:points(sample1$Year, sample1$Population, type="p", lwd=3)
A note: you must follow the order of commands as I have described, otherwise the function
curve
creates a wrong graph. So summarizing the commands to get the continuous function, and the experimental points on the same graph are the following:plot(sample1$Year, sample1$Population, type="p", lwd=3) pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1] curve(pol2, col="red", lwd=2) points(sample1$Year, sample1$Population, type="p", lwd=3)
The graph we get is the following:
Now draw the graph of the polynomial of 3rd degree:
plot(sample1$Year, sample1$Population, type="p", lwd=3) pol3 <- function(x) fit3$coefficient[4]*x^3 + fit3$coefficient[3]*x^2 + fit3$coefficient[2]*x + fit3$coefficient[1] curve(pol3, col="red", lwd=2) points(sample1$Year, sample1$Population, type="p", lwd=3)
To leave a comment for the author, please follow the link and comment on their blog: Statistic on aiR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.