Data transformation and standardization in r

R-Blog on Data modelling to develop ...

7 hours ago

[This article was first published on R-Blog on Data modelling to develop ..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Often economic and other Machine Learning data are of different units or sizes making either estimation, interpretation or visualization difficult. The solution to these issues can be handled if the data can be transformed to unitless or data of similar magnitude. When the need to transform thus arises, then one finds it difficult to get handy function to achieve that.

In this blog, I share with you a function data_transform from Dyn4cast package that can easily transform your data.frame for estimation and visualization purposes. It is a one line code and easy to use. The usage is as follows:

data_transform(data, method, x, MARGIN)

data clean numeric data frame

method method of transformation or standardization: 1 = min-max, 2 = log, 3 = mean-SD.

MARGIN optional, to indicate if the data is column-wise or row-wise. Defaults to coulmn-wise if not indicate

Load library

library(Dyn4cast)
Transform <- readRDS("data/Transform.RDS")

Visualize the raw data without any transformation

data0 <- Transform %>%
pivot_longer(!X, names_to = "Factors", values_to = "Data")
ggplot(data = data0, aes(x = X, y = Data, fill = Factors, color = Factors)) +
geom_line() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Data", x = "Series", color = "Factors") +
theme_bw(base_size = 12)

The pattern of the smaller data is hidden by the larger data, it is even difficult to see their distribution

Transformation by min-max method

You could also transform the X column but is is better not to.

data11 <- data1 <- data_transform(Transform[, -1], 1)
data1 <- cbind(Transform[, 1], data1)
data1 <- data1 %>%
pivot_longer(!X, names_to = "Factors", values_to = "Data")
ggplot(data = data1, aes(x = X, y = Data, fill = Factors, color = Factors)) +
geom_line() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Data", x = "Series", color = "Factors") +
theme_bw(base_size = 12)

The pattern of each of the variables are now very evident.

`log` transformation of the data

data21 <- data2 <- data_transform(Transform[, -1], 2)
data2 <- cbind(Transform[, 1], data2)
data2 <- data2 %>%
pivot_longer(!X, names_to = "Factors", values_to = "Data")
ggplot(data = data2, aes(x = X, y = Data, fill = Factors, color = Factors)) +
geom_line() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Data", x = "Series", color = "Factors") +
theme_bw(base_size = 12)

log is a linear transformation of the data. The pattern are shown but much less that the transformation with min-max method.

`Mean-SD` transformation

data31 <- data3 <- data_transform(Transform[, -1], 3)
data3 <- cbind(Transform[, 1], data3)
data3 <- data3 %>%
pivot_longer(!X, names_to = "Factors", values_to = "Data")
ggplot(data = data3, aes(x = X, y = Data, fill = Factors, color = Factors)) +
geom_line() +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Data", x = "Series", color = "Factors") +
theme_bw(base_size = 12)

Much similar to the min-max transformation, but the essential pattern of the data is evident.

Comparison of the linear regression of the raw and transformed data

Raw <- lm(col1 ~ ., data = Transform[, -1])
Data1 <- lm(col1 ~ ., data = data.frame(data11))
Data2 <- lm(col1 ~ ., data = data.frame(data21))
Data3 <- lm(col1 ~ ., data = data.frame(data31))
m_list <- list(Raw = Raw, Max = Data1, Log = Data2, Mean = Data3)
modelsummary::modelsummary(m_list, stars = TRUE, digits = 2)

	Raw	Max	Log	Mean
(Intercept)	840399.559+	0.288+	10.042***	0.000
	(429034.752)	(0.140)	(1.804)	(0.099)
col2	642.404***	0.740***	0.114***	0.722***
	(111.262)	(0.128)	(0.014)	(0.125)
col3	−114.479	−0.107	0.016	−0.079
	(195.042)	(0.183)	(0.168)	(0.135)
col4	−6770.682*	−0.317*	−0.189	−0.244*
	(2935.693)	(0.138)	(0.180)	(0.106)
col5	1422.072+	0.276+	0.093*	0.211+
	(776.594)	(0.151)	(0.044)	(0.115)
col6	2088.735	0.249	0.629+	0.186
	(1357.125)	(0.161)	(0.302)	(0.121)
Num.Obs.	25	25	25	25
R2	0.805	0.805	0.889	0.805
R2 Adj.	0.753	0.753	0.860	0.753
AIC	636.6	−17.4	−68.7	43.1
BIC	645.1	−8.9	−60.2	51.6
Log.Lik.	−311.285	15.701	41.363	−14.539
F	15.671	15.671	30.555	15.671
RMSE	61849.44	0.13	0.05	0.43
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

modelsummary::modelplot(m_list)

The coefficients of the transformed data are better than the coefficients of the raw data although the effects of the variables looks same except for the intercept and col4 and col5. Log transformation estimated a very significant intercept wheareas those of raw and max transformation are rarely significant. The Mean transformation did not estimate a significant intercept. All the transformations estimated a significant col4 except log wheareas log estimated a significant col5 while others are rarely significant.

In terms of the model properties, the log transformation is the best, followed by max, then Mean; and the raw data gave the worst model properties.

To leave a comment for the author, please follow the link and comment on their blog: R-Blog on Data modelling to develop ....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.