Tidy correlation tests in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When we try to estimate the correlation coefficient between multiple variables, the task is more complicated in order to obtain a simple and tidy result. A simple solution is to use the tidy()
function from the {broom} package. In this post we are going to estimate the correlation coefficients between the annual precipitation of several Spanish cities and climate teleconnections indices: download. The data of the teleconnections are preprocessed, but can be downloaded directly from crudata.uea.ac.uk. The daily precipitation data comes from ECA&D.
Packages
In this post we will use the following packages:
Package | Description |
---|---|
tidyverse | Collection of packages (visualization, manipulation): ggplot2, dplyr, purrr, etc. |
broom | Convert results of statistical functions (lm, t.test, cor.test, etc.) into tidy tables |
fs | Provides a cross-platform, uniform interface to file system operations |
lubridate | Easy manipulation of dates and times |
#install the packages if necessary if(!require("tidyverse")) install.packages("tidyverse") if(!require("broom")) install.packages("broom") if(!require("fs")) install.packages("fs") if(!require("lubridate")) install.packages("lubridate") #load libraries library(tidyverse) library(broom) library(fs) library(lubridate)
Import data
First we have to import the daily precipitation of the selected weather stations.
- Create a vector with all precipitation files using the function
dir_ls()
of the {fs} package. - Import the data using the
map_df()
function of the {purrr} package that applies another function to a vector or list, and joins them together in a single data.frame. - Select the columns that interest us, b) Convert the date string into a date object using the
ymd()
function of the {lubridate} package, c) Create a new column yr with the years, d) Divide the precipitation values by 10 and reclassify absent values -9999 by NA, e) Finally, reclassify the ID of each weather station creating a factor with new labels.
- Select the columns that interest us, b) Convert the date string into a date object using the
More details about the use of the dir_ls()
and map_df()
functions can be found in this previous post.
#precipitation files files <- dir_ls(regexp = "txt") files ## RR_STAID001393.txt RR_STAID001394.txt RR_STAID002969.txt ## RR_STAID003946.txt RR_STAID003969.txt #import all files and join them together pr <- files %>% map_df(read_csv,skip=20) ## Parsed with column specification: ## cols( ## STAID = col_double(), ## SOUID = col_double(), ## DATE = col_double(), ## RR = col_double(), ## Q_RR = col_double() ## ) ## Parsed with column specification: ## cols( ## STAID = col_double(), ## SOUID = col_double(), ## DATE = col_double(), ## RR = col_double(), ## Q_RR = col_double() ## ) ## Parsed with column specification: ## cols( ## STAID = col_double(), ## SOUID = col_double(), ## DATE = col_double(), ## RR = col_double(), ## Q_RR = col_double() ## ) ## Parsed with column specification: ## cols( ## STAID = col_double(), ## SOUID = col_double(), ## DATE = col_double(), ## RR = col_double(), ## Q_RR = col_double() ## ) ## Parsed with column specification: ## cols( ## STAID = col_double(), ## SOUID = col_double(), ## DATE = col_double(), ## RR = col_double(), ## Q_RR = col_double() ## ) pr ## # A tibble: 133,343 x 5 ## STAID SOUID DATE RR Q_RR ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1393 20611 19470301 0 0 ## 2 1393 20611 19470302 5 0 ## 3 1393 20611 19470303 0 0 ## 4 1393 20611 19470304 33 0 ## 5 1393 20611 19470305 15 0 ## 6 1393 20611 19470306 0 0 ## 7 1393 20611 19470307 85 0 ## 8 1393 20611 19470308 3 0 ## 9 1393 20611 19470309 0 0 ## 10 1393 20611 19470310 0 0 ## # ... with 133,333 more rows #create levels for the factor id <- unique(pr$STAID) #the corresponding labels lab <- c("Bilbao","Santiago","Barcelona","Madrid","Valencia") #first changes pr <- select(pr,STAID,DATE,RR)%>% mutate(DATE=ymd(DATE), RR=ifelse(RR==-9999,NA,RR/10), STAID=factor(STAID,id,lab), yr=year(DATE)) pr ## # A tibble: 133,343 x 4 ## STAID DATE RR yr ## <fct> <date> <dbl> <dbl> ## 1 Bilbao 1947-03-01 0 1947 ## 2 Bilbao 1947-03-02 0.5 1947 ## 3 Bilbao 1947-03-03 0 1947 ## 4 Bilbao 1947-03-04 3.3 1947 ## 5 Bilbao 1947-03-05 1.5 1947 ## 6 Bilbao 1947-03-06 0 1947 ## 7 Bilbao 1947-03-07 8.5 1947 ## 8 Bilbao 1947-03-08 0.3 1947 ## 9 Bilbao 1947-03-09 0 1947 ## 10 Bilbao 1947-03-10 0 1947 ## # ... with 133,333 more rows
We still need to filter and calculate the annual amount of precipitation. Actually, it is not correct to sum up precipitation without taking into account that there are missing values, but it should be enough for this practice. Then, we change the table format with the spread()
function, passing from a long to a wide table, that is, we want to obtain one column per weather station.
pr_yr <- filter(pr,DATE>="1950-01-01",DATE<"2018-01-01")%>% group_by(STAID,yr)%>% summarise(pr=sum(RR,na.rm = TRUE)) pr_yr ## # A tibble: 324 x 3 ## # Groups: STAID [5] ## STAID yr pr ## <fct> <dbl> <dbl> ## 1 Bilbao 1950 1342 ## 2 Bilbao 1951 1306. ## 3 Bilbao 1952 1355. ## 4 Bilbao 1953 1372. ## 5 Bilbao 1954 1428. ## 6 Bilbao 1955 1062. ## 7 Bilbao 1956 1254. ## 8 Bilbao 1957 968. ## 9 Bilbao 1958 1272. ## 10 Bilbao 1959 1450. ## # ... with 314 more rows pr_yr <- spread(pr_yr,STAID,pr) pr_yr ## # A tibble: 68 x 6 ## yr Bilbao Santiago Barcelona Madrid Valencia ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1950 1342 1800. 345 NA NA ## 2 1951 1306. 2344. 1072. 798. NA ## 3 1952 1355. 1973. 415. 524. NA ## 4 1953 1372. 973. 683. 365. NA ## 5 1954 1428. 1348. 581. 246. NA ## 6 1955 1062. 1769. 530. 473. NA ## 7 1956 1254. 1533. 695. 480. NA ## 8 1957 968. 1599. 635. 424. NA ## 9 1958 1272. 2658. 479. 482. NA ## 10 1959 1450. 2847. 1006 665. NA ## # ... with 58 more rows
The next step is to import the climate teleconnection indices.
#teleconnections telecon <- read_csv("teleconnections_indices.csv") ## Parsed with column specification: ## cols( ## yr = col_double(), ## NAO = col_double(), ## WeMO = col_double(), ## EA = col_double(), ## `POL-EUAS` = col_double(), ## `EATL/WRUS` = col_double(), ## MO = col_double(), ## SCAND = col_double(), ## AO = col_double() ## ) telecon ## # A tibble: 68 x 9 ## yr NAO WeMO EA `POL-EUAS` `EATL/WRUS` MO SCAND AO ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1950 0.49 0.555 -0.332 0.0217 -0.0567 0.335 0.301 -1.99e-1 ## 2 1951 -0.07 0.379 -0.372 0.402 -0.419 0.149 -0.00667 -3.65e-1 ## 3 1952 -0.37 0.693 -0.688 -0.0117 -0.711 0.282 0.0642 -6.75e-1 ## 4 1953 0.4 -0.213 -0.727 -0.0567 -0.0508 0.216 0.0233 -1.64e-2 ## 5 1954 0.51 1.20 -0.912 0.142 -0.318 0.386 0.458 -5.83e-4 ## 6 1955 -0.64 0.138 -0.824 -0.0267 0.154 0.134 0.0392 -3.62e-1 ## 7 1956 0.17 0.617 -1.29 -0.197 0.0617 0.256 0.302 -1.63e-1 ## 8 1957 -0.02 0.321 -0.952 -0.638 -0.167 0.322 -0.134 -3.42e-1 ## 9 1958 0.12 0.941 -0.243 0.138 0.661 0.296 0.279 -8.68e-1 ## 10 1959 0.49 -0.055 -0.23 -0.0142 0.631 0.316 0.725 -7.62e-2 ## # ... with 58 more rows
Finally we need to join both tables by year.
data_all <- left_join(pr_yr,telecon,by="yr") data_all ## # A tibble: 68 x 14 ## yr Bilbao Santiago Barcelona Madrid Valencia NAO WeMO EA ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1950 1342 1800. 345 NA NA 0.49 0.555 -0.332 ## 2 1951 1306. 2344. 1072. 798. NA -0.07 0.379 -0.372 ## 3 1952 1355. 1973. 415. 524. NA -0.37 0.693 -0.688 ## 4 1953 1372. 973. 683. 365. NA 0.4 -0.213 -0.727 ## 5 1954 1428. 1348. 581. 246. NA 0.51 1.20 -0.912 ## 6 1955 1062. 1769. 530. 473. NA -0.64 0.138 -0.824 ## 7 1956 1254. 1533. 695. 480. NA 0.17 0.617 -1.29 ## 8 1957 968. 1599. 635. 424. NA -0.02 0.321 -0.952 ## 9 1958 1272. 2658. 479. 482. NA 0.12 0.941 -0.243 ## 10 1959 1450. 2847. 1006 665. NA 0.49 -0.055 -0.23 ## # ... with 58 more rows, and 5 more variables: `POL-EUAS` <dbl>, ## # `EATL/WRUS` <dbl>, MO <dbl>, SCAND <dbl>, AO <dbl>
Correlation test
A correlation test between paired samples can be done with the cor.test()
function of R Base. In this case between the annual precipitation of Bilbao and the NAO index.
cor_nao_bil <- cor.test(data_all$Bilbao,data_all$NAO, method="spearman") ## Warning in cor.test.default(data_all$Bilbao, data_all$NAO, method = ## "spearman"): Cannot compute exact p-value with ties cor_nao_bil ## ## Spearman's rank correlation rho ## ## data: data_all$Bilbao and data_all$NAO ## S = 44372, p-value = 0.2126 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.1531149 str(cor_nao_bil) ## List of 8 ## $ statistic : Named num 44372 ## ..- attr(*, "names")= chr "S" ## $ parameter : NULL ## $ p.value : num 0.213 ## $ estimate : Named num 0.153 ## ..- attr(*, "names")= chr "rho" ## $ null.value : Named num 0 ## ..- attr(*, "names")= chr "rho" ## $ alternative: chr "two.sided" ## $ method : chr "Spearman's rank correlation rho" ## $ data.name : chr "data_all$Bilbao and data_all$NAO" ## - attr(*, "class")= chr "htest"
We see that the result is in an unmanageable and untidy format. It is a console summary of the correlation with all the statistical parameters necessary to get a conclusion about the relationship. The orginal structure is a list of vectors. However, the tidy()
function of the {broom} package allows us to convert the result into a table format.
tidy(cor_nao_bil) ## # A tibble: 1 x 5 ## estimate statistic p.value method alternative ## <dbl> <dbl> <dbl> <chr> <chr> ## 1 0.153 44372. 0.213 Spearman's rank correlation rho two.sided
Apply the correlation test to multiple variables
The objective is to apply the correlation test to all weather stations and climate teleconnection indices.
First, we must pass the table to the long format, that is, create a column/variable for the city and for the value of the corresponding precipitation. Then we repeat the same for the teleconnections indices.
data <- gather(data_all,city,pr,Bilbao:Valencia)%>% gather(telecon,index,NAO:AO) data ## # A tibble: 2,720 x 5 ## yr city pr telecon index ## <dbl> <chr> <dbl> <chr> <dbl> ## 1 1950 Bilbao 1342 NAO 0.49 ## 2 1951 Bilbao 1306. NAO -0.07 ## 3 1952 Bilbao 1355. NAO -0.37 ## 4 1953 Bilbao 1372. NAO 0.4 ## 5 1954 Bilbao 1428. NAO 0.51 ## 6 1955 Bilbao 1062. NAO -0.64 ## 7 1956 Bilbao 1254. NAO 0.17 ## 8 1957 Bilbao 968. NAO -0.02 ## 9 1958 Bilbao 1272. NAO 0.12 ## 10 1959 Bilbao 1450. NAO 0.49 ## # ... with 2,710 more rows
To apply the test to all cities, we need the corresponding groupings. Therefore, we use the group_by()
function for indicating the two groups: city and telecon. In addition, we apply the nest()
function of the {tidyr} package ({tidyverse} collection) with the aim of creating lists of tables nested per row. In other words, in each row of each city and teleconnection index we will have a new table that contains the year, the precipitation value and the value of each teleconection, correspondingly.
data_nest <- group_by(data,city,telecon)%>%nest() data_nest ## # A tibble: 40 x 3 ## city telecon data ## <chr> <chr> <list> ## 1 Bilbao NAO <tibble [68 x 3]> ## 2 Santiago NAO <tibble [68 x 3]> ## 3 Barcelona NAO <tibble [68 x 3]> ## 4 Madrid NAO <tibble [68 x 3]> ## 5 Valencia NAO <tibble [68 x 3]> ## 6 Bilbao WeMO <tibble [68 x 3]> ## 7 Santiago WeMO <tibble [68 x 3]> ## 8 Barcelona WeMO <tibble [68 x 3]> ## 9 Madrid WeMO <tibble [68 x 3]> ## 10 Valencia WeMO <tibble [68 x 3]> ## # ... with 30 more rows str(slice(data_nest,1)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 3 variables: ## $ city : chr "Bilbao" ## $ telecon: chr "NAO" ## $ data :List of 1 ## ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 68 obs. of 3 variables: ## .. ..$ yr : num 1950 1951 1952 1953 1954 ... ## .. ..$ pr : num 1342 1306 1355 1372 1428 ... ## .. ..$ index: num 0.49 -0.07 -0.37 0.4 0.51 -0.64 0.17 -0.02 0.12 0.49 ...
The next step is to create a function, in which we define the correlation test and pass it to the clean format using the tidy()
function, which we apply to each groupings.
cor_fun <- function(df) cor.test(df$pr,df$index,method="spearman")%>%tidy()
Now we only have to apply our function to the column that contains the tables for each combination between city and teleconnection. To do this, we use the map()
function that applies another function to a vector or list. What we do is create a new column that contains the result, a statistical summary table, for each combination.
data_nest <- mutate(data_nest, model = map(data,cor_fun)) data_nest ## # A tibble: 40 x 4 ## city telecon data model ## <chr> <chr> <list> <list> ## 1 Bilbao NAO <tibble [68 x 3]> <tibble [1 x 5]> ## 2 Santiago NAO <tibble [68 x 3]> <tibble [1 x 5]> ## 3 Barcelona NAO <tibble [68 x 3]> <tibble [1 x 5]> ## 4 Madrid NAO <tibble [68 x 3]> <tibble [1 x 5]> ## 5 Valencia NAO <tibble [68 x 3]> <tibble [1 x 5]> ## 6 Bilbao WeMO <tibble [68 x 3]> <tibble [1 x 5]> ## 7 Santiago WeMO <tibble [68 x 3]> <tibble [1 x 5]> ## 8 Barcelona WeMO <tibble [68 x 3]> <tibble [1 x 5]> ## 9 Madrid WeMO <tibble [68 x 3]> <tibble [1 x 5]> ## 10 Valencia WeMO <tibble [68 x 3]> <tibble [1 x 5]> ## # ... with 30 more rows str(slice(data_nest,1)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 4 variables: ## $ city : chr "Bilbao" ## $ telecon: chr "NAO" ## $ data :List of 1 ## ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 68 obs. of 3 variables: ## .. ..$ yr : num 1950 1951 1952 1953 1954 ... ## .. ..$ pr : num 1342 1306 1355 1372 1428 ... ## .. ..$ index: num 0.49 -0.07 -0.37 0.4 0.51 -0.64 0.17 -0.02 0.12 0.49 ... ## $ model :List of 1 ## ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 5 variables: ## .. ..$ estimate : num 0.153 ## .. ..$ statistic : num 44372 ## .. ..$ p.value : num 0.213 ## .. ..$ method : chr "Spearman's rank correlation rho" ## .. ..$ alternative: chr "two.sided"
How can we undo the list of tables in each row of our table?
First we eliminate the column with the data and then simply we can apply the unnest()
function.
corr_pr <- select(data_nest,-data)%>%unnest() corr_pr ## # A tibble: 40 x 7 ## city telecon estimate statistic p.value method alternative ## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Bilbao NAO 0.153 44372. 0.213 Spearman's rank~ two.sided ## 2 Santia~ NAO -0.181 61902. 0.139 Spearman's rank~ two.sided ## 3 Barcel~ NAO -0.0203 53460. 0.869 Spearman's rank~ two.sided ## 4 Madrid NAO -0.291 64692. 0.0169 Spearman's rank~ two.sided ## 5 Valenc~ NAO -0.113 27600. 0.422 Spearman's rank~ two.sided ## 6 Bilbao WeMO 0.404 31242. 0.000706 Spearman's rank~ two.sided ## 7 Santia~ WeMO 0.332 35014. 0.00594 Spearman's rank~ two.sided ## 8 Barcel~ WeMO 0.0292 50862 0.813 Spearman's rank~ two.sided ## 9 Madrid WeMO 0.109 44660 0.380 Spearman's rank~ two.sided ## 10 Valenc~ WeMO -0.252 31056. 0.0688 Spearman's rank~ two.sided ## # ... with 30 more rows
The result is a table in which we can see the correlations and their statistical significance for each city and teleconnection index.
Heatmap of the results
Finally, we make a heatmap of the obtained result. But, previously we create a column that indicates whether the correlation is significant with p-value less than 0.05.
corr_pr <- mutate(corr_pr,sig=ifelse(p.value<0.05,"Sig.","Non Sig.")) ggplot()+ geom_tile(data=corr_pr, aes(city,telecon,fill=estimate), size=1, colour="white")+ geom_tile(data=filter(corr_pr,sig=="Sig."), aes(city,telecon), size=1, colour="black", fill="transparent")+ geom_text(data=corr_pr, aes(city,telecon,label=round(estimate,2), fontface=ifelse(sig=="Sig.","bold","plain")))+ scale_fill_gradient2(breaks=seq(-1,1,0.2))+ labs(x="",y="",fill="",p.value="")+ theme_minimal()+ theme(panel.grid.major = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.ticks = element_blank())
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.