How to apply the Mann-Whitney U Test in R

Posted on March 8, 2020 by George Pipis in R bloggers | 0 Comments

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In statistics, the Mann–Whitney U test (also called Wilcoxon rank-sum test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.

Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means ( \(H_0: \mu_1=\mu_2\) ) between independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:

\(H_0\): The two populations are equal versus

\(H_1\): The two populations are not equal

This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to \(n_1+n_2\), respectively.

Mann-Whitney U Test in Breast Cancer Dataset

We will work with the Breast Cancer Wisconsin dataset, where we will apply the Mann-Whitney Test in every independent variable by comparing those who were diagnosed with a malignant tumor vs those with a benign.

We provide a report where we will represent the Mean, the Standard Deviation, the Median, the Difference In Medians as well as the P-value of the Mann-Whitney U test of each variable.

library(tidyverse)


# the column names of the dataset
names <- c('id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 
           'concavity_mean','concave_points_mean', 
           'symmetry_mean', 'fractal_dimension_mean',
           'radius_se', 'texture_se', 'perimeter_se', 
           'area_se', 'smoothness_se', 'compactness_se', 
           'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 
           'perimeter_worst', 'area_worst', 
           'smoothness_worst', 'compactness_worst', 
           'concavity_worst', 'concave_points_worst', 
           'symmetry_worst', 'fractal_dimension_worst')

# get the data from the URL and assign the column names
df<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"), col.names=names)

# remove the ID number
df<-df%>%select(-id_number)


# get the means of all the variables
means<-df%>%group_by(diagnosis)%>%summarise_all(list(mean), na.rm=TRUE )%>%gather("Variable", "Mean", -diagnosis)%>%spread(diagnosis, Mean)%>%rename("Mean_M"='M', "Mean_B"="B")

# get the standard deviation of all the variables
sds<-df%>%group_by(diagnosis)%>%summarise_all(list(sd), na.rm=TRUE )%>%gather("Variable", "SD", -diagnosis)%>%spread(diagnosis, SD)%>%rename("SD_M"='M', "SD_B"="B")

# get the median of all the variables
medians<-df%>%group_by(diagnosis)%>%summarise_all(list(median), na.rm=TRUE )%>%gather("Variable", "Median", -diagnosis)%>%spread(diagnosis, Median)%>%rename("Median_M"='M', "Median_B"="B")


# join the tables 
summary_report<-means%>%inner_join(sds, "Variable")%>%inner_join(medians, "Variable")%>%mutate(DiffInMedians=Median_M-Median_B)



# now apply the Mann-Whitney U test for all variables

variables<-colnames(df)[2:dim(df)[2]]



pvals<-{}
vars<-{}

for (i in variables) {
  
  xxx<-df%>%select(c("diagnosis", i))
  
  
  
  x1<-xxx%>%filter(diagnosis=="M")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  x2<-xxx%>%filter(diagnosis=="B")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  wc<-wilcox.test(x1,x2)
  
  
  pvals<-c(pvals,round(wc$p.value,4) )
  vars<-c(vars,i)
  
}

wc_df<-data.frame(Variable=vars,pvalues=pvals)

wc_df$Variable<-as.character(wc_df$Variable)
summary_report<-summary_report%>%inner_join(wc_df, by="Variable")

If we run the R script above, we get the following output. As we can see almost all of the variables appear to be statistically significant (p-values<0.05) between the two groups (Malignant and Benign). The only non-statistically significant variables appear to be the fractal_dimension_mean, the smoothness_se and the texture_se.

Variable	Mean_B	Mean_M	SD_B	SD_M	Median_B	Median_M	DiffInMedians	pvalues
area_mean	462.7902	978.2692	134.2871	368.8097	458.4	930.9	472.5	0
area_se	21.13515	72.28981	8.843472	61.24716	19.63	58.38	38.75	0
area_worst	558.8994	1419.458	163.6014	597.967	547.4	1302	754.6	0
compactness_mean	0.0800846	0.1445602	0.03375	0.0533352	0.07529	0.1319	0.05661	0
compactness_se	0.0214383	0.0322017	0.0163515	0.0183944	0.01631	0.02855	0.01224	0
compactness_worst	0.1826725	0.373446	0.09218	0.1695886	0.1698	0.3559	0.1861	0
concave_points_mean	0.0257174	0.0877099	0.0159088	0.0342122	0.02344	0.08624	0.0628	0
concave_points_se	0.0098577	0.0150566	0.0057086	0.0055302	0.009061	0.0142	0.005139	0
concave_points_worst	0.0744443	0.1818432	0.0357974	0.0460601	0.07431	0.182	0.10769	0
concavity_mean	0.0460576	0.1601144	0.0434422	0.0745776	0.03709	0.1508	0.11371	0
concavity_se	0.0259967	0.0417676	0.0329182	0.0216391	0.0184	0.0371	0.0187	0
concavity_worst	0.1662377	0.4493672	0.1403677	0.1810384	0.1412	0.4029	0.2617	0
fractal_dimension_mean	0.0628674	0.0626041	0.0067473	0.0075099	0.06154	0.06149	-0.00005	0.4788
fractal_dimension_se	0.0036361	0.0040523	0.0029382	0.002041	0.002808	0.003739	0.000931	0
fractal_dimension_worst	0.0794421	0.0914002	0.0138041	0.021521	0.07712	0.08758	0.01046	0
perimeter_mean	78.07541	115.3301	11.80744	21.90059	78.18	114.2	36.02	0
perimeter_se	2.000321	4.303716	0.7711692	2.557696	1.851	3.654	1.803	0
perimeter_worst	87.00594	141.1655	13.52709	29.37531	86.92	137.9	50.98	0
radius_mean	12.14652	17.46033	1.780512	3.211384	12.2	17.3	5.1	0
radius_se	0.2840824	0.6067796	0.1125696	0.3442221	0.2575	0.5449	0.2874	0
radius_worst	13.3798	21.11469	1.981368	4.283704	13.35	20.58	7.23	0
smoothness_mean	0.0924777	0.102825	0.0134461	0.0125927	0.09076	0.102	0.01124	0
smoothness_se	0.0071959	0.0067819	0.0030606	0.0028972	0.00653	0.006208	-0.000322	0.2134
smoothness_worst	0.1249595	0.144763	0.0200135	0.021889	0.1254	0.1434	0.018	0
symmetry_mean	0.174186	0.1926768	0.0248068	0.0274958	0.1714	0.1896	0.0182	0
symmetry_se	0.0205838	0.0204271	0.0069985	0.0100671	0.01909	0.01768	-0.00141	0.0225
symmetry_worst	0.2702459	0.3228204	0.0417448	0.0742636	0.2687	0.3103	0.0416	0
texture_mean	17.91476	21.6581	3.995125	3.708042	17.39	21.46	4.07	0
texture_se	1.22038	1.212363	0.5891797	0.4838656	1.108	1.127	0.019	0.6195
texture_worst	23.51507	29.37502	5.493955	5.384249	22.82	29.02	6.2	0

Discussion

Generally, is a good idea to apply the Mann-Whitney U test during the Exploratory Data Analysis part since we can get an idea of which variables may be significant for the final machine learning model and the most important thing is that since it is a non-parametric test, we do not need to make any assumption about the distribution of the variables. For example, instead of the Student’s t-test we can apply the Mann-Whitney U test without worrying about the assumptions of the normal distribution.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to apply the Mann-Whitney U Test in R

Mann-Whitney U Test in Breast Cancer Dataset

Discussion

Related

Mann-Whitney U Test in Breast Cancer Dataset

Discussion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)