Site icon R-bloggers

How to apply the Mann-Whitney U Test in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In statistics, the Mann–Whitney U test (also called Wilcoxon rank-sum test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.

Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means ( \(H_0: \mu_1=\mu_2\) ) between independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:

\(H_0\): The two populations are equal versus

\(H_1\): The two populations are not equal

This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to \(n_1+n_2\), respectively.

Mann-Whitney U Test in Breast Cancer Dataset

We will work with the Breast Cancer Wisconsin dataset, where we will apply the Mann-Whitney Test in every independent variable by comparing those who were diagnosed with a malignant tumor vs those with a benign.

We provide a report where we will represent the Mean, the Standard Deviation, the Median, the Difference In Medians as well as the P-value of the Mann-Whitney U test of each variable.

library(tidyverse)


# the column names of the dataset
names <- c('id_number', 'diagnosis', 'radius_mean', 
           'texture_mean', 'perimeter_mean', 'area_mean', 
           'smoothness_mean', 'compactness_mean', 
           'concavity_mean','concave_points_mean', 
           'symmetry_mean', 'fractal_dimension_mean',
           'radius_se', 'texture_se', 'perimeter_se', 
           'area_se', 'smoothness_se', 'compactness_se', 
           'concavity_se', 'concave_points_se', 
           'symmetry_se', 'fractal_dimension_se', 
           'radius_worst', 'texture_worst', 
           'perimeter_worst', 'area_worst', 
           'smoothness_worst', 'compactness_worst', 
           'concavity_worst', 'concave_points_worst', 
           'symmetry_worst', 'fractal_dimension_worst')

# get the data from the URL and assign the column names
df<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"), col.names=names)

# remove the ID number
df<-df%>%select(-id_number)


# get the means of all the variables
means<-df%>%group_by(diagnosis)%>%summarise_all(list(mean), na.rm=TRUE )%>%gather("Variable", "Mean", -diagnosis)%>%spread(diagnosis, Mean)%>%rename("Mean_M"='M', "Mean_B"="B")

# get the standard deviation of all the variables
sds<-df%>%group_by(diagnosis)%>%summarise_all(list(sd), na.rm=TRUE )%>%gather("Variable", "SD", -diagnosis)%>%spread(diagnosis, SD)%>%rename("SD_M"='M', "SD_B"="B")

# get the median of all the variables
medians<-df%>%group_by(diagnosis)%>%summarise_all(list(median), na.rm=TRUE )%>%gather("Variable", "Median", -diagnosis)%>%spread(diagnosis, Median)%>%rename("Median_M"='M', "Median_B"="B")


# join the tables 
summary_report<-means%>%inner_join(sds, "Variable")%>%inner_join(medians, "Variable")%>%mutate(DiffInMedians=Median_M-Median_B)



# now apply the Mann-Whitney U test for all variables

variables<-colnames(df)[2:dim(df)[2]]



pvals<-{}
vars<-{}

for (i in variables) {
  
  xxx<-df%>%select(c("diagnosis", i))
  
  
  
  x1<-xxx%>%filter(diagnosis=="M")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  x2<-xxx%>%filter(diagnosis=="B")%>%dplyr::select(c(2))%>%na.omit()%>%pull()
  wc<-wilcox.test(x1,x2)
  
  
  pvals<-c(pvals,round(wc$p.value,4) )
  vars<-c(vars,i)
  
}

wc_df<-data.frame(Variable=vars,pvalues=pvals)

wc_df$Variable<-as.character(wc_df$Variable)
summary_report<-summary_report%>%inner_join(wc_df, by="Variable")
 

If we run the R script above, we get the following output. As we can see almost all of the variables appear to be statistically significant (p-values<0.05) between the two groups (Malignant and Benign). The only non-statistically significant variables appear to be the fractal_dimension_mean, the smoothness_se and the texture_se.


Variable Mean_B Mean_M SD_B SD_M Median_B Median_M DiffInMedians pvalues
area_mean462.7902978.2692134.2871368.8097458.4930.9472.50
area_se21.1351572.289818.84347261.2471619.6358.3838.750
area_worst558.89941419.458163.6014597.967547.41302754.60
compactness_mean0.08008460.14456020.033750.05333520.075290.13190.056610
compactness_se0.02143830.03220170.01635150.01839440.016310.028550.012240
compactness_worst0.18267250.3734460.092180.16958860.16980.35590.18610
concave_points_mean0.02571740.08770990.01590880.03421220.023440.086240.06280
concave_points_se0.00985770.01505660.00570860.00553020.0090610.01420.0051390
concave_points_worst0.07444430.18184320.03579740.04606010.074310.1820.107690
concavity_mean0.04605760.16011440.04344220.07457760.037090.15080.113710
concavity_se0.02599670.04176760.03291820.02163910.01840.03710.01870
concavity_worst0.16623770.44936720.14036770.18103840.14120.40290.26170
fractal_dimension_mean0.06286740.06260410.00674730.00750990.061540.06149-0.000050.4788
fractal_dimension_se0.00363610.00405230.00293820.0020410.0028080.0037390.0009310
fractal_dimension_worst0.07944210.09140020.01380410.0215210.077120.087580.010460
perimeter_mean78.07541115.330111.8074421.9005978.18114.236.020
perimeter_se2.0003214.3037160.77116922.5576961.8513.6541.8030
perimeter_worst87.00594141.165513.5270929.3753186.92137.950.980
radius_mean12.1465217.460331.7805123.21138412.217.35.10
radius_se0.28408240.60677960.11256960.34422210.25750.54490.28740
radius_worst13.379821.114691.9813684.28370413.3520.587.230
smoothness_mean0.09247770.1028250.01344610.01259270.090760.1020.011240
smoothness_se0.00719590.00678190.00306060.00289720.006530.006208-0.0003220.2134
smoothness_worst0.12495950.1447630.02001350.0218890.12540.14340.0180
symmetry_mean0.1741860.19267680.02480680.02749580.17140.18960.01820
symmetry_se0.02058380.02042710.00699850.01006710.019090.01768-0.001410.0225
symmetry_worst0.27024590.32282040.04174480.07426360.26870.31030.04160
texture_mean17.9147621.65813.9951253.70804217.3921.464.070
texture_se1.220381.2123630.58917970.48386561.1081.1270.0190.6195
texture_worst23.5150729.375025.4939555.38424922.8229.026.20

Discussion

Generally, is a good idea to apply the Mann-Whitney U test during the Exploratory Data Analysis part since we can get an idea of which variables may be significant for the final machine learning model and the most important thing is that since it is a non-parametric test, we do not need to make any assumption about the distribution of the variables. For example, instead of the Student’s t-test we can apply the Mann-Whitney U test without worrying about the assumptions of the normal distribution.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.