Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In statistics, the Mann–Whitney U test (also called Wilcoxon rank-sum test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. This test can be used to investigate whether two independent samples were selected from populations having the same distribution.
Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means ( \(H_0: \mu_1=\mu_2\) ) between independent groups.
In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows:
\(H_0\): The two populations are equal versus
\(H_1\): The two populations are not equal
This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to \(n_1+n_2\), respectively.
Mann-Whitney U Test in Breast Cancer Dataset
We will work with the Breast Cancer Wisconsin dataset, where we will apply the Mann-Whitney Test in every independent variable by comparing those who were diagnosed with a malignant tumor vs those with a benign.
We provide a report where we will represent the Mean, the Standard Deviation, the Median, the Difference In Medians as well as the P-value of the Mann-Whitney U test of each variable.
library(tidyverse) # the column names of the dataset names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst') # get the data from the URL and assign the column names df<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"), col.names=names) # remove the ID number df<-df%>%select(-id_number) # get the means of all the variables means<-df%>%group_by(diagnosis)%>%summarise_all(list(mean), na.rm=TRUE )%>%gather("Variable", "Mean", -diagnosis)%>%spread(diagnosis, Mean)%>%rename("Mean_M"='M', "Mean_B"="B") # get the standard deviation of all the variables sds<-df%>%group_by(diagnosis)%>%summarise_all(list(sd), na.rm=TRUE )%>%gather("Variable", "SD", -diagnosis)%>%spread(diagnosis, SD)%>%rename("SD_M"='M', "SD_B"="B") # get the median of all the variables medians<-df%>%group_by(diagnosis)%>%summarise_all(list(median), na.rm=TRUE )%>%gather("Variable", "Median", -diagnosis)%>%spread(diagnosis, Median)%>%rename("Median_M"='M', "Median_B"="B") # join the tables summary_report<-means%>%inner_join(sds, "Variable")%>%inner_join(medians, "Variable")%>%mutate(DiffInMedians=Median_M-Median_B) # now apply the Mann-Whitney U test for all variables variables<-colnames(df)[2:dim(df)[2]] pvals<-{} vars<-{} for (i in variables) { xxx<-df%>%select(c("diagnosis", i)) x1<-xxx%>%filter(diagnosis=="M")%>%dplyr::select(c(2))%>%na.omit()%>%pull() x2<-xxx%>%filter(diagnosis=="B")%>%dplyr::select(c(2))%>%na.omit()%>%pull() wc<-wilcox.test(x1,x2) pvals<-c(pvals,round(wc$p.value,4) ) vars<-c(vars,i) } wc_df<-data.frame(Variable=vars,pvalues=pvals) wc_df$Variable<-as.character(wc_df$Variable) summary_report<-summary_report%>%inner_join(wc_df, by="Variable")
If we run the R script above, we get the following output. As we can see almost all of the variables appear to be statistically significant (p-values<0.05) between the two groups (Malignant and Benign). The only non-statistically significant variables appear to be the fractal_dimension_mean
, the smoothness_se
and the texture_se
.
Variable | Mean_B | Mean_M | SD_B | SD_M | Median_B | Median_M | DiffInMedians | pvalues |
area_mean | 462.7902 | 978.2692 | 134.2871 | 368.8097 | 458.4 | 930.9 | 472.5 | 0 |
area_se | 21.13515 | 72.28981 | 8.843472 | 61.24716 | 19.63 | 58.38 | 38.75 | 0 |
area_worst | 558.8994 | 1419.458 | 163.6014 | 597.967 | 547.4 | 1302 | 754.6 | 0 |
compactness_mean | 0.0800846 | 0.1445602 | 0.03375 | 0.0533352 | 0.07529 | 0.1319 | 0.05661 | 0 |
compactness_se | 0.0214383 | 0.0322017 | 0.0163515 | 0.0183944 | 0.01631 | 0.02855 | 0.01224 | 0 |
compactness_worst | 0.1826725 | 0.373446 | 0.09218 | 0.1695886 | 0.1698 | 0.3559 | 0.1861 | 0 |
concave_points_mean | 0.0257174 | 0.0877099 | 0.0159088 | 0.0342122 | 0.02344 | 0.08624 | 0.0628 | 0 |
concave_points_se | 0.0098577 | 0.0150566 | 0.0057086 | 0.0055302 | 0.009061 | 0.0142 | 0.005139 | 0 |
concave_points_worst | 0.0744443 | 0.1818432 | 0.0357974 | 0.0460601 | 0.07431 | 0.182 | 0.10769 | 0 |
concavity_mean | 0.0460576 | 0.1601144 | 0.0434422 | 0.0745776 | 0.03709 | 0.1508 | 0.11371 | 0 |
concavity_se | 0.0259967 | 0.0417676 | 0.0329182 | 0.0216391 | 0.0184 | 0.0371 | 0.0187 | 0 |
concavity_worst | 0.1662377 | 0.4493672 | 0.1403677 | 0.1810384 | 0.1412 | 0.4029 | 0.2617 | 0 |
fractal_dimension_mean | 0.0628674 | 0.0626041 | 0.0067473 | 0.0075099 | 0.06154 | 0.06149 | -0.00005 | 0.4788 |
fractal_dimension_se | 0.0036361 | 0.0040523 | 0.0029382 | 0.002041 | 0.002808 | 0.003739 | 0.000931 | 0 |
fractal_dimension_worst | 0.0794421 | 0.0914002 | 0.0138041 | 0.021521 | 0.07712 | 0.08758 | 0.01046 | 0 |
perimeter_mean | 78.07541 | 115.3301 | 11.80744 | 21.90059 | 78.18 | 114.2 | 36.02 | 0 |
perimeter_se | 2.000321 | 4.303716 | 0.7711692 | 2.557696 | 1.851 | 3.654 | 1.803 | 0 |
perimeter_worst | 87.00594 | 141.1655 | 13.52709 | 29.37531 | 86.92 | 137.9 | 50.98 | 0 |
radius_mean | 12.14652 | 17.46033 | 1.780512 | 3.211384 | 12.2 | 17.3 | 5.1 | 0 |
radius_se | 0.2840824 | 0.6067796 | 0.1125696 | 0.3442221 | 0.2575 | 0.5449 | 0.2874 | 0 |
radius_worst | 13.3798 | 21.11469 | 1.981368 | 4.283704 | 13.35 | 20.58 | 7.23 | 0 |
smoothness_mean | 0.0924777 | 0.102825 | 0.0134461 | 0.0125927 | 0.09076 | 0.102 | 0.01124 | 0 |
smoothness_se | 0.0071959 | 0.0067819 | 0.0030606 | 0.0028972 | 0.00653 | 0.006208 | -0.000322 | 0.2134 |
smoothness_worst | 0.1249595 | 0.144763 | 0.0200135 | 0.021889 | 0.1254 | 0.1434 | 0.018 | 0 |
symmetry_mean | 0.174186 | 0.1926768 | 0.0248068 | 0.0274958 | 0.1714 | 0.1896 | 0.0182 | 0 |
symmetry_se | 0.0205838 | 0.0204271 | 0.0069985 | 0.0100671 | 0.01909 | 0.01768 | -0.00141 | 0.0225 |
symmetry_worst | 0.2702459 | 0.3228204 | 0.0417448 | 0.0742636 | 0.2687 | 0.3103 | 0.0416 | 0 |
texture_mean | 17.91476 | 21.6581 | 3.995125 | 3.708042 | 17.39 | 21.46 | 4.07 | 0 |
texture_se | 1.22038 | 1.212363 | 0.5891797 | 0.4838656 | 1.108 | 1.127 | 0.019 | 0.6195 |
texture_worst | 23.51507 | 29.37502 | 5.493955 | 5.384249 | 22.82 | 29.02 | 6.2 | 0 |
Discussion
Generally, is a good idea to apply the Mann-Whitney U test during the Exploratory Data Analysis part since we can get an idea of which variables may be significant for the final machine learning model and the most important thing is that since it is a non-parametric test, we do not need to make any assumption about the distribution of the variables. For example, instead of the Student’s t-test we can apply the Mann-Whitney U test without worrying about the assumptions of the normal distribution.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.