Site icon R-bloggers

Mastering Quantile Normalization in R: A Step-by-Step Guide

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

Quantile normalization is a crucial technique in data preprocessing, especially in fields like genomics and bioinformatics. It ensures that the distributions of different samples are aligned, making them directly comparable. In this tutorial, we’ll walk through the process step by step, demystifying the syntax and empowering you to apply this technique confidently in your projects.

< section id="understanding-quantile-normalization" class="level1">

Understanding Quantile Normalization

Before we dive into the code, let’s understand the concept behind quantile normalization. At its core, quantile normalization aims to equalize the distributions of multiple datasets by aligning their quantiles. This ensures that each dataset has the same distribution of values, making meaningful comparisons possible.

< section id="example" class="level1">

Example

< section id="step-1-load-your-data" class="level2">

Step 1: Load Your Data

First things first, you’ll need some data to work with. For the sake of this tutorial, let’s say you have a dataframe called df containing your datasets.

set.seed(42)  # For reproducibility
df <- data.frame(
  sample1 = rnorm(100, mean = 5, sd = 2),
  sample2 = rnorm(100, mean = 10, sd = 1),
  sample3 = rnorm(100)
)

head(df)
   sample1   sample2    sample3
1 7.741917 11.200965 -2.0009292
2 3.870604 11.044751  0.3337772
3 5.726257  8.996791  1.1713251
4 6.265725 11.848482  2.0595392
5 5.808537  9.333227 -1.3768616
6 4.787751 10.105514 -1.1508556
hist(df$sample1, col = 'red', xlim=c(min(df), max(df)), 
     main = 'Distribution of Sample 1')
hist(df$sample2, col = 'blue', add = TRUE)
hist(df$sample3, col = 'green', add = TRUE)
#add legend
legend('topright', 
       c('Sample 1', 'Sample 2','Sample 3'), 
       fill=c('red','blue', 'green'))

< section id="step-2-perform-quantile-normalization" class="level2">

Step 2: Perform Quantile Normalization

Now, it’s time to perform quantile normalization using R’s built-in quantile() function. This function calculates quantiles for a given dataset, which is essential for aligning the distributions. Function from: https://lifewithdata.com/2023/09/02/how-to-perform-quantile-normalization-in-r/

# Perform quantile normalization
qn <- function(.data){
 data_sort <- apply(.data, 2, sort)
 row_means <- rowMeans(data_sort)
 data_sort <- matrix(row_means, 
                     nrow = nrow(data_sort), 
                     ncol = ncol(data_sort), 
                     byrow = TRUE
                     )
 index_rank <- apply(.data, 2, order)
 normalized_data <- matrix(nrow = nrow(.data), ncol = ncol(.data))
 for(i in 1:ncol(.data)){
   normalized_data[,i] <- data_sort[index_rank[,i], i]
 }
 return(normalized_data)
}

normalized_data <- qn(df)

Let’s break down this code snippet:

Absolutely, let’s break down this R code block piece by piece:

1. Function Definition:

qn <- function(.data){
  # ... function body here ...
}

This defines a function named qn that takes a data frame (data) as input. This data frame is most likely your dataset you want to normalize.

2. Sorting Each Column:

data_sort <- apply(.data, 2, sort)

This line sorts each column of the data frame data independently. Imagine sorting rows of data like sorting words in a dictionary. Here, we are sorting each column (each variable) from smallest to largest values. The result is stored in data_sort.

3. Calculating Row Means:

row_means <- rowMeans(data_sort)

This line calculates the average value for each row in the sorted data frame (data_sort). So, for each row (each data point), it finds the mean of the sorted values across all variables. The result is stored in row_means.

4. Replicating Row Means into a Matrix:

data_sort <- matrix(row_means, 
                    nrow = nrow(data_sort), 
                    ncol = ncol(data_sort), 
                    byrow = TRUE
                    )

This part is a bit trickier. It creates a new matrix (data_sort) with the same dimensions (number of rows and columns) as the original sorted data. Then, it fills each row of this new matrix with the corresponding row mean calculated earlier (row_means). The byrow argument ensures this replication happens row-wise.

5. Ranking Each Value’s Position:

index_rank <- apply(.data, 2, order)

Similar to sorting, this line assigns a rank (position) to each value within its column (variable) in the original data frame (data). Imagine a race where the first place gets rank 1, second place gets rank 2, and so on. Here, the rank indicates the original position of each value after everything was sorted in step 2. The result is stored in index_rank.

6. Building the Normalized Data Frame:

normalized_data <- matrix(nrow = nrow(.data), ncol = ncol(.data))

This line creates an empty matrix (normalized_data) with the same dimensions as the original data frame. This will eventually hold the normalized data.

7. Looping Through Columns and Assigning Ranked Values:

for(i in 1:ncol(.data)){
  normalized_data[,i] <- data_sort[index_rank[,i], i]
}

This is the core of the normalization process. It loops through each column (variable) of the original data frame (data). For each column, it uses the ranks (index_rank) as indices to pick values from the sorted data with row means (data_sort). Basically, it replaces each value in the original data with the value from the sorted data that has the same rank (original position). This effectively replaces the original values with their corresponding row means (representing the center point) based on their original order.

8. Returning the Normalized Data:

return(normalized_data)

Finally, the function returns the normalized_data matrix, which contains the quantile normalized version of your original data frame.

In essence, this code performs a type of rank-based normalization where each value is replaced with the row mean that corresponds to its original position after sorting all the data together. This approach ensures that the distribution of values across columns becomes more consistent.

< section id="step-3-explore-the-results" class="level2">

Step 3: Explore the Results

After quantile normalization, you’ll have a list of normalized datasets ready for further analysis. Take a moment to explore the results and ensure that the distributions are aligned as expected.

summary(df)
    sample1           sample2          sample3        
 Min.   :-0.9862   Min.   : 7.975   Min.   :-2.69993  
 1st Qu.: 3.7666   1st Qu.: 9.409   1st Qu.:-0.71167  
 Median : 5.1796   Median : 9.931   Median :-0.02474  
 Mean   : 5.0650   Mean   : 9.913   Mean   :-0.01037  
 3rd Qu.: 6.3231   3rd Qu.:10.462   3rd Qu.: 0.65254  
 Max.   : 9.5733   Max.   :12.702   Max.   : 2.45959  
# Explore the results
summary(normalized_data)
       V1              V2              V3       
 Min.   :1.430   Min.   :1.430   Min.   :1.430  
 1st Qu.:4.154   1st Qu.:4.154   1st Qu.:4.154  
 Median :5.029   Median :5.029   Median :5.029  
 Mean   :4.989   Mean   :4.989   Mean   :4.989  
 3rd Qu.:5.812   3rd Qu.:5.812   3rd Qu.:5.812  
 Max.   :8.245   Max.   :8.245   Max.   :8.245  
< section id="step-4-obtain-quantiles" class="level2">

Step 4: Obtain Quantiles

Now that the data is normalized, we can extract the quantiles to compare the distributions across datasets. This will help you confirm that the normalization process was successful.

as.data.frame(normalized_data) |> 
  sapply(function(x) quantile(x, probs = seq(0,1,1/4)))
           V1       V2       V3
0%   1.429737 1.429737 1.429737
25%  4.154481 4.154481 4.154481
50%  5.028521 5.028521 5.028521
75%  5.812480 5.812480 5.812480
100% 8.244925 8.244925 8.244925

As we can see, the quantiles of the normalized data are consistent across the different datasets. This indicates that the distributions have been aligned through quantile normalization.

Let’s visuzlize for another confirmation

df_normalized <- as.data.frame(normalized_data)

hist(df_normalized$V1, col = 'red')
hist(df_normalized$V2, col = 'blue', add = TRUE)
hist(df_normalized$V3, col = 'green', add = TRUE)

legend('topright', c('Sample 1', 'Sample 2','Sample 3'), fill=c('red','blue', 'green'))

< section id="wrapping-up" class="level1">

Wrapping Up

Congratulations! You’ve successfully mastered quantile normalization in R. By understanding the underlying concept and applying the quantile() function effectively, you can ensure that your datasets are comparable and ready for downstream analysis.

I encourage you to experiment with different datasets and explore the impact of quantile normalization on your analyses. Remember, practice makes perfect, so don’t hesitate to try it out on your own data. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version