An easy way to manage your genome-wide-association data: GenABEL package.

Andrea Pedretti

10 years ago

[This article was first published on Milano R net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a little overview on GenABEL library developed by Yurii Aulchenko (www.genabel.org/).

GenABEL is a full-featured R library for dealing with Genome-Wide Association analysis of binary and quantitative traits.

Compared to the ‘genetics’ package and many other tools, GenABEL provides specific features for storage and manipulation of large amounts of data, testing for GWA analysis, and functions for estimating the kinship matrix from a dense marker panel.

Maybe the most useful feature of GenABEL is the special data class: gwaa.data. An object of this class permits to store GWA data in a efficient way and to retrieve in a simple way the information in your dataset.

At the first level, a gwaa.data-class object has the phdata ‘slot’ that can be accessed by command dataset@phdata, which contains all phenotypic information in a data frame (data.frame-class object). The rows of this data frame correspond to study subjects, and the columns correspond to the variables/phenotypes. There are two default variables, which are always present in phdata: the first of these is ”id” (must be unique), which contains study subject identification code: the second one is a dummy variable indicating the sex.

If you want to add phenotypes from another dataframe to phdata object already created, special GenABEL function add.phdata should be used. This function allows you to add variables contained in some data frame to the existing data@phdata object. The data frame to be added should contain ”id” variable, identical to that existing in the object.

The other slot of an object of gwaa.data-class is gtdata, which contains all genetic data in an object of class snp.data. This class, in turn, has slots containing the number of study subjects, ID names of these subjects, the number of SNPs typed, the SNP names, the name of the chromosome the SNPs belong to and map position of SNPs, strand information and the sex code for the subjects. The latter is identical to the ”sex” variable contained in the phdata.

To import data to GenABEL, you need to prepare two files: one containing the phenotypic data, and another one containing genotypic data.

Phenotype file: the first column must contain the subjects’ unique ID. The IDs listed here and in the genotypic data file must be the same. The second column must contain the sex information and other columns in the file should contain phenotypic information. The names of the first two columns must be ‘id’ and ‘sex’.
Genotype file: information on chromosome, map position and strand should be provided for every SNP and the SNPs genotype have to be indicated for every study subjects.

In GenABEL there are a number of functions to convert these dataset from different formats to the internal GenABEL raw format. One of those format is the Illumina format. To be clear the ”illumina” format is just one of the possible text output format from the Illumina BeadStudio; similar formats are generated by HapMap and Affymetrix. The file of the ”Illumina” format contains SNPs in rows and IDs in columns and the first four columns should contain information on SNP name, chromosome, position and strand. After those columns, each of the residual ones corresponds to an individual, with ID as the column name, the elements of these colums are the genotypes.

Anyways, this file contains all required genotypic information, now you can convert the data to GenABEL raw format using the conversion command:

> convert.snp.illumina(inf = "gen.illu", out = "gen.raw", strand = "file")

The option strand=”file” shows that strand information is provided in the file.

Finally, you can load the data into GenABEL typing

> dataset <- load.gwaa.data(phe = phe.txt", gen = "gen.raw")

Now you can start with the analyses!

This is only a brief introduction to this package: in my opinion there are many different methods (parametric and non-parametric) that are suitable to conduct a genome wide association study, but GenABEL package could give a fundamental help with the management and quality control of your dataset.

Ps In the GenABEL website you will find the documentation, tutorials and also a forum where you can find the answers to your questions.

To leave a comment for the author, please follow the link and comment on their blog: Milano R net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.