Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Addressing the question ‘R or Python for data science’ depends mainly on the problems which is to be solved, the tools required to solve the problem and your personal preference.
Python is a general purpose programming language created by Guido Van Rossum in 1991 and R was created four years later by Ross Ihaka and Robert Gentleman keeping the statisticians in mind.
R has a steep learning curve which makes it a bit difficult for beginners but once the basics are clear it will be easy to learn advanced stuffs. On the other hand, the simplicity and readability of Python makes its learning curve relatively low and also it is a good choice for beginners.
The same functionality can be written in different ways in R but it is not the same in Python.
RStudio is the best IDE for R. Spyder, IPython, Notebook, Eric etc are some of the IDE for Python. Both R and Python have a huge number of reliable libraries. The CRAN is the biggest repository of R packages while PyPi is the Python repository.
The popular libraries in R includes caret, dplyr, data.tables, zoo, ggplot2, ggvis, stringr, lattice etc. Libraries like Pandas, Scikit Learn, SciPy, NumPy, matplotlib etc makes Python more attractive. Both R and Python have a good support and documentation.
When it comes to data visualization, R has an upper hand over Python. Packages like ggplot2 and ggvis are two incredible visualization packages in R.
Few examples of codes from both the languages which are used to get the same results.
To import a .csv dataset,
R:
dataset_name <- read.csv(“dataset_name.csv”)
Python:
import pandas
dataset_name = pandas.read_csv(“dataset_name.csv”)
To find the dimension of the dataset,
R:
dim(dataset_name)
Python:
dataset_name.shape
To obtain the first n observation in a dataframe,
R:
head(dataset_name)
Python:
dataset_name.head()
For splitting the dataset into training and test sets,
R:
RowCount <- floor(0.75 * nrow(dataset_name))
set.seed(123)
trainIndex <- sample(1:nrow(dataset_name), RowCount)
train <- dataset_name[trainIndex,]
test <- dataset_name[-trainIndex,]
Python:
train = dataset_name.sample(frac=0.75, random_state=1)
test = dataset_name.loc[~dataset_name.index.isin(train.index)]
R is more functional in nature and has a lot of build-in data analysis features. On the other hand Python is object oriented language which mostly relay on packages for data analysis. When it comes to data science, both these languages are important and it depends on the data analyst to choose between the two. If you know both, then you are definitely ahead of many others in this field.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.