Step by Step Tutorial: Deep Learning with TensorFlow in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Deep Learning with TensorFlow
Deep learning, also known as deep structured learning or hierarchical learning, is a type of machine learning focused on learning data representations and feature learning rather than individual or specific tasks. Feature learning, also known as representation learning, can be supervised, semi-supervised or unsupervised.
Deep learning architectures include deep neural networks, deep belief networks and recurrent neural networks. Real-world applications using deep learning include computer vision, speech recognition, machine translation, natural language processing, and image recognition.
The following recipe introduces how to implement a deep neural network using TensorFlow, which is an open source software library, originally developed at Google, for complex computation by constructing network graphs of mathematical operations and data (Abadi et al. 2016; Cheng et al. 2017). Tang et al. (2017) developed an R
interface to the TensorFlow API for our use.
A deep neural network can be explained as a neural network with multiple hidden layers, which add complexity to the model, but also allows the network to learn the underlying patterns.
Before we use this library, we need to install it. Since this is a very recent library, we will install the library from github directly.
devtools::install_github("rstudio/tfestimators") library(tfestimators)
Although we installed the library, we don’t have the actual compiled code for TensorFlow, which we need to install using the install_tensorlfow()
command that came with the tfestimators
package.
install_tensorflow()
When you try to run this, you may run into an error like this one:
#> Error: Prerequisites for installing #> TensorFlow not available. Execute the #> following at a terminal to install the #> prerequisites: $ sudo #> /usr/local/bin/pip install --upgrade #> virtualenv
I was able to fix the error by running the above command on a Mac. On Windows, you may need further troubleshooting. After installing the prerequisites, you can try installing TensorFlow again.
install_tensorflow()
We will use the sample dononr data set from the book data science for fundraising. We’ll load it using read_csv
function from the readr
library.
library(readr) library(dplyr) donor_data <- read_csv("https://www.dropbox.com/s/ntd5tbhr7fxmrr4/DonorSampleDataCleaned.csv?raw=1")
Let’s see what this data looks like:
glimpse(donor_data) #> Observations: 34,508 #> Variables: 23 #> $ ID <int> 1, 2, 3, 4, 5, 6,... #> $ ZIPCODE <chr> "23187", "77643",... #> $ AGE <int> NA, 33, NA, 31, 6... #> $ MARITAL_STATUS <chr> "Married", NA, "M... #> $ GENDER <chr> "Female", "Female... #> $ MEMBERSHIP_IND <chr> "N", "N", "N", "N... #> $ ALUMNUS_IND <chr> "N", "Y", "N", "Y... #> $ PARENT_IND <chr> "N", "N", "N", "N... #> $ HAS_INVOLVEMENT_IND <chr> "N", "Y", "N", "Y... #> $ WEALTH_RATING <chr> NA, NA, NA, NA, N... #> $ DEGREE_LEVEL <chr> NA, "UB", NA, NA,... #> $ PREF_ADDRESS_TYPE <chr> "HOME", NA, "HOME... #> $ EMAIL_PRESENT_IND <chr> "N", "Y", "N", "Y... #> $ CON_YEARS <int> 1, 0, 1, 0, 0, 0,... #> $ PrevFYGiving <chr> "$0", "$0", "$0",... #> $ PrevFY1Giving <chr> "$0", "$0", "$0",... #> $ PrevFY2Giving <chr> "$0", "$0", "$0",... #> $ PrevFY3Giving <chr> "$0", "$0", "$0",... #> $ PrevFY4Giving <chr> "$0", "$0", "$0",... #> $ CurrFYGiving <chr> "$0", "$0", "$200... #> $ TotalGiving <dbl> 10, 2100, 200, 0,... #> $ DONOR_IND <chr> "Y", "Y", "Y", "N... #> $ BIRTH_DATE NA, 1984-06-16, ...
TensorFlow library doesn’t tolerate missing values, therefore, we will replace missing factor values with modes and missing numeric values with medians.
# function copied from # https://stackoverflow.com/a/8189441/934898 my_mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } donor_data <- donor_data %>% mutate_if(is.numeric, .funs = funs( ifelse(is.na(.), median(., na.rm = TRUE), .))) %>% mutate_if(is.character, .funs = funs( ifelse(is.na(.), my_mode(.), .)))
Next, we need to convert the character variables to factors.
predictor_cols <- c("MARITAL_STATUS", "GENDER", "ALUMNUS_IND", "PARENT_IND", "WEALTH_RATING", "PREF_ADDRESS_TYPE") # Convert feature to factor donor_data <- mutate_at(donor_data, .vars = predictor_cols, .funs = as.factor)
Now, we need to let TensorFlow know about the column types. For factor columns, we need to specify all the values contained in those columns using column_categorical_with_vocabulary_list
function. Then using the column_indicator
function, we convert each of the factor values in a column to its own column with 0 and 1s – this process is known as one hot encoding. For example, for the GENDER column, say we have two possible values of male and female. One hot encoding process will create two columns: one for male and the other for female. Each of these columns will contain either 0 or 1 depending on the data value the GENDER column contained.
feature_cols <- feature_columns( column_indicator( column_categorical_with_vocabulary_list( "MARITAL_STATUS", vocabulary_list = unique(donor_data$MARITAL_STATUS))), column_indicator( column_categorical_with_vocabulary_list( "GENDER", vocabulary_list = unique(donor_data$GENDER))), column_indicator( column_categorical_with_vocabulary_list( "ALUMNUS_IND", vocabulary_list = unique(donor_data$ALUMNUS_IND))), column_indicator( column_categorical_with_vocabulary_list( "PARENT_IND", vocabulary_list = unique(donor_data$PARENT_IND))), column_indicator( column_categorical_with_vocabulary_list( "WEALTH_RATING", vocabulary_list = unique(donor_data$WEALTH_RATING))), column_indicator( column_categorical_with_vocabulary_list( "PREF_ADDRESS_TYPE", vocabulary_list = unique(donor_data$PREF_ADDRESS_TYPE))), column_numeric("AGE"))
After we created the column types, let’s the data set into train and test datasets.
row_indices <- sample(1:nrow(donor_data), size = 0.8 * nrow(donor_data)) donor_data_train <- donor_data[row_indices, ] donor_data_test <- donor_data[-row_indices, ]
The TensorFlow package then requires that we create an input function with the listing of input and out variables. We will predict the likelihood of a person’s donation.
donor_pred_fn <- function(data) { input_fn(data, features = c("AGE", "MARITAL_STATUS", "GENDER", "ALUMNUS_IND", "PARENT_IND", "WEALTH_RATING", "PREF_ADDRESS_TYPE"), response = "DONOR_IND") }
Learn More
This is a modified excerpt from the book Data Science for Fundraising (Build Data Driven Solutions Using R).Learn more.
Build a Deep Learning Classifier
Finally, we can use the prepared data set as well as the input function to build a deep learning classifier. We will create three hidden layers with 80, 40 and 30 nodes respectively.
classifier <- dnn_classifier( feature_columns = feature_cols, hidden_units = c(80, 40, 30), n_classes = 2, label_vocabulary = c("N", "Y"))
Using the train
function we will build the classifier.
train(classifier, input_fn = donor_pred_fn(donor_data_train))
We will next predict the values using the model for the test data set as well as the full data set.
predictions_test <- predict( classifier, input_fn = donor_pred_fn(donor_data_test)) predictions_all <- predict( classifier, input_fn = donor_pred_fn(donor_data))
Similarly, we will evaluate the model for both the test data and the full data set. You can see the evaluation on the test data in Table @ref(tab:evaltftest) and for the full data set in Table @ref(tab:evaltfall).
evaluation_test <- evaluate( classifier, input_fn = donor_pred_fn(donor_data_test)) evaluation_all <- evaluate( classifier, input_fn = donor_pred_fn(donor_data))
Measure | Value |
---|---|
accuracy | 84.34 |
accuracy_baseline | 0.63 |
auc | 216.00 |
auc_precision_recall | 0.51 |
average_loss | 0.62 |
global_step | 0.63 |
label/mean | 0.66 |
loss | 0.63 |
prediction/mean | 0.63 |
Measure | Value |
---|---|
accuracy | 84.87 |
accuracy_baseline | 0.62 |
auc | 216.00 |
auc_precision_recall | 0.51 |
average_loss | 0.62 |
global_step | 0.62 |
label/mean | 0.66 |
loss | 0.62 |
prediction/mean | 0.62 |
The overall accuarcy doesn’t seem too impressive, even though we used large number of nodes in the hidden layers. This is partially due to the data itself – it is a synthetic data set afterall. But you should try the above recipe with your own data set and see if you can get better results. All the best.
References
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, et al. 2016. “Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.” arXiv Preprint arXiv:1603.04467.
Cheng, Heng-Tze, Lichan Hong, Mustafa Ispir, Clemens Mewald, Zakaria Haque, Illia Polosukhin, Georgios Roumpos, et al. 2017. “TensorFlow Estimators: Managing Simplicity Vs. Flexibility in High-Level Machine Learning Frameworks.” In Proceedings of the 23rd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1763–71. New York, NY, USA: ACM. http://doi.acm.org/10.1145/3097983.3098171.
Tang, Yuan, JJ Allaire, RStudio, Kevin Ushey, Daniel Falbel, and Google Inc. 2017. Tfestimators: High-Level Estimator Interface to Tensorflow in R. https://github.com/rstudio/tfestimators.
The post Step by Step Tutorial: Deep Learning with TensorFlow in R appeared first on nandeshwar.info.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.