Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This is the first of what I am hoping are a number of posts on different machine learning classifiers. The subject matter is not lab medicine but the methodology applies to any similar project. For example, maybe you want to classify the text of a general internal medicine consult into its subspecialty based on the words used or perhaps you want to determine which IT tickets are likely high priority. Maybe you want to convert free text diagnoses into categorical diagnoses. Ultimately, the problem I want to tackle is text classification.
In any case, the book that I have been reading at home is Deep Learning with R by Francois Chollet JJ Allaire and given the many interesting and easy-to-follow examples. Since it’s on my mind, I thought a deep learning model would be a good place to start. But I did not want to just redo one of the examples from the book because the data sets are already cleansed and in that sense much of the heavy lifting is done. I wanted to start from a new data set and use the approach shown in section 3.5 but apply it to a new text classification problem. Because I want to follow the basic flow of the Reuters News Wire classifier, I need a similar natural language processing (NLP) multiclass text classifier problem.
The problem I have chosen is one of for authorship classification. Specifically, given any Greek sentence take from the New Testament, can I make a deep learning classifier that will identify the author of a verse that the classifier has never seen?
Data Cleansing
The text of the New Testament is available online from numerous sources. I downloaded it here and chose the Byzantine Textform 2005 file. The text has been cleansed, put to lower case and transliterated to English characters. There are several steps to get it to a simple dataframe which the following code achieves. The code makes a dataframe where each row is a verse.
library(tidyverse) library(janitor) library(keras) library(knitr) # Deals with carriage returns within verses and brings all verses to one row versify <- function(text, book) { book_text <- tibble(reference = rep(NA,10000), verse = rep(NA,10000), book = book) chap_verse_pattern <- "[:space:]*[0-9]{1,3}[:]{1}[0-9]{1,3}[:space:]*" chap_verse_indices <- str_which(text, chap_verse_pattern) chap_verse_indices <- c(1, chap_verse_indices) for(i in 1:(length(chap_verse_indices) - 1)){ verse_text <- paste(text[chap_verse_indices[i] : (chap_verse_indices[i + 1] - 1)], collapse = " ") reference <- str_extract(verse_text, chap_verse_pattern) %>% str_trim verse_text <- str_replace(verse_text, chap_verse_pattern, "") book_text$verse[i] <- verse_text if(length(reference) == 0){ book_text$reference[i] <- NA } else { book_text$reference[i] <- reference } } book_text <- book_text %>% filter(verse != "") %>% na.omit() return(book_text) } # read in the books and get the in the expected order just for readability books <- list.files("Greek/", pattern = ".ASC") book_order <- c("MT05.ASC", "MR05.ASC", "LU05.ASC", "JOH05.ASC", "AC05.ASC", "RO05.ASC", "1CO05.ASC", "2CO05.ASC", "GA05.ASC", "EPH05.ASC", "PHP05.ASC", "COL05.ASC", "1TH05.ASC", "2TH05.ASC", "1TI05.ASC", "2TI05.ASC", "TIT05.ASC", "PHM05.ASC", "HEB05.ASC", "JAS05.ASC", "1PE05.ASC", "2PE05.ASC", "1JO05.ASC", "2JO05.ASC", "3JO05.ASC", "JUDE05.ASC", "RE05.ASC") # sort book names book_names <- sort(factor(books, levels = book_order)) # set up author vector authors <- c("matthew", "mark", "luke", "john", "luke", rep("paul", 13), "unknown", "james", rep("peter", 2), rep("john", 3), "jude", "john") # book author dataframe authors <- tibble(book = book_names, author = authors) # prepare empty tibble to store text nt_frame <- tibble(reference = character(), verse = character(), book = character()) for(i in 1:length(book_names)){ tmp <- readLines(con = paste0("Greek/",as.character(books[i]))) tmp <- versify(tmp, books[i]) nt_frame <- bind_rows(nt_frame,tmp) } nt_frame <- left_join(nt_frame, authors, by = "book") # force correct display order nt_frame$book <- factor(nt_frame$book, levels = book_names) nt_frame <- arrange(nt_frame, book)
Now that this wrangling is complete, we have a tibble that looks like this:
# show tibble format kable(head(nt_frame, 10))
reference | verse | book | author |
---|---|---|---|
1:1 | biblov genesewv ihsou cristou uiou dauid uiou abraam | MT05.ASC | matthew |
1:2 | abraam egennhsen ton isaak isaak de egennhsen ton iakwb iakwb de egennhsen ton ioudan kai touv adelfouv autou | MT05.ASC | matthew |
1:3 | ioudav de egennhsen ton farev kai ton zara ek thv yamar farev de egennhsen ton esrwm esrwm de egennhsen ton aram | MT05.ASC | matthew |
1:4 | aram de egennhsen ton aminadab aminadab de egennhsen ton naasswn naasswn de egennhsen ton salmwn | MT05.ASC | matthew |
1:5 | salmwn de egennhsen ton booz ek thv racab booz de egennhsen ton wbhd ek thv rouy wbhd de egennhsen ton iessai | MT05.ASC | matthew |
1:6 | iessai de egennhsen ton dauid ton basilea dauid de o basileuv egennhsen ton solomwna ek thv tou ouriou | MT05.ASC | matthew |
1:7 | solomwn de egennhsen ton roboam roboam de egennhsen ton abia abia de egennhsen ton asa | MT05.ASC | matthew |
1:8 | asa de egennhsen ton iwsafat iwsafat de egennhsen ton iwram iwram de egennhsen ton ozian | MT05.ASC | matthew |
1:9 | oziav de egennhsen ton iwayam iwayam de egennhsen ton acaz acaz de egennhsen ton ezekian | MT05.ASC | matthew |
1:10 | ezekiav de egennhsen ton manassh manasshv de egennhsen ton amwn amwn de egennhsen ton iwsian | MT05.ASC | matthew |
We should get verse counts that match what is expected, which we do.
# sanity check the verse counts by book verse_counts <- nt_frame %>% group_by(book) %>% summarise(counts = n()) kable(verse_counts)
book | counts |
---|---|
MT05.ASC | 1070 |
MR05.ASC | 677 |
LU05.ASC | 1149 |
JOH05.ASC | 878 |
AC05.ASC | 1003 |
RO05.ASC | 432 |
1CO05.ASC | 436 |
2CO05.ASC | 256 |
GA05.ASC | 148 |
EPH05.ASC | 154 |
PHP05.ASC | 103 |
COL05.ASC | 94 |
1TH05.ASC | 88 |
2TH05.ASC | 46 |
1TI05.ASC | 112 |
2TI05.ASC | 82 |
TIT05.ASC | 45 |
PHM05.ASC | 24 |
HEB05.ASC | 302 |
JAS05.ASC | 107 |
1PE05.ASC | 104 |
2PE05.ASC | 60 |
1JO05.ASC | 104 |
2JO05.ASC | 12 |
3JO05.ASC | 13 |
JUDE05.ASC | 24 |
RE05.ASC | 403 |
And we can check the unique word count
# check unique words word_list <- paste(nt_frame$verse, collapse = " ") %>% str_split(" ") %>% unlist %>% str_replace("[:punct:]","") %>% tolower() length(unique(word_list))
## [1] 17156
Normally at this point, we might remove stop words and then stem and lemmatize the text (ie get rid useless little words and suffixes that cause words of the same meaning to look different). This would be more important in more traditional learning classifiers but is likely less important when using Keras and Tensorflow. If I were running this classifier on the English text of the KJV for example, I would run it with and without such a process and guage the performance change. There are numerous NLP packages specifically dedicated to this task. I am going to skip it here. This process is, of course, highly language-dependent.
The other thing I need to do is make the author-factor column numbered 0-8 instead of 1-9 because R is going to be calling python code and python starts counting a 0. This bug took me a while to sort out.
nt_frame <- nt_frame %>% mutate(author_factor = as.numeric(as.factor(nt_frame$author)) - 1) %>% #pythonic mutate(verse_number = 1:nrow(nt_frame))
Now we will make a tokenizer, that is a function to convert words to integers and we will limit the model to the top 15000 out of the 17156 unique words found in the text.
max_features <- 15000 text <- nt_frame$verse tokenizer <- text_tokenizer(num_words = max_features) %>% fit_text_tokenizer(text)
Now we need to split the text randomly into training and testing sets in an 80:20 split.
set.seed(316) # Select random indices for training using 80% of the data. Notice that these are random! training_id <- sample.int(nrow(nt_frame), size = nrow(nt_frame)*0.8) # for reference separate training from testing data train_data <- nt_frame[training_id,] test_data <- nt_frame[-training_id,]
The data is very imbalanced, that is, there are authors (like Jude and James) that have very few verses ascribed to them and there are others (like Paul and Luke) who have many verses. For this reason, we should sanity check our training and testing data to make sure that we have sampled about 80% of each book. There are specific tools to achieve this process which is referred to as stratified sampling.
train_n <- table(train_data$author_factor) test_n <- table(test_data$author_factor) train_props <- round(train_n/(train_n + test_n),2) train_props
## ## 0 1 2 3 4 5 6 7 8 ## 0.78 0.79 0.67 0.81 0.81 0.81 0.79 0.80 0.81
We can see that we have a problem with author 2 who has only 24 verses. This is probably not going to matter much but we can try balanced sampling for which we do get better proportions.
# do balanced sampling library(splitstackshape) # get the balanced sample train_data <- stratified(nt_frame, "author_factor", .8) # randomize the sample train_data <- train_data[sample(nrow(train_data)),] training_id <- train_data$verse_number # the test indices are therefore testing_id <- which(!(nt_frame$verse_number %in% training_id)) # randomize the sample test_data <- nt_frame[testing_id,] test_data <- test_data[sample(nrow(test_data)),] train_n <- table(train_data$author_factor) test_n <- table(test_data$author_factor) train_props <- round(train_n/(train_n + test_n),2) train_props
## ## 0 1 2 3 4 5 6 7 8 ## 0.80 0.80 0.79 0.80 0.80 0.80 0.80 0.80 0.80
Now we can tokenize the data, that is, convert the verse from lists of integers to a one-hot encoded form.
# Create the training and testing x data x_train <- texts_to_matrix(tokenizer, text[training_id], mode = "binary") x_test <- texts_to_matrix(tokenizer, text[-training_id], mode = "binary") # Set the training and testing y data and then one hot encode them train_labels <- nt_frame$author_factor[training_id] y_train <- to_categorical(train_labels) test_labels <- nt_frame$author_factor[-training_id] y_test <- to_categorical(test_labels)
Satisfy ourselves that the training data is random in order
kable(head(nt_frame[training_id,], 10))
reference | verse | book | author | author_factor | verse_number |
---|---|---|---|---|---|
16:6 | all oti tauta lelalhka umin h luph peplhrwken umwn thn kardian | JOH05.ASC | john | 1 | 3584 |
2:20 | all ecw kata sou oti afeiv thn gunaika sou iezabel h legei eauthn profhtin kai didaskei kai plana touv emouv doulouv porneusai kai fagein eidwloyuta | RE05.ASC | john | 1 | 7563 |
21:23 | kai elyonti autw eiv to ieron proshlyon autw didaskonti oi arciereiv kai oi presbuteroi tou laou legontev en poia exousia tauta poieiv kai tiv soi edwken thn exousian tauthn | MT05.ASC | matthew | 5 | 705 |
5:1 | dikaiwyentev oun ek pistewv eirhnhn ecomen prov ton yeon dia tou kuriou hmwn ihsou cristou | RO05.ASC | paul | 6 | 4895 |
12:29 | h pwv dunatai tiv eiselyein eiv thn oikian tou iscurou kai ta skeuh autou diarpasai ean mh prwton dhsh ton iscuron kai tote thn oikian autou diarpasei | MT05.ASC | matthew | 5 | 374 |
4:24 | alla kai di hmav oiv mellei logizesyai toiv pisteuousin epi ton egeiranta ihsoun ton kurion hmwn ek nekrwn | RO05.ASC | paul | 6 | 4893 |
27:31 | eipen o paulov tw ekatontarch kai toiv stratiwtaiv ean mh outoi meinwsin en tw ploiw umeiv swyhnai ou dunasye | AC05.ASC | luke | 3 | 4734 |
1:25 | kai hrwthsan auton kai eipon autw ti oun baptizeiv ei su ouk ei o cristov oute hliav oute o profhthv | JOH05.ASC | john | 1 | 2921 |
3:6 | kai exelyontev oi farisaioi euyewv twn hrwdianwn sumboulion epoioun kat autou opwv auton apoleswsin | MR05.ASC | mark | 4 | 1149 |
8:4 | ei men gar hn epi ghv oud an hn iereuv ontwn twn ierewn twn prosferontwn kata ton nomon ta dwra | HEB05.ASC | unknown | 8 | 6930 |
Now we can build a basic model:
# Build a basic model model <- keras_model_sequential() %>% layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% layer_dense(units = 64, activation = "relu") %>% layer_dense(units = 9, activation = "softmax") model %>% compile( optimizer = "rmsprop", loss = "categorical_crossentropy", metrics = c("accuracy") )
and pull out validation data, again in an 80:20 split.
# pull out some validation data val_indices <- 1:floor(nrow(train_data))*0.2 x_val <- x_train[val_indices,] partial_x_train <- x_train[-val_indices,] y_val <- y_train[val_indices,] partial_y_train = y_train[-val_indices,]
Now we run the model:
history <- model %>% keras::fit( partial_x_train, # train in the non-validation training data partial_y_train, epochs = 5, batch_size = 256, validation_data = list(x_val, y_val) ) results <- model %>% keras::evaluate(x_test, y_test) results
## loss accuracy ## 1.0402801 0.6382576
plot(history) + geom_line()
We can show the model performance graphically:
library(corrplot) pred <- model %>% predict_classes(x_test) %>% factor(0:8) res_tab <- table(Pred = pred, Act = test_labels) res_prop <- prop.table(res_tab,2) author_key <- tibble(author = nt_frame$author, code = nt_frame$author_factor) %>% unique %>% arrange(code) colnames(res_prop) <- author_key$author rownames(res_prop) <- author_key$author corrplot(res_prop,is.corr = FALSE, method = "circle", addCoef.col = "lightblue", number.cex = 0.7)
Results are not great because many authors are being misclassified as Paul or Luke. This is likely from author imbalance so we can address the imbalance with weights and with dropout layers as suggested in this very informative tutorial from
Dr. Bharatendra Rai.
model <- keras_model_sequential() %>% layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 32, activation = 'relu') %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 9, activation = "softmax") model %>% compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics = 'accuracy') history <- model %>% keras::fit( partial_x_train, partial_y_train, epochs = 5, batch_size = 256, validation_data = list(x_val, y_val), class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1)) results <- model %>% keras::evaluate(x_test, y_test) results
## loss accuracy ## 1.4781048 0.5997475
pred <- model %>% predict_classes(x_test) %>% factor(0:8) # res_frame <- tibble(Prediction = pred, Target = test_labels) # res_frame <- res_frame %>% # group_by(Prediction, Target) %>% # summarise(N = n()) # plot_confusion_matrix(res_frame, add_normalized = FALSE) res_tab <- table(Pred = pred, Act = test_labels) res_tab
## Act ## Pred 0 1 2 3 4 5 6 7 8 ## 0 6 1 0 8 0 4 19 2 1 ## 1 1 211 0 34 18 26 19 2 3 ## 2 0 1 0 2 0 0 1 0 0 ## 3 1 21 1 261 31 42 38 4 8 ## 4 0 26 0 42 58 39 5 0 4 ## 5 1 7 2 52 23 97 5 1 2 ## 6 10 11 1 16 1 4 286 20 11 ## 7 0 0 0 1 0 0 0 0 0 ## 8 2 4 1 14 4 2 31 4 31
res_prop <- prop.table(res_tab,2) colnames(res_prop) <- author_key$author rownames(res_prop) <- author_key$author
What we get looks a little better with more counts on the diagonal.
corrplot(res_prop,is.corr = FALSE, method = "circle", addCoef.col = "lightblue", number.cex = 0.7)
The model is jumpy on the small books, probably because of undersampling of them. This means that k-fold cross validation help us assess model performance. Not sure if I should try to have balanced sampling in the folds but I am not going to worry about that at the moment.
build_model <- function() { model <- keras_model_sequential() %>% layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 32, activation = 'relu') %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 9, activation = "softmax") model %>% compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics = 'accuracy') }
Run the k-fold cross-validation.
acc_histories <- NULL loss_histories <- NULL k <- 4 indices <- sample(1:nrow(train_data)) folds <- cut(1:length(indices), breaks = k, labels = FALSE) num_epochs <- 50 all_scores <- c() proportions_list <- list() for (i in 1:k) { cat("processing fold #", i, "\n") # Prepare the validation data: data from partition # k val_indices <- which(folds == i, arr.ind = TRUE) x_val_kfold <- x_train[val_indices,] y_val_kfold <- y_train[val_indices,] # Prepare the training data: data from all other partitions #partial_train_data <- train_data[-val_indices,] x_train_kfold <- x_train[-val_indices,] y_train_kfold <- y_train[-val_indices,] # Build the Keras model (already compiled) model <- build_model() # Train the model (in silent mode, verbose=0) history <- model %>% fit(x_train_kfold, y_train_kfold, epochs = num_epochs, batch_size = 256, validation_data = list(x_val_kfold, y_val_kfold), verbose = 0, class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1)) # Evaluate the model on the fold's validation data results <- model %>% keras::evaluate(x_val_kfold, y_val_kfold) all_scores <- c(all_scores, results["accuracy"]) pred <- model %>% predict_classes(x_val_kfold) %>% factor(0:8) res_tab <- table(Pred = pred, Act = train_data$author_factor[val_indices]) res_prop <- prop.table(res_tab,2) proportions_list[[i]] <- res_prop acc_history <- history$metrics$val_accuracy acc_histories <- rbind(acc_histories, acc_history) loss_history <- history$metrics$loss loss_histories <- rbind(loss_history, acc_history) } all_scores mean(all_scores) average_acc_history <- data.frame( epoch = seq(1:ncol(acc_histories)), validation_acc = apply(acc_histories, 2, mean) ) average_loss_history <- data.frame( epoch = seq(1:ncol(loss_histories)), validation_loss = apply(loss_histories, 2, mean) )
Validation accuracy improves modestly with more epochs but the model definitely overfits the training data (getting to the high 90s in accuracy). This is a bit of a conundrum to me for which I do not know the answer (those who know, please comment): namely, I can overfit the model to make gains on the validation set and these do improve performance on the test set but I expect that this improvement is happening in some non-generalizable way.
library(ggplot2) ggplot(average_acc_history, aes(x = epoch, y = validation_acc)) + geom_line()
Likewise loss slowly declines over many epochs but the model overfits.
ggplot(average_loss_history, aes(x = epoch, y = validation_loss)) + geom_line()
In any case, this is the model performance rerunning the k-fold cross validation with 5 epochs.
Final Outcome
Satisfied enough that 5 epochs should be OK, I can run the model on the whole training set and look at its performance on the testing set.
model <- keras_model_sequential() %>% layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 32, activation = 'relu') %>% layer_dropout(rate = 0.2) %>% layer_dense(units = 9, activation = "softmax") model %>% compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics = 'accuracy') history <- model %>% keras::fit( x_train, y_train, epochs = 5, batch_size = 256, #validation_data = list(x_val, y_val), class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1)) results <- model %>% keras::evaluate(x_test, y_test) results
## loss accuracy ## 1.4544461 0.5568182
pred <- model %>% predict_classes(x_test) %>% factor(0:8) # res_frame <- tibble(Prediction = pred, Target = test_labels) # res_frame <- res_frame %>% # group_by(Prediction, Target) %>% # summarise(N = n()) # # plot_confusion_matrix(res_frame, add_normalized = FALSE) res_tab <- table(Pred = pred, Act = test_labels) res_tab
## Act ## Pred 0 1 2 3 4 5 6 7 8 ## 0 10 8 0 15 3 3 45 3 4 ## 1 1 233 0 45 21 31 38 2 10 ## 2 0 0 0 0 0 0 1 0 0 ## 3 2 6 0 198 13 25 17 3 3 ## 4 0 19 1 77 64 57 7 0 1 ## 5 0 8 0 57 29 90 7 1 1 ## 6 6 4 0 14 1 5 250 19 7 ## 7 2 0 3 5 0 0 8 3 0 ## 8 0 4 1 19 4 3 31 2 34
res_prop <- prop.table(res_tab,2) colnames(res_prop) <- author_key$author rownames(res_prop) <- author_key$author corrplot(res_prop,is.corr = FALSE, method = "circle", addCoef.col = "lightblue", number.cex = 0.7)
Some interesting findings:
- John seems to be the easiest to classify. This fits well with his unique authorship style.
- The synoptic gospels are are easily misclassified among one another. Again, this fits with the overlap of stories, parables and other content.
- Hebrews looks more like Hebrews than it looks like Paul. This fits with the perspective that Paul is not the author of Hebrews.
- Poor James, Jude and Peter: just not enough verses to get proper classification. I am sure there are ways to address this kind of imballance were classifying Jude correctly a very important thing to do.
I think I am going to stop trying to improve this because it is not a real problem but I hope that someone else can recycle some code for a real-life problem. I would be interested in comments on how to get improved classification of small classes.
Parting Easter Thought
ouk estin wde hgeryh gar kaywv eipen deute idete ton topon opou ekeito o kuriov, Matthew 28:6
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.