Learning about Algorithms and Program Structure
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Outline of the Problem
When learning to program, we often start with trivial exercises that don’t motivate anyone to treat programming seriously. Here I want to develop an example step-by-step that will illustrate key concepts of programming in R and do something useful. We are going to elaborate a method of searching through a list called “binary search”.
Suppose you were living in New York City back in the antediluvian times that we used phone books. In those times, the phone book for Manhattan, the principal borough of New York, was ridiculously thick, as you can see in this photo of the Yellow Pages (commercial directory) from 10 years ago. The White Pages, or the phone directory just of names, addresses and phone numbers was similarly thick.
How would you find a friend’s phone number in this multiple thousand page monstrosity? Or, to put the same question in the language of computer science, how would we design an algorithm to find our friend’s number?
What’s An Algorithm?
“Algorithm” is a word that simply means a solution method, the steps we adopt to solve a problem. We often enshrine such a solution in a computer program as we will do here. However, before we do that, let’s conceptualize the problem and try to find an efficient method to solve the problem.
A Naive Approach
We could simply start at the beginning, at page 1, and leaf page by page until we find our friend’s name. If her last name was “Zimmer”, this could take a while. Not very efficient.
Second Approach
A second way to find your friend’s name is to open the book at random and see if your friend’s name is on that page. If not, you close the book and open to a new random page. You continue on until you actually find the page you are looking for. Perhaps this is even worse than the page-by-page method because here we could end up check the same page multiple times.
A More Reasonable (and Common) Approach
After years of asking this question, I have found that most people start out by suggesting that you open the book in the middle, check to see if your friend’s name is on that page. There is only a very small probability this is the case if the phone book is the size of the Manhattan book. If not, you note whether the name comes before or after the point at which you opened the book.
At this point, instead of ignorantly closing the book and try again, you make use of the information you gained. If the name comes before the point where you opened the book, you can forget the second half of the book. You can toss that half away. Likewise, if the name comes after that point, you can toss away the first half of the book, because it cannot help you solve your problem.
You have cut your problem literally in half. Now, instead of trying to find your friend in a list of, say, 1,000 pages, you’ve only got to deal with 500 pages.
As a peripheral note, in the days when phone books were still an important tool for everyday life, professors illustrated this method by literally tearing a phone book in half. Becuase I can’t find one, I can’t give you a demonstration of that. However, the following link will show Prof. David Malan of Harvard doing this for his course CS50, a course I highly recommend. [LINK]
You can execute this approach repeatedly, each time cutting the problem in half, until you find the page you need. For example, in the worst case, you could search through a 1,000 page phone book in 11 iterations, always using integer values (can’t have a fraction of page in this system).
1,000 > 500 > 250 > 125 > 63 > 32 > 16 > 8 > 4 > 2 > 1
This is binary search and we can implement this on a computer in a program.
Planning Our Binary Search Program
Pseudocode
The first phase in writing any computer program is to plan it. We write out a version of the program in our working language: English, Portuguese, whatever. Thus, before trying to translate our ideas into computer code, we are clear about what those ideas are.
Loading the Data
Our first task to identify the list which contains the name we are looking for and read it into memory. In our case, I have created a list of 500 last names of professional American baseball players taken from the famous Lahman database of baseball statistics. [1] The list is saved in a compressed R data file, players_list.rds, so we will read that and install it in memory as players.
So, in pseudocode, we could lay out the first step as:
- Read file players_list.rds into memory as players
To understand how to proceed from here, let’s take a quick look at the beginning of the list:
"Victorino" "Morales" "Evans" "Grube" "Figaro" "Cruz" ...
Preprocessing
players is a character vector, a simple set of 500 family names of baseball players. To make our example a bit more concrete, we are going to seek the name “Mantle”, the name of one of my childhood heroes, Mickey Mantle of the New York Yankees.
Does players have the same characteristics as the last names in a phone book? What would we need to do to pre-process the list, to make it ready for the binary search algorithm? In a phone book, the names are in alphabetical order. This is what enables us to tear it in half and know that the half we are now working with must contain the name we seek. If I am searching for the name “Mantle” and I tear a phone book in half at the letter “N”, I know I can throw away the second half of the book (“N” – “Z”) as “Mantle” cannot be there. Binary search depends on the list being searched actually being in order.
Our list is in random order. So, our next task is to put it in alphabetical order. We can do this with the base R sort function or the arrange function within dplyr, one of the key components of the tidyverse set of packages.
Our pseudocode with both steps:
- Read file players_list.rds into memory as players
- Put players in alphabetical order with sort or arrange
Binary Search
In binary search, we have to identify the middle entry in our list. This will be the dividing point between the first and second halves of the list. To divide the list in the middle, we can divide the number of names by 2, using integer division. Integer division forces any non-integer result to the truncated number, that is, it simply drops the decimal part of the number. Whether we are rounding up or down makes little practical difference in binary searches, as we are still dividing the remaining list in half.
- Read file players_list.rds into memory as players
- Put players in alphabetical order with sort or arrange
- Count the entries in the list and integer divide the total by 2
Now, we can check the entry we have selected and see if it is “Mantle”. If so, we’re done and we stop.
- Read file players_list.rds into memory as players
- Put players in alphabetical order with sort or arrange
- Count the number of entries in the list and divide the total by 2
- If the entry chosen equals “Mantle”, report the result and end the program.
Cutting the List Down to Size
If not, we need to find the new middle. First, we need to have to decide if we are dealing with the first half or the second half of the list. If “Mantle” is later in the alphabet than the entry, we can redefine the limits of the list to be the next entry higher than the one we just looked at and the end of the list. If “Mantle” is earlier in the alphabet than the entry we examined, we would choose limits of the first entry up to the entry before the one we chose. For example, if we have a list with seven names:
Baker, Carter, George, Katz, Landis, Mantle, Taylor
we would first choose the middle name–“Katz”–and see if equaled “Mantle”. Since it does not, we would compare “Katz” and “Mantle”. “Mantle” is later than “Katz” alphabetically, so we would define the new list as “Landis” through “Taylor” since “Mantle” must be located there.
If, on the other hand, we had the following list:
Baker, Mantle, Nixon, Parker, Rosen, Smith, Taylor
we would start with “Parker”, the middle name, compare the name to “Mantle”, see that “Mantle” comes before “Parker” and so start the new list at the beginning “Baker” and go to “Nixon”. This defines a new step in the program, setting the new list, which is based on an “if . . . then . . . else” construct that we will see when we translate the pseudocode to R code.
- Read file players_list.rds into memory as players
- Put players in alphabetical order with sort or arrange
- Count the number of entries in the list and divide the total by 2
- If the entry chosen equals “Mantle”, report the result and end the program
- If “Mantle” is greater than the chosen item, define new list as chosen item + 1 to the end of the list or if “Mantle” is lesser than the chosen item, define new list as the beginning of the list up to the previously chosen item – 1
Now, we just have to keep repeating the steps 3 – 5b, until we reach a point where step 4 returns TRUE, that is, we have found “Mantle” and we simply have to report the results.
This is then our program in pseudocode. Let’s turn it into real programming code and try it out.
Translate Program to R Code
Now, we will start to use code blocks to elaborate our program. Normally, the first block of code we put into a program specifies the libraries of functions that R has that facilitate our programming. However, we won’t need any of these libraries for this exercise, since the objective is to focus on the thought process of creating a program. Here, we will use base R functions, that is, those functions that come with the default distribution of R from CRAN.
In this program, I will also emphasize comments. It is very important to explain your computer code to yourself and to others. The best way to do this is to put observations in your program that explain why you coded as you did. This is primarily for your benefit. When you return to your program six months after you wrote, the comments will help you understand what you did and what choices you made. The first step in commenting is to put your pseudocode steps in your code to serve as an outline of what your program does.
Code Block 1 – Read the Data
# Read file `players_list.rds` into memory as `players` players <- readRDS("players_list.rds") # Have a look at the structure `str` of `players` str(players) ## chr [1:500] "Victorino" "Morales" "Evans" "Grube" "Figaro" "Cruz" ...
As I stated above, players is a character vector containing 500 names. Technically, you need to put the entire path of folders from the working directory down to the folder where you placed “players_list.rds”. To keep things simple here, I put the data in the same directory as the program.
Also, the symbol <- in R is the primary way we indicate assignment, that is assigning some data to a name. We tend not to use =, as it can be confused with a logical test of equality, which is 2 equals signs together, ==.
Code Block 2 – Arrange the List in Alphabetical Order
# Put `players` in alphabetical order with `sort` players <- sort(players)
In R, you can redefine an object like players to contain a transformed version of itself. In some languages, you need to give it a new name and then assign it back to its original name. In R, you can do it directly.
Code Block 3 – Find the Middle Entry
# Find the middle entry using integer division size <- length(players) lower_limit <- 1 # for later use upper_limit <- size # for later use middle_entry <- size %/% 2
Code Block 4 – Test the Entry against “Mantle”
# If the entry chosen equals "Mantle", report the result # and end the program # Define Mickey Mantle as the `target` target <- "Mantle" finished <- FALSE # create a logical variable to mark if we are finished if(players[middle_entry] == target) { print(paste(target, "is entry number", middle_entry, "in the list.")) finished = TRUE }
When you use an “if . . . then” construction to create a logical test and you want the program to do multiple sub-steps, you need to enclose the steps in curly brackets {}. Also note that we will deal with implementing an end to the program when we construct a loop. In this block as well, I have assigned “Mantle” to the variable target, which will allow us to search for anyone without having to rewrite many lines of code whenever we have a new person to search for.
Code Block 5 – Determining the New, Half-Size List
# A. If "Mantle" is greater than the chosen item, define new list # as chosen item + 1 to the end of the list # B. If "Mantle" is lesser than the chosen item, define new list # as the beginning of the list up to the previously chosen item - 1 ifelse(target > players[middle_entry], lower_limit <- middle_entry + 1, upper_limit <- middle_entry - 1) ## [1] 251 new_size <- upper_limit - lower_limit + 1 # 1 needed to include both end points middle_entry <- new_size %/% 2
I am laying out the steps to define each iteration so you can follow the logic a bit better. There are faster ways to code this block, but I think you want a somewhat clearer demonstration of what is going on.
However, I have used one shortcut. ifelse is a very efficient version of an “if . . . then . . .else” construction. Here, I used it to redefine the limits of the list we still want to consider. The conditional tests if Mantle is later in the sorted list than the player we have identified as the middle_entry. If he is, we adjust the lower limit to be 1 entry higher than the player we just considered. If he comes before that player, we adjust the upper limit to be one lower than the entry we just tested. From these new limits, we can get a new size count of the list we are considering and determine a new midpoint.
players is still in memory in its original form (500 entries), but we are using progressively smaller portions of it to conduct our search for Mickey Mantle.
But, We Haven’t Finished
No, finished is still FALSE. We only have gone through our program once. Now, we have to modify the program so it will iterate through the list until it finds “Mantle”.
R, as well as most programming languages, can create loops to process data on an industrial basis. Loops can iterate through a series of data, such as our players list. Two varieties of loop are common – for loops and while loops. Programming languages process for loops a set number of times that the programmer defines through a variable. The program will apply the statements within the loop to the data exactly the number of times specified. while loops operate while a condition holds. The program will keep applying the loop’s programming statements to the data until the condition becomes FALSE.
We don’t know how many iterations our search will take until it finds “Mantle” in the list of players, so a for loop is not appropriate. However, we can use our finished variable as a condition to define the loop. We are going to put code blocks 3, 4 and 5 inside the loop as they contain the program steps we want to iterate through.
Revision of Pseudocode
Because we need to incorporate a loop into our program, let’s return to the pseudocode and build in the looping structure. It’s very common to do revisions of the pseudocode to help you understand what you intended to do when you (or someone else) examines the code at a later date.
The first three code blocks are fine. However, we next will define a loop in step 4 and create a series of sub-steps that will execute the loop. We have most of these steps defined already in 4, 5a and 5b. First, we will execute the test to see if we have found “Mantle”. If not, we will redefine the limits as we did in 5a and 5b above and cycle around until the finished flag is TRUE. Then we will print the results and a counter to show how many iterations the binary search took.
- Read file players_list.rds into memory as players
- Put players in alphabetical order with sort or arrange
- Count the number of entries in the list and divide the total by 2
- Enter a while loop that continues until finished becomes TRUE
4a. Test to see if the middle_entry equals “Mantle”. If so, report the result and end the program
4b. If “Mantle” is greater than the chosen item, define new lower_limit of the list as the chosen item + 1 and leave the upper_limit as it was
4c. If “Mantle” is lesser than the chosen item, define leave the lower_limit the same and change the upper_limit to be the previously chosen item – 1
4d. Calculate new size of the remaining list
4e. Calculate a new middle_entry to test in step 4a
4f. Update the entries tested by the binary search in the variable guesses
4g. Repeat the loop until finished is TRUE.
Now, we can go ahead and write the code incorporating the statements from our old program structure. Note that I have repeated some of the variable initializations we stated above so this code block will run correctly without reference to code blocks 4 and 5 above.
Code Block 4 Revised – Loop Structure
# Initializing variables size <- length(players) lower_limit <- 1 upper_limit <- size middle_entry <- size %/% 2 finished <- FALSE iter <- 1 guesses <- middle_entry # loop while (!finished) { # keep repeating until finished == TRUE if (players[middle_entry] == target) { # passes test; go to end print(paste(target, "is entry number", middle_entry, "in the list.")) print(paste("We used", iter, "iterations to find the target.")) finished <- TRUE } else { # Needs additional iteration iter <- iter + 1 # increase iteration counter ifelse(target > players[middle_entry], # test lower_limit <- middle_entry + 1, # TRUE upper_limit <- middle_entry - 1) # FALSE new_size <- upper_limit - lower_limit + 1 # 1 needed to include both end points middle_entry <- lower_limit + (new_size %/% 2) guesses <- c(guesses, middle_entry) # tally what inidices chosen } } ## [1] "Mantle is entry number 275 in the list." ## [1] "We used 9 iterations to find the target."
We Did It! Mission Accomplished
We have found my childhood hero in our list of players. It took 9 iterations to hit the name. The guesses were 250, 376, 313, 282, 266, 274, 278, 276, 275. If you run this program on your system, you will see that it runs very fast. However, it is very inefficient. I have opted for clarity over speed in every case I could so the reader can have a clearer idea of what decisions I made and how I implemented them. I also did not encapsulate our procedures in functions that make repeated uses of blocks of code easier to manage. I would normally do that.
How to achieve efficient, faster running code is the subject for a future lesson.
If You Want to Play with This Program
I have stored three files on GitHub in my repository MADBlog so you can manipulate the program as well. The first is the program itself as I’ve laid it out here: bin_search_program.R. The second is the 500 name player list: players_list.rds. The third is the 5,000 name player list: big_player_list.rds. Click on the file names in the list and you can download the files. If you are not familiar with GitHub, you will need to click on “Raw” button when you open the bin_search_program.R file. The other two files will download directly. Also, if you are unfamiliar with GitHub and work with data science or programming, don’t miss out on this valuable resource that facilitates version control and simple file transmission for others.
If you want to explore this code further, I can recommend two options. First, have a look at the names in the list with the command players. This will print out on your console the full list of five hundred names. Pick one and modify the target variable to that name. Then run the Code Block 4 Revised again and see how many iterations it took to find the name you chose.
The second option is to replace the 500 name list with the 5,000 name list by modifying the line of code
players <- readRDS("players_list.rds")
with the following line of code
players <- readRDS("big_player_list.rds")
and then see how many iterations it takes to find “Mantle” or any other player who interests you and appears in the list. How much more time did it take than with the short list?
Summary
We learned about how to structure a program and move from an idea (“binary search”) through pseudocode to actual R code. We saw how a while loop functions and we conducted a number of logical tests.
Now, you modify this code to fit your needs. Have fun.
[1] Lahman, S. (2017) Lahman’s Baseball Database, 1871-2016, Main page, http://www.seanlahman.com/ baseball-archive/statistics/
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.