EC2 Trials and Tribulations, Part 1 (Web Crawling)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.
Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with using parallelism in your code to accomplish large tasks.
Monitor your Instances
Monitoring your instances has two important benefits. First, to make sure that you are not maxing out resources on the machine. EC2 is “elastic.” With some clever programming, you can boot up more machines if you notice resources becoming scarce on your current machines, and then decommission them later when they are not needed. I did not do this at first, and I ran into several issues.
Disk Space. The concept of a “disk” is very confusing in EC2. The AMI forms a disk, sort of. Above and beyond the OS and any other software and packages you may install as part of the AMI, you can use whatever free space is remaining to store output files. The total disk space used by the AMI seems to be configured at the moment the AMI is constructed. Thus, it is not a good idea to store files in the instance. I did this. Fortunately, I found out before it was too late that my “disk” was filling up. I wrote a cron job to copy all of my output files to /mnt every five minutes. Use /mnt to store your files as it has lots and lots of space; HOWEVER if you terminate your instance, the files are gone. This is still true if you use space within the instance. Once your job completes, upload your files to S3. s3cmd allows access to S3 from the command line, and with the modification here, you can upload and download files in parallel (a life saver for big batches). Another option is to create an EBS volume, mount it, and write files directly to the EBS volume. EBS space is much more expensive than S3 space.
Memory. On my first attempt, I maxed out memory to the point that the OS killed 6 of my 8 processes. This caused a huge blow in the performance of my crawler and rendered the extra money I spent on an extra large instance wasted. Monitor your job’s memory using top. If memory usage seems to grow too fast to your liking, consider using a memory profiler to make sure that there are no memory leaks in your code. I have found that long running Python processes eat up a lot of RAM, even if there are no explicit growths of data structures.
Additionally, maxing out RAM means that the disk will begin to swap. This is devastating to performance because this extra grinding of the disk decreases the total I/O throughput your job can handle. This is crucial for crawlers as files need to be written to disk quickly.
If after profiling you find that your job is still using too much RAM, consider caching, or using a high memory EC2 instance.
I/O Throughput. How fast your job consumes and produces data is a good way to determine if something is going amiss in your job, or with the other resources you are using. When I started my crawling job, I was crawling n pages per hour, but after twelve hours, the throughout decreased exponentially until it got so slow that I had to add more instances. One way to monitor throughput is to save the results of ls -latr –full-time to disk and extract the date/time of each file. Using a tool like R, you can quickly plot your I/O throughput over time using an aggregate(). A decrease in I/O throughput can be the result of many things: 1) swapping from exhausting RAM, 2) low disk space, 3) network congestion within AWS, 4) poor resource performance (if crawling, the resource would be the website being crawled), 5) hammering an external resource and/or HTTP throttling, 6) congestion in the Internet. For crawling, you may want to consider using several smaller instances rather than fewer larger instances. This way, you will be accessing the resource from many IPs and the result of being throttled should be lessened. Additionally, use instances that have “High” I/O performance; some are rated “Moderate” or “Low.”
CPU. A general rule of thumb is that you can run n processes in parallel, for n cores. Additionally, if each core supports hyperthreading then the number of processes you can run is approximately 2n. If you run more than the suggested number, the price of context switching can slow down your performance. If you find the need to routinely exceed this guideline, use an instance with more cores.
When running parallel code, routinely do a ps aux | grep processName to make sure the correct number of processes is running. If any were killed, this will be noted in /var/syslog with a reason.
Financial metrics. Are you getting your money’s worth? Are you really using all of the cores you are paying for? Are you really using all of the memory you are paying for? This is up to you and your budget to dictate. But do not get carried away and assume that you must stay with the same instance size. Most AMIs can run on different instance types (except 64bit AMIs are restricted to m1.large and bigger).
Quarantine Essential Services
My crawler used Redis as a work queue. Each process could easily write new thread IDs and page numbers to the queue as they are discovered, and read thread IDs and page numbers from the queue as each process is ready to crawl a page. One problem that I faced was that I coupled the crawling operation with queue management into the same script, and ran the Redis server on a server where a crawler was running. This coupling posed two challenges. First, it can sustain nasty bugs. Whenever a process was created on the master Redis node, my code would wipe the Redis queue clean to prepare it for crawling (bug!). Flushing the queue, and the initial population of the queue should have taken place in two separate scripts. Due to my major bug, I wiped the entire queue clean in the middle of the crawl. Fortunately, I followed the advice in the next section.
Second, I had to be careful that my processes did not exceed RAM limitations. Because Redis is mainly an in-memory key-value store, it itself can hog up most of the RAM in the instance. For this reason, it is best to quarantine essential services such as queues to their own instances.
Document Everything
Log everything. Log every resource you are going to use (URLs for a crawl) and log everything that was done and any problems that arise. Using the directory structure (ls) as well as a log of what work was already performed, I was able to reconstruct and repopulate the work queue and essentially start where I left off. For my crawling operation, I wrote the following events to logs, each with a timestamp.
- Starting the crawl.
- Logging in to the site being crawled.
- Clearing and populating the queue.
- Visiting a thread’s first page.
- Discovering the number of pages of posts in the thread/inserting to the queue.
- HTTP redirects, when a thread has been moved.
- Visiting a thread ID that does not exist.
- Inadvertent logouts, marking work to be redone.
- Queuing inconsistencies.
An activity log verbosely documented everything that occurred without logging actual data. An inventory manifest indicated which URLs/forum posts had valid content and how many pages of content were associated with them. A standard directory listing indicated what work had been done. By cross referencing the manifest and the directory listing, it is easy to see which posts had not yet been processed. A system log prepared by the operating system also documents critical failures for you, such as lack of disk space or processes being killed.
When writing your logs, use the advice in the next section!
Take Care to not Clobber Files and Objects
It’s been said over and over again. Each process should hold as much of its own real estate as possible. When two or more processes write to the same object, corruption can occur unless there is a locking mechanism in place. If two processes write to the same file at the same time, you will notice garbled entries in your logs. This did not affect my crawled data because each file was written by a single process. The same can be true for reading data as well. When spawning multiple processes, I shared the same Redis connection with all of the processes. If two processes read from the queue at the same time, one process would get the correct data (a thread ID and page number) and the other would get “OK”, which was the result of the first process’ fetch operation. This is mostly my fault, but partially redis-py‘s fault for filling some buffer between Python and Redis with meaningless information (“OK”).
Each process should write is own log files. When opening a file, you can use the following:
import os OUT = open("mylogfile-%s.log" % str(os.getpid()), "w") ... OUT.close()
Crawler Specific: Set an Upper Bound
Crawling is fun, but you must practice moderation or it is easy to attempt to boil the ocean. When I first started, I would run a crawl, have it crash, and then deem the data out of date and start over from the beginning and crawl until there was nothing possible left to crawl. It is good to set an upper bound: “I will crawl 10 days worth of data”, or “I will only use threads created prior to May 1, 2011.”
One of the keys to success with EC2 is to get over the penny pinching. If you have a project, just take the plunge and do it on EC2 (if required). The amount you spend on the first few projects will save you more on future projects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.