Site icon R-bloggers

Simple Python performance timing by checkpoints

[This article was first published on From Learning and Evolution to Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary

The need of a simple timing script

I have been revisiting Python recently and walked through Google’s tutorial. Interestingly my first solution to one of the exercise word count took a lot of time to analysis a 600K text.

I wanted some profile tool to locate the bottle neck. There are lots of profiler tools but they seemed to be too “heavy weighted” for my simple purpose. This great Python profiler guide introduced a simple timing context manager script to measure script execution. You will need to

wrap blocks of code that you want to time with Python’s with keyword and this Timer context manager. It will take care of starting the timer when your code block begins execution and stopping the timer when your code block ends.

Like this

from timer import Timer
from redis import Redis
rdb = Redis()

with Timer() as t:
    rdb.lpush("foo", "bar")
print "=> elasped lpush: %s s" % t.secs

with Timer() as t:
    rdb.lpop("foo")
print "=> elasped lpop: %s s" % t.secs

I used this method and found out the problem in my script is that I used

if key in dict.keys() 

which seems natural since I saw this before:
for key in sorted(dict.keys()):

However, this is a linear look up in a list which didn’t utilize the constant time access of HashMap. The right way is to use dict directly without keys().
if key in dict

So the problem was solved but I still found the timing context manager not simple and intuitive enough. Adding timer means lots of edits and indents of original code, and moving checkpoints involves more editing and indenting.

My version of an even simpler timing script

Therefore, I wrote a simple script that have similar function of timing code blocks by checkpoints, also kept the usage effort minimal.

Here is an example of using timing script in Wordcount. You can search and highlight times to see all the edits needed in script.

wordcount.pyview raw
#!/usr/bin/python -tt
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/

"""Wordcount exercise
Google's Python class
"""

import sys

from basic.util import times

###

# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.

def count_words(filename):
    """helper method for the other 2, return dict"""
    # read file into words
    times.seg_start('count words start')
    f = open(filename, 'rU')
    word_count = {}
    for line in f:
        words = line.split()
        for word in words:
            word = word.lower()
            if word in word_count:    # 0.069 for 640k
            #if word_count.get(word): # 0.03s for 250k text, 0.084 for 640k
                word_count[word] += 1
            else:
                word_count[word] = 1
    times.seg_stop('count words stop')
    return word_count

def print_words(filename):
    "count words"
    #sort dict and print
    word_count = count_words(filename)
    for word in sorted(word_count.keys()):
        print word, word_count[word]


def print_top(filename):
    word_count = count_words(filename)
    word_count_pairs = word_count.items()
    times.seg_start('sorting start')
    word_count_pairs = sorted(word_count_pairs, key = lambda pair: pair[-1], reverse=True)
    times.seg_stop('sorting stop')
    for i in range(3):
        print word_count_pairs[i][0], word_count_pairs[i][1]
    times.last_seg()

def main():
  if len(sys.argv) != 3:
    print 'usage: ./wordcount.py {--count | --topcount} file'
    sys.exit(1)
  times.start(5)
  option = sys.argv[1]
  filename = sys.argv[2]
  if option == '--count':
    print_words(filename)
  elif option == '--topcount':
    print_top(filename)
  else:
    print 'unknown option: ' + option
    sys.exit(1)
  times.end()
if __name__ == '__main__':
  main()

I put some efforts in the print indenting to make the output aligned:

output
==>| Timer start | set to 5 digits after decimal point
=> << | 0          count words start
      | 0.06900 s  count words stop  >>
=> << | 0          sorting start
      | 0.00800 s  sorting stop  >>
the 5027
to 3353
and 2831
=> | 0.00100 s since last point
=> | 0.00300 s since last point. Timer end.
| 0.08600 s Total time elapsed |

Here is the script. You can also download it from the link in right top corner of code block.

times.pyOpen in Github
import time

__author__ = 'dracodoc'
# measure script time duration in segments. http://dracodoc.github.io/
# the clock value list: start, segments, end
# usage: import times, put times.start, seg_start, end etc in line.
T = []
Digit = [7]


def start(digit=7):
    """Timer start. digit control the number width to align"""
    del T[:]  # clean up first
    Digit[0] = digit
    T.append(time.time())
    print '==>| Timer start | set to', Digit[0], 'digits after decimal point'


def last_seg(s='since last point'):
    """calculate the duration between last point till this one"""
    T.append(time.time())
    duration = T[-1] - T[-2]
    print "=> | %.*f s" % (Digit[0], duration), s


def seg_start(s='start...'):
    """set a segment start, always used with seg_stop in pairs"""
    T.append(time.time())
    print "=> << | 0", ' ' * (Digit[0] + 3), s


def seg_stop(s='...stop'):
    """set a segment end, always used with seg_start in pairs"""
    T.append(time.time())
    duration = T[-1] - T[-2]
    print "      | %.*f s " % (Digit[0], duration), s, ' >>'

def end(s='since last point. Timer end.'):
    T.append(time.time())
    duration = T[-1] - T[-2]
    total = T[-1] - T[0]
    print "=> | %.*f s" % (Digit[0], duration), s
    print "==>| %.*f s" % (Digit[0], total), 'Total time elapsed'

Of course sometimes you want more features like execution frequency and memory analysis, then you can always use more powerful profiler like line_profiler and memory profiler. The profiler guide I mentioned earlier have detailed introductions on them.

That being said, I still found my simple timing script is often useful enough and easy to use with minimal overhead.

Version History

To leave a comment for the author, please follow the link and comment on their blog: From Learning and Evolution to Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.