Parallelized Back Testing
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As mentioned earlier, currently I am playing with trading strategies based on Support Vector Machines. At a high level, the approach is quite similar to what I have implemented for my ARMA+GARCH strategy. Briefly, the simulation goes as follows: we step through the series one period (day, week, etc) at a time. For each period, we use history of pre-defined length to determine the best model parameters. Then, using these parameters, we forecast the next period.
There are two ways to parallelize this process. The approach taken in the ARMA+GARCH system, parallelizes the model selection (done for each period), while it steps through the series sequentially. There are good reasons why I did it this way, for instance, the garchSearch function is usable standalone. Another reason was probably that I was dumb enough to hope that it (garchSearch function)may get picked up and added to a GARCH package (fGarch preferably), so that I don’t have to maintain it myself.:)
In any case, this is definitely the more inefficient approach. In parallel execution, it pays off to increase the size of the task executed in parallel. The more efficient alternative in our case, is to step through the series in parallel, while each task does the fitting for all possible models for a given period. A different way to think about it is how many tasks will be executed in total. In the ARMA+GARCH system, we execute inputeSeriesLength*numberOfModels tasks, while in the second we only execute inputSeriesLength tasks. Since the total amount of work is the same, certainly the latter approach has bigger tasks, thus, less overhead in terms of starting/stopping new tasks.
This approach has another advantage as well, it also works better with the way SVMs, Neural Networks and other similar apparatus is implemented in R. For instance, the e1071 package comes with a tune function which already does the tuning among a predefined set of models. Thus, if we want to make use of the provided tune, then we need to parallelize the tune invocations.
Let’s take a look at the driver code for an SVM (the comments should be helpful understanding the code):
svmComputeForecasts = function( data, response, history=500, modelPeriod="days", modelPeriodMultiple=1, trace=TRUE, startDate, endDate, # All following parameters are system (SVM, Neural Network, etc) dependent kernel="radial", gamma=10^(-5:-1), cost=10^(0:2), sampling="cross", cross=10, selectFeatures=TRUE, cores) { require( e1071 ) stopifnot( NROW( data ) == NROW( response ) ) len = NROW( response ) # Determine the starting index if( !missing( startDate ) ) { startIndex = max( len - NROW( index( data[paste( sep="", startDate, "/" )] ) ) + 1, history + 2 ) } else { startIndex = history + 2 } # Determine the ending index if( missing( endDate ) ) { lastIndex = len } else { lastIndex = NROW( index( data[paste( sep="", "/", endDate )] ) ) } if( startIndex > lastIndex ) { return( NULL ) } modelPeriod = tolower( modelPeriod[1] ) # Get the interesting indexes periods = index(data)[startIndex:lastIndex] # Compute the end points for each period (day, week, month, etc) endPoints = endpoints( periods, modelPeriod, modelPeriodMultiple ) # Compute the starting points of each period, relative to the *data* index startPoints = endPoints + startIndex # Remove the last start point - it's outside length(startPoints) = length(startPoints) - 1 # Make the end points relative to the *data* index endPoints = endPoints + startIndex - 1 # Remove the first end point - it's always zero endPoints = tail( endPoints, -1 ) stopifnot( length( endPoints ) == length( startPoints ) ) # Initialize the number of cores if( missing( cores ) ) { cores = 1 } res = mclapply( seq(1,length(startPoints)), svmComputeOneForecast, data=data, # System dependent response=response, startPoints=startPoints, endPoints=endPoints, len=len, history=history, trace=TRUE, kernel=kernel, gamma=gamma, cost=cost, selectFeatures=selectFeatures, mc.cores=cores ) # Vectors to summarize the results forecasts = rep( NA, len ) gammas = rep( NA, len ) costs = rep( NA, len ) performances = rep( NA, len ) features = rep( NA, len ) # Format the results for( ll in res ) { # Prepare the indexes ii = ll[["index"]] jj = ii + NROW( ll[["forecasts"]] ) - 1 # Copy the output forecasts[ii:jj] = ll[["forecasts"]] gammas[ii:jj] = ll[["gamma"]] costs[ii:jj] = ll[["cost"]] performances[ii:jj] = ll[["performance"]] # Encode the participating features as a bit mask stored in a single # integer. This representation limits us to max 32 features. features[ii:jj] = sum( 2^( ll[["features"]] - 1 ) ) } sigUp = ifelse( forecasts >= 0, 1, 0 ) sigUp[is.na( sigUp )] = 0 sigDown = ifelse( forecasts < 0, -1, 0 ) sigDown[is.na( sigDown)] = 0 sig = sigUp + sigDown res = merge( reclass( sig, response ), reclass( sigUp, response ), reclass( sigDown, response ), na.trim( reclass( forecasts, response ) ), reclass( performances, response ), reclass( gammas, response ), reclass( costs, response ), reclass( features, response ), all=F ) colnames( res ) = c( "Indicator", "Up", "Down", "Forecasts", "Performance", "Gamma", "Cost", "Features" ) return( res ) }
The pattern that I want to emphasize culminates in the call to mclapply. Each instance of the function executed in parallel (svmComputeOneForecast in this example), needs an index to identify the period it has to operate on. This is precisely what seq(1,length(startPoints) is for. Each task will be passed an unique index identifying the period. The period is retrieved by using: startPoints[uniqueIndex] and endPoints[uniqueIndex], respectively.
I have used this approach in an earlier implementation of the ARMA+GARCH system and now I am using it for the SVM system. Likely I will be using the same approach with Neural Networks, Random Forest and all the rest I have on my to-test list. Thought it may be helpful to others too.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.