domingo, 7 de julio de 2013

Parallel programming in Octave

Motivation

Nowadays, it is almost impossible to be working in a single core CPU. Chip prices have been dropping constantly and new programming models need to be in place to exploit the available processing power.
I've been using Octave for quite a while now and found myself with the need to write some parallel code to perform some time consuming tasks like network access in parallel and to use the full power of Octave at the same time.
Currently there are two options to go parallel in Octave. One of them is exploiting multiple cores in a single machine through functions like pararrayfun or parcellfun. The second option is to "scale out" the execution to multiple computers with a package like parallel
The parallel package implements a message passing programming model, the so called MPI. On the other hand, functions like pararrayfun or parcellfun have a more functional flavor. We'll focus on them in this article.

Basics

We'll begin with pararrayfun(). The basic signature of this function is as follows,

[o1o2...] = pararrayfun (nproc, fun, a1, a2, ...)

You provide this function with a function handler, fun, that will point to the function that does the actual work.  Also, you specify the input parameters to the function a1,a2, aN. These have to be arrays, all of the same size. Finally, nproc will specify the amoung of worker processes that will be used in the computation.
The way it works is really simple. Assume for simplicity that fun accepts two scalar parameters, and a1, a2 are two arrays, then pararrayfun() will traverse your arrays taking two scalars at a time and will evaluate your function on this pair. At the end, pararrayfun() will return a concatenation of whatever you return from fun.
The parallelism comes into play with nproc>1. If you have, for instance, nproc==4, you will have four processes trying to perform the evaluation of your function in parallel on separate parts of your input data.
The underlying implementation uses fork() for the multiprocessing, this comes with some overhead, so it is better that each of the chunks of computation, or evaluations of fun are heavy enough to make up the price being paid. To put it simple, suppose your array has 1000 elements, using 4 workers leaves an average of 250 elements being processed by each. If the load for the 250 elements is not heavy enough you'll be slowing things down instead of speeding up.

Example

In this section we go through an example on how to perform a simple computation in parallel with pararrayfun(). The code can be downloaded here.

function c = printArgs(lilist, a)
    lilist
    c = lilist.+a;
end

function retcode = eh(error)
    a = error
    retcode = zeros(3, 1); 
end
  
ilist = zeros(1,3);
output = pararrayfun(2, @(a)printArgs(ilist, a), [3,4,5] , "ErrorHandler" , @eh);

In line 1, we define a function printArgs() that will print its first argument, ilist,  and return a version of it which each of its elements increased by one.
In line 6, we define a eh(), which stands for error handler. This will be called when printArgs() fails and will be responsible of producing the output that printArgs() couldn't create.
Finally, in line 12 is where the magic happens. The first parameters tells pararrayfun that we will be using two processes. The second parameter is a function handler. It looks a bit tricky, but we will explain it.
The @(a) creates the handler based in printArgs(). The thing is that pararrayfun here is set for receiving a one parameter function, and printArgs receives two. What you're telling Octave here is "create a function with one parameter, a, by using printArgs(), and hard code the other parameter to ilist". In this way you get the one parameter function.
In the functional programming jargon you may say that you're closing over ilist. And the whole @(a)printArgs(ilist, a) expression is called a closure.
On execution, paraarrayfun() will call printArgs() with each of the elements in [3,4,5] distributing the elements between available processors in a first come first serve fashion. The code above will lead to the following three calls, with, at most, two of them at the same time,

printArgs(ilist, 3)
printArgs(ilist, 4)
printArgs(ilist, 5)

Remember ilist will always be the one you declare in line 11. Run it yourself and check the output!.