Overcoming the No Free Lunch Theorem in Cut-off Algorithms for Fork-Join programs by Alcides Fonseca

Editorial note: I will start blogging more about technical content. This one is about my latest research paper.

Have you heard about the No Free Lunch Theorem? Basically it occurs when different solutions for a class of programs are not better nor worse than other solutions on average.

A non-mathematical example: if you have a great pasta chef, a great sushi chef and a great chimichanga chef, it’s a case of the “No Free Lunch Theorem”, and not because they will be expensive, but because if you ask them to prepare meals from the three cuisines, the quality will be equal on average (each one with a great meal and two so-so ones).

So this occurs in search and optimization algorithms, in which algorithms can be fined tuned for a given problem, but that extra-performance will not be seen in other types of search and optimization problems. Maybe the latest research on Transfer learning will show that this is not the case.

Back to my paper, it should evidence of the application of the theorem in the case of granularity control mechanisms for parallel programs. But what are those granularity control mechanisms?

When you write a divide-and-conquer parallel program, you split a program in two independent halves that you computer in parallel. But if you have more than 2 cores on your CPU, you can subdivide each half again, and you can now use 4 cores. And so until you can’t subdivide. But if you subdivide to have 1000 tasks, but you only have 16 CPU cores, your program will actually be slower! Dividing tasks and schedulinging each micro-task takes time, and doing that 1000 times is stupid. So the solution is to create as many tasks as you have CPU cores¹, right?

If only it was so simple. If a program being split in half would result in two independent tasks that take the same time to conclude, than the answer might be yes [2]. But there are several unbalanced or asymmetrical task graphs like this one:

In these cases, you have to create a dynamic condition that stops creating parallel tasks, and begins solving the problem sequentially. Here is an example program, using the Fibonacci example³:

The major research question of my PhD was to find the ideal function for that cut-off criteria. I proposed two new approaches and I have even tried to use Evolutionary Computation to find it for me, but I could never get to a function that would outperform all others in all of the 24 benchmark programs I have used.

So the answer to my PhD main Research Question is that it is impossible because this is a case of the No Free Lunch Theorem! There is no optimal cut-off criteria!

I could have given up on this, publishing a paper that would show that there is no answer, in order to prevent future PhD students to waste their time researching this topic like I did. But I ended up. not giving up on the big-picture problem here: Automatically optimizing parallel programs.

If there is no single cut-off criteria that is better than the rest, and typically there is one that is the best for each kind of problem, I took on the quest to automatically choose that best criteria for each problem. Using data from either my benchmarks, and random synthetic benchmarks, I applied Machine Learning Techniques to learn and predict the best criteria for future programs.

It’s 2018 and Machine Learning has saved the day once more [4]

1. Or hyper-threads, or CUDA cores, or OpenCL compute-units, of simply different machines.

2. But you still have to consider lock and memory contention.

3. Yes, this is a dumb implementation of fibonacci. It could be better either by being sequential or by using memoization. However, this dumb version is very asymmetrical and it’s hard to predict when its useful to stop parallelizing: it makes a wonderful hello world for this kind of problem.

4. Back in 2012 I have already applied Machine Learning to a very similar problem: deciding if a data-parallel program would execute on the CPU or GPU