When you first take your embarrassingly parallel, embarrassingly slow application and throw it on a big grid, it can sing and dance and return results fast enough to make you not really wonder — nor particularly care — about how it got from here to there. Indeed, when your grid is built on top of unreliable and inhomogeneous resources, you probably really don't want to think too much about how the sausage gets made — and a well-built grid, such as Frontier, insulates you from such concerns for the most part by being intelligent and playing games with the resources on your behalf so that they behave like a team of workhorses rather than the willful herd of cats they are. But once you work with a grid for a while and start to get familiar with its behaviour and quirks, then thinking a bit about what's actually going on behind the scenes can help take an app that goes and make it hum instead. Case in point: Monte Carlo.
But first, let me explain what I mean by the "unreliable and inhomogeneous resources" composing the grid. Because in a way, these two are the same thing. When you launch a thousand tasks out into the cloud, you'd like to imagine that they end up on a thousand identical boxes which crank away for exactly 735.2 seconds until out pops a result, and you just come along and collect the ripe fruit. And that's certainly an option, but a grid like that is a lot more expensive to build, use and maintain than one which, like Frontier, aggregates whatever resources happen to be available. What really happens in a system like Frontier is that the first task might end up on a hellacious monster of a workstation, while the second ends up on a brand-new desktop box with lightning-fast integer performance but middlin' floating-point, the third lands on a machine that doesn't have quite enough RAM to finish the task or has some esoteric DLL problem, and so on. Like I said, Frontier's scheduler will help insulate your job from this, for instance by noticing that some tasks have been aborted or just aren't making progress and restarting those tasks elsewhere, and by "predictive scheduling", which uses selective redundancy (starting some tasks on multiple machines) to help make sure the job completes in a relatively predictable time. Even so, it's a statistical inevitability that your tasks will complete according to a curve rather than all at once.
You're going to see a large fraction of your tasks complete quickly — for the sake of argument, say half your tasks are done within 100 seconds. But those tasks which landed on the slower little machines that could might not end up finishing for another 50 seconds, and other tasks may need to be restarted at some point — they might average, say, 200 seconds total. Statistically, some small fraction of tasks will need to be restartedtwice, and hence take longer; and so on. The result is that the curve has a tail. Frontier's scheduler will help mitigate this dramatically, taking what might be a huge tail and reducing it significantly, but it's a fact of life that there'll still be a tail. All this means that if your app — like most apps — can't complete until every last task returns its piece of the puzzle, it could sit around waiting (again, just for example) 300 or 500 seconds, even though the majority of the tasks have long since completed.
But some applications which use randomness — for instance, those based on Monte Carlo techniques (though the same idea applies to similar randomized sampling methods which aren't technically Monte Carlo) — can apply a clever approach to defeat this statistical Murphy's Law and "cut off the tail" entirely. So let's talk about the Monte Carlo method (herein "MC"); don't worry, we'll come back and tie this all together shortly. Skip ahead if you're familiar with MC. MC is an analysis technique to find some unknown quantity by evaluating a bunch of random points and combining their results into a single coherent picture. For instance, if we didn't have a convenient little equation for the area of a circle, we could determine the area by throwing a bunch of darts at random into a square containing the circle and counting how many hit inside the circle vs. how many we threw — that times the area of the square equals the area of the circle. Go ahead, try it, it's fun. Or you could, say, analyze how quickly traffic tends to move through a set of intersections by creating a bunch of cars at random and watching their speeds as you simulate them driving through the intersections; repeat this many many times, then combine the results. As it turns out, there are a surprisingly large number of uses for MC, in fields from physics to finance.
So back to the tail. If you're computing 10,000 tasks' worth of different random scenarios and combining the results, you could wait for that ten thousandth task to complete — but as I said above, that last task will inevitably take longer than the average task by some nontrivial factor, and in the meantime the grid is able to do less and less work on your behalf. Half your tasks (say) complete within 100 seconds, and maybe within 300 seconds 90% have completed, but you're still sitting around for that last task to slink in 500 seconds later.
Instead, why not launch 20,000 tasks? It'll take twice as long to get to the halfway point, but once you do, you have the results of 10,000 random tasks — who really careswhich 10,000 they are? They're random anyhow*!= Cancel the rest: you're done! You've done roughly the same amount of computation as the first case, but suddenly you're done in 200 seconds instead of 500. That's a speedup I'll be happy to live with. Even better, if you tell Frontier not to use predictive scheduling — which decreases the tail at the price of average efficiency — you'll see even more of a speedup, because you're "cutting off the tail" yourself, so you might as well take advantage of as much raw power as you can, even if each individual task finishes in a less predictable time.
(One tradeoff is that because Frontier is delivering more power over a shorter time with this approach, your cost will be marginally higher in some cases for the increased capacity, but this will often be offset by the fact that by turning redundancy off, you're able to complete the job with less overall computation. But pricing models and cost are a topic for another post.)
Parabon tried this with a 3D renderer using Monte Carlo techniques we wrote for research purposes, and saw a genuinely impressive speedup. In fact, we took it a step further and displayed the results continuously, as each task returned, gradually enhancing the image being rendered until it looked "good enough". The results were beautiful: an image forming before your eyes, white noise at first until shapes began to emerge from the haze of mosquitoes until one could see the final image taking shape — a small fraction of the time before it would have completed. Saved a lot of time on "false starts", too — quite often we'd decide, based on this "rough draft," to tweak a light or move the camera. Each time, we'd just cancel the job and start a new one, all before the original image was even 10% complete!
*For those mathematical purists out there: yes, I neglected to mention bias. By picking the half of the tasks which return first, you're technically introducing a bias towards those samples which can be evaluated more quickly. In practice, though, this bias rarely matters, as long as you have either many more tasks than engines or many samples per task; in either case, the computation required of any given sample tends to average out. Put another way, as long as your sample time is << the time until job cutoff, the correlation between any sample's runtime and the likelihood that it will contribute to the final result is negligible. But if bias matters to you, then you should make sure to chart out your results as a function how long a job has been running and make sure there's no correlation, or at least run a few jobs through to completion and compare their results against similar jobs run with this tweak and make sure they match within uncertainty bounds.