The Frontier API > Foundation

Documentation Home 

Foundation

Frontier® is a massively scalable, grid computing platform that allows applications to easily utilize network-aggregated computational power atop many different operating systems. Specifically, Frontier draws on the otherwise unused computational capacity of relatively low-power, unreliable, high-latency nodes in the form of Internet-connected desktop computers running in parallel to create a coherent, high-power, reliable whole.

Behind this platform is the Frontier Application Programming Interface (API), the framework that makes creating, launching, monitoring, and controlling enormous compute-intensive jobs from an average desktop computer possible. Consisting of two primary components—the Task Runtime API and the Client API, the Frontier API enables developers to quickly build new applications or port existing applications to run on Frontier.

This document describes the Frontier API and presents how it can be used to develop applications to utilize the unprecedented power of Frontier. Specifically, we'll first take a look at the architecture of the Frontier platform, discuss the components of the Frontier API, examine the basic elements and design considerations that go into writing a Frontier application, and provide a technical overview of the Frontier API itself.

 

The Frontier Platform

The Frontier platform is made up of three main components, each of which plays a significant role in performing computational work on Frontier:

 

Frontier Basics

Jobs and Tasks

Computational work to be performed on Frontier is grouped into a single, relatively isolated unit called a job. Within a job, work is divided into an arbitrary set of individual tasks, with each task being executed independently on a single node. Each job is defined by a set of elements (which we'll describe in the next section) and a set of tasks, while each task is defined by a set of elements it contains, a set of elements it requires, a list of parameters, and an entry point in the form of a Java™ class name.

Tasks running on Frontier have the following characteristics, each of which will be discussed in more detail later within this document:

Although tasks within a job tend to be homogeneous and launched simultaneously, this behavior is not mandated. Tasks can be completely unrelated and launched at different times over the lifetime of a job. Once a job or task is submitted to the Frontier server, it remains resident until explicitly removed. For a task, this includes all results, regardless of whether the task has yet to run, is running, or is completed. For a job, this includes all job elements and tasks.

Data and Executable Elements

An element is the mechanism used to efficiently transport relatively large chunks of binary data required to execute a task. Data and executable elements may be associated with a single job or task. They are sent from a client application to the server and directed to computational nodes as required before the execution of a task is initiated. A task may refer to elements that exist either as part of that task or as part of the job that contains it. Any other reference will not resolve correctly and will result in a malformed task.

Executable elements provide the Java bytecode instructions necessary to run a task. Except for the runtime environment, all classes required to run a given task must exist in one of the executable elements explicitly referenced in the task specification. Each executable element is formatted as a Java jar file and is made available by the engine to the Java Virtual Machine (JVM) in which the task is executed.

Data elements are black-box chunks of data that are referred to in a task's parameter list, and hence are implicitly required for running a task. All data elements thus referenced are sent to a compute engine before a task's execution begins and are made available as streams to the running task via requests to the task context. A task can handle these streams of data however it deems appropriate.

Task Definition

A task is defined by the job of which it is a part, the elements it contains, a list of executable elements it requires, a list of parameters, and an 'entry-point' class name. The latter two are described in more detail below.

The parameter list is the most important component of a task definition. It is possible—though not necessary—for a job to contain a set of tasks that differ only by the contents of their parameter lists. Put simply, a parameter list is a set of name <-> value pairs. Each name is a short, case-sensitive string composed of alphanumeric characters and underscores, and each value is one of a set of the following predefined types of parameter values: boolean, integer, scalar, string, date, data element reference, binary data, array of parameter values, or structure. As a structure is defined as a set of name <-> value pairs with the same form as the parameter list itself, parameters can actually contain entire hierarchies of information.

The entry-point class name specifies the name of a Java class that is included in the executable elements referenced by a task. This class must be public and must implement the Task interface as described in The Task Runtime API section.

Task Status

Task status reports are the primary method tasks use to communicate with the client application. These reports include run mode, results or exceptions, and progress, as well as other pieces of information that the Frontier compute engine and server may include such as computational work performed, etc. Run mode can include unstarted, running, and complete, as well as the more exceptional modes of paused and aborted. Results can be arbitrary and are generated by the task using the same name <-> value mapping mechanism as employed for task parameters. Exceptions, composed of codes and a textual description, are generated either for tasks that complete by throwing exceptions or that are 'malformed' and cannot be started, or for tasks that cannot be run on available engines due to resource limitations. Since the former type of exception would occur no matter where a task is run, it is considered the natural result of running a task, and the task generating the exception is marked complete. The latter set of exceptions cause the task to be marked aborted.

Tasks announce results and progress to the engine at arbitrary intervals—generally, whenever it is convenient for the task to do so. After each report, the Frontier compute engine may save and/or report results to the Frontier server to be persisted and routed to applicable client sessions. The last such status report before normal termination of a task is considered the final result and is consistently sent back to the server and maintained as such. Each new report of results replaces any and all previous reports.

Progress information may also optionally be attached to results, reported alone, or attached to checkpoints (described in the next section). Progress is a strictly monotonically nondecreasing, positive scalar over the lifetime of a task (including across checkpoint restarts). If reported, at any given stage of the execution of a task, a task's 'progress' is considered the last progress reported. Any new progress reported must be greater than or equal to the last.

Checkpoints

The tasks comprising a job may not always run from start to finish in a single session. The most common reason for this is temporary interruption on the provider node, such as when a provider moves his mouse when a task is only halfway complete. If such an event occurred, all the work that had been done prior to the mouse move would simply be discarded, and the task would have to start over from the beginning when the machine was once again idle. In fact, long running tasks would possibly never complete—for example, if a task ran for more than a day and a provider checked her mail every morning, the task would never have a chance to run from start to finish. A better approach would be to start the task from where it left off. Another, albeit less frequent, situation in which a task had to be stopped in the middle would occur if the Frontier server reassigned it to another node, for instance, from a slow machine to a faster one.

In these situations, before a task is stopped, a snapshot of its state must be saved, from which it can be restarted. Frontier uses a mechanism called checkpoints to achieve this capture of a task's state. Basically, a task takes a snapshot of itself whenever convenient, packages it up, and ships it off to the engine to be saved to disk. A task can then be stopped quickly and easily at a moment's notice, and only the work done since the last time it logged a checkpoint will be lost. The engine can use the most recent checkpoint to restart a task when it's allowed to once again start doing work, or send it to the server when requested. This mechanism is easy for tasks to use, since tasks themselves decide when they can most easily package up their state. Its real beauty, though, is that it even works when a task is shut down abruptly: the engine can stop computation fast when interrupted by the provider, and the mechanism even works when the computer is unexpectedly rebooted.

Checkpoints themselves are very similar in format to the set of name <-> value pairs used for task parameters. In fact, what a task actually reports as a checkpoint is simply a set of parameters that should be used in place of the original parameters to start a new task from something close to the task's current state. So, when a task is started, the parameters it is given may be either the initial parameters specified by the client application or some set of parameters logged as a checkpoint by an earlier execution of the same task. The task should continue executing based on these parameters, behaving as closely as possible to how it would have behaved had no interruption ever occurred.

For a trivial example of checkpoints, let's consider a toy task that sums all the integers from N1 to N2. The two most obvious parameters would be N1 and N2; however, we'll introduce a third parameter, CurrentTotal. If we wanted to sum 1 through 100, we'd set the following parameters: N1 = 1, N2 = 100, CurrentTotal = 0. The task would start looping from N1 to N2, logging a checkpoint after each iteration. After the first iteration, the checkpoint might look like N1 = 2, N2 = 100, CurrentTotal = 1. After the second iteration, N1 = 3, N2 = 100, CurrentTotal = 3. After the third, N1 = 4, N2 = 100, CurrentTotal = 6. Let's say the power goes off during the fourth iteration. When the machine comes back up, the task will be restarted using this last checkpoint instead of the initial parameters. Conceptually, it will now compute the sum of the integers from 4 to 100, adding 6 to the result. Hence, the answer sent back will be the same as would have been reported had the task never been interrupted, without having to re-compute the sum from 1 through 3.

 

Design Considerations

When developing applications to take advantage of the Frontier platform, one must take into account several considerations that would not necessarily be encountered when dealing with more traditional, localized platforms. This section explains several of these issues and discusses how they affect application design and implementation.

Use of Java for Task Executable Code

The security model for an Internet-based grid computing platform requires that neither users nor providers be considered trusted entities. Unlike most local systems, users do not own or directly control the computational nodes that comprise a Frontier grid. Because Frontier allows arbitrary tasks to be executed using resources owned and operated by independent providers, engines must be able to execute unreviewed, third-party user code on demand, without risk of violating the security and integrity of a provider's computer, data, and privacy. Writing task executable code in the Java programming language ensures that this objective is achieved.

Frontier supports native tasks (written, for example, in C/C++ or Fortran), however, Frontier's security policy limits their execution to nodes explicitly designated for such. Users can configure nodes under their authority into a so-called virtual private grid for exclusive execution of their native tasks. Alternatively, special arrangements can be made with a provider of native-enabled engines. A tutorial for the creation of native tasks is provided in a separate document.

Java was specifically chosen over other programming languages for its inherent security features, namely, its JVM technology, which provides a 'sandbox' inside which an engine can securely process tasks on a provider's computer. The JVM is arguably the most flexible, robust, and well-established modern sandbox technology. Its sandbox is given a restrictive security policy that prohibits, among other things, direct network access, disk access, and access to native methods. The only route through which tasks can access the outside world is through interfaces specifically designed for this purpose. Tasks can communicate only directly with the Frontier compute engine and back to the Frontier server, employing disk storage within carefully established limits. This ensures that arbitrary code can be used to execute tasks on providers' machines with no manual intervention on behalf of either providers or Parabon. Thus, providers can rest assured that tasks cannot harm their machines or violate their privacy.

As tasks are run within the JVM, the code sent to engines and used to run tasks must be in the form of valid Java bytecode. Generally, Java bytecode is produced by compiling source written in the Java programming language; however, Frontier does not mandate the use of the Java language. Any language that can be compiled to Java bytecode can be used, as long as it can interface with the Frontier Runtime API. Beyond this, the client application need not be written in Java. In fact, aside from task code itself, the only portion of an application that must use Java is the piece that communicates with the Frontier server via the Client API. For example, a client application could be written almost entirely in C++, with some Java Native Interface (JNI) glue to talk to the client library, plus tasks written in Java.

Division of Jobs into Tasks

A job to be processed on Frontier must be explicitly broken up into a set of relatively small tasks. Each of these tasks is given some piece of data and is sent off to an engine to run independently and report back results. The results of these tasks are then gathered by the client application and assembled like pieces of a puzzle to form a coherent image. This requires that the amount of work a task performs be small enough both to be processed effectively given the resources (i.e., memory, disk storage, etc.) available on a provider node and to return a final result within a relatively short time—generally, a few minutes or a few hours.

Further, as nodes are inherently high-latency, the time required for a round trip between tasks running on two different nodes can be quite long—often on the order of several seconds and possibly as long as minutes or even hours. Thus, frequent communication between tasks is not feasible. Though inter-task communication may be more feasible in the future, this functionality is not supported in the current version of Frontier.

High Compute-to-Data Ratio

The individual machines that provide Frontier's computational power often have relatively low bandwidth connections to the Frontier server. Further, the central server must communicate with many of these nodes simultaneously, meaning that communications bandwidth on the server side is at a premium as well. This means that sending large amounts of data to nodes and returning large results can take significant amounts of time.

However, after a task's data has been sent to a node and before its results are sent back, the node can efficiently crunch away on a task for minutes or hours. Adding all this together, we see that tasks run most efficiently if they have large amounts of computation to perform and relatively small pieces of data required and results to report. This is known as a high 'compute-to-data ratio'.

For comparison, let's consider two extreme cases. In a best-case scenario, a task with a very high inherent compute-to-data ratio would be compute-limited and would tend to scale and run as well on Frontier as it would on a traditional cluster of machines. Thus, it would use Frontier's power at an efficiency approaching 100 percent. On the other hand, in a worst-case scenario, a task with a very low compute-to-data ratio would be bandwidth-limited and would spend most of its time transferring data back and forth to and from the server. Such a task could, in fact, take longer to complete on Frontier than on a single machine.

The Launch-and-Listen Paradigm

For most applications, the lifetime of the actual computation maps directly to a subset of the lifetime of an application. That is, computation begins sometime after the application starts, ends sometime before the application exits, and runs only while the application is active. Frontier follows a significantly different paradigm. Applications utilizing Frontier tend to use a two-stage methodology: launching and listening.

Both stages can occur within a single instance of an application, two or more separate invocations of the application, or even via two or more completely separate applications. When launching a job, tasks are created and sent to the server for processing. Listening involves gathering results and status updates or removing tasks from Frontier. Launching and listening can intermingle in a single session or occur over the course of several sessions with the server. The precise paradigms for establishing, observing, and terminating jobs are described in The Client API.

Unpredictable Task Execution

Nodes in Frontier are by nature unreliable, in terms of whether or not they can successfully complete a given task within available resources, what their effective power over the course of a given task is, and whether they still exist and can communicate with the server by the time they have completed a task. While the Frontier server can help mitigate the first and completely hide the last by reassigning tasks, task execution time may become even more unpredictable. As a result, the expected time to completion for a given task follows a complex probabilistic distribution determined by a number of factors.

The most significant of these factors is the actual amount of computational work required to complete a given task. Although Frontier can even help mitigate this through techniques such as redundancy and selective node assignment, applications that take this distribution into account implicitly tend to behave most satisfactorily. For instance, applications that compute intermediate results can give hints as to how well a computation is performing and quickly provide feedback as to the expected final results. Some algorithms—for instance, Monte Carlo analysis—even behave quite well in the presence of only a random subset of task results, with the only penalty being additional variance in the 'partial' result.

Erroneous Results

The Frontier server can employ a number of safeguards to protect against compute engines returning invalid results, either accidentally or maliciously, including full task redundancy and result comparison. However, it is possible that an invalid result could go undetected. Hence, it is considered good practice for client applications to validate results whenever possible via internal consistency checks, discarding of outliers, etc.

Intellectual Property Protection

Often, the parameters, data, or executable elements of a task may represent a user's intellectual property. To protect this sensitive information, Frontier employs a number of mechanisms, including:

Moreover, a given task—even if cracked—represents only a tiny fraction of an entire job: one piece of a large puzzle, as it were. To provide even more protection, users can implement additional safeguards within their applications beyond those Frontier provides, such as obfuscating executable bytecode, excluding intellectual property and identifying marks from task parameters and interface, and obfuscating data and results.