The FTAG Model for
Creating Fault-tolerant Software
This month, the focus is on an ongoing collaborative research effort
between prominent computer scientists from the Tokyo Institute of Technology
and the University of Arizona to implement a model for designing more robust
software.
by Steven Myers
One of the chief goals of software engineers who work on mission-critical
systems is to write code that is as robust as possible. Certain types of
software systems simply must work properly at all times, no matter what
happens. In order to aid software designers in the creation of fault-tolerant
software, Professors Takuya Katayama of the Tokyo Institute of Technology
(TITech) and Rick Schlichting of the University of Arizona have undertaken
a joint research project aimed at implementing a new attribute-based programming
model, called FTAG.
The collaboration started in 1990, when Prof. Schlichting began a sabbatical
at TITech, funded by a grant from the US National Science Foundation. A
follow-on grant from the NSF supported exchanges over the next two years,
and more recent support for the project has come from the US Office of Naval
Research. Currently, most of the actual technical work is going on at TITech
and the Japan Advanced Institute of Science and Technology (JAIST, where
Prof. Katayama is also an active faculty member).
The FTAG model
Under the FTAG project, the FTAG model is used to write a program as a series
of module decompositions, with provisions for redoing and replicating modules
used to implement the fault-tolerance requirements. In simple terms, the
program starts with one main "top" module, which gets broken down
into smaller modules in a recursive, tree-like fashion. The model consists
primarily of two parts: type definitions and module definitions. FTAG has
basically the same set of primitive types found in traditional programming
languages, such as C and Pascal, and it supports type constructors that
can be used to make more complex types, such as arrays and records.
The fault-tolerance features of FTAG include built-in functions for redoing,
replication, and stable-object access. The redoing function replaces a portion
of the computation tree with a new computation. This is used as a mechanism
for replacing a part of a computation that has failed.
With each computation, a set of attribute values is stored that can be tested
to determine the validity of the computation. If a failure is detected in
a certain module, then the entire execution starting at that module is discarded
and recomputed. (This action does not affect the execution of other program
modules.)
Replication enables copies of an FTAG module to be created and executed
in parallel, providing backups in the event of a failure in the execution
of one of the modules. The stable-object access feature, meanwhile, provides
a means for determining which attribute values are important enough to be
stored somewhere other than in main memory, so that they can be retrieved
if the reconstruction of a computation becomes necessary.
The advantages of FTAG
FTAG offers a number of advantages for writing fault-tolerant software.
Programs are static and declarative in nature, making it easier to understand
and incrementally create this type of software. Also, syntactic and semantic
definitions are kept completely separate, contributing to program readability.
Finally, programs in FTAG exhibit a high degree of locality; information
is passed only between functions using attributes, and only then between
functions that have a parent/child relationship (in the execution tree).
FTAG is well-suited to implementation on a loosely-coupled multi-processor
system, such as a cluster of workstations. These systems are of special
interest to designers of fault-tolerant software because they consist of
multiple processors with independent failure modes, and are thus more prone
to partial failures than are traditional systems.
Execution in FTAG depends only on the presence of certain attribute values,
so a simple scheme can be used for allocating module decompositions to the
processors. A node in the computation tree is assigned to a processor upon
creation, with that processor being responsible for all communication between
the node and its children. All nodes can be executed in parallel.
Future work on the project will focus on implementing various fault-tolerant
paradigms using the FTAG framework, and investigating the features needed
to realize each paradigm. According to Prof. Katayama, the next step in
the project will involve the programming of practical applications using
the model to test the true benefits of the FTAG approach.ç
For more information about the FTAG project, send e-mail to Prof. Schlichting
(rick@cs.arizona.edu) or Prof. Katayama (katayama@cs.titech.ac.jp).
Many readers are no doubt familiar with Prof. Schlichting through his
JapanCS project, which is a completely separate activity from that described
here. The goal of the JapanCS project is to help make research results in
Japanese computing and computer science more accessible to people outside
Japan. Schlichting and his students operate a Usenet newsgroup called comp.research.japan
and maintain an electronic archive at cs.arizona.edu (accessible via anonymous
ftp and the Web).
|