The Brazos Parallel Programming Environment

In addition to the traditional "grand challenge" applications, such as global weather prediction and oil reservoir simulation, high performance parallel computing offers the potential to both simplify and improve the basic corporate computing environment. Today, this environment consists of networks of workstations or personal computers connected by a network. Although individuals in this environment can communicate, and can even explicitly share data, they are limited to the nature and size of problem that can fit a single machine's memory and processing resources.

Larger and more complex problems require more processors and larger memories. All parallel machines make use of multiple processors to achieve the processing power needed to run large-scale programs, however, parallel machines differ in the structure of the memory subsystem. In tightly-coupled shared memory systems, processors are connected to global memory via a system bus, and all processors share equal access capability to this memory. This approach offers high performance for a small number of processors, but is relatively expensive.

Recent improvements in commodity general-purpose networks and processors, and in particular the development of low-order multiprocessor workstations, have made feasible an inexpensive alternative to large bus-based multiprocessor systems. Networks of stand-alone workstations can be used to implement a distributed memory multiprocessor, where each processing node (consisting of from one to four processors) directly accesses only local memory, and access to memory on remote nodes requires that data be explicitly passed in messages. Networks of stand-alone workstations can be used to implement distributed memory multiprocessors capable of providing the computational power needed to solve large-scale scientific problems. Parallel programs are typically developed for these systems using message passing libraries such as PVM or MPI.

In shared memory systems, underlying hardware support provides the mechanisms for ensuring a consistent view among all processors of shared data. The resulting programming paradigm more closely resembles that of sequential programming because the programmer is not required to explicitly move data between processing nodes. Shared-memory parallel programming is therefore generally considered more intuitive than its distributed-memory programming counterpart.

It is possible to combine the programming advantages of shared memory, and the cost advantages of distributed memory, by providing hardware or software support for an abstraction of shared memory on top of a distributed memory system. These distributed shared memory (DSM) systems allow programmers to use the shared-memory programming paradigm, even though the underlying network of workstations is actually a distributed memory system. The DSM system support, whether hardware or software, transparently intercepts user accesses to remote memory and translates these accesses into messages appropriate to the underlying communication media. The programmer is thus given the illusion of a large global address space encompassing the memory on all available workstations. The Stanford DASH multiprocessor is an example of a hardware-supported DSM system; The Rice Munin and TreadMarks DSM systems are examples of software-supported DSM.

DSM systems implemented on networks of workstations represent a low-cost entry point into the parallel computing arena, in terms of both startup cost and maintenance. However, the development of parallel programs that achieve substantial performance improvements over their sequential counterparts is often difficult. Many factors, including the degree of data sharing between processors, initial data placement, and methods used to communicate data between processors can have a substantial negative impact on the overall performance of parallel programs. Efforts to improve parallel program performance are based upon software support, hardware support, or a combination of both, and can be undertaken at design-time, compile-time, or runtime.

Design-time improvements involve changing the program itself. This technique typically involves iteration: performance analysis through the use of performance tuning tools, such as profilers (e.g., gprof) or performance monitors; changes to the program to address areas of poor performance; and re-execution. This cycle is repeated until the desired (or maximum attainable) level of performance is achieved. Although a necessary step in the development of any parallel or sequential program, the lack of effective performance tuning tools, as well as the complex interaction between multiple computing nodes makes this process more time-consuming and difficult in parallel computing environments than in their sequential counterparts.

Compiler support can also be used to improve the execution behavior of parallel programs. This support includes compile-time data transformations, intended to reduce traffic associated with maintaining consistent views of data; compiler-directed prefetching, which seeks to bring data closer to the requesting processor before the actual request is issued; and data restructuring and initial placement, designed to reduce the effects of non-uniform data access latencies. Complier-based approaches to parallel program performance tuning are attractive because they require little user involvement. The disadvantage of compiler-based techniques is that a non-standard compiler must be developed with specific performance improvements in mind. Such compilers are difficult to develop and are not likely to be widely available. Furthermore, it is difficult for compile-time analysis to predict some runtime-dependent data access behavior, particularly across inter-procedural boundaries.

Finally, there are two types of parallel performance support that can be employed at runtime: architectural improvements, and adaptive runtime support. Architectural improvements typically involve the addition of specialized hardware that addresses a specific performance problem exhibited by a large class of parallel programs. Examples of such hardware support include on-chip simultaneous multithreading, and lookahead buffers that implement prefetching. Hardware support has the advantage of being fast, but is expensive, and usually very platform-specific.

Adaptive runtime support dynamically changes the behavior of the program during execution based upon performance-related observations made during both the current and previous executions. Prefetching, changes in the way data is moved, migration of computational threads between processing nodes, and initial data placement are examples of the kind of performance tuning mechanisms that can be added to a basic a DSM runtime system. Adaptive performance tuning at runtime requires no user involvement, no costly architectural changes, no specialized compiler support, and can be implemented on commodity processing nodes and networks.

We have developed Brazos, a software DSM system and parallel programming environment for x86 machines running Windows NT that incorporates adaptive runtime performance tuning mechanisms. In addition, Brazos employs an adaptation of "scope consistency" that is appropriate for software-only implementations, and that makes selective use of multicast to minimize the number of required consistency messages. Brazos is fully multithreaded, so that it can take advantage of local multiprocessing when this is available. The Brazos design also includes runtime optimizations that recognize locality and assign processor affinity appropriately at runtime. This allows programs to take advantage of the local tightly-coupled shared memory, while transparently interacting with remote "virtual" shared memory physically resident on other clusters.

 

Brazos Publications

(These papers are postscript documents.  Ghostview / Ghostscript tools for viewing and printing postscript can be obtained from http://www.cs.wisc.edu/~ghost/aladdin/.)

bulletE. Speight, H. Abdel-Shafi, and J.K. Bennett. An Integrated Shared-Memory/Message Passing API for Cluster-Based Multicomputing.  In Proceedings of the Second International Conference on Parallel and Distributed Computing and Networks (PDCN), December, 1998.
bulletE. Speight and J.K. Bennett. Using Multicast and Multithreading to Reduce Communication in Software DSM Systems.  In Proceedings of the Fourth Symposium on High Performance Architecture (HPCA), 312-323, February, 1998.
bulletE. Speight and J.K. Bennett. Brazos: A third generation DSM system. In Proceedings of the 1997 USENIX Windows/NT Workshop, August, 1997.
bulletE. Speight.  Efficient Runtime Support for Cluster-Based Distributed Shared Memory Multiprocessors, Ph.D. Thesis, Rice University, August, 1997.
bulletE. Speight and J.K. Bennett. Reducing Coherence-Related Communication in Software Distributed Shared Memory Systems.  ECE TR-98-03.  Rice University, Houston Texas.