The goal of this research is to develop techniques for building highly scalable shared memory multiprocessors. Factors that have traditionally have limited the scalability of shared memory systems include:
| enforcing sequential consistency, | |
| inefficient synchronization, | |
| memory latency and bandwidth limitations, | |
| bus and memory contention, | |
| the necessity to enforce inclusion on lower-level caches, and | |
| limited I/O bandwidth. |
We began this effort with a paper design of Willow, an architecture that attempted to address each of these issues directly. Willow is a shared memory multiprocessor whose design provides system capacity and performance that is capable of supporting hundreds of microprocessors. We developed a detailed design of a sixty-four processor prototype that tested most of our ideas about scalability. Our results studying the Willow design, as well as the research of others investigating these issues, have led us to a cluster-based approach.
Parallel computing using networks of workstations has increasingly become a viable alternative to large, tightly-coupled parallel machines. Recently, it has become feasible for a single workstation to contain two to four processors. These multiprocessor workstations, combined with a high-speed network, comprise a cluster-based multiprocessor. Our recent research has focused on the means by which these individual clusters are connected to the network. In particular, we are investigating shared network caches and multi-channel networks.
A network cache augments the caching already provided by private, per-processor caches, and is particularly appropriate when large, second-level processor caches are not present. The network cache stores local copies of remote data and provides three potential benefits: increased intra-cluster sharing, reduced network traffic, and prefetching. Intra-cluster sharing occurs when processors on the same cluster actively share data without a corresponding network data transfer. There are two ways in which this can occur: direct processor-to-processor transfers, and sharing of network cache lines. Only direct transfer is provided in systems without a network cache. By providing a network cache, we increase the amount of intra-cluster sharing, since the network cache holds copies of remote data that processors can access locally. Many requests for remote data that would have otherwise incurred long network latencies are satisfied in the network cache, thereby reducing the demand placed on the network. Finally, network cache lines that are longer than processor cache lines may result in improved performance by prefetching needed data. We are currently examining the structure, feasibility, and performance of cluster multiprocessors employing network caches.
We have observed that the adverse impact of cache inclusion-related evictions must be mitigated for the performance benefits of network caches to be fully realized. We have evaluated three network cache architectural alternatives designed to address this issue: increasing network cache associativity, adding a network victim cache, and adding a tag cache to relax inclusion requirements for clean network cache lines. We found that a four-way set associative network cache, or a four-entry victim cache, dramatically reduced execution time for all applications examined. Employing a tag cache for replaced clean lines improved performance for some applications, but this benefit is highly dependent upon processor cache associativity.
![]()
| H. Abdel-Shafi, W.E. Speight and J.K. Bennett. Efficient user-level thread migration and checkpointing on Windows NT clusters. In Proceedings of the 1999 USENIX Windows/NT Research Symposium, 1-10, July, 1999. | |
| W.E. Speight, H. Abdel-Shafi, and J.K. Bennett. Realizing the performance potential of the Virtual Interface Architecture. In Proceedings of the 13th ACM International Conference on Supercomputing (ICS), June 1999. | |
| E. Speight, H. Abdel-Shafi, and J.K. Bennett. An Integrated Shared-Memory/Message Passing API for Cluster-Based Multicomputing. In Proceedings of the Second International Conference on Parallel and Distributed Computing and Networks (PDCN), 146-153, December, 1998. | |
| E. Speight and J.K. Bennett. Using Multicast and Multithreading to Reduce Communication in Software DSM Systems. In Proceedings of the Fourth Symposium on High Performance Architecture (HPCA), 312-323, February, 1998. | |
| W.E. Speight and J.K. Bennett. Brazos: A third generation DSM system. In Proceedings of the 1997 USENIX Windows/NT Workshop, 95-106, 1997. | |
| J.K. Bennett, K.E. Fletcher, and W.E. Speight. The Performance Value of Shared Network Caches in Clustered Multiprocessor Workstations. In Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS-16), May 1996. | |
| Yanyang Xiao and J.K. Bennett. Memory organization in multi-channel optical networks: NUMA and COMA revisited. In Proceedings of the 10th ACM International Conference on Supercomputing (ICS), May 1996. |
| J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Techniques for reducing consistency-related communications in distributed shared memory systems. ACM Transactions on Computers, 13(3), 205-243, Aug. 1995. | |
| J.K. Bennett, J.B. Carter, and W. Zwaenepoel. Adaptive software cache management for distributed shared memory architectures. In The cache coherence problem in shared memory multiprocessors: software solutions, Igor Tartalja and Veljko Milutinovic, editors, IEEE Computer Society Press, 1995. | |
| R. Mukherjee and J.K. Bennett. Operating system design principles for scalable shared memory multiprocessors. In Proceedings of the Symposium on Parallel and Distributed Computing Systems (PDCS-95), 1995. | |
| R. Mukherjee, J.K. Bennett, and J.A. Greenwood. The effects of architecture on the performance of latency hiding via rapid context switching. In Proceedings of the 7th IASTED International Conference on Parallel and Distributed Computing and Systems, 1995. | |
| W.E. Speight, K.E. Fletcher, and J.K. Bennett. Working set requirements and performance of network caches in cluster-based multiprocessors. ECE TR 9414. | |
| K.E. Fletcher, W.E. Speight, and J.K. Bennett. Techniques for reducing the impact of cache inclusion in shared network cache multiprocessors. ECE TR 9413. | |
| J.K. Bennett, S. Dwarkadas, J.A. Greenwood, and Evan Speight. Willow: A scalable shared memory multiprocessor. In Proceedings of SuperComputing `92, IEEE Computer Society Press, pp. 336-345, Nov. 1992. | |
| J.K. Bennett, J.B. Carter, and W. Zwaenepoel. Toward large scale shared memory multiprocessing. In Scalable Shared Memory Multiprocessors, M. Dubois and Shreekant Thakkar, editors, Kluwer Academic Publishers, Nov. 1991. | |
| J.B. Carter, J.K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th Symposium on Operating System Principles, pp. 152--164, Oct. 1991. | |
| J.K. Bennett, J.B. Carter, and W. Zwaenepoel. Adaptive software cache management for distributed shared memory architectures. In Proceedings of the 17th International Symposium on Computer Architecture, pp. 125--134, May 1990. | |
| J.K. Bennett, J.B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type--specific memory coherence. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 168--176, Mar. 1990. | |
| R. Mukherjee and J.K. Bennett. Simulation of parallel programs on a shared memory multiprocessor. In Proceedings of the Hawaii International Conference on System Sciences (HICSS23), pp. 242--251, Dec. 1989. |