Performance Debugging

Performance debugging is the process of understanding and correcting performance problems in parallel programs that produce correct results but fail to make maximum use of the available computational resources. Performance debugging is made difficult by the complex runtime interaction between the parallel program, the underlying operating system, and the hardware architecture. Performance degradation rooted in this interaction is often not readily apparent to the application programmer. Thus, the challenge in performance debugging is to relate problems with application performance in the context of the program's logical behavior, without requiring the application programmer to have detailed knowledge of the underlying system components.

We have developed ParaView, a prototype performance debugging system for parallel programs written for shared-memory multiprocessors. The ParaView system includes trace generation tools, an on-the-fly trace analysis preprocessor, and an X-windows based graphical user interface. ParaView is is designed to assist the programmer in identifying problems such as poor data partitioning, false sharing, contention for shared resources, and unnecessary or inappropriate synchronization.

Publications Related to Performance Debugging 

bulletW.E. Speight, H. Abdel-Shafi, and J.K. Bennett.  Realizing the performance potential
of the Virtual Interface Architecture
.  In Proceedings of the 13th ACM International Conference on
Supercomputing
(ICS), June 1999.
bulletE. Speight and J.K. Bennett. Using Multicast and Multithreading to Reduce Communication in Software DSM Systems.  In Proceedings of the Fourth Symposium on High Performance Architecture (HPCA), 312-323, February, 1998.
bulletJ.K. Bennett, K.E. Fletcher, and W.E. Speight. The Performance Value of Shared Network Caches in Clustered Multiprocessor Workstations. In Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS-16), May 1996.
bulletYanyang Xiao and J.K. Bennett. Memory organization in multi-channel optical networks: NUMA and COMA revisited. In Proceedings of the 10th ACM International Conference on Supercomputing (ICS), May 1996.
bulletR. Mukherjee and J.K. Bennett. Operating system design principles for scalable shared memory multiprocessors. In Proceedings of the Symposium on Parallel and Distributed Computing Systems (PDCS-95), 1995.
bulletR. Mukherjee, J.K. Bennett, and J.A. Greenwood. The effects of architecture on the performance of latency hiding via rapid context switching. In Proceedings of the 7th IASTED International Conference on Parallel and Distributed Computing and Systems, 1995.
bulletW.E. Speight, K.E. Fletcher, and J.K. Bennett. Working set requirements and performance of network caches in cluster-based multiprocessors. ECE TR 9414.
bulletK.E. Fletcher, W.E. Speight, and J.K. Bennett. Techniques for reducing the impact of cache inclusion in shared network cache multiprocessors. ECE TR 9413.
bulletJ.K. Bennett, J.B. Carter, and W. Zwaenepoel. Adaptive software cache management for distributed shared memory architectures. In Proceedings of the 17th International Symposium on Computer Architecture, pp. 125--134, May 1990.