Co-Mentors: William Gropp (Computer Science) and Marc Snir (Computer Science)
Social Impact: Improve performance of critical scientific codes running on the nation’s largest HPC systems.
Project description: Some of the most challenging problems facing society today require the fastest available computers. Such problems include everything from understanding climate change to modeling the actions of diseases. A critical barrier to successful simulations is the scalability of massively parallel simulation codes, and one element of that is the effective placement of the computing processes on a massively parallel machine. The mapping of processes to cores and nodes in a massively parallel computer can have significant impact on the performance of tightly-coupled applications as several studies have shown. In addition, the dominant programming model for these applications, MPI, provides an interface to allow applications to provide information about how the processes communicate; this interface was intended to allow MPI implementations to provide an appropriate mapping of processes to the nodes and cores. However, no implementation of MPI provides a useful implementation of this feature, in part because a perfect solution is NP complete. The challenge is to develop practical approximate solutions that match common application needs.
In this project, students will develop both benchmarks to measure the impact of different process assignments and tools to provide good (if not optimal) process assignments for common communication patterns. Several approaches will be considered, including bandwidth reduction permutations for the matrix that represents the communication graph, as proposed by Snir and Hoefler, as well as exploring hierarchical approaches that consider the assignment to nodes, chips, and cores separately. The project will result in better implementations for the MPI process topology routines that can be integrated into existing open source implementations such as MPICH and Open MPI, as well as research papers detailing the approach and results. The work will take advantage of the Blue Waters system for tests involving measurements of the impact of process topology assignment on scalability, and work with the Center for the Extreme Scale Simulation of Plasma Coupled Combustion to work with these methods in a large-scale complex application.