Thursday, July 18, 2019, 3:00 pm
Speaker: TUM-IAS Visiting Fellow Professor Abhinav Bhatele
Title: On Mitigating Congestion in High Performance Networks
High performance networks enable fast communication between compute nodes on large clusters and supercomputers. Even so, many parallel programs spend a significant fraction of their exe- cution time performing communication (process-to-process mes- sages, filesystem reads/writes, etc.) on these networks. This is due to the sharing of network resources among different traffic classes and among concurrently running programs (jobs), which leads to network congestion, and as a result, run-to-run performance varia- bility and performance degradation of individual programs (jobs). No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network.
In this talk, Dr. Bhatele will present a novel algorithm to mitigate congestion on high performance networks by minimizing sha- ring of network links among jobs. The algorithm is a new resource allocation policy used by the job scheduler on fat-tree network ba- sed systems to assign„isolated“ node partitions to individual jobs. These isolated partitions prevent multiple jobs from sharing the same network links, and as a result, completely eliminate inter-job network interference.
He will also present his work on investigating performance varia- bility arising due to network effects on supercomputers that use a dragonfly topology — specifically, a Cray XC40 equipped with the Aries interconnect.
Venue: Leibniz Supercomputing Centre, Seminar Room 2