March 13, 2020
Simulating the stars at exascale requires HIP solutions
As GPU architectures have become the standard for scientific computing, application teams have had to retrofit their scientific codes to run on new systems. Even teams with codes that have been re-engineered for GPUs must continually adapt them for new architectures.
Evan Schneider of Princeton University, though, began developing her code for GPUs at the outset. In 2012, Schneider faced the challenge of figuring out how to solve huge astrophysics problems using GPU clusters. What began on small clusters at the University of Arizona with her PhD advisor, Brant Robertson—currently an associate professor at the University of California, Santa Cruz—eventually was run on the now-decommissioned Cray XK7 Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility located at DOE's Oak Ridge National Laboratory. The code—called Cholla, or Computational hydrodynamics on ∥ (parallel) architectures—is now one of the first codes being rewritten for Frontier, an exascale system to be deployed at the OLCF in 2021.
"With Frontier, there's going to be so much more power available on the GPUs," Schneider said. "It really doesn't make sense to do almost anything on the CPUs anymore, so a lot of what we're working out is getting some of our additional physics modules running on the GPUs."
The code is one of eight in the Center for Accelerated Application Readiness (CAAR), an effort to prepare scientific applications for Frontier. Cholla is used to simulate physical systems involved in galaxy evolution, which is how galaxies in the universe change with time. Galaxies are made of not only stars but also dust and gas that interact to influence this evolution. The team's goal is to run a simulation of the Milky Way that incorporates all the gas physics occurring, in addition to all the stars.
"We need high-resolution models because we really want to track the gas in all its different phases—warm, cold, hot, high-velocity, and so on," Schneider said. "We want to understand the gas physics driving star formation and why galaxies stop forming stars. To leverage the observational data we already have, we need to do an extremely large simulation."
Cholla is currently compatible with NVIDIA's CUDA programming language to run on the OLCF's IBM AC922 Summit system, which features NVIDIA Tesla V100 GPUs. Now, Schneider and her team, with CAAR liaison Reuben Budiardja in the OLCF's Scientific Computing Group and representatives from AMD and Cray, are using the Heterogeneous-Compute Interface for Portability (HIP) to do just what its name suggests—translate certain pieces of the code to be portable for the Frontier architecture, which will feature Cray's Shasta architecture and Slingshot Interconnect as well as AMD EPYC CPUs and AMD Radeon Instinct GPUs. This translation process lets users such as Schneider adapt to new GPU architectures like Frontier.
Schneider's graduate student, Orlando Warren at the University of Pittsburgh—where Schneider recently accepted a position as an assistant professor—has already rewritten much of the GPU portion of the code to be compatible with HIP. Next, the team will rewrite the pieces of Cholla currently running on CPUs, so that these can run on GPUs as well.
Robertson is working with his graduate student, Bruno Villasenor, who is adding substantial pieces to Cholla, including the calculations needed to solve for gravity in the team's giant Milky Way simulation. Schneider is coordinating the effort to re-engineer the code as well as adding what she calls "bells and whistles" to further refine the simulations necessary to understand star formation.
With Frontier, the team believes they will be able to simulate star formation with high resolution.
"Right now, we'd like to identify how gas leaves the galaxy and returns to it and how that affects the process of star formation in the Milky Way. The higher resolution we can get, the better we can understand the physical processes of the gas, and that ends up affecting many different problems in astrophysics."
The last step, Schneider said, is ensuring that the new code works when transferred to thousands of GPUs rather than running on just a few, a task that requires a large-scale high-performance computing system like Summit. The team will run large-scale tests on Summit before running on the Frontier system when it's deployed next year.