Skip links

Enhancing HPC at TACC with Exostellar’s Hybrid Cloud Solution

INTRODUCTION

Enhancing High-Performance Computing at TACC with Exostellar’s Hybrid Cloud Solution

The Texas Advanced Computing Center (TACC) renowned for its cutting-edge research in the world of HPC deployed the petascale Frontera system in 2019, providing new capabilities for advanced research. However, increasing demand led to longer queue wait times for users. To address this, TACC leveraged Exostellar’s Infrastructure optimizer to harness the hybrid cloud. It enables users to burst workloads to AWS when on-premises resources are constrained. Infrastructure Optimizer automatically provisions optimal cloud resources based on demand predictions, avoiding queue backlogs. This hybrid approach provides users the best of both worlds – leveraging the scale and flexibility of the cloud while still utilizing on-premises HPC assets efficiently.

“Cloud resources offer enormous potential for the users of TACC, particularly for throughput users. But only if they can manage and control the costs. Exostellar’s innovative hybrid cloud solution can reduce wait times for our users but also optimize our resources and costs. This pilot shows the potential to seamlessly integrate our on-premises resources with AWS through Exostellar’s product.”

Dan Stanzione Ph.D.,

Associate Vice President for Research at UT Austin and Executive Director, TACC

1
THE CHALLENGE

As a leading high-performance computing center, the Texas Advanced Computing Center (TACC) enables researchers to push boundaries across science and engineering domains by providing access to state-of-the-art computational capabilities. However, due to its popularity, Frontera is always operating at full capacity and thus susceptible to extended queue times when user demand is high. Researchers depending on TACC for running complex simulations, data analysis, and modeling may experience delays ranging from hours to days before their jobs could start executing. This wait time can negatively impact research productivity as well as impede urgent computing operations. This highlights the need for a more flexible and scalable approach to provide users with high-powered computing in a timely manner. While many of the jobs that use Frontera require the scale of the supercomputer, there are a number of jobs that simply use Frontera for “throughput computing” – typically single server jobs where the end user simply needs to run hundreds or thousands of jobs. This type of job seemed a good candidate to start exploring use of the cloud.

THE PILOT SOLUTION

Exostellar and TACC adopted a hybrid cloud model that extended TACC’s capabilities to AWS, providing users with an alternative to waiting in the queue. This approach enables TACC to scale resources flexibly and respond quickly to user demands.

  1. TACC users submit jobs to Frontera’s queue using Slurm as usual. For certain jobs, Exostellar software handles bursting transparently to AWS for timely execution. Results are transferred back and synced to the user’s storage on Frontera.
  2. Exostellar sets up a compatible Amazon Parallel Cluster to burst to, with matched server types that auto-scale as needed.
  3. When Frontera’s queue is congested, Exostellar migrates queued jobs to the AWS cluster to run using provisioned compute resources. These are managed by Exostellar’s Infrastructure Optimizer controller and workers.
  4. Workloads are automatically migrated among Infrastructure Optimizer workers, optimizing for cost and performance – locating low-cost spot servers as available and reliably migrating to on-demand when they expire.

In summary, Exostellar’s hybrid cloud solution allows TACC and its users to reduce costs by over $1M per year versus a typical cloud bursting solution, by leveraging optimized use of spot and hybrid cloud. This savings is realized for single node workloads burst to AWS, with additional savings for larger scale workloads.

SUMMARY

The adoption of Exostellar’s Infrastructure Optimizer solution by TACC demonstrates the remarkable potential of adapting to the ever-evolving landscape of high-performance computing. It not only addresses immediate challenges but also sets a precedent for the future, where the combination of on-premises and cloud resources can empower researchers to push the boundaries of scientific discovery.

1. Cost Savings: Infrastructure Optimizer allows TACC to optimize costs by only utilizing AWS resources when the on-premises cluster is near capacity thereby enhancing the overall value of TACC’s HPC services.

2. Reduced Time to Market: Exostellar enabled Frontera users to run HPC applications on AWS, reducing wait times. This not only reduced time-to-market for researchers but also significantly improved overall system efficiency.

3. Scalability & Accessibility: Access to the immense computational power of AWS, as and when needed, with minimal disruptions. It democratizes access to HPC resources, fostering greater innovation across various scientific domains.

4. Global Collaboration: As researchers harness the hybrid cloud’s capabilities, it will enable global collaboration by connecting researchers worldwide to high-performance computing resources. The implications of this for scientific discoveries are profound.

Seeing is Believing

Schedule a demo to take a closer look what Exostellar can do for your cloud.

Request Demo
Close Bitnami banner
Bitnami