By Vikram Bhatia, Originally posted on Synopsys.com
As more designers deploy production EDA workloads on Synopsys Cloud, enabling cloud infrastructure cost optimization has consistently been one of the most common asks. Within the Synopsys Cloud product team, we shortlisted a few areas of innovation which would help our customers achieve this goal. Among these, leveraging Spot virtual machines clearly stood out as the most effective means to drive costs down. However, as most cloud infrastructure experts would agree, running high performance, high memory tools on Spot is not trivial. In this article, we will attempt to define the problem and discuss in detail how we arrived at this unique solution to address the constraints public cloud infrastructure subjects us to.
What is Spot?
Cloud providers plan for capacity of each type of compute resource several quarters in advance to ensure availability of elastic cloud scale. In practical terms, these capacity projections don’t play out perfectly and there’s always a struggle between demand for specific compute virtual machines versus actual supply. When there’s excess capacity of certain compute VMs, cloud providers put these on what is called the “Spot market” and make them available at heavily discounted prices with the caveat that these VMs may be removed on short notice. Users have leveraged discounts of up to 80% off on-demand prices in the Spot market.
Leveraging Spot for EDA Workloads: What’s the problem?
High performance workloads like EDA that can scale on cloud infrastructure need the ability to recover from a Spot VM termination signal in order to ensure that there’s no processing time lost when a job has been running for a while. The most common solution to this problem is to build checkpoint-restore functionality in the tools. Several Synopsys tools offer this capability and users have learnt to use it well for their needs over the years.
However, just having checkpoint-restore available at your disposal does not enable Spot. Spot is a unique beast that adds a more stringent constraint on the deployment architecture by providing a very limited window to take a snapshot of the runtime memory state of the tool. AWS offers a 2 minute warning, which in reality may be much shorter, and Microsoft Azure offers only a 30 second notice currently.
As we all know, each EDA tool is not created equal and tools that have smaller memory footprint can successfully checkpoint their state within this warning window. Users who run verification jobs on VCS have successfully leveraged the tool’s inherent checkpoint-restore capabilities to run on Spot and reduced costs significantly. Similarly, for library characterization on PrimeLib, since typical distributed jobs run for only a few minutes and runtime state has a very small footprint, customers have successfully enabled Spot instances by just ignoring the failures and restarting those jobs.
The challenge is more pronounced when we start exploring high memory workloads such as timing analysis, physical verification, physical design, or RTL-to-Gates implementation. The size of the runtime state for these workloads may run into several hundred gigabytes and the time needed to checkpoint is much longer that the Spot warning window provided by cloud providers. So, jobs which get terminated while running on Spot cannot be restored, as no state is saved. This means several hours of runtime and compute usage costs can go waste. For these workloads, just having checkpoint-restore capability is not enough to effectively use Spot.
Synopsys Solution: Leverage AI to Optimize Tool and Infrastructure Together for Spot
As each EDA tool behaves differently, we started analyzing runtime memory usage patterns for each tool independently to assess the amount of time needed to successfully save the state. We also explored multiple technical solutions that can either provide alternatives to the standard checkpoint-restore functionality or complement that capability. One unique solution was presented by Exostellar, which leverages machine learning to predict the onset of a Spot termination signal in advance. Earlier this year, Synopsys entered into a technical partnership with Exostellar to jointly develop and market an intelligent solution to this problem.
This solution is built on the concept of creating a “Virtual Machine Array” optimized for each EDA tool which contains a mix of Spot and on-demand VMs. Based on termination signal predictions from its AI driven algorithm, the solution migrates the running EDA workload, live, to an on-demand VM in the VM Array, thus reducing the chances of the workload being terminated. Once Spot availability eases, the running state is migrated back to a Spot VM in the array. We tested this architecture extensively on some of our most compute intensive high memory workloads and are now pleased to announce the Spot Optimized Synopsys Cloud solution for EDA. Powered by Exostellar Compute Optimizer technology, this solution enables customers to save up to 75% off on-demand compute prices. This capability will be released in Q3, 2023, starting first with availability on AWS, followed by Azure. The first release of this solution will enable Fusion Compiler, PrimeTime, and IC Validator tools to run on Spot instances.
When we first launched Synopsys Cloud one year ago, one of our primary goals was to enable industry transforming technology for chip designers who want to leverage cloud. With FlexEDA, the industry’s first true pay-per-use business model, and a completely browser-based high performance computing experience, Synopsys Cloud is driving cutting edge innovation to enable our customers to focus on what they do best – design chips, faster. You can try a full-featured free evaluation of Synopsys Cloud for 30 days by signing up at https://www.synopsys.com/cloud.