A Case Study in Leveraging Azure Analytics To Optimize on HPC/EDA Cloud Spending
Published Feb 17 2022 11:09 AM 2,880 Views
Microsoft

by Andy Chan, Director, Azure Global Solutions, Semiconductor/EDA/CAE

Rocky Kamen-Rubio – Data Engineer, Six Nines

Katie Singh – Director of Growth, Six Nines

Jason Cutrer – Chief Executive Officer, Six Nines

 

Abstract

 

In 2021, an MSFT partner presented SixNines https://sixninesit.com with an opportunity to give recommendations on how to optimize their Azure usage by leverage Azure analytics. Given log data from the first six months of usage and working closely with Microsoft’s dedicated EDA team, SixNines IT ran extensive analysis to give specific cost-saving recommendations for this client’s use-case. This post will outline the methodologies, tools, analysis, and recommendations that fully leverage the swath of tools available in the Azure ecosystem to combine spot and reserved instances to efficiently match changes in demand dynamically in real-time.

 

Challenges

 

Despite the limited timeframe of the initial dataset, the logs had already amassed >300GB worth of memory, most of which was not relevant to the research questions at hand. Thus, the first step was to pair the dataset down to <20GB by leveraging VMs and Blob Storage with PowerBI in python to identify redundant or non-useful rows and columns to eliminate. Subsequently, the remaining data could be queried relationally. From this sampling procedure, the following plots and breakdowns were generated.

 

Note on times: the timestamps are UNIX timestamps, which convert to UTC-00.

Note on sampling: the resulting data was still too large to practically fit in RAM, so the plots seen are random subsamples of the data and don’t reflect the total size of the dataset.

 

Analysis Duration

The jobs have a median duration of 55 minutes and a mean of 2hr45min, and 61% of jobs took under an hour. Note the long tail of high-duration jobs, which leads to a mean roughly 3 times bigger than the median.

 

andychaneda_0-1645124438014.jpeg

 

andychaneda_1-1645124438016.jpeg

 

 

 

 

andychaneda_2-1645124438019.jpeg

 

 

Machine Time

Machine-time breakdown looks very different from the durations breakdown, and and the “long tail” starts to become more significant. For example, 6+ hour jobs are ~8% of submitted jobs but take over 50% of our machine-hours! Jobs taking less than 1hr to run are about half of submissions but take under 20% of our resources. These are important considerations when allocating resources based on which we are processing at what times, since costs are generally associated more closely with machine time than raw job count. 

 

andychaneda_3-1645124438023.jpeg

 

Submission Times

The data provided was limited, so it’s difficult to draw meaningful conclusions. There is a very high variance in submissions by day, and some seasonal trends that could be exploited for cost savings by spinning up reserved instances before periods of high demand. Multi-year data will be necessary to confirm whether this is practical.

 

andychaneda_4-1645124438026.jpeg

 

andychaneda_5-1645124438028.jpeg

 

Longer jobs are being sent over the weekend, and shorter ones during the week. This evens out demand, but not entirely. Encouraging more of this behavior, when possible and practical, could help depending on infrastructure.

 

andychaneda_6-1645124438030.jpeg

 

andychaneda_7-1645124438032.jpeg

 

andychaneda_8-1645124438034.jpeg

 

On a daily timescale, there are two main peaks around the beginning and end of the workday. These could be covered by fleet instances for additional cost savings.

 

andychaneda_9-1645124438036.jpeg

 

andychaneda_10-1645124438037.jpeg

 

andychaneda_11-1645124438040.jpeg

 

From an hourly view, submissions stay relatively constant, but big jobs tend to get sent on the hour (so more potential for savings)

 

andychaneda_12-1645124438041.jpeg

 

andychaneda_13-1645124438043.jpeg

 

andychaneda_14-1645124438046.jpeg

 

 

 Job Wait Times

andychaneda_15-1645124438047.png

 

 

This fits a Poisson distribution pretty nicely with median ~30 seconds. This is consistent with new resources becoming available at random intervals, with median interval being ~30 seconds.

 

Number of Processors Requested

Users overwhelmingly request 1 processor with no significant correlation between job duration and number of processors requested (r~.03). It is unclear if this indicates that users aren’t good at requesting the appropriate number of processors for their jobs, if more processors wouldn’t help for longer jobs, or if users are constrained by cost.

andychaneda_16-1645124438049.jpeg

 




Solution

This client was using primarily pay-as-you-go instances as their reserved overflow. Spot instances provide a similar resource and have up to 90% savings over pay-as-you-go (80% savings are typical). However, there are some important caveats when making this overall cost-saving change: price fluctuations are common, and jobs can fail! These are important considerations when deciding whether to leverage this solution, though their consequences can be mitigated. For example, reversion to pas-as-you-go if the price of spot instances is too high can be automated, but this requires additional overhead and engineering. Based on the above analysis, a base level of compute for reserved instance with spot instances for additional, non-critical jobs will yield significant savings.

Below is a proposed architecture

  • Stage 1: Migrate existing tooling into VMs and Blob Storage, using Stream Analytics and Data Explorer to automate some analysis processes, then leverage PowerBI to build an interactive user-friendly interface.

andychaneda_17-1645124438053.png

 

 

  • Stage 2: Tune/train a demand prediction ML (Machine Learning) model using HDInsight, PowerBI, and Azure Machine Learning like this example

andychaneda_18-1645124438057.png

 

  • Stage 3: Build a data pipeline with Data Factory, like this example, that can read from a data stream and use the above model to make predictions in real-time.

andychaneda_19-1645124438062.png

 

Conclusion

The aforementioned metrics and analysis could be automated integrated into a real-time optimization dashboard that provides predictive analytics to emerging trends by setting up an ETL pipeline. This could also give users visibility into the status of nodes before submitting jobs and allow users to choose the most efficient times for their jobs. Allowing users to allocate priority levels to their jobs and looking for consistent timeouts in jobs also gives room for cost savings. Given the significant price differential between instance types, this could theoretically lead to savings of up to 80%. In practice, 30-40% is likely a more reasonable estimate, given that there will be irregular behavior and not all jobs are good candidates for spot instances. However, the more data is made available and the more fine-tuned the predictive modeling and real-time analytics architectures, the more savings become available.

 

Co-Authors
Version history
Last update:
‎Oct 25 2022 12:55 PM
Updated by: