by Andy Chan, Director, Azure Global Solutions, Semiconductor/EDA/CAE
Rocky Kamen-Rubio – Data Engineer, Six Nines
Katie Singh – Director of Growth, Six Nines
Jason Cutrer – Chief Executive Officer, Six Nines
In 2021, an MSFT partner presented SixNines https://sixninesit.com with an opportunity to give recommendations on how to optimize their Azure usage by leverage Azure analytics. Given log data from the first six months of usage and working closely with Microsoft’s dedicated EDA team, SixNines IT ran extensive analysis to give specific cost-saving recommendations for this client’s use-case. This post will outline the methodologies, tools, analysis, and recommendations that fully leverage the swath of tools available in the Azure ecosystem to combine spot and reserved instances to efficiently match changes in demand dynamically in real-time.
Despite the limited timeframe of the initial dataset, the logs had already amassed >300GB worth of memory, most of which was not relevant to the research questions at hand. Thus, the first step was to pair the dataset down to <20GB by leveraging VMs and Blob Storage with PowerBI in python to identify redundant or non-useful rows and columns to eliminate. Subsequently, the remaining data could be queried relationally. From this sampling procedure, the following plots and breakdowns were generated.
Note on times: the timestamps are UNIX timestamps, which convert to UTC-00.
Note on sampling: the resulting data was still too large to practically fit in RAM, so the plots seen are random subsamples of the data and don’t reflect the total size of the dataset.
The jobs have a median duration of 55 minutes and a mean of 2hr45min, and 61% of jobs took under an hour. Note the long tail of high-duration jobs, which leads to a mean roughly 3 times bigger than the median.
Machine-time breakdown looks very different from the durations breakdown, and and the “long tail” starts to become more significant. For example, 6+ hour jobs are ~8% of submitted jobs but take over 50% of our machine-hours! Jobs taking less than 1hr to run are about half of submissions but take under 20% of our resources. These are important considerations when allocating resources based on which we are processing at what times, since costs are generally associated more closely with machine time than raw job count.
The data provided was limited, so it’s difficult to draw meaningful conclusions. There is a very high variance in submissions by day, and some seasonal trends that could be exploited for cost savings by spinning up reserved instances before periods of high demand. Multi-year data will be necessary to confirm whether this is practical.
Longer jobs are being sent over the weekend, and shorter ones during the week. This evens out demand, but not entirely. Encouraging more of this behavior, when possible and practical, could help depending on infrastructure.
On a daily timescale, there are two main peaks around the beginning and end of the workday. These could be covered by fleet instances for additional cost savings.
From an hourly view, submissions stay relatively constant, but big jobs tend to get sent on the hour (so more potential for savings)
Job Wait Times
This fits a Poisson distribution pretty nicely with median ~30 seconds. This is consistent with new resources becoming available at random intervals, with median interval being ~30 seconds.
Number of Processors Requested
Users overwhelmingly request 1 processor with no significant correlation between job duration and number of processors requested (r~.03). It is unclear if this indicates that users aren’t good at requesting the appropriate number of processors for their jobs, if more processors wouldn’t help for longer jobs, or if users are constrained by cost.
This client was using primarily pay-as-you-go instances as their reserved overflow. Spot instances provide a similar resource and have up to 90% savings over pay-as-you-go (80% savings are typical). However, there are some important caveats when making this overall cost-saving change: price fluctuations are common, and jobs can fail! These are important considerations when deciding whether to leverage this solution, though their consequences can be mitigated. For example, reversion to pas-as-you-go if the price of spot instances is too high can be automated, but this requires additional overhead and engineering. Based on the above analysis, a base level of compute for reserved instance with spot instances for additional, non-critical jobs will yield significant savings.
Below is a proposed architecture
The aforementioned metrics and analysis could be automated integrated into a real-time optimization dashboard that provides predictive analytics to emerging trends by setting up an ETL pipeline. This could also give users visibility into the status of nodes before submitting jobs and allow users to choose the most efficient times for their jobs. Allowing users to allocate priority levels to their jobs and looking for consistent timeouts in jobs also gives room for cost savings. Given the significant price differential between instance types, this could theoretically lead to savings of up to 80%. In practice, 30-40% is likely a more reasonable estimate, given that there will be irregular behavior and not all jobs are good candidates for spot instances. However, the more data is made available and the more fine-tuned the predictive modeling and real-time analytics architectures, the more savings become available.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.