How GitHub Actions handles CI/CD scale on short-running jobs with Ephemeral OS disk Reimage
Published May 11 2022 10:07 AM 5,297 Views
Microsoft

Background

GitHub Actions makes it easy for customers to automate all their software workflows, with world-class CI/CD. Users can build, test, and deploy their code right from GitHub. Whether they want to build a container, deploy a web service, or automate welcoming new users to their open-source projects—there's an action for that.

 

The Challenge

GitHub Actions must always provide a clean VM for each job. To support this at scale as of January 2022, GitHub Actions would have had to delete and recreate more than 7 million VMs per day. GitHub has lots of customers doing very short jobs like updating issues, simple Cron jobs and more. These jobs would not take more than 5 mins to complete. It takes ~2 mins to delete a VM, recreate it, and then run custom extension scripts to get the VM ready for use. For a job that runs no more than 5 mins spending more than 2 mins on setup is an expensive affair. When recreating a VM the corresponding OS caches would be deleted, and that impacted the read IOPS. They needed a reliable and efficient solution to get the VMs faster.

 

The Solution: Ephemeral OS disk Reimage

Reimage can be performed on a single instance VM or VMSS using Ephemeral OS disks. It is highly efficient and reliable (99.99%). Ephemeral OS disks reimage for VM or reimage for VMSS is equivalent to deleting and re-creating a VM of the same config. The reimage would replace the old OS disk with a new OS disk, and optionally resets the Temp disk contents (if this is included as a parameter) and you would retain,

  1. Data disks,
  2. Current configuration of the VM and
  3. Public IP associated with the VM

The average end-to-end time was reduced by 50% after replacing “deleting a VM, recreating it and then run custom extension scripts" with “Ephemeral OS disks Reimage”. This also helped in reducing the VM pools size.

 

“We saw our target machine count reduce by 15-20%.”

– Chad Kimes, Staff Software Engineer, GitHub

 

The below chart shows GitHub Actions VM pools jobs reimage performance between Ephemeral and Non-Ephemeral (in seconds).

 

Reimage_perf_comparison_secs.jpg

 

“The Ephemeral pool ran more customer requests than the non-Ephemeral because of its throughput efficiency. “

– Jiange Sun, Senior Software Engineer, GitHub

 

Along with the re-image being faster than deleting/creating a new VM, GitHub has also seen the below advantages with Ephemeral VMs

  • Faster reads as cached parts of the OS disk should persist around after reimage (if it is not dirty)
  • No leaking disks

“Apart from faster reimages, Ephemeral also saves us time setting up machines after each reimage because we no longer need to run "warm-up" scripts, where the sole intention was to forcibly populate more of the disk image on the host”

– Chad Kimes, Staff Software Engineer, GitHub

 

Related links

 

  • Learn more about Ephemeral OS disks
  • If you are interested to try out GitHub Actions, try out the quick start guide to improve your CI/CD experience today

Please share your feedback or questions in the comments section below.

Co-Authors
Version history
Last update:
‎May 11 2022 10:06 AM
Updated by: