Bioconductor on Microsoft Azure
Co-authored by:

Nitesh Turaga - Scientist at Dana Farber/Harvard, Bioconductor Core Team

Erdal Cosgun - Sr. Data Scientist at Microsoft Biomedical Platforms and Genomics team

Vincent Carey - Professor at Harvard Medical School, Bioconductor Core Team




The Bioconductor project promotes the statistical analysis and comprehension of current and emerging high-throughput biological assays. Bioconductor is a strict proponent to open source and open development of software, and collaborative, literate, and reproducible research. As the scale of genomic data grows exponentially in the genomics era, the use of cloud services is on the upward trend to deal with the size of the data. The advantage of cloud computing services fits the needs of the analysis of the varying size of data depending on the analysis setting. The elasticity and scalability of cloud services is a resource that makes it easy for a small lab or a large company to take advantage of Bioconductor's open-source software, and data resources.




Bioconductor Docker Images


Bioconductor produces docker images so that users can run the latest stable version of R, Bioconductor using either the command line or an RStudio UI. These images are built with system libraries that can be used to install (and compile) over 2000 Bioconductor packages. These docker images hosted by Bioconductor are available on the Microsoft container registry (MCR) and are freely available to the public on an open-source Artistic-2.0 license.



docker pull



These images can be used with Azure container instances (ACI) with the available launch instructions. The added benefit of these docker images is the availability of pre-compiled Bioconductor package binaries. These package binaries speed up the installation of packages on the Docker image and provide users an efficient research computing environment where exploratory data analysis is faster. Since this is a more recent feature in development - it is available on a branch of the CRAN package BiocManager that will be merged soon. An example of installation of binary packages is given below:




pkgs <- c('BiocParallel', 'rsbml', 'rhdf5')




Bioconductor Hubs - AnnotationHub and ExperimentHub data


Bioconductor distributes it's annotation and experiment hub data through Azure Storage containers. 


The Bioconductor AnnotationHub resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered. The resource includes metadata about each resource, e.g., a textual description, tags, and date of modification.


The Bioconductor ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed. Each resource has associated metadata, tags and date of modification. As of this post, (1/27/2022) about 2.5 TB of data has been distributed to important genomics research to scientists around the world. Usage stats of Bioconductor Hubs between Dec 27th, 2021 to Jan 27th, 2022:



Jupyter Notebooks and Virtual Machines for Bioconductor on Microsoft Azure

The Genomics Data Lake provides various public datasets that you can access for free and integrate into your genomics analysis workflows and applications. The datasets include genome sequences, variant info, and subject/sample metadata in BAM, FASTA, VCF, CSV file formats. The Genomics Data Lake is hosted in the West US 2 and West Central US Azure region. Allocating compute resources in West US 2 and West Central US is recommended for affinity. The Bioconductor Annotation and Experiment Hub data will be available on Microsoft Genomics Data Lake on mid-February 2022.


Researchers can use the sample Jupyter notebooks from this repo to download the Genomics Data Lake's data to their Bioconductor projects. Another option to use Bioconductor packages on Azure is to use pre-built Genomics Data Science VMs. Windows OR Linux VMs can be deployed quickly from this link.


