Healthcare and Life Sciences Blog

5 MIN READ

Introducing the Open Targets Dataset: Now Available on Genomics Data Lake on Azure

Mamta-Giri

Microsoft

Oct 28, 2024

Title:

Introducing the Open Targets Dataset: Now Available on Genomics Data Lake on Azure for Advanced Biomedical Research

Introduction

Biomedical research is accelerating at an unprecedented pace, driven by the vast amounts of data generated from genetic studies, drug development, and disease research. Today, we would like to announce that critical datasets from Open Targets are now available on Azure Genomics Open Data Lake. This data can be seamlessly integrated into your research workflows, providing a rich resource for exploring gene-disease associations, drug targets, and biomedical mechanisms.

With Azure’s cloud-based solutions, these datasets are not only easier to access, but they can also be combined with machine learning, analytics, and AI-powered tools to drive deeper insights and foster innovation in areas like drug discovery, personalized medicine, and genomics.

Dataset Overview

The Open Targets consortium is a collaborative public -private research partnership which aims to systematically identify and prioritize drug targets. Its flagship informatics platform integrates genetic and molecular evidence associating targets and diseases, and includes extensive data on the genetic basis of diseases, drugs, and the identification of potential therapeutic targets.

The Open Targets provides crucial data that integrates genetics, genomics, and drug information, enabling researchers to identify and prioritize drug targets for complex diseases. By offering insights into gene-disease associations, molecular interactions, and drug mechanisms, it supports drug discovery and development. This dataset enhances the understanding of the genetic basis of diseases and the effects of drugs, fostering better therapeutic strategies and precision medicine. It is widely used for target validation, drug repositioning, and understanding adverse drug events.

Key Datasets

This dataset offers comprehensive access to 25 different JSON and file formats, which can be seamlessly integrated into your analysis workflows. These datasets fall into the following categories:

Drug data: Mechanism of action, indications, pharmacovigilance and pharmacogenetics

Target-Disease Associations: Curated data linking specific genes to diseases, helping researchers better understand disease pathways and mechanisms.
Target, Disease, drug annotations: core annotations for molecular targets, diseases and drugs
Molecular interactions: Target interactions and supporting evidence.
Expression and Phenotypes: Baseline expressions, animal model phenotypes and gene ontology
Pathways and essentiality: Reactome pathway and DepMap essentiality for targets

Significance in research and drug development:

This rich dataset opens numerous opportunities for researchers in a variety of fields. Below are a few use cases:

Identification of Drug Targets for Alzheimer's Disease

Publication: "Genome-wide association study identifies new loci and functional pathways influencing Alzheimer's disease risk" by Kunkle, B.W. et al. (2019) in Nature Genetics.

Summary: The researchers used the Open Targets dataset to integrate genetic association data with functional genomics, which helped them prioritize genes and pathways linked to Alzheimer's disease. This approach led to the identification of potential therapeutic targets that could be further investigated for developing treatments for Alzheimer's

Understanding Genetic Basis of Inflammatory Bowel Disease (IBD)

Publication: "Genetic risk factors for inflammatory bowel disease" by de Lange, K.M. et al. (2017) in Nature Genetics.

Summary: This study leveraged the Open Targets dataset to identify and prioritize genetic variants associated with IBD. By linking genetic associations to specific genes and pathways, the researchers gained valuable insights into the mechanisms underlying the disease, which could inform the development of new therapeutic strategies

Drug Repurposing for COVID-19

Publication: "Drug repurposing for COVID-19: a systematic review" by Zhou, Y. et al. (2020) in Nature Reviews Drug Discovery.

Summary: The researchers used the Open Targets dataset to analyze and prioritize drug targets for COVID-19. This analysis helped them identify existing drugs that could be repurposed for treating COVID-19, providing a list of potential candidates for clinical trials

Availability

How to Access the Dataset on Azure

** Please note**

We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z. After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters)

Accessing this dataset on Azure is straightforward and can be integrated into a variety of Azure services for analysis and visualization. Here’s how you can get started:

Using AzCopy

Prerequisites:

AzCopy must be installed on your machine. Download AzCopy here.

Steps:

Get the SAS URL of the blob container or file you want to download. The url can be found here
Open your command line (e.g., Command Prompt, Terminal, or PowerShell).
Run the following command to download data from the Azure Blob storage:

azcopy copy "https://datasetopentargets.blob.core.windows.net/dataset//17.02/17.02_association_data.json.gz" "C:\Users\YourUser\Downloads\"

This will copy the blob “17.02_association_data.json.gz” to your Downloads directory

Using Python SDK

#Install the Azure Storage Blob library for Python
pip install azure-storage-blob

Import the necessary libraries
from azure.storage.blob import BlobClient 
import os

#Download the blob by specifying the SAS URL and the local file path.
sas_url = "https://datasetopentargets.blob.core.windows.net/dataset/17.02/17.02_association_data.json.gz?sv=2023-01-03&st=2024-10-24T21%3A20%3A22Z&se=2026-10-25T21%3A20%3A00Z&sr=c&sp=rl&sig=9EI4PbUvTkT%2F0jUCg5aNLP5CBlu1bUDsyK6TDFzZacw%3D"

local_path = "path/to/save/file"

#Create BlobClient
blob_client = BlobClient.from_blob_url(sas_url)

#Download the blob content to a local file
with open(local_path, "wb") as download_file: download_stream = blob_client.download_blob() download_file.write(download_stream.readall())

Run the script. The blob will be downloaded and saved to the specified location.

Using Azure Storage Explorer

Prerequisites:

Download and install Azure Storage Explorer.

Steps:

Open Azure Storage Explorer.
Connect to your Azure account by clicking "Add an Account" or use the "Connect to Azure Storage Container"
- Choose "Use a shared access signature (SAS) URI" and paste the SAS URL for your blob container. OR
- Choose “Anonymously (my blob container allows public access)” [after 11/19/2024 since public access will be unabled on all dataset]
Navigate to the Blob Container in the left-hand panel where your data is stored.
Right-click on the blob or folder you want to download and select "Download".
Select the destination folder on your local machine.
The blob data will be downloaded to the specified location

We encourage researchers to explore the Open Targets dataset to accelerate breakthroughs in target prioritisation!

Acknowledgements:

We would like to Annalisa Buniello, Manuel Bernal Llinares, Roberto LLeras and Matt Mcloughlin for helping us make the data available on Azure and Helena Cornu for help with the blog.

Updated Oct 28, 2024

Version 1.0

Microsoft

Joined June 16, 2022

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity

Blog Post

Introducing the Open Targets Dataset: Now Available on Genomics Data Lake on Azure

Share