Genomics
28 TopicsEnhancing Genomics Annotation with GraphRAG
Introduction The intersection of generative AI and genomics is rapidly reshaping how researchers understand and annotate complex biological data. Among the emerging techniques, GraphRAG stands out by integrating structured knowledge graphs with large language models to enhance contextual reasoning and data retrieval. In genomics annotation, where relationships between genes, proteins, and phenotypes are intricate and deeply interlinked, GraphRAG offers a novel approach to navigate this complexity with greater precision and interpretability. This blog is a follow up study of our previous paper and explores how GraphRAG can be leveraged to accelerate and improve the annotation of genomic sequences. Please find the GraphRAG quick start Jupyter notebook from this link. You also can find the sample Jupyter notebook for reproducing the content of this blog from this link. Sample ClinVAR variant record: chr1:1523548,na,ATAD3A,not_provided, "GRCh38_chr:chr1 GRCh38_pos:1523548 reference_allele:T alternative_allele:C dbSNP_ID:na Variation_ID:1704755 Allele_ID:1699287 canonical_SPDI:NC_000001.11:g.1523548T>C molecular_consequence:SO:0001583|missense_variant germline_review:Uncertain_significance germline_status:criteria_provided,_single_submitter Gene:ATAD3A Condition:not_provided source:clinvar clinvar_URL:https://www.ncbi.nlm.nih.gov/clinvar/variation/1704755/" Compute environment: Azure ML Studio VM: Standard_DS15_v2 (20 cores, 140 GB RAM, 280 GB disk) GraphRAG indexing time: 72 milliseconds per variant record Query method: local GraphRAG supports 4 different methods for more information please visit: Overview - GraphRAG Model Information: IMPORTANT: Please update your 'settings.yaml' file on your GraphRAG with your Azure OpenAI Service REST API information. default_embedding_model: type: azure_openai_embedding api_base: https://XXX.openai.azure.com api_version: 2025-01-01-preview auth_type: azure_managed_identity model: text-embedding-3-small deployment_name: text-embedding-3-small Indexing command: !graphrag index --root ."/genomicsragtest" Sample indexing process: Results In this blog, we indexed all variants from ClinVAR vcf file with GraphRAG. Here are sample query results from 'Baseline RAG (GPT-4o, from our previous study)' vs 'GraphRAG': Sample query: !graphrag query --root ./genomicsragtest --method local --query "Annotate chr1:5863337" A comparison table between baseline RAG and GraphRAG highlights that GraphRAG produces more structured outputs and is highly sensitive to query phrasing, enabling more precise and context-aware responses. (Table 1) Visualizing and Debugging Your Knowledge Graph The GraphRAG developer team recommends using Gephi for intuitive and scalable visualization of the resulting knowledge graphs. Please review the step-by-step guide walks through the process to visualize a knowledge graph after it's been constructed by GraphRAG. Conclusion As the volume and complexity of genomic data continue to grow, traditional annotation pipelines face limitations in scalability and contextual understanding. GraphRAG presents a compelling solution, bridging the structured world of biological ontologies with the flexible reasoning capabilities of AI models. By harnessing graph-based retrieval, it enhances the relevance and accuracy of annotations, opening doors to deeper insights and faster discoveries. The future of genomics may well lie in this symbiotic relationship between knowledge graphs and AI models. Researchers can transform bioinformatics tools from data-heavy to insight-rich applications. Acknowledgments Special thanks to Jesus Aguilar for initiating this work and setting the foundation. I also want to thank Jonathan Larson, who not only provided valuable feedback but also served as a GraphRAG project lead, guiding the direction of this effort. Notices This blog is for research and informational purposes only. It is not intended for clinical use. Please note that AI-generated outputs may contain inaccuracies or misleading information.1.4KViews0likes0CommentsIntroducing the Open Targets Dataset: Now Available on Genomics Data Lake on Azure
Access the Open Targets dataset on Azure to accelerate biomedical research. This open-access resource connects genetic, biological, and clinical data, helping researchers identify and prioritize drug targets efficientlyIntroducing Scalable and Enterprise-Grade Genomics Workflows in Azure ML
Genomics workflows are essential in bioinformatics as they help researchers analyse and interpret vast amounts of genomic data. However, creating a consistent and repeatable environment with specialized software and complex dependencies can be challenging, making integration with CI/CD tools difficult, too. Azure Machine Learning (Azure ML) is a cloud-based platform that provides a comprehensive set of tools and services for developing, deploying, and managing machine learning models. Azure ML offers great repeatability and auditability features natively that not many workflow solutions offer. It provides a highly integrated and standardised environment for running workflows, ensuring that each step is executed in a consistent and reproducible manner. This feature is particularly useful for genomics workflows that require the use of multiple tools and software packages of certain versions with specific dependencies. In this blog post, we will show how Azure ML can run genomics workflows efficiently and effectively, in addition to being an end-to-end platform for machine learning model training and deployment. Figure 1 illustrates an example of such a workflow. Figure 1: A sample genomics workflow running in Azure ML, consisting of 3 steps. A reference genome input dataset flows into the indexer step, while the sequence quality step gets its data from a folder of DNA sequences (".fastq" files). Azure ML has comprehensive audit and logging capabilities that track and record every step of the workflow, ensuring traceability and repeatability. One of the critical features of Azure ML to achieve these capabilities is its support for users to be able to specify https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=cli#create-an-environment for each workflow step, which guarantees consistent environment execution. These environments can be versioned and centrally shared. Workflow steps within pipelines then can refer to a particular environment. Figure 2 shows one such environment, bwa, version "5". As we make modifications in the environment definition, the new version will be registered as "6", however, we will still be able to continue to use older versions. Figure 2: An example Azure ML environment, defining a Docker image containing the BWA bioinformatics software package. This is the 5th version of this environment registered under the name, "bwa". Like environments, Azure ML supports user created pipeline https://learn.microsoft.com/en-us/azure/machine-learning/concept-component that can be centrally registered for reuse in other pipelines, also versioned, and with an audit log of their usage. Runs are logged together with standard out and error streams generated by the underlying processes, automatically. https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-use-mlflow-configure-tracking?tabs=cli%2Cmlflow and adding custom tags to all assets and runs are supported, too. This feature ensures that the results are consistent and reproducible, saving users’ time. An example versioned component is shown in Figure 3. Figure 3: An Azure ML component named "BWA Indexer". It is a self-contained, re-usable, versioned piece of code that does one step in a machine learning pipeline: running the bwa indexer command, in this instance. Versioning is not limited to environments and pipeline components. Another essential feature of Azure ML is its support for versioning all input https://learn.microsoft.com/en-gb/azure/machine-learning/concept-data and genomic data, including overall pipeline input, and as well as intermediate step and final outputs, if needed. This feature enables users to keep track of dataset changes and ensure that the same version is used consistently across different runs of the workflow, or in others. There are many genomics workflow engines which are very good with multiple parallel execution when it comes to processing files in parallel. However, Azure ML https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-parallel support parallel running both at the file-level (one by one, or 3 files at a time etc) and at the file chunk-level (50 MB of data per process, or 20 KB of text per node etc) where appropriate as supported by the consuming application, enabling the processing of large genomic datasets efficiently across elastic compute clusters that can auto-scale. Pipelines can even also https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-attach-compute-targets#local-computer for test/development phases, and of course support powerful CPU and GPU-based VMs, https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-use-low-priority-batch?tabs=cli or on-demand compute clusters, https://learn.microsoft.com/en-gb/azure/machine-learning/quickstart-spark-jobs?tabs=cli, and other compute targets such as https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-attach-kubernetes-anywhere, making it flexible for different use cases. Azure ML has integrations with https://learn.microsoft.com/en-us/azure/machine-learning/how-to-devops-machine-learning and https://learn.microsoft.com/en-us/azure/machine-learning/how-to-github-actions-machine-learning?tabs=userlevel for CI/CD, making it easy to deploy and manage genomics workflows in a production environment, which in turn makes GenomicsOps possible. Well established pipelines ready for "production use" can be published, and called on-demand or from other Azure services including the https://learn.microsoft.com/en-us/azure/data-factory/transform-data-machine-learning-service. This means we can create a schedule for running pipelines automatically, or whenever data become available. Thanks to its Python SDK, command line utility (https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli?tabs=public), REST-API, and user-friendly UI, it makes it possible to develop pipelines and initiate pipeline runs from any preferred means, also providing easy monitoring and management of workflows. That said, event-based triggers and notifications are also supported. For instance, one can set up an email alert that will be triggered whenever a genomics pipeline finishes execution. As compute and storage are de-coupled, any pipeline input or output stored in an Azure ML datastore or blob storage can also be accessed by Azure ML’s https://learn.microsoft.com/en-gb/azure/machine-learning/quickstart-run-notebooks Notebooks for any upstream or downstream analysis. Azure ML is a managed PaaS service, making it an accessible and easy to set up platform for genomics researchers and bioinformaticians. Additionally, it has a https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-setup-vs-code for local development and has a https://learn.microsoft.com/en-gb/azure/machine-learning/concept-workspace for managing pipeline projects, enabling collaboration, and Azure role-based access control (RBAC). In conclusion, Azure ML comes with advanced security features, including AD authentication, public & private endpoints, subscription-based event triggers, storage backed by the Azure Storage Service that comes with encryption at rest and in transit, and application insights, making it a reliable and already proven enterprise platform that can also be natively used for genomics research. For a more detailed tutorial that shows how to set up and run the example workflow shown in Figure 1, as well as for all the source code for creating the aforementioned sample environments and components, please check out this GitHub repository: https://github.com/truehand/azureml-genomics973Views0likes2Comments