genomics

27 Topics

Enhancing Genomics Annotation with GraphRAG
Introduction The intersection of generative AI and genomics is rapidly reshaping how researchers understand and annotate complex biological data. Among the emerging techniques, GraphRAG stands out by integrating structured knowledge graphs with large language models to enhance contextual reasoning and data retrieval. In genomics annotation, where relationships between genes, proteins, and phenotypes are intricate and deeply interlinked, GraphRAG offers a novel approach to navigate this complexity with greater precision and interpretability. This blog is a follow up study of our previous paper and explores how GraphRAG can be leveraged to accelerate and improve the annotation of genomic sequences. Please find the GraphRAG quick start Jupyter notebook from this link. You also can find the sample Jupyter notebook for reproducing the content of this blog from this link. Sample ClinVAR variant record: chr1:1523548,na,ATAD3A,not_provided, "GRCh38_chr:chr1 GRCh38_pos:1523548 reference_allele:T alternative_allele:C dbSNP_ID:na Variation_ID:1704755 Allele_ID:1699287 canonical_SPDI:NC_000001.11:g.1523548T>C molecular_consequence:SO:0001583|missense_variant germline_review:Uncertain_significance germline_status:criteria_provided,_single_submitter Gene:ATAD3A Condition:not_provided source:clinvar clinvar_URL:https://www.ncbi.nlm.nih.gov/clinvar/variation/1704755/" Compute environment: Azure ML Studio VM: Standard_DS15_v2 (20 cores, 140 GB RAM, 280 GB disk) GraphRAG indexing time: 72 milliseconds per variant record Query method: local GraphRAG supports 4 different methods for more information please visit: Overview - GraphRAG Model Information: IMPORTANT: Please update your 'settings.yaml' file on your GraphRAG with your Azure OpenAI Service REST API information. default_embedding_model: type: azure_openai_embedding api_base: https://XXX.openai.azure.com api_version: 2025-01-01-preview auth_type: azure_managed_identity model: text-embedding-3-small deployment_name: text-embedding-3-small Indexing command: !graphrag index --root ."/genomicsragtest" Sample indexing process: Results In this blog, we indexed all variants from ClinVAR vcf file with GraphRAG. Here are sample query results from 'Baseline RAG (GPT-4o, from our previous study)' vs 'GraphRAG': Sample query: !graphrag query --root ./genomicsragtest --method local --query "Annotate chr1:5863337" A comparison table between baseline RAG and GraphRAG highlights that GraphRAG produces more structured outputs and is highly sensitive to query phrasing, enabling more precise and context-aware responses. (Table 1) Visualizing and Debugging Your Knowledge Graph The GraphRAG developer team recommends using Gephi for intuitive and scalable visualization of the resulting knowledge graphs. Please review the step-by-step guide walks through the process to visualize a knowledge graph after it's been constructed by GraphRAG. Conclusion As the volume and complexity of genomic data continue to grow, traditional annotation pipelines face limitations in scalability and contextual understanding. GraphRAG presents a compelling solution, bridging the structured world of biological ontologies with the flexible reasoning capabilities of AI models. By harnessing graph-based retrieval, it enhances the relevance and accuracy of annotations, opening doors to deeper insights and faster discoveries. The future of genomics may well lie in this symbiotic relationship between knowledge graphs and AI models. Researchers can transform bioinformatics tools from data-heavy to insight-rich applications. Acknowledgments Special thanks to Jesus Aguilar for initiating this work and setting the foundation. I also want to thank Jonathan Larson, who not only provided valuable feedback but also served as a GraphRAG project lead, guiding the direction of this effort. Notices This blog is for research and informational purposes only. It is not intended for clinical use. Please note that AI-generated outputs may contain inaccuracies or misleading information.
Erdal_Cosgun
May 02, 2025 Place Healthcare and Life Sciences Blog
1.6KViews
0likes
0Comments
Simplifying Genomic Task Execution with GA4GH TES: A Guide for Bioinformatics Workflows
Learn how to use GA4GH TES API with Nextflow.
Venkat_Malladi
Feb 21, 2025 Place Healthcare and Life Sciences Blog
5.3KViews
1like
0Comments
Simplifying Genomic Task Execution with TES Rust: A Guide for Bioinformatics Workflows
Learn how to use GA4GH TES API with Rust.
Venkat_Malladi
Feb 14, 2025 Place Healthcare and Life Sciences Blog
747Views
3likes
0Comments
Simplifying Workflow Management with Snakemake and Task Execution Service (TES)

Venkat_Malladi
Feb 14, 2025 Place Healthcare and Life Sciences Blog
843Views
1like
0Comments
Scalable Genomics Annotation Analysis with OpenCRAVAT in Microsoft Azure
H ow to annotate genomics data in Microsoft Azure?
Erdal_Cosgun
Nov 22, 2024 Place Healthcare and Life Sciences Blog
4.2KViews
0likes
0Comments
Accelerating pathogen identification by using Snakemake on Azure
By deploying Snakemake on Azure, researchers can create reproducible and scalable data analysis workflows, streamlining the pathogen management process
Venkat_Malladi
Nov 10, 2024 Place Healthcare and Life Sciences Blog
607Views
2likes
0Comments
Introducing the Open Targets Dataset: Now Available on Genomics Data Lake on Azure
Access the Open Targets dataset on Azure to accelerate biomedical research. This open-access resource connects genetic, biological, and clinical data, helping researchers identify and prioritize drug targets efficiently
Mamta-Giri
Oct 28, 2024 Place Healthcare and Life Sciences Blog
1.3KViews
1like
0Comments
Genomics + LLMs: A Case Study on adding variant annotations to LLMs through RAG and Fine-tuning
Learn how to add genomics domain knowledge to Large Language Models (LLMs)
Erdal_Cosgun
Sep 30, 2024 Place Healthcare and Life Sciences Blog
4.7KViews
2likes
0Comments
Introducing Nextflow with GA4GH TES: A New Era of Scalable Data Processing on Azure
Learn how to use GA4GH TES API with Nextflow.
Venkat_Malladi
Sep 23, 2024 Place Healthcare and Life Sciences Blog
1.6KViews
0likes
0Comments
Update: Cost-effective genomics analysis with Sentieon on Azure
Sentieon pipelines allow researchers and clinicians to process and analyze genomic data quickly, accurately, and efficiently with a low total cost of ownership. Here we have an update to the previous results for new version of the software.
Venkat_Malladi
Sep 16, 2024 Place Healthcare and Life Sciences Blog
6.4KViews
0likes
0Comments