Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics

Published Jul 21 2022 02:18 PM 1,172 Views
Microsoft

Note: a Jupyter Notebook accompanies this blog post and contains all of the information necessary for generating synthetic clinical data in FHIR format using Synthea and converting FHIR and genomic VCF data into tabular Parquet format for further analysis. Please review the notebook here.

 

fhir_long_read_1.JPG

Please find the brief information about the data formats that we used on the sample notebook:

 

1. Clinical Data

 

1.1. FHIR: Fast Healthcare Interoperability Resource Format

 

Clinical data can come in the form of numbers, raw text, images, or even 3D scans. Despite the inherent diversity of this data, it must be stored in a consistent digital format that allows for easy and efficient exchange between hospitals, laboratories, and data centers. The “Fast Healthcare Interoperability Resource” (FHIR) format is the current leading standard for health care data exchange [5]. Each chunk of FHIR data is an instance of one of 140 pre-defined resources, represented in XML, JSON, or RDF format. The framework was designed to be broad and extensible, covering clinical healthcare, clinical trials, organization administration, and finances. Data is commonly accessed from a FHIR RESTful API that ensures secure and efficient querying of patient data. In Section 2 of our Jupyter notebook, we configure a FHIR server for hosting synthetic patient data.

 

1.2. Synthea


Synthea is a widely used open-source tool for generating realistic (but synthetic) patient data in FHIR format [6]. This enables researchers to work with realistic clinical datasets without worrying about any of the legal, ethical, or security concerns that would accompany working with real patient data. In Section 1 of our Jupyter notebook, we demonstrate how to generate synthetic patient data with Synthea on Azure. In Section 3, we then upload the data to our FHIR server. 

 

2. Genomics Data

 

2.1.  Long Read Sequencing

 

Short read sequencing remains the well-known technology for genome sequencing today. For years it has successfully provided relatively low cost and massively-parallel short read sequencing. Recently, newer long read technologies have begun to be visible. Illumina short reads of several hundred bases typically achieve around 99.9% accuracy [7]. These newer technologies can easily achieve average read lengths of over 10,000 bases [9]. This greatly aids in assembling the human genome in complex or repetitive genomic regions, and PacBio reads were instrumental in the “Telomere-to-Telomere“ consortium completing the first truly complete human genome in 2021 [10].

 

2.2. VCF: Variant Call Format


Since two human genomes are 99.9% identical [1], the end goal of most DNA sequencing efforts is to identify unexpected differences in a person’s DNA from an expected reference sequence. This problem is known as “variant calling”. These small changes in DNA can be in the form of mutations (e.g. from an ‘A’ to a ‘G’), insertions (of unexpected bases), deletions (of expected bases), or structural variants (in which large segments of DNA move to a different location in the genome). These variations are stored in “Variant Call Format” (VCF), which notes the expected (“reference”) and actual (“alternate”) observed DNA sequence. Databases of known mutations and their effects (if any) on patient health are used to identify important mutations [11].

 

figure3.png

3. Data Management

 

3.1. Parquet

 

Parquet files store the same information as ordinary CSV or TSV files, but in a more efficient compressed manner. Data is stored in column-major order, which allows compression algorithms such as run-length encoding, dictionary encoding, or delta-encoding to be applied per-column depending upon each column’s data format and values [12].

 

In Section 3 of our Jupyter notebook, we set up an open-source tool for downloading FHIR data from a server and converting it to Parquet format, called “FHIR to Synapse Sync Agent” [13]. In this sample notebook, we used `bcftools` to convert PacBio VCF files to TSV [14], and then we use `pandas` to save the data in Parquet format [15].

 

4. Next Steps

 

After preparing the parquet data using accompanying Jupyter notebook, there are several options for consuming that data:

 

  1. Azure Machine Learning [16]: Machine learning libraries can be used to develop Machine/Deep Learning models for FHIR or VCF data.
  2. Azure Synapse Analytics [17]: A subgroup of patients’ genomic VCF data can be selected for further analysis based on clinical data present in their FHIR records.
  3. Terra [18]: Microsoft Biomedical Platforms and Genomics team announced the partnership with Broad Institute and Verily to accelerate the genomics analysis on next generation of Terra platform. The notebooks and the data that you prepared on this blog will be a great resource for using the Terra on Azure notebooks in future.

5. References

 

Jupyter Notebook

 

[1] NIH: National Human Genome Research Institute. “Genetics vs. Genomics Fact Sheet.” https://www.genome.gov/about-genomics/fact-sheets/Genetics-vs-Genomics

[2] Pater, Adrian A., et al. "High throughput nanopore sequencing of SARS-CoV-2 viral genomes from patient samples." Journal of biological methods 8.COVID 19 Special Issue (2021).

[3] Ji, Boyang, and Jens Nielsen. "From next-generation sequencing to systematic modeling of the gut microbiome." Frontiers in genetics 6 (2015): 219.

[4] NIH: National Human Genome Research Institute. “The Cost of Sequencing a Human Genome.” https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

[5] HL7 International. “Welcome to FHIR: HL7 FHIR Release 4B Documentation” https://hl7.org/fhir

[6] Walonoski, Jason, et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25.3 (2018): 230-238.

[7] Fox, Edward J., et al. "Accuracy of next generation sequencing platforms." Next generation, sequencing & applications 1 (2014).

[8] Rhoads, Anthony, and Kin Fai Au. "PacBio sequencing and its applications." Genomics, proteomics & bioinformatics 13.5 (2015): 278-289.

[9] Jain, Miten, et al. "The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community." Genome biology 17.1 (2016): 1-11.

[10] Nurk, Sergey, et al. "The complete sequence of a human genome." Science 376.6588 (2022): 44-53.

[11] Hutter, Carolyn, and Jean Claude Zenklusen. "The cancer genome atlas: creating lasting value beyond its data." Cell 173.2 (2018): 283-285.

[12] Ivanov, Todor, and Matteo Pergolesi. "The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet." Concurrency and Computation: Practice and Experience 32.5 (2020): e5523.

[13] Microsoft. “FHIR to Synapse Sync Agent.” https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md

[14] Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008.

[15] Reback, Jeff, et al. "pandas-dev/pandas: Pandas 1.0. 5." Zenodo (2020).

[16] Microsoft. “Docs: What is Azure Machine Learning?” https://docs.microsoft.com/EN-US/azure/machine-learning/overview-what-is-azure-machine-learning

[17] Microsoft. “Docs: Azure Synapse Analytics.” https://docs.microsoft.com/en-us/azure/synapse-analytics/

[18] Broad Institute of MIT and Harvard. “Terra: A scalable platform for biomedical research.” https://terra.bio/

Version history
Last update:
‎Jul 21 2022 02:25 PM
Updated by: