Data validation, data transformation and de-identification can be complex and time-consuming. As data volumes grow, new downstream use cases and applications emerge, and expectations of timely delivery of high-quality data increase the importance of fast and reliable data transformation, validation, de-duplication and error correction.
How the City of Spokane improved data quality while lowering costs
To abstract their entire ETL process and achieve consistent data through data quality and master data management services, the City of Spokane leveraged DQLabs and Azure Databricks. They merged a variety of data sources, removed duplicate data and curated the data in Azure Data Lake Storage (ADLS).
“Transparency and accountability are high priorities for the City of Spokane,” said Eric Finch, Chief Innovation and Technology Officer, City of Spokane. “DQLabs and Azure Databricks enable us to deliver a consistent source of cleansed data to address concerns for high-risk populations and to improve public safety and community planning.”
City of Spokane ETL/ELT process with DQLabs and Azure Databricks
How DQLabs leverages Azure Databricks to improve data quality
“DQLabs is an augmented data quality platform, helping organizations manage data smarter,” said Raj Joseph, CEO, DQLabs. “With over two decades of experience in data and data science solutions and products, what I find is that organizations struggle a lot in terms of consolidating data from different locations. Data is commonly stored in different forms and locations, such as PDFs, databases, and other file types scattered across a variety of locations such as on-premises systems, cloud APIs, and third-party systems.”
To help customers make sense of their data and answer even simple questions such as, “is it good?” or “is it bad?” are far more complicated than organizations ever anticipated. To solve these challenges, DQLabs built an augmented data quality platform. DQLabs helped the City of Spokane to create an automated cloud data architecture using Azure Databricks to process a wide variety of data formats, including JSON and relational databases. They first leveraged Azure Data Factory (ADF) with DQLabs’ built-in data integration tools to connect the various data sources and orchestrate the data ingestion at different velocities, for both full and incremental updates.
DQLabs uses Azure Databricks to process and de-identify both streaming and batch data in real time for data quality profiling. This data is then staged and curated for machine learning models PySpark MLlib.