One of the challenges of large scale data analysis is being able to get the value from data with least effort. Doing that often involves multiple stages: provisioning infrastructure, accessing or moving data, transforming or filtering data, analyzing and learning from data, automating the data pipelines, connecting with other services that provide input or consume the output data, and more. There are quite a few tools available to solve these questions, but it's usually difficult to have them all in one place and easily connected.
If this article was helpful or interesting to you, follow@lenadroidon Twitter.
This is the first article in this series, which will cover what Azure Synapse is and how to start using it with Azure CLI. Make sure yourAzure CLIis installed and up-to-date, and add asynapseextension if necessary:
$ az extension add --name synapse
What is Azure Synapse? In Azure, we haveSynapse Analyticsservice, which aims to provide managed support for distributed data analysis workloads with less friction. If you're coming from GCP or AWS background, Azure Synapse alternatives in other clouds are products like BigQuery or Redshift. Azure Synapse is currently in public preview.
Serverless and provisioned capacity In the world of large-scale data processing and analytics, things like autoscale clusters and pay-for-what-you-use has become a must-have. In Azure Synapse, you can choose betweenserverless and provisionedcapacity, depending on whether you need to be flexible and adjust to bursts, or have a predictable resource load.
Native Apache Spark support Apache Spark has demonstrated its power in data processing for both batch and real-time streaming models. It offers a great Python and Scala/Java support for data operations at large scale. Azure Synapse providesbuilt-in supportfor data analytics using Apache Spark. It's possible to create an Apache Spark pool, upload Spark jobs, or create Spark notebooks for experimenting with the data.
SQL support In addition to Apache Spark support, Azure Synapse has excellent support for data analytics withSQL.
Other features Azure Synapse provides smooth integration with Azure Machine Learning and Spark ML. It enables convenient data ingestion and export using Azure Data Factory, which connects with many Azure and independent data input and output sources. Data can be effectively visualized with PowerBI.
At Microsoft Build 2020, Satya Nadella announcedSynapse Linkfunctionality that will help get insights from real-time transactional data stored in operational databases (e.g. Cosmos DB) with a single click, without the need to manage data movement.
Get started with Azure Synapse Workspaces using Azure CLI
Prepare the necessary environment variables:
$ StorageAccountName='<come up with a name for your storage account>'$ ResourceGroup='<come up with a name for your resource group>'$ Region='<come up with a name of the region, e.g. eastus>'$ FileShareName='<come up with a name of the storage file share>'$ SynapseWorkspaceName='<come up with a name for Synapse Workspace>'$ SqlUser='<come up with a username>'$ SqlPassword='<come up with a secure password>'
Create a resource group as a container for your resources:
$ az group create --name$ResourceGroup--location$Region
Create a Data Lake storage account:
$ az storage account create \--name$StorageAccountName\--resource-group$ResourceGroup\--location$Region\--sku Standard_GRS \--kind StorageV2
After you successfully created these resources, you should be able to go to Azure Portal, and navigate to the resource called$SynapseWorkspaceNamewithin$ResourceGroupresource group. You should see a similar page:
You can now load data and experiment with it in Synapse Data Studio, create Spark or SQL pools and run analytics queries, connect to PowerBI and visualize your data, and many more.
Stay tuned for next articles to learn more! Thanks for reading!
If this article was interesting to you, follow@lenadroidon Twitter.