Educator Developer Blog

3 MIN READ

Big Data on Azure with No Limits Data, Analytics and Managed Clusters

Lee_Stott

Microsoft

Mar 21, 2019

First published on MSDN on Feb 24, 2017

HDInsight

Reliable with an industry leading SLA

Enterprise-grade security and monitoring

Productive platform for developers and scientists

Cost effective cloud scale

Integration with leading ISV applications

Easy for administrators to manage

Resources & Hands on Labs for teaching

https://github.com/MSFTImagine/computerscience/tree/master/Workshop/7.%20HDInsight

Cluster Types available on Azure

Hadoop

Governed by the Apache Software Foundation

Core services: MapReduce, HDFS and YARN

Other functions across: data lifecycle and governance, data workflow, security, tools, data access and operations

MapReduce

Programming framework for analysing datasets stored on HDFS

Typically MapReduce + HDFS run on the same set of nodes

Distributes computation locally to where the data is

Scales linearly as you add nodes

Composed of used-supplied Map and Reduce functions:

Map(): subdivide and conquer

Reduce(): combine and reduce cardinality

Hive

SQL-like queries on data in HDFS

HiveQL = SQL-like languages

Hive structures include well-understood database concepts such as tables, rows, columns, partitions

Compiled into MapReduce jobs that are executed on Hadoop

Supports custom serializer/deserializers for complex or irregularly structured data

ODBC drivers to integrate with Power BI, Tableau, Qlik, etc.

HBase

NoSQL database on data in HDFS

random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families

Can be queried using Hive

HDInsight implementation: automatic sharding of tables, strong consistency for reads and writes, and automatic failover

Storm

Stream Analytics for Near-Real Time processing

Consumes millions of real-time events from a scalable event broker: Apache Kafka, Azure Event Hub

Guarantees processing of data

Output to persistent stores, dashboards or devices

HDInsight implementation: Event Hub, Azure Virtual Network, SQL Database, Blob storage, and DocumentDB

Kafka

High-throughput, low-latency for real-time data

Stream millions of events per second

Enterprise-grade management and control

Resources & Hands on Labs for courses

https://github.com/MSFTImagine/computerscience/tree/master/Workshop/6.%20HPC%20and%20Containers

https://github.com/Microsoft/TechnicalCommunityContent/tree/master/Big%20Data%20and%20Analytics

Azure Data Lake

Petabyte size files and Trillions of objects

Scalable throughput for massively parallel analytics

HDFS for the cloud

Always encrypted, role-based security & auditing

Enterprise-grade support

Data Ingress & Egress –> To & From Azure Data Lake

Data can be ingested into Azure Data Lake Store from a variety of sources

Data can be exported from Azure Data Lake Store into numerous targets/sinks

Resources and Hands on Labs

https://github.com/Microsoft/TechnicalCommunityContent/tree/master/Big%20Data%20and%20Analytics

https://azure.microsoft.com/en-us/solutions/data-lake/

Azure Data Lake Analytics

Built on Apache YARN

Scales dynamically with the turn of a dial

Pay by the query

Supports Azure AD for access control, roles, and integration with on-premise identity systems

Processes data across Azure

Built with U-SQL to unify the benefits of SQL with the power of C#

U-SQL: a simple
and powerful language that’s familiar and easily extensible

Unifies the declarative
nature of SQL with expressive
power of C#

Leverage existing libraries in .NET languages, R and Python

Massively parallelize code on diverse workloads (ETL, ML, image tagging, facial detection)

Resources

https://azure.microsoft.com/en-us/services/data-lake-analytics/

Architecture

Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective: