Real-time machine learning on globally-distributed data with Apache Spark and DocumentDB

Community Manager

At the Strata + Hadoop World 2017 Conference in San Jose, we have announced the Spark to DocumentDB Connector. It enables real-time data science, machine learning, and exploration over globally distributed data in Azure DocumentDB. Connecting Apache Spark to Azure DocumentDB accelerates our customer’s ability to solve fast-moving data science problems, where data can be quickly persisted and queried using DocumentDB. The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios. The Spark to DocumentDB connector uses the Azure DocumentDB Java SDK. You can get started today and download the Spark connector from GitHub!


Azure DocumentDB is our globally distributed database service designed to enable developers to build planet scale applications. DocumentDB allows you to elastically scale both, throughput and storage across any number of geographical regions. The service offers guaranteed low latency at P99, 99.99% high availability, predictable throughput, and multiple well-defined consistency models, all backed by comprehensive SLAs. By virtue of its schema-agnostic and write optimized database engine, by default DocumentDB is capable of automatically indexing all the data it ingests and serve SQL, MongoDB, and JavaScript language-integrated queries in a scale-independent manner. As a cloud service, DocumentDB is carefully engineered with multi-tenancy and global distribution from the ground up. These unique benefits make DocumentDB a great fit for both operational as well as analytical workloads for applications including web, mobile, personalization, gaming, IoT, and many other that need seamless scale and global replication.




Read about it on the Azure blog.

0 Replies