At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring ways to reducing their operational costs – incurring unchecked expenses are their worst nightmare. Over the years, we have seen a huge demand for bringing ‘indexing’ capabilities that come de facto in the traditional database systems world into Apache Spark™.
Today, we are making that possible by open-sourcing Hyperspace v0.1 open-sourcing Hyperspace v0.1 – an indexing subsystem for Apache Spark™. Hyperspace is the same technology that powers indexing within Azure Synapse Analytics!
At a high-level, Hyperspace offers users the ability to:
Build indexes on your data (e.g., CSV, JSON, Parquet, etc.)
Maintain the indexes through a multi-user concurrency model
Leverage these indexes automatically, within your Spark workloads, without any changes to your application code for query/workload acceleration.
When running test queries derived from industry standard TPC benchmarks (Test-H and Test-DS) over 1 TB of Parquet data, we have seen Hyperspace deliver up to 11x acceleration in query performance for individual queries. We ran all benchmark derived queries using open-source Apache Spark™ 2.4 running on a 7 node Azure E8 V3 cluster (7 executors, each executor having 8 cores and 47 GB memory) and a scale factor of 1000 (i.e., 1 TB data).
Overall, we have seen a ~2x and ~1.8x acceleration in query performance time, respectively – all using commodity hardware.
To learn more about Hyperspace, check out our presentation at Spark+AI Summit 2020. Stay tuned for more articles in the coming weeks!
Feel like contributing? Start with the current outstanding Issues
For those of you who may not want to immediately dive into the GitHub repository, we recently announced the public preview features availability of Azure Synapse Analytics – a limitless analytics service that brings together enterprise data warehousing and big data analytics. And as mentioned earlier Hyperspace is the same technology that powers indexing within Azure Synapse.
Get started today with Hyperspace on Azure Synapse Analytics: