Data Warehousing using Apache Spark on Azure HDinsight

Question

Hi TeamHope all are safe!&nbsp;This is my first project in Azure and we are looking at developing a DW using Apache Spark on Azure HDinsight.&nbsp;In simple terms we are currently trying to pick files from Share Point and then do transformations using pyspark and then load the data into a Azure Sql db.&nbsp;Can someone help me on the below queries:&nbsp; &nbsp;1) Can we connect Apache Spark or Pyspark on Azure HDinsight to Share Point to pick files?&nbsp; &nbsp;2) Can we implement the usual SCD1 or SCD2 logic using pyspark?&nbsp;Thanks in advance!&nbsp;

buckwoodymsft · Answer

Hello! I suggest you request support here - https://social.msdn.microsoft.com/Forums/sqlserver/en-US/home

ronen_ariely · Answer

Hi Aishwar04
This question is very old, but maybe this will help other people that will come here in the future as well.

Note! The link which you got in the MSDN forums is not active and it will redirect you to a general forum at MSDN named "Forums Issues (not product support)".
The English MSDN forums were migrated long time ago to a new system named Microsoft QnA which is part of the Microsoft Docs system. You can find it in this link: https://docs.microsoft.com/answers

The new system is based on tags and not separate forums. When you ask a question make sure that you are using a relevant tag. In your case you probably you can use the tag "azure-databricks"

> This is my first project in Azure

Congrats 🙂

> 1) Can we connect Apache Spark or Pyspark on Azure HDinsight to Share Point to pick files?

You can manage jobs in Azure Databricks using PySpark (Python API for Apache Spark).
I have not tried it but using the help of Google I found the following article which seems like bring the solution. Please check it:
https://www.cdata.com/kb/tech/sharepoint-jdbc-apache-spark.rst

I recommend to check the QnA forum as mentioned above for experts in the topic

xbanalytics · Answer

Yes, you can connect Apache Spark or PySpark on Azure HDInsight to SharePoint by using APIs or data connectors to access files, which can then be integrated into your data warehouse for further processing and analysis.

Forum Discussion

Data Warehousing using Apache Spark on Azure HDinsight

Share

Resources