Jul 22 2019 05:05 AM
Hi All
Hope you are doing well.
We have a use case around using ADLS as different tiers of ingestion process, just required you valuable opinions regarding feasibility of the same.
INFRASTRUCTURE: There will two instances of ADLS named LAND & RAW. LAND instance will be getting the file directly from the source while RAW instance will be getting the file once validations are passed in LAND instance. We also have a Cloudera cluster hosted on Azure platform which will have connectivity established to both the ADLS instances.
PROCESS: We will have a set of data & control files landing in one of the ADLS instance (say landing). We need to run a spark code on Cloudera cluster to perform count validation between Data & control file present in Land ADLS instance. Once the validation is successful, we want distcp command to copy data from Land ADLS instance to Raw ADLS instance. We are assuming that Distcp utility will be already installed on the Cloudera cluster.
Can you guys suggest if above approach looks fine?
Primarily our question is whether DISTCP utility will support data movement between two different ADLS instances?
We also considered other options like ADLCopy but Distcp appeared better.
NOTE: We haven't considered use Azure Data Factory since it may has certain security challenges though we know Data Factory is best suited for above use case.