May 28 2021 05:46 AM
May 28 2021 05:46 AM
The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
1) All the messages should be consumed from FTP by Azure IOT hub by some means.
2) From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
3) At the same time, I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Jun 01 2021 03:54 PM
There are a number of choices that you can make in order to make this architecture more scalable and resilient. Without fully understanding the consumption use-case, it's generally thought that IoT data does not lend itself very well to relational data stores as the raw storage layer, and that is why you see a lot of the reference architectures with a data lake, or a data lakehouse (Data Lake combined with ACID capabilities, such as Databricks Delta) as the central repository, and then other data stores being added based on usage patterns (Azure Synapse analytics for aggregate reporting, for example)..
There are some very good examples of reference architectures for IoT analysis available here: Azure IoT Reference Architecture (microsoft.com)
Depending on what you want to do with the collected data, you can also look to incorporate tools such as Azure Time Series Insights (Gen2 use cases - Azure Time Series Insights Gen2 | Microsoft Docs) or even Azure Metrics Advisor (Azure Metrics Advisor – Root-Cause Analysis | Microsoft Azure) to help add value to the overall solution.
I hope this helps.
Microsoft - West Enterprise Commercial
Jun 01 2021 05:17 PM
Jun 01 2021 05:18 PM
@ChandanJha, what is the uses pattern for the data? Will we be performing real time analytics on the data? or we are planning to perform batch analytics on it? From the description, since we have FTP in the context, I am guessing we are performing batch analytics only. Also, from the discussion it is not very clear, if we really need to put the data into relational data store. Right?
If we need batch analytics and if there is no schema on write needed, then I would say we can bypass the IoT Hub and directly upload the data from FTP locations to Azure Data Lake Store Gen2, instead of Blob, unless you want to make IoT Hub as central messaging Hub. I see IoT Hub more in the real time ingestion and processing scenarios.
Please take a look at the architectural pattern here:
The pattern will all depend on what you would like to do with the data downstream. In the above architecture, you can easily leverage ADLS Gen2, instead of Blob, in column 3, for cold path.
It sounds like you do not need hot path here. Granted IoT Hub gives you options to keep the messages until consumed in different ways (hot path/cold path etc.), I am not sure how complex it will be to work with IoT Hub to upload the data from FTP and whether it is considered a hot path.
Once your data is in ADLS Gen2, you can then leverage CosmosDB, to take advantage of unlimited scale, ms latencies, multi-regions writes etc.
Few other references:
I hope this helps.