IoT Solutions and Azure Cosmos DB

Microsoft

Nov 26, 2019

Given the volume of data and the 10Gb partition limit, what is the approach for keeping data sizes in check? In your example, a city with multiple oil wells would probably hit that limit rather quickly (days/weeks). Is the approach just to introduce look-back reporting constraints for hot data and leverage TTL? In very high volume IoT scenarios with business requirements around look-back this could be an issue.

You're absolutely correct that devices generally produce data at a fast rate - which is why Cosmos DB is usually used as a landing zone for ingest.

If you were to assume a 1 KB record per device per second * 60 seconds per minute * 60 minutes per hour * 24 hours in a day => a device will produce an estimated ~2.5GB per Month. Generally speaking - people handle this 2 different ways (which of course, aren't mutually exclusive):

1st method is to to set the partition key to an artificial field called /partitionKey - and assign a composite value (e.g. Device ID + Current Month and Year). This enables the workload to write to a extremely high cardinality of values (1 discrete value per device) and thus load balancing the throughput on the system's underlying partitions. At the same time, this preserves efficiency of common query patterns in hot telemetry data without having to blindly fan-out (e.g. SELECT * FROM c WHERE c.partitionKey IN ("device123-Jan-2019", "device123-Feb-2019", "device123-March-2019"). The rest of the queries can fan-out but that's okay - as the telemetry generally follows a pareto principal and is generally heavily skewed to being a write-heavy scenario with relatively low QPS (>80% of operations/sec are writes, not reads... optimize for the 80% instead of the 20%).

2nd method is to tier data in to a hot-store (e.g. Cosmos DB) + cold-store (e.g. Blob Storage) by using a combination of TTL to automatically prune data from the hot store and change feed to replicate data from hot store => cold store. Hot store enables low latency indexed queries over recent telemetry; meanwhile cold store enables low cost commodity storage for batch processing and/or retaining data long term for regulatory/compliance reasons. The Cosmos DB team is actually working on operationalizing this pattern (since it comes up so often as a common scenario) in the form of its analytical storage. This analytical storage feature is still in a gated preview (at the time of this post) - so I'd recommend using TTL + change feed for product use-cases until analytical storage feature graduates to GA.

Blog Post