This is a step by step, no fluff guide to building and architecting cloud native applications.
Cloud application development introduces unique challenges: applications are distributed, scale horizontally, communicate asynchronously, have automated deployments, and are built to handle failures resiliently. This demands a shift in both technical approach and mindset.
So Cloud Native is not about technologies, if you can do all of the above, then you are Cloud Native. You don’t have to use Kubernetes for instance, to become Cloud Native.
Say you want to architect an application in the cloud, and let’s use an E-Commerce Website as an example, how do you go about it and what decisions do you need to make?
Regardless of the type of application, we have to go through six distinct steps:
Before starting any development, it’s crucial to align with business objectives, ensuring every feature and decision supports these goals. While microservices are popular, their complexity isn’t always discussed.
It’s essential to evaluate if they align with your business needs, as there’s no universal solution. For external applications, the focus is often on enhancing customer experience or generating new revenue. Internal applications, however, aim to reduce operational costs and support queries.
For instance, in E-commerce, improving customer experience may involve enhancing user interfaces or ensuring uptime, while creating new revenue streams could require capabilities in data analytics and product recommendations.
By first establishing and then rigorously adhering to the business objectives, we ensure that the every functionality in the application — whether serving external customers or internal users — delivers real value.
Now that we understand and are aligned with the business objectives, we go one step further by getting a bit more specific. And that is by defining functional and none-functional requirements.
Functional requirements define the functionalities you will have in your application:
None functional requirements define the “quality” of the application:
We understand the business objectives, we are clear on what functionalities our application will support, and we understand the constraints our applications need to adhere to.
Now, it is time to make a decision on the Application Architecture. For simplicity, you either choose a Monolith or a Microservices architecture.
Monolith: A big misconception is that all monoliths are big ball of muds — that is not necessary the case. Monolith simply means that the entire application with all of its functionalities are hosted in a single deployment instance. The internal architecture though, can be designed to keep the components of different subsystems isolated from each other but calls between them are in process function calls as opposed to network calls.
Microservices: Modularity has always been a best practice in software development whether you use a Monolith (logical modularity) or Microservice (physical modularity). Microservices is when functionalities of your application are physically separate into different services communicating over network calls.
What to pick: Choose microservices if technical debt is high and your codebase has become so large that feature releases are slow, affecting market responsiveness, or if your application components need different scaling approaches. Conversely, if your app updates infrequently and doesn’t require detailed scalability, the investment in microservices, which introduces its own challenges, might not be justified.
Again, this is why we start with business objectives and requirements.
It is time to make some technology choices. We won’t be architecting yet, but we need to get clear on what technologies we will choose, which is now much easier because we understand the application architecture and requirements.
Encompassing all technology choices, you should be thinking about containers if you are architecting cloud native applications. Away from all the fluff, this is how containers work: The code, dependencies, and run-time are all packaged into a binary (an executable package) known as an image. That binary or image is stored in a container registry and you can run this binary on any computer that has a supported container runtime (such as Docker). With that, you have a portable and lightweight deployable package.
Most applications model data. We represent different entities in our applications (users, products) using objects with certain properties (name, title, description).
When we pick a database, we need to take into consideration how will we represent our data model (JSON, Tables, Key Value/Pairs). In addition, we need to understand relationships between different models, scalability multi-regional replication, schema flexibility, and of course, team skillsets.
One of the advantages of Microservices is that each service can pick its own database. A product service with reviews follows a tree like structure (one to many relationship) and would fit a document database. An inventory service on the other hand, where each product can belong to one or more category and each order can contain multiple products (many to many relationships) might use a relational database.
The difference between an event and message is that with message (apart from having a larger payload), a producer expects the message to be processed while an event is just a broadcast with no expectation of acting on it.
For an event, it is either discrete (such as a notification) or in a sequence such as IOT streams. Streams are good for statistical evaluation. For instance, if you are tracking inventory through a stream of events, you can analyze in the last 2 hours, how much stock has been moved.
Pick:
Many times, a combination of these services are used. Say you have an e-commerce, and when new orders come, you send a message to a queue where inventory system listens to confirm & update stock. Since orders are infrequent, the inventory system will constantly poll for no reason. To mitigate that, we can use an Event Grid listening to the Service Bus Queue, and the Event Grid triggers the inventory system whenever a new message comes in. Event Grid helps convert this inefficient polling model into an event-driven model.
Identity is one of the most critical elements in your application. How you authenticate & authorize users is a considerable design decision. Often times, identity is the first thing that is considered before apps are moved to the cloud. Users need to authenticate to the application and the application itself needs to authenticate to other services like Azure.
The recommendation is to outsource identity to an identity provider. Storing credentials in your databases makes you a target for attacks. By using an Identity as a Service (IDaaS) solution, you outsource the problem of credential storage to experts who can invest the time and resources in securely managing credentials. Whichever service you choose, make sure it supports Passwordless & MFA which is more secure. Open ID & OAuth are the standard protocols for Authentication and Authorization.
We defined requirements, picked the application architecture, made some technology choices, and now it is time to think about how to design the actual application.
No matter how good the infrastructure is and how good of a technology choice you’ve made, if the application is poorly designed or is not designed to operate in the cloud, you will not meet your requirements.
In the cloud, failures are inevitable. This is the nature of distributed systems, we rarely run everything on one node. Here is a checklist of things you need to account for:
Asynchronous communication is when services talk to each other using events or messages. The web was built on synchronous (direct) Request-Response communication, but it becomes problematic when you start chaining multiple requests, such as in a distributed transaction because if one call fails, the entire chain fails.
Say we have a distributed transaction in a microservices application. A customer places an order processed by order service, which does a direct call to the payment service to process payments. If payment system is down, then the entire chain fails, and order will not go through. With asynchronous communication, a customer places an order, the order puts an order request in a queue, and when payment is up again, an order confirmation is sent to the user. You can see, such small design decisions can make massive impacts on revenues.
You might think scalability is as simple as defining a set of auto-scaling rules. However, your application should first be designed to become scalable.
The most important design principle is having a stateless application, which means logic should not depend on a value stored on a specific instance such as a user session (which should be stored in a shared cache). The simple rule here is: don’t assume your request will always hit the same instance.
Consider using connection pooling as well, which is re-using established database connections instead of creating a new connection for every request. It also helps avoid SNAT exhaustion especially when you’re operating at scale.
An application consists of various pieces of logic & components working together. If these are tightly coupled, it becomes hard to change the system i.e. changing one system, means coordinating with multiple other systems.
Loosely coupled services means you can change one service without changing the other. Asynchronous communication, as explained previously, helps with designing loosely coupled components.
Also, using a separate database per microservice is an example of loose coupling (changing the schema of one DB will not affect any other service, for instance).
Lastly, ensure services are deployed individually without having to re-deploy the entire system.
If performance is crucial, then caching is a no brainer and cache management is the responsibility of the client application or intermediate server.
Caching, essential for performance, involves storing frequently accessed data in fast storage to reduce latency and backend load. It’s suitable for data that’s often read but seldom changed, like product prices, minimizing database queries. Managing cache expiration is critical to ensure data remains current. Techniques include setting max-age headers for client-side caching and using ETags to identify when cached data should be refreshed.
Service Side Caching: Server-side caching often employs Redis, an in-memory database ideal for fast data retrieval due to its RAM-based storage, not suitable for long-term persistence. Redis enhances performance by caching data that doesn’t frequently change, using key-value pairs and supporting expiration times (TTLs) for data.
Client Side Caching: Content Delivery Networks (CDNs) are often used to cache data closer to frontend clients, and more often, it is static data. In Web Lingo, static data refers to files that are static, such as CSS, Images, HTML, etc. Those are often stored in CDNs, which are nothing more than a set of reverse proxies (spread around the world) between the client and backend. Client will usually request these static resources from these CDN URLs (sometimes called edge servers) and the CDN will return the static content along with TTL headers and if TTL expires, CDN will request static resources again from the origin.
You’ve defined business objectives & requirements, made technology choices, and coded your application using cloud native design patterns.
This section is about how you architect your solution in the cloud, defined through a cloud agnostic framework known as Well-Architected framework encompassing reliability, performance, security, operational excellence, and cost optimization.
Note that many of these are related. Having scalability will help with reliability, performance, and cost optimization. Loose coupling between services will help with both reliability and performance.
A reliable architecture must be fault tolerant and must survive outages automatically. We discussed how to increase reliability in our application code through retry mechanisms, circuit breaking patterns, etc. This is about how do we ensure reliability is at the forefront of our architecture.
Here is a summary of things to think about when it comes to reliability:
The design patterns you implement in your applications, will influence performance. This focuses more on the architecture.
Security is about embracing two mindsets, zero trust and a defense in depth approach.
While the application, infrastructure, and architecture layer are important. We need to ensure we have the operations in place to support our workaloads. I boiled them down into four main practices.
If costs are not optimized, your solution is not well architected, it is as simple as that. Here are things you should be thinking about:
I hope this overview has captured the essentials of building cloud-native applications, highlighting the shift required from traditional development practices. While not exhaustive, I’m open to exploring any specific areas in more detail based on your interest. Just let me know in the comments, and I’ll gladly delve deeper.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.