书名：Solutions Architect's Handbook
作者名：Saurabh Shrivastava Neelanjali Srivastav Kamal Arora
本章字数：504字
更新时间：2025-03-30 21:13:01

High availability and resiliency

The one thing an organization doesn't want to see is downtime. Application downtime can cause a loss of business and user trust, which makes high availability one of the primary factors while designing the solution architecture. The requirement of application uptime varies from application to application.

If you have an external-facing application with a large user base such as an e-commerce website or social media, then 100% uptime becomes critical. In the case of an internal application (accessed by an employee such as an HR system or internal company), a blog can tolerate some downtime. Achieving high availability is directly associated with cost, so a solution architect always needs to plan for high availability, as per the application requirements, to avoid over-architecting.

To achieve a high availability (HA) architecture, it's better to plan workloads in the isolated physical location of the data center so that if an outage happens in one place, then your application replica can operate from another location.

As shown in the following architecture diagram, you have a web and application server fleet available in two separate availability zones (which is the different physical location of the data center). The load balancer helps distribute the workload between two availability zones in case Availability Zone 1 goes down due to power or network outage. Availability Zone 2 can handle user traffic, and your application will be up and running.

In the case of the database, you have a standby instance in Availability Zone 2, which will failover and become the primary instance in the event of an issue in Availability Zone 1. Both the master and standby instances continuously sync data:

High availability and resilience architecture

The other factor is the architecture's resiliency. When your application is in trouble and you are facing an intermittent issue then apply the principle of self-healing, this means your application should be able to recover itself without human intervention.

For your architecture, resiliency can be achieved by monitoring the workload and taking proactive action. As shown in the preceding architecture, the load balancer will be monitoring the health of instances. If any instance stops receiving the request, the load balancer can take out the bad instances from the server fleet and tell autoscaling to spin up a new server as a replacement. The other proactive approach to monitor the health of all instances (such as CPU and memory utilization and spinning up new instances as soon as a working instance starts to reach a threshold limit) such as CPU utilization is higher than 70% or that memory utilization is more than 80%.

The attributes of high availability and resiliency can help in terms of cost by achieving elasticity, for example, if server utilization is low, you can take out some servers and save costs. The HA architecture goes hand in hand with self-healing, where you can make sure that your application is up and running, but; you also need to have a quick recovery to maintain the desired user experience.