High Availability (HA) refers to a site being up for as long as possible. This means there is enough infrastructure in the right locations to ensure there is no single point of failure that could take the site down.
Both Failover and Load Balancing are means to achieve High Availability:
- Load Balancing spreads the load of the application across multiple application servers or multiple web servers to help smooth out the peaks if there is a lot of traffic all at once. Load Balancing is one piece of the puzzle when implementing high availability.
- Failover protects web applications or websites from downtime by having redundant equipment that can take over if the one part were to fail.
Scaling Up with Load Balancing
At Six Feet Up, we commonly use the open source load balancing tool called HAProxy as it is simple, easy to set up, and reliable. It also has some nice features for doing layer 7 inspection of the packets. Because HAProxy is aware of HTTP headers, it can make decisions on its own, like send specific USER_AGENTs to different backends, reject requests from specific spiders, or send specific application requests to alternate backends for special handling.
Many cloud providers propose some kind of equivalent. For example, AWS offers Elastic Load Balancers (ELB). Cloud infrastructure offers the ability to run applications in multiple regions or zones and designate load balancers for failover across them. If one region is unresponsive, the system will switch to a different load balancer in a different region.
Load Balancing by itself typically isn't enough to achieve full high availability and can actually be tricky. When debugging an application backed by a load balancer, it can be difficult to know if you are going back to the exact same server again and again. Therefore it is always best to try and debug the application outside the load balancing.
Preventing Single Points of Failure with Failover Setup
Relying on more than one element in the same location can pose a risk if something were to happen to that location. Even one database server poses a risk: if any component anywhere along that chain breaks, it becomes a single point of failure (SPoF). One way to combat SPoF is to implement as much network redundancy (referring to multiple pieces of infrastructure) as possible.
Many cloud providers provide much of the infrastructure, such as switches and routers, under the covers, but it is still important to consider eliminating SPoF in the application. It is critical to ensure that, if a region were to have some kind of issue, the application could switch to run in another region or zone without much effort.
If you are running your own infrastructure, you will need to consider many areas of the infrastructure to ensure that they have the proper amount of redundancy. It may not be obvious, but this goes all the way down to the cooling systems in the datacenter. If one were to fail, it will be tough to keep your application running if it is too hot to run the servers. As with most things in life, you are going to have to decide which compromises you are willing to make to keep the costs reasonable.
Exponential Uptime Costs
The biggest consideration when assessing your needs for High Availability is to identify the risks and determine which ones you can live with. Can the site be down for a few minutes? Can it be down for a few hours? Is anyone going to care? Will money be lost over this?
The more nines of uptime we seek to obtain (the percentage that your site is available during any given time period), the more expensive it is. For example, while trying to achieve five nines (e.g.: 99.999% or 26 seconds of interuption per month) of uptime is really expensive, striving to go to six nines (99.9999% or 2.6 seconds per month of downtime) of uptime will incur exponential costs.
As a matter of fact, achieving six nines is very difficult. It requires a lot of planning and lots of additional moving parts. And all those moving parts require additional maintenance. The real kicker here is that the complexity of the setup itself can cause unintended outages.
Keeping web assets up and running can be critical. It is best to look at all options very closely before choosing one. There are good arguments for each method, but you have to chose what works best for you. Even then, you will have to test and change your process many times before finding what works for you.
Do you have questions about failover, load balancing or High Availability? What are you using to achieve HA? Let's talk!