Wednesday, May 24, 2023

High Availability: Concept and Strategies.

To be always up and running is the goal of all IT services. 
Business resilience depends on how quickly mission-critical services are restored once there is a disruption. 

High Availability System (HA): 

Is the ability of overall system resilient to failure by keeping systems running through outages and failures, or at least minimizing the impact of those outages.

Only a very small amount of downtime is acceptable, e.g., 0.05 % (= 99.95 % uptime), for a system to be considered highly available.



Single Point Of Failure (SPOF): 

SPOF Is a component whose failure will cause the entire system to fail.

A single disk, whose failure would bring down a system, is a SPOF.

Computer systems without clustering represent a SPOF.

A single data center without a DR site is a SPOF.

Ultimately, of course, the Earth is a SPOF.


Fault Tolerance System (FT): 

FT is the ability of the system to continue operating even in the event of a failure.

By making individual components of a system resilient to failure.

No downtime is acceptable for a system component to be considered fault tolerant.


Disaster Recovery System (DR):
DR is the ability of the system to restore operations after a loss of service.

DR is the highest level and the most expensive component in HA System, Because DR implies protection against large scale disasters, like floods, fire, hurricanes, explosions, and so on that can take out an entire data center.

Cloud technologies has made DR much easier and less expensive.

Techniques for Building Fault-Tolerant Systems: 

1-    Redundancy: 
It is duplicating critical components or systems to create backups that can seamlessly take over in case of a failure. 
So, redundant system will provide failover or load balancing in case of failure.

Redundancy will be on level Hardware and software.
    Hardware: 
    Onboard HW components { Power supply, Network Interfaces, Storage Controllers &Hard disks}, Servers, Storage, UPS, Switches, Routers.
    Software: 
    Backup Data { Files & folders, Configuration files, Database files, Software files}     
 
    For example:
Servers cluster: multiple servers are configured to handle the workload, and if one server fails, the others can immediately step in to ensure uninterrupted service.

2-    Failover: 
Failover is the process of automatically switching to a redundant system or component when a failure is detected.

Failover Technique used in Stateful services or applications (change frequently) like File Share Services.
Quorum: is a membership algorithm that determines which nodes stay online Active and which remain Passive if they cannot communicate with each other. 

Heart beats: is a periodic signal (generated by hardware or software) between cluster nodes to indicate normal operation and availability.

Witness (Disk or Share): it is an important added to Quorum to avoid a “split-brain” scenario, 
Split-Brain : means the cluster divided  into smaller clusters of equal numbers of nodes, each of which believes it is the only active cluster.

3-    Load Balancing: 
Load balancing distributes the workload across multiple systems or components to prevent any single component from becoming overloaded and then failing. 

Load Balancing Technique used in Stateless services or applications (rare changes) like web Services.


Load balancers can intelligently route incoming requests to available resources, optimizing performance and minimizing the impact of failures.

4-    Error Detection and Recovery: 
Continuously monitoring the system for errors or abnormal behavior and then detecting any faults that may arise, such as health checks, automated alerts, and system logs.
Once a fault is detected, the system should be capable of recovering from it automatically or with minimal manual intervention. 
    For example:
    Redundant storage systems with automatic error correction can detect and repair     data errors without interrupting the system’s operation.

5-    Parallel Processing: 
Is breaking down tasks into smaller subtasks that can be processed simultaneously, the system can continue functioning even if one or more components encounter failures.

Parallel processing distributes the workload across multiple resources, allowing the system to maintain its performance and availability.

What are the types of On-Premises Disaster Recovery Sites?


1-    Cold Site: 
The simplest type of disaster recovery site. 
IT consists of elements providing power, networking capability, and cooling. It does not include other hardware elements such as switches, servers and storage.  Backup data and some additional hardware must be sent to the site and installed at the event of Disaster.

2-    Warm Site: 
Contain all the elements of a cold site while adding additional elements, including storage, servers, and switches. 
A warm site cannot perform on the same level as the production center because they are not equipped in the same way.
Data synchronization between the primary and the secondary sites is performed daily or weekly, which can result in minor data loss.

3-    Hot Site: 
It is equipped with all the necessary hardware, software, and network connectivity.
It maintains up-to-date copies of data at all times (real-time synchronization). 
Hot sites are time-consuming to set up and more expensive than cold sites, but they dramatically reduce down time.

What are the types of Cloud-Based Disaster Recovery Sites?

1-    Back Up as a Service: 
Similar to backing up data at a remote location, with Back Up as a Service, a third party provider backs up an organization’s data, but not its IT infrastructure.

2-    Disaster Recovery as a Service (DRaaS): 
An organization's on-premises infrastructure are backed up or replicated in a third-party environment, cloud-based infrastructure. 
In the event of a disaster or ransomware attack, a DRaaS provider moves an organization’s computer processing to its own cloud infrastructure, allowing a business to continue operations seamlessly from the vendor’s location.

Disaster Recovery Goals: 
The two primary goals of disaster recovery:
1-    Return affected systems to an operational state as fast as possible.
2-    Also Return with as little data loss as possible.

The metrics for these two key goals are universally known as the recovery time objective (RTO) and recovery point objective (RPO) respectively.

1-    Recovery time objective (RTO): Downtime
The planned length of time it takes to restore a business system to a fully operational state after a disaster occurs.
(the time between last backup/replication and the time of disaster occurred).

2-    Recovery point objective (RPO): Data Loss
The amount of data (in terms of the most recent changes) the company is will lose after a disaster occurs. 
(the amount of data lost (data changes) between the time of last backup/replication to the time of disaster occurred)

3-    Recovery time Actual (RTA): Downtime
The Actual length of time it takes to restore a business system to a fully operational state after a disaster occurs.

Backup Strategies


What is backup?

Backup is the process of creating a copy of the data on your system that you use for recovery in case your original data is lost or corrupted.

1-    The 3-2-1 backup strategy

Store 3X copies of your data (One copy is the production data and two backup copies).

2X local copies (on-site) but on different storage types.

1X copy off-site.

 The 3-2-1-1-0 backup strategy

Store 3X copies of your data (One copy is the production data and two backup copies).

2X local copies (on-site) but on different storage types.

1X copy off-site.

1X copy offline, air-gapped or immutable (No Access).

0X errors with Sure Backup recovery verification

 The 4-3-2 backup strategy

Store 4X copies of your data (One copy is the production data and three backup copies).

3X copies of your Data in three locations (one on-prem with you, one with an MSP like Continuity Centers, and one stored with a cloud provider).

2X copy off-site granting higher data protection against disasters and targeted attacks.

Which Backup Strategy Is Right for You?

First, any backup strategy is better than no backup strategy. As long as it meets the core principles of 3-2-1 backup, you can still get your data back in the event of a natural disaster, a lost laptop, or an accidental deletion. To summarize, that means:

3-2-1-1-0 or 4-3-2, giving your data an additional layer of protection by virtually isolating it so it can’t be deleted or encrypted.



No comments:

Post a Comment