ChillyBytes · The semantic way to achieve resiliency

This is going to be more of a theoretical article about most of the terms that exist within the domain of IT resiliency. They are “defined” already in countless blogs of IT consultancies - let’s try to paint a bigger picture of how they connect and how the all pay into the one common goal: Digital Resiliency.

For that, we will try to find less-IT related analogies. This post is part of a three post series and will be updated accordingly.

Redundancy

Redundancy refers to the practice of duplicating components or systems within an IT infrastructure to enhance fault tolerance and ensure continuous availability with the goal, that a failure in one component does not result in system downtime. This is an essential aspect of IT resiliency, as it reduces the risk of disruptions and provides backup options in case of hardware or software failure.

Firstly, there is hardware redundancy, typically implemented in data centers: physical servers, networks, hard drives, etc. Secondly, there is software redundancy: virtual machines, DNS records, block storage, etc.

Hardware redundancy is based on our chosen cloud providers and their SLAs as well as our own on-prem servers, if any. As ChillyBytes we know some of the complexity of hardware redundancy but stay with our consulting service in the domain of software redundancy - as does this post.

Ultimately, redundancy improves system reliability and performance under adverse conditions.

Think: Broken screen
If your TV fails at home because you smashed your PS5 controller during a round of Helldivers 2 into your screen, then you might have another screen at home. If this is sitting in the basement, then you have an active-passive setup: The active screen is now broken, and it takes a few minutes until you got the screen from the basement and connected it to make it your new active screen.
If your second screen was in the sleeping room, already connected to the PS5 and powered on, this would be an active-active setup. The second screen might not be as good as the first one, e.g. it does not support 4K, but it still will be good enough until you repaired the first screen.

Recovery

With recovery comes RTO (Recovery Time Objective) and RPO (Recovery Point Objective) - key metrics used to define recovery strategies in IT systems, particularly for disaster recovery and business continuity (covered in another part).

RTO refers to the maximum allowable downtime after an outage before it disrupts business operations, while RPO defines the maximum acceptable data loss in terms of time. Setting an appropriate RTO and RPO is essential for maintaining resiliency, as they guide the design of backup systems, data replication, and failover procedures.

Think: Connection loss during an online multiplayer session
In an online game, you are defending democracy with your friends on a planet against bugs. Everything works well, you almost won and time left for you and your friends to reach the extraction point back to your home ship is about 5 minutes - plenty of time, or so you thought.
Of course, your Wi-Fi fails exactly now, and you lost connection to your game and friends. Now the 5 minutes to reconnect and extract together with your friends to get all bonus points don’t seem like a lot of time anymore. Hence, your RTO just became anything less than 5 minutes, better 4. You don’t care much about your RPO since your game data and status was constantly streamed to the game server. Your game locally was pretty much stateless, once you are connected again, the latest data is synced back from the online multiplayer server.

This architecture is not too far off from typical cloud native architectures in enterprises: The app is stateless and streams all data to the cloud continuously where it is redundantly stored. Once the app fails, you care about how fast it is online again, your RTO.

Reliability

Reliability in IT refers to the consistent performance of systems and applications over time. It’s a fundamental aspect of IT resiliency, as it ensures that services are stable and perform its intended function for a given period. Reliable systems are designed with fault tolerance, robust monitoring, and regular maintenance to avoid issues that could compromise operations. High reliability increases confidence in system performance, ensuring that businesses can meet customer expectations and service level agreements (SLAs).

Think: Ungrateful companions
Continuing the example above, you managed to get online again within 3 minutes, yeah, but your friends filled the empty seat with some random guy who automatically connected once you disconnected. The multiplayer session was set to public since your “friends” knew about your sketchy Wi-Fi connection and just wanted some reliability to fulfill your function and finish the game with all bonus points.

Security

Security is a critical component of IT resiliency, ensuring that systems are protected from external threats and internal vulnerabilities. With increasing cyber-attacks and data breaches, securing IT infrastructures against threats like malware, ransomware, and DDoS (Distributed Denial of Service) attacks is essential to maintain service availability. Security also ensures the integrity of systems by protecting against unauthorized access, tampering, or data loss. When focusing on resiliency, organizations must integrate security at every level, including data, applications, networks, and devices.

DDoS

A very common threat to IT systems are DDoS attacks. They are cyber-attacks designed to overwhelm and disrupt the availability of online services, making them unavailable to users by flooding them with excessive traffic. These attacks are a significant threat to IT resiliency, as they can cause service downtime, loss of revenue, and damage to brand reputation.

To protect against DDoS, organizations implement measures such as rate limiting, traffic filtering, and using DDoS protection services. Additionally, cloud-based systems can offer elastic scaling to absorb large volumes of attack traffic, mitigating the impact.

By preparing for DDoS attacks through resilient infrastructure and incident response plans, organizations can reduce downtime and maintain service availability. Proactively addressing DDoS risks strengthens overall IT security and contributes to a more resilient system.

Think: Full mailbox
You await a very important letter from your bank. Yes, in Germany banks would still use plain old letters for important bank account related business. Anyway, one morning after your coffee, you go outside to open your mailbox and find your mailbox completely full - not one more single piece of paper would fit inside. 99% of those letters are spam and most of them even come from different senders.
The letter from the bank you waited for was not among them, and you have no idea if the mailman wanted to deliver it today and just couldn’t because of your full mailbox. If yes, he might try it the next day or not - you just don’t know. The next days it is the same picture: full mailbox every morning. You probably have to move.

Elasticity and Scalability

Elasticity and scalability are both essential for ensuring that IT systems can adapt to fluctuating workloads and remain resilient under varying demands. Elasticity refers to the ability of a system to automatically scale resources up or down based on real-time demand, ensuring optimal resource utilization.

Scalability, on the other hand, refers to the ability of a system to handle increased loads by adding more resources (scale out) or by increasing the capacity of the current resources (scale up) without compromising performance.

Usually scalability is the right term when discussing scheduled or predicable workload increase, whereas elasticity is used for automatic scaling (up or down) based on real-time workloads.

Cloud environments are particularly well-suited for elasticity and scalability, as they enable rapid provisioning of resources. These features contribute to IT resiliency by ensuring systems can withstand and recover from unexpected increases in demand.

Think: Netflix series
Finally, it’s Tuesday, 7pm, and the next episode of your new favorite series on Netflix is released. It’s a bit annoying because it feels like we went back from streaming to a weekly release cycle like during the times of linear TV but hey - it is what it is.
The problem this week, though, is that the cliffhanger of last episode was quite intense. Netflix apparently didn’t account for the increased demand of parallel viewers this week as they predicted viewer counts would be around the same each week. Therefore, during the first 10 minutes you have to suffer the good old buffering screen several times, and you know it’s not because of your internet connection.
Netflix is still available but just cannot scale up with the demand as quickly that this cliffhanger has caused this week. Encoding and streaming files takes computational resources and Netflix’s architecture took a few minutes to automatically scale up those resources to match viewership.

High Availability

High availability (HA), as the name suggests, tries to maximize the time of remaining operational and accessible. This is achieved by designing systems to automatically recover from failures, either through hardware redundancy, software failover, or geographically distributed systems.

In an HA environment, components are often duplicated to ensure that if one part fails, another takes over seamlessly without impacting user experience. Ultimately, high availability is a key aspect of IT resiliency, as it minimizes the impact of disruptions on business operations (covered in another part).

Think: Vacation time
There are quite a few systems in the real world where HA really matters. One of those is the airplane that takes you to your next vacation destination. You would really like a 100% up-time (quite literally) when it comes to airplanes but statistically speaking we are not there yet.
They are a good example, though, when it comes to HA of one (isolated) system that also includes lots of redundancy and automatic failovers. You are not a statistical outlier and arrive your destination safely.

Stay tuned for the next part of this series.