Categories
Cloud et Cybersécurité (EN) Featured-Post-CloudSecu-EN

High Availability in the Public Cloud: Designing a Resilient Architecture (Azure / AWS / GCP / Infomaniak)

Auteur n°16 – Martin

By Martin Moraz
Views: 2

Summary – Faced with financial and reputational risks from outages, cloud resilience relies on a multi-AZ and multi-region architecture, a segmented network, and redundant managed databases, complemented by SLAs/SLOs/SLIs with error budgets and RTO/RPO targets. IaC, recovery testing, and chaos engineering, backed by centralized monitoring, validate operational robustness. Solution: deploy a modular multi-zone infrastructure aligned with Swiss requirements, manage risk with error budgets, and automate processes to ensure optimal availability.

In an environment where downtime can lead to financial losses and reputational damage, achieving high availability in the public cloud demands a proactive strategy. It’s not just about choosing a provider—it involves thoughtfully architecting a multi-Availability Zone (AZ) and multi-region setup, segmenting networks, deploying redundant databases, and validating recovery scenarios.

Analyzing SLAs, defining SLOs/SLIs, and controlling error budgets allow for continuous risk management. Infrastructure as Code (IaC) automation, regular testing, and chaos engineering further strengthen resilience. Finally, balancing cost against over-provisioning and addressing Swiss regulatory requirements ensures a robust and compliant solution.

Designing a Multi-AZ and Multi-Region Architecture

A highly available infrastructure is built first on a multi-AZ model, then extended to multiple regions. Adhering to network best practices and leveraging managed services enhances resilience.

Active-Passive vs. Active-Active Multi-AZ Model

Deploying across multiple Availability Zones isolates localized failures. In an active-passive setup, one site handles primary traffic while the other remains on standby, ready to take over.

In an active-active configuration, each AZ carries part of the workload, delivering seamless fault tolerance and failover without noticeable interruption. This setup requires continuous data synchronization and session balancing.

Example: A financial services company implemented an active-active cluster across two Azure regions. During a critical AZ failure, this architecture maintained transaction continuity and reduced the Recovery Time Objective (RTO) to mere seconds.

Networks: Front-End and Back-End Subnets

Segmenting networks into front-end and back-end subnets improves both security and reliability. The front-end hosts public entry points, while the back-end contains business services and databases.

Each subnet can be replicated across multiple AZs so that the loss of one segment doesn’t compromise the entire platform. Access control lists (ACLs) and security groups further segregate traffic.

Load Balancers and Zone-Redundant Managed Databases

Native cloud load balancers distribute traffic across instances and AZs. They continuously monitor service health and automatically reroute traffic upon detecting failures.

Zone-redundant managed databases (Azure SQL, AWS RDS Multi-AZ, Google Cloud SQL) offer synchronous or asynchronous replication. They ensure data consistency and transparent failover.

Ensuring Reliability with SLAs, SLOs, SLIs, RTO and RPO

SLAs represent contractual commitments, but only SLOs/SLIs and error budgets drive ongoing risk management. RTO and RPO objectives structure recovery planning.

Decoding SLAs and Service Credits

The Service Level Agreement (SLA) specifies uptime targets (e.g., 99.99%) and offers service credits when commitments aren’t met. However, credits seldom offset real business impact.

A 99.99% uptime target allows roughly 52 minutes of downtime per year. Understanding failure granularity (duration, frequency) and credit eligibility criteria avoids unexpected outcomes.

SLOs, SLIs and Error Budgets for Risk Management

The Service Level Objectives (SLOs) set periodic operational thresholds, while Service Level Indicators (SLIs) measure service quality (latency, error rate).

The error budget concept defines the allowable margin of incidents. Each outage consumes part of this budget, guiding the balance between innovation and stability.

Example: A small e-commerce business established a 200 ms latency SLO for its APIs. By monitoring its error budget, it spotted a gradual latency increase from a software update and rolled back before customers were impacted, avoiding significant degradation.

RTO, RPO and Risk Prioritization

The Recovery Time Objective (RTO) defines the maximum acceptable downtime, while the Recovery Point Objective (RPO) specifies the maximum tolerable data loss. They inform recovery strategy design.

Organizations prioritize workflows based on criticality, selecting backup frequency, synchronous or asynchronous replication, and automatic or manual failover accordingly.

Example: A healthcare provider set a 30-minute RTO and a 5-minute RPO for its patient database. Combining frequent snapshots with asynchronous replication ensured continuity of records without notable data loss during a regional failover.

Edana: strategic digital partner in Switzerland

We support companies and organizations in their digital transformation

Automation, Testing and Chaos Engineering

Infrastructure as Code and recovery testing ensure reliable scaling. Game days and comprehensive monitoring bolster operational resilience.

Infrastructure as Code with Terraform

IaC enables versioning and repeatable deployments. Terraform’s multi-provider support ensures consistency across Azure, AWS, GCP, or Infomaniak.

Reusable modules standardize network, compute, and storage configurations. CI/CD pipelines automate operations, triggering updates that are validated via code reviews.

Recovery Testing, Game Days and Chaos Engineering

Recovery tests simulate planned failures (AZ outages, instance shutdowns) to validate runbooks. They ensure teams can execute procedures under real-world conditions.

Game days, inspired by chaos engineering, introduce controlled disruptions in production or staging (network outages, CPU saturation). They uncover weaknesses and enhance overall robustness.

Monitoring and Alerting

A centralized monitoring solution (Prometheus, CloudWatch, Azure Monitor) collects metrics and logs. Dashboards provide a unified view of service health.

Alerts based on critical SLIs trigger automatic notifications (Slack, email) and escalation playbooks. Incidents are documented for thorough post-mortem analysis.

Costs, Trade-Offs and Swiss-Specific Considerations

Evaluating downtime costs versus over-provisioning helps optimize cloud budgets. Swiss latency, residency, and sovereignty requirements shape region choices.

Assessing Downtime Costs vs. Over-Provisioning

Downtime costs include lost revenue, contractual penalties, and reputational harm. They often far exceed the expense of enhanced redundancy.

A return on investment calculation compares the hourly cost of downtime (per RTO) with the expenditures for multi-region replication or auto-scaling.

Example: A manufacturing company estimated its production line halt at CHF 20,000 per hour. Implementing a multi-AZ cluster cost CHF 15,000 per year—deemed far more economical than any unplanned stoppage.

Swiss Considerations: Latency, Data Residency and Compliance

Hosting data in Switzerland or the EU meets sovereignty and compliance requirements (GDPR, FINMA) while minimizing latency for local users.

Choosing Infomaniak or Swiss regions of Azure or Google avoids cross-border data transit and simplifies audits, all while delivering availability guarantees comparable to major hyperscalers.

Guarantee Optimal Availability for Your Cloud Services

Building a multi-AZ, multi-region architecture combined with network segmentation and zone-redundant managed databases is the foundation of resilience. Differentiating SLAs from SLOs and leveraging error budgets enables proactive risk control. RTO and RPO targets guide recovery choices, while automation, testing, and chaos engineering validate processes. Finally, balancing cost against resilience and addressing Swiss requirements delivers a solution that is both robust and compliant.

To turn these best practices into a tailored action plan, our experts are ready to assist you. We’ll help you design a scalable, secure architecture aligned with your business and regulatory needs.

Discuss your challenges with an Edana expert

By Martin

Enterprise Architect

PUBLISHED BY

Martin Moraz

Avatar de David Mendes

Martin is a senior enterprise architect. He designs robust and scalable technology architectures for your business software, SaaS products, mobile applications, websites, and digital ecosystems. With expertise in IT strategy and system integration, he ensures technical coherence aligned with your business goals.

FAQ

Frequently Asked Questions about Public Cloud High Availability

What is the difference between an active-passive and an active-active multi-AZ architecture?

Active-passive mode involves a primary site and a secondary standby site ready to fail over in case of an outage. Active-active mode distributes the load across all sites, providing continuity without noticeable interruption. Active-active requires continuous synchronization and load balancing but significantly reduces the RTO in case of failure.

How do you define SLOs and SLIs to manage cloud resiliency?

SLIs measure concrete indicators (latency, error rate) while SLOs set periodic targets to meet, for example 99.9% of responses under 200 ms. The associated error budget quantifies incident tolerance and guides trade-offs between innovation and stability.

How do you choose between Azure, AWS, GCP and Infomaniak for high availability?

The choice depends on data location, available managed services and sovereignty constraints. Infomaniak or the Swiss cloud regions of Azure and GCP avoid international transit. AWS may offer specific features, but the decision should consider internal expertise and overall cloud strategy.

What recovery scenarios should you implement to meet RTO and RPO?

You need to define backup and replication strategies based on RTO (acceptable downtime) and RPO (tolerable data loss). Synchronous replication reduces RPO to a few seconds, while frequent snapshots can achieve an RTO of a few minutes. Failover can be automatic or manual depending on the use case.

How do you automate the deployment of a resilient architecture with Terraform?

Terraform allows you to version and standardize multi-cloud configurations using reusable modules. By declaring network, compute and storage resources, you generate a repeatable execution plan. CI/CD pipelines incorporate reviews and Infrastructure as Code testing, ensuring consistency and traceability before production deployment.

Why integrate chaos engineering into your reliability strategy?

Chaos engineering, through game days or random tests (network cuts, CPU overload), uncovers potential breaking points in an infrastructure. By addressing identified failures, you improve overall resilience and validate runbooks, while training teams for crisis situations.

How do you assess the error budget to balance innovation and stability?

The error budget represents the time the service can fall below defined objectives. By measuring each incident, you know the remaining margin to deploy new features without risking SLO breaches. This approach enables pragmatic operational and update management.

What are the best network practices for segmenting front-end and back-end in multi-AZ?

Create separate subnets for front-end and back-end in each AZ, with appropriate ACLs and security groups. This segmentation ensures that a failure or attack on the front end does not impact business services. Replicating subnets across multiple zones strengthens fault tolerance.

CONTACT US

They trust us

Let’s talk about you

Describe your project to us, and one of our experts will get back to you.

SUBSCRIBE

Don’t miss our strategists’ advice

Get our insights, the latest digital strategies and best practices in digital transformation, innovation, technology and cybersecurity.

Let’s turn your challenges into opportunities

Based in Geneva, Edana designs tailor-made digital solutions for companies and organizations seeking greater competitiveness.

We combine strategy, consulting, and technological excellence to transform your business processes, customer experience, and performance.

Let’s discuss your strategic challenges.

022 596 73 70

Agence Digitale Edana sur LinkedInAgence Digitale Edana sur InstagramAgence Digitale Edana sur Facebook