Edit

Configure per-partition automatic failover for Azure Cosmos DB

This article explains how to configure per-partition automatic failover (PPAF) on your Azure Cosmos DB account.

Per-partition automatic failover (PPAF) is an Azure Cosmos DB feature that improves availability for single-write region accounts. Instead of failing over an entire database account during a regional outage, Azure Cosmos DB can automatically fail over at the partition level, which minimizes downtime and accelerates recovery.

Prerequisites

Before enabling PPAF, ensure your environment meets the following prerequisites:

  • Multi-region account: Single-write region account with at least one other read region configured.
  • Consistency model: Strong, Session, Consistent prefix, or Eventual consistency are currently supported. Bounded staleness will be supported in a future release.
  • API type: The account must use the Core (SQL) API (NoSQL API).
  • Azure region: The account must be in a global Azure region
  • SDK version: Your application must use a supported Azure Cosmos DB SDK that implements PPAF logic. The following SDK versions are supported:
    • .NET SDK v3 : v3.60.0 or later
    • Java SDK: v4.79.0 or later
    • Python SDK: v4.16.0 or later
    • Node.js SDK: v4.7.0 or later

How to enable PPAF on your Azure Cosmos DB account

You can enable PPAF by using the Azure portal, Azure CLI, or Azure PowerShell.

Important

Before you enable per-partition automatic failover, confirm that your account meets every requirement in the Prerequisites section and that all application instances are upgraded to a supported SDK version. Enabling PPAF with an unsupported SDK or a misconfigured account can cause availability issues, including failed writes during a partition-level failover.

  1. Sign in to the Azure portal.

  2. Navigate to your Azure Cosmos DB account.

  3. In the left menu, select Features under the Settings section.

  4. Select Per-partition automatic failover.

  5. Review the information and prerequisites, and then switch to Enable PPAF.

    Screenshot of the per-partition automatic failover feature in the Azure portal with the Enable toggle highlighted.

PPAF pricing

PPAF is part of the Business Critical service tier and is charged accordingly. For more information, see Azure Cosmos DB pricing.

Configure the application for PPAF

Configuring your application's Azure Cosmos DB SDK is critical so that it knows to handle partition-level failovers.

  • Upgrade SDK: Make sure your app is running the latest SDK version that supports PPAF (as identified in Prerequisites).
  • Configure secondary region: Make sure your Azure Cosmos DB account has at least one secondary region.

Test the PPAF setup (simulate a fault)

After you configure the account and client, validate that everything works as expected before a real outage occurs. Azure Cosmos DB provides a partition failure simulation capability for PPAF-enabled accounts:

  • Partition failure simulation: The partition failure simulation capability for PPAF is available through REST API. For ease of use, a PowerShell script is provided to manage the simulation.

    • Download the script EnableDisableChaosFault.ps1 at azurecosmosdb/ppaf-samples.

    • Start PowerShell and sign in to your subscription by running az login.

    • Navigate to the folder that contains the PowerShell script and invoke it with the required parameters to inject the fault:

      • It might take up to 15 minutes for the simulation to take effect.
      • The simulation is applied to 10% of the partitions in the specified collection, with a maximum of 10 partitions and a minimum of 1 partition.
      .\EnableDisableChaosFault.ps1 -FaultType "PerPartitionAutomaticFailover" -ResourceGroup "{ResourceGroupName}" -AccountName "{DatabaseAccountName}" -DatabaseName "{DatabaseName}" -ContainerName "{CollectionName}"  -SubscriptionId "{SubscriptionId}" -Region "{PreferredWriteRegion}" -Enable
      
  • Application testing: Test critical transactions of your application during the failover.

  • Metrics:

    • Verify the traffic in the Azure portal Metrics blade for your account. Look at metrics like Total Requests broken down by region. You should see write operations occurring in a secondary region during the simulation, confirming the failover worked.
    • A new metric named PartitionWriteGlobalStatus reports the count of write partitions for a region at any given time. Use this metric to track how many partitions failed over during the simulation.
  • Stop the simulation: Invoke the same script with the -Disable switch to stop the partition failure simulation. It might take up to 15 minutes for the simulation to stop.

    .\EnableDisableChaosFault.ps1 -FaultType "PerPartitionAutomaticFailover" -ResourceGroup "{ResourceGroupName}" -AccountName "{DatabaseAccountName}" -DatabaseName "{DatabaseName}" -ContainerName "{CollectionName}"  -SubscriptionId "{SubscriptionId}" -Region "{PreferredWriteRegion}" -Disable