Geo-redundancy behaviors and troubleshooting

Recovering your Blue Planet software platform after a failure requires an understanding of how disaster recovery is supported. Use these topics to help you prepare and troubleshoot a system failure.

This section includes:

Geo-redundancy behaviors
Tips for troubleshooting geo-redundancy
Geo-redundancy event details

Geo-redundancy behaviors

The following geo-redundancy behaviors are supported:

The warm standby mechanism allows our state-persistence to be replicated on another site, with significant latency (up to 1 sec), and limited bandwidth (10-100Mbps). The redundancy covers major databases, including Cassandra, Datomic, and Galera. Other temporary cache utilized by other subsystems are not supported.
Existing data on the active site is replicated to the standby site.
Monitor replication status to ensure databases are replicating even as changes are made to the active site.
Failover to activate the backup to the active site results in no data loss (except for transactions currently in progress).
Reclaim previously active geo-redundant sites after failover.
Stop and resume replication processes; for example during a site software upgrade.
Resume replication occurs automatically after a geo-redundant link failure (up to 16 hours).
Components on the standby site are not active except for the databases.

Synchronization data flows from the active to the standby site only.
Failover switch is manual. Failover time is measured in minutes since applications need to start.

Tips for troubleshooting geo-redundancy

Once you complete the restore to the backup server and the backup server is the only server where activity is performed, you must restart all external RA solutions on the backup server (using solman). If your solution does not contain RAs, you do not need to perform a restart.
Do not exit during backup or restore for cold georedundancy. Because geored runs in the background, if you exit the command window all the data being backed up is lost. For example, selecting CTRL+C exits geored and displays the following status message: Deleted geored server configuration database.

If you accidentally exit geored during backup, the geored application:
- Deletes the commands you completed in the basic geored configuration steps.
- Fails to complete any future scheduled backups until you perform the basic geored configuration steps.
If you install the geo-redundancy server on another server (not the backup server) and it has not yet started, run /geored/src/runServer.sh from the Docker folder. You can also use ps fax to confirm a process. The result may look similar to: /usr/bin/python geored/src/georedServer.py -b -p 2000 -l debug.
To exit your geored console temporarily and return without impacting any process enter: Ctrl+A d

Use list_servers <hostname> to view current tasks, active/backup server associations, and restore snapshot queues.
For detailed command line syntax and descriptions, use ? at the command line to view the commands and help <command> to view the syntax description.

Geo-redundancy event details

Several fault situations are communicated using events and alarms. The following list contains some of the behaviors supported by Blue Planet geo-redundancy disaster recovery.

Handling incorrect, invalid, or out-of-bounds input
Event module restarts
Hardware issues (card pulls)
Loss of association
Event or alarm bursts
Busy or slow networks

Blue Planet supports all interfaces between micro-services, resource adapters, and any other component in a solution. Details on individual support, be it from the user interface or REST API, are documented in the user documentation or Swagger API.

To view geo-redundancy event information, use the alarm viewer. For details on how to use the alarm viewer, see Viewing and managing alarms in the alarm viewer.

Geo-redundancy behaviors and troubleshooting