Managing HA

This topic describes how Blue Planet HA works.

The following behaviors occur when a failure occurs in a multi-node HA cluster:

  • The leader of all apps may be hosted by different nodes in your cluster.

  • If a host fails, and that host has the leader for an app then that leader for that app moves to another node.

  • You must have three hosts in a cluster; two hosts that are alive at all times.

  • If the node is unable to recover (maybe due to hardware failure), you must create a new multi-node HA cluster to ensure that you have HA continuity of operation. For details on creating a new HA multi-node cluster, see your specific Blue Planet installation guide.

Review the following troubleshooting details if you have issues with HA.

Checking the integrity of your HA cluster

To determine if there are any issues in your HA cluster, perform the following commands in your root directory as the bpadmin user.

bp2-site check-ilan
bp2-site check-platform
sudo bp2-site diff-site-config
get-hosts-config
bpssh hostname

You may be required to reenter your bpadmin password. The server should respond 'OK' or provide the requested details.

To clear sync or swarm errors, use the following commands in sequence.

# sudo bp2-site sync-site-config
# bp2-site restart sw

Finding your leader node in a cluster

To view the leader node, use the following command example to display details about the hosts where app (bpocore) is installed. The following example shows instance 0 as the bpocore leader. It is running on host number 0 as indicated by the third octet in the IP address, 10.10.10.99. Substitute the appropriate file name for the app in question.

You may need to substitute the repository listed from bpdr.io.
# sudo solman
# api clusters bpocore get

{'group': bpocore',
'kind': '#cluster',
'leader': 0,
'name': bpocore',
'peers': {'0': {'container': 'c4ff5ade0178',
                 'host': 'ip-10-206-31-212',
                 'init': True,
                 'ip': '172.16.2.24',
                 'phase': 'running',
                 'restore': False},
           '1': {'container': '31f2dd0ee7c1',
                 'host': 'ip-10-206-31-134',
                 'init': True,
                 'ip': '172.16.3.24',
                 'phase': 'running',
                 'restore': False},
           '2': {'container': '5d7eff554c62',
                 'host': 'ip-10-206-31-3',
                 'init': True,
                 'ip': '172.16.1.24',
                 'phase': 'running',
                 'restore': False}},
 'rank': [2, 1, 0]}
#

Listing all apps in a cluster

To see a list of all apps in a cluster:

# sudo solman
# api clusters get

The output of this command includes all of the clusters.

Backing up a failing HA cluster node

The following behavior occurs when a failure happens in a multi-node HA cluster:

  • The leader of all apps may be hosted by different nodes in your cluster.

  • If a host fails, and that host has the leader for an app then that leader for that app moves to another node.

  • You must have at least three hosts in a cluster; two hosts that are alive at all times. If you have more than three hosts, the number of servers must be odd.

  • If the node is unable to recover (maybe due to hardware failure), you must create a new multi-node HA cluster to ensure that you have HA continuity of operation. For details on creating a new HA multi-node cluster, see your specific Blue Planet installation guide.

If you have a failing node in an HA cluster, you must back up the solution by running the backup command on a remaining healthy node.

To copy a snapshot of a node from a failing HA cluster, copy a node snapshot from one of the working nodes to a new cluster by performing the steps in the install guide to create a new cluster, then do the following on one of the nodes:

# sudo solman
# solution_backup [--label=your_label] <solution_name>

The snapshot name displays after the backup is complete.

Restoring an existing snapshot to a new cluster

Once your new cluster is running, restore the snapshot to this cluster.

# sudo solman
# solution_restore <solution_name> <snapshot_name>

Wait several minutes to ensure all processes have time to restart. You should remove all remnants of the previous cluster in a timely manner to avoid any conflicts.

Restarting an RA solution after snapshot restore

After restoring a snapshot from a single-host system into a multi-host system, Blue Planet cannot terminate an existing Firefly vFW service and create a new service. Creation and termination of services fails on the virtual customer premises equipment (vCPE) sub-resource on Firefly vFW service. To workaround this issue, restart the RA solution after you restore the snapshot.

Leaving or rejoining the HA cluster

A server can leave a cluster in three ways:

  • Through a system failure

  • A link or communication failure

  • A planned request by an administrator, such as service maintenance

In all these scenarios, users switch over to software components on another server in the cluster and the cluster continues to provide service to the network. The failover takes a few seconds and users experience only a brief period of service disruption.

After a server that has left the cluster for any reason is ready to rejoin the cluster, the server automatically rejoins the cluster as long as the connectivity to the other servers is available. The software components on the returning server reestablish communications and synchronize with the software components on the other servers. No manual restore from a backup is necessary.

For catastrophic failure, backup and restore is still necessary for recovery.

results matching ""

    No results matching ""