Manual Chapter : Troubleshooting Minimal Downtime Upgrade Issues

Applies To:

Show Versions

BIG-IQ Centralized Management

6.1.0

Troubleshooting Minimal Downtime Upgrade Issues

What do I do if there is not enough disk space?

If there is not enough disk space to install the new software, you need to extend the

/var

partition. The default size of the

/var

file system in a newly installed node is 30 GB. This volume size might be insufficient to store your data. You can see how to extend this file system to a larger size in knowledge article K16103. Refer to:

K16103: Extending disk space on BIG-IQ Virtual Edition

support.f5.com/csp/article/K16103

. Because upgrading a node requires at least two volumes, you must ensure that both volumes can have their

/var

file system extended to the same size, or upgrades might fail.

Symptom

If the message:

UCS restore failure

displays during software installation, it might be due to insufficient disk space.

To determine if this is the case:

Run the command:

tmsh show sys software

The system displays

failed (UCS application failed; unknown cause

in response.

Navigate to the

liveinstall.log

file in the

/var/log/

folder. If the issue triggering the error message is insufficient disk space, then the file contains the following error message:

info: capture: status 256 returned by command: F5_INSTALL_MODE=install F5_INSTALL_SESSION_TYPE=hotfix chroot 
/mnt/tm_install/9934.NdHXAL /usr/local/bin/im -force  /var/local/ucs/config.ucs
info: >++++ result: 
info: Extracting manifest: /var/local/ucs/config.ucs 
info: /var: Not enough free space 
info: 3404179456 bytes required info: 2206740480 bytes available 
info: /var/local/ucs/config.ucs: Not enough free disk space to install! 
info: Operation aborted. 
info: >---- 
info: Removing boot loader reference

Terminal error: UCS application failed; unknown cause.
*** Live install end at 2017/03/21 17:00:54: failed (return code 2) ***

Recommended Actions

Switch back to the older (pre-upgrade) volume.

Change to the pre-upgrade volume by running the command

switchboot -b <old-volume-name>

Restart the device by running the command

reboot

Clean up the audit log entries and snapshot objects.

Retry the installation.

Data collection device cluster status is yellow

After upgrading a data collection device (DCD), if the cluster status is yellow (unhealthy) instead of green (healthy), there are a number of potential causes and corresponding corrective actions you can take to attempt to resolve the issue.

Connectivity issues in an upgraded data collection device

There could be unassigned replica shards in the cluster. Network connectivity issues can cause relocation of shards to a newly upgraded DCD to fail.

Find Symptoms

Navigate to the

/var/log/elasticsearch/

directory.

Examine the

eslognode.log

file, to see if there is an add/remove/add cycle due to temporary network connectivity issues.

If there are unassigned replica shards, you will find a pattern similar to this example:

[2017-05-03 10:49:20,885][INFO ][cluster.service ] 
[f992a8aa-49c8-47ba-a59a-2be863f3a042] added {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true},}, 
reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}])
[2017-05-03 10:49:32,172][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] 
removed {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}
{10.10.10.2:9300}{zone=Seattle, master=true},}, reason: zen-disco-node_failed(
{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}
{zone=Seattle, master=true}), reason transport disconnected
[2017-05-03 10:49:36,210][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] added 
{{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}
{zone=Seattle, master=true},}, reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}])

The sample log file above illustrates the added/removed/added cycle. Note the presence of the word

added

on line 2, followed by

removed

on line 7, and then

added

again on line 11.

Recommended Actions

Restart the elastic search service by running the command:

bigstart restart elasticsearch

Shard assignment can now succeed, so the cluster status should change to healthy (green).

Index created in later version

When a DCD is upgraded, ElasticSearch begins creating indexes on that DCD. Because the upgraded DCD is running a different software version than the other DCDs in the cluster, shards cannot be replicated between the DCDs. This can result in unassigned replica shards in the cluster.

Find Symptoms

Check for unassigned replica shards by running this command:

curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED

If there are no unassigned shards, there should be no response to the command. If there are unassigned shards, try the recommended actions.

Recommended Actions

For a single zone DCD cluster, continue the upgrade process by upgrading another DCD in the cluster. When another DCD in the cluster is running the same version, shards can begin replicating again and the cluster status should become healthy.

For a multiple zone DCD cluster, continue the upgrade process but upgrade the next DCD in a different zone than the first DCD. When there are DCDs running the same version in different zones, shards can begin replicating again and the cluster status should become healthy.

Data collection device cluster status is red

After upgrading a data collection device (DCD), if the cluster status is red (unhealthy) instead of green (healthy), there are a number of potential causes and corresponding corrective actions you can attempt to resolve the issue.

Statistics replicas are not enabled

If statistics replicas were not enabled before you upgraded, the cluster will not create replicas of your data, and the cluster health will be unhealthy.

Recommended Actions

Navigate to the Statistics Retention Policy screen and expand the Advanced Settings, then select

Enable Replicas

Change to the pre-upgrade volume by running the command

switchboot -b <old-volume-name>

Reboot the DCD by running the command

reboot

Wait for the rebooted DCD to join the cluster.

Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards.)

Repeat the upgrade for the DCDs in the cluster.

After the upgrade, the cluster status should change to healthy (green).

Generic failure

Sometimes, for no discernible reason, the DCD cluster fails to assign the primary data shards.

Symptoms

There are no symptoms to confirm this, other than the cluster status is unhealthy (red). However, if you have tried other corrective actions and the problem persists, you can try this remedy to see if it solves the problem.

Recommended Actions

Change to the pre-upgrade volume by running the command

switchboot -b <old-volume-name>

Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards).

Check to see that replicas exist for each primary shard.

Check for replicas of each primary by running the command

curl -s localhost:9200/_cat/shards?v

. A typical response would look like this:

In the following sample response, note that each primary index (designated with a

in the docs column) has a corresponding replica (designated with an

in the docs column), and that the replica exists on a node with an IP address that is different than the node that the primary index is on.


index   shard prirep            state  docs    store    ip node                                 
websafe_2017-07-21t00-00-00-0700  2     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
websafe_2017-07-21t00-00-00-0700  2     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
websafe_2017-07-21t00-00-00-0700  1     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
websafe_2017-07-21t00-00-00-0700  1     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
websafe_2017-07-21t00-00-00-0700  3     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
websafe_2017-07-21t00-00-00-0700  3     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
websafe_2017-07-21t00-00-00-0700  4     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
websafe_2017-07-21t00-00-00-0700  4     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
websafe_2017-07-21t00-00-00-0700  0     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
websafe_2017-07-21t00-00-00-0700  0     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877

If there are replicas for each primary shard, repeat the upgrade for the DCDs in the cluster.

After you upgrade the DCDs in the cluster, the cluster status should change to healthy (green).

Data collection device cluster is offline

After upgrading a data collection device (DCD), if the cluster status is completely offline, there is one primary potential cause and corresponding corrective action you can take to attempt to resolve the issue.

Election of new master node failed

When the DCD cluster's master node is rebooted, a new master must be elected. Sometimes that election can fail, which causes the cluster to be offline.

Symptoms

There are a number of different symptoms that can indicate that the master election has failed. Master election failure can only occur when three conditions are met:

You are upgrading from version 6.0.0.

The DCD that was upgraded and rebooted was the master node before the upgrade.

Statistics replicas were enabled shortly before the upgrade and reboot.

One symptom of an election failure is when an API call to the cluster responds with a

master not discovered

message.

Use SSH to log in to the primary BIG-IQ for the DCD cluster.

Check the cluster status by submitting the following API call:

curl http://localhost:9200/_cat/nodes?v

If the response to the API call is similar to the this example, the master election failed.

 {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}      $

Another symptom of an election failure is when the ElasticSearch log file contains error messages.

Use SSH to log in to the primary BIG-IQ for the DCD cluster.

Navigate to the

/var/log/elasticsearch/

folder and open the log file

eslognode.log

Examine the log file. The following text snippets are examples of the two error messages that signal an election failure. If either of these messages is in the log file, the master election failed.


[2017-06-01 16:13:42,792][ERROR][discovery.zen ] [eb1d873a-ffdb-4ae8-ae22-946554970c54] unexpected failure during [zen-disco-join(elected_as_master, 
[4] joins received)] RoutingValidationException[[Index [statistics_tl0_device_2017-152-21]: Shard [2] routing table has wrong number of replicas, 
expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [1] routing table has wrong number of replicas, expected [0], got [1], Index
[statistics_tl0_device_2017-152-21]: Shard [4] routing table has wrong number of replicas, expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: 
Shard [3] routing table has wrong number of replicas,expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [0] routing table has
wrong number of replicas, expected [0], got [1]]]


[2017-06-01 16:14:48,992][INFO ][discovery.zen ] [163e016c-3827-496c-b306-b2972d60c8df] failed to send join request to master [{eb1d873a-ffdb-4ae8-ae22-946554970c54}
{IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147}{10.145.192.147:9300}{data=false, zone=L, master=true}], reason [RemoteTransportException[[eb1d873a-ffdb-4ae8-ae22-946554970c54]
[10.145.192.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[Node [{eb1d873a-ffdb-4ae8-ae22-946554970c54}{IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147}
{10.145.192.147:9300}{data=false, zone=L, master=true}] not master for join request]; ]
[2017-06-01 16:14:50,124][WARN ][rest.suppressed ] path: /_bulk, params: {}ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Recommended Actions

Repeat these steps for each DCD in the cluster.

Use SSH to log in to the first DCD in the cluster.

Restart the ElasticSearch service by running the command:

bigstart restart elasticsearch

Restarting the service for each DCD in the cluster triggers a new master node election. After the last DCD in the cluster is restarted, the cluster status should change to healthy (green).

Subscriptions

Product Usage

Trials

Registration Keys

Downloads

Applies To:

BIG-IQ Centralized Management

Troubleshooting Minimal Downtime Upgrade Issues

What do I do if there is not enough disk space?

Symptom

Recommended Actions

Data collection device cluster status is yellow

Connectivity issues in an upgraded data collection device

Find Symptoms

Recommended Actions

Index created in later version

Find Symptoms

Recommended Actions

Data collection device cluster status is red

Statistics replicas are not enabled

Recommended Actions

Generic failure

Symptoms

Recommended Actions

Data collection device cluster is offline

Election of new master node failed

Symptoms

Recommended Actions

Have a Question?

About F5

Education

F5 Sites

Support Tasks

Subscriptions

Product Usage

Trials

Registration Keys

Downloads

Applies To:

BIG-IQ Centralized Management

Troubleshooting Minimal Downtime Upgrade Issues

What do I do if there is not enough disk space?

Symptom

Recommended Actions

Data collection device cluster status is yellow

Connectivity issues in an upgraded data collection device

Find Symptoms

Recommended Actions

Index created in later version

Find Symptoms

Recommended Actions

Data collection device cluster status is red

Statistics replicas are not enabled

Recommended Actions

Generic failure

Symptoms

Recommended Actions

Data collection device cluster is offline

Election of new master node failed

Symptoms

Recommended Actions

Have a Question?

Follow Us

About F5

Education

F5 Sites

Support Tasks