Manual Chapter : Troubleshooting Minimal Downtime Upgrade Issues

Applies To:

Show Versions Show Versions

BIG-IQ Centralized Management

  • 6.0.1
Manual Chapter

Troubleshooting Minimal Downtime Upgrade Issues

What do I do if there is not enough disk space?

If there is not enough disk space to install the new software, you need to extend the /var partition. The default size of the /var file system in a newly installed node is 30 GB. This volume size might be insufficient to store your data. You can see how to extend this file system to a larger size in knowledge article K16103. Refer to: K16103: Extending disk space on BIG-IQ Virtual Edition at support.f5.com/csp/article/K16103. Because upgrading a node requires at least two volumes, you must ensure that both volumes can have their /var file system extended to the same size, or upgrades might fail.

Symptom

If the message: UCS restore failure displays during software installation, it might be due to insufficient disk space.

To determine if this is the case:
  1. Log in to the BIG-IQ system or DCD on which the software is failing to install.
  2. Run the command: tmsh show sys software.

    The system displays failed (UCS application failed; unknown cause in response.

  3. Navigate to the liveinstall.log file in the /var/log/ folder. If the issue triggering the error message is insufficient disk space, then the file contains the following error message:
    info: capture: status 256 returned by command: F5_INSTALL_MODE=install F5_INSTALL_SESSION_TYPE=hotfix chroot 
    /mnt/tm_install/9934.NdHXAL /usr/local/bin/im -force  /var/local/ucs/config.ucs
    info: >++++ result: 
    info: Extracting manifest: /var/local/ucs/config.ucs 
    info: /var: Not enough free space 
    info: 3404179456 bytes required info: 2206740480 bytes available 
    info: /var/local/ucs/config.ucs: Not enough free disk space to install! 
    info: Operation aborted. 
    info: >---- 
    info: Removing boot loader reference
    
    Terminal error: UCS application failed; unknown cause.
    *** Live install end at 2017/03/21 17:00:54: failed (return code 2) ***

Recommended Actions

  1. Switch back to the older (pre-upgrade) volume.
    1. Log in to the BIG-IQ system or DCD on which the software is failing to install.
    2. Change to the pre-upgrade volume by running the command switchboot -b <old-volume-name>.
    3. Restart the device by running the command reboot.
  2. Clean up the audit log entries and snapshot objects.
  3. Retry the installation.

Data collection device cluster status is yellow

After upgrading a data collection device (DCD), if the cluster status is yellow (unhealthy) instead of green (healthy), there are a number of potential causes and corresponding corrective actions you can take to attempt to resolve the issue.

Connectivity issues in an upgraded data collection device

There could be unassigned replica shards in the cluster. Network connectivity issues can cause relocation of shards to a newly upgraded DCD to fail.

Find Symptoms

  1. Log in to the primary BIG-IQ DCD for the cluster.
  2. Navigate to the /var/log/elasticsearch/ directory.
  3. Examine the eslognode.log file, to see if there is an add/remove/add cycle due to temporary network connectivity issues.

    If there are unassigned replica shards, you will find a pattern similar to this example:

    [2017-05-03 10:49:20,885][INFO ][cluster.service ] 
    [f992a8aa-49c8-47ba-a59a-2be863f3a042] added {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
    {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true},}, 
    reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
    {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}])
    [2017-05-03 10:49:32,172][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] 
    removed {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}
    {10.10.10.2:9300}{zone=Seattle, master=true},}, reason: zen-disco-node_failed(
    {7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}
    {zone=Seattle, master=true}), reason transport disconnected
    [2017-05-03 10:49:36,210][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] added 
    {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}
    {zone=Seattle, master=true},}, reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364}
    {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}])
Note: The sample log file above illustrates the added/removed/added cycle. Note the presence of the word added on line 2, followed by removed on line 7, and then added again on line 11.

Recommended Actions

  1. Log in to the DCD that you just upgraded.
  2. Restart the elastic search service by running the command: bigstart restart elasticsearch.

Shard assignment can now succeed, so the cluster status should change to healthy (green).

Index created in later version

When a DCD is upgraded, ElasticSearch begins creating indexes on that DCD. Because the upgraded DCD is running a different software version than the other DCDs in the cluster, shards cannot be replicated between the DCDs. This can result in unassigned replica shards in the cluster.

Find Symptoms

  1. Log in to the upgraded DCD.
  2. Check for unassigned replica shards by running this command:

    curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED

  3. If there are no unassigned shards, there should be no response to the command. If there are unassigned shards, try the recommended actions.

Recommended Actions

  • For a single zone DCD cluster, continue the upgrade process by upgrading another DCD in the cluster. When another DCD in the cluster is running the same version, shards can begin replicating again and the cluster status should become healthy.
  • For a multiple zone DCD cluster, continue the upgrade process but upgrade the next DCD in a different zone than the first DCD. When there are DCDs running the same version in different zones, shards can begin replicating again and the cluster status should become healthy.

Data collection device cluster status is red

After upgrading a data collection device (DCD), if the cluster status is red (unhealthy) instead of green (healthy), there are a number of potential causes and corresponding corrective actions you can attempt to resolve the issue.

Statistics replicas are not enabled

If statistics replicas were not enabled before you upgraded, the cluster will not create replicas of your data, and the cluster health will be unhealthy.

Recommended Actions

  1. Log in to the primary BIG-IQ for the DCD cluster.
  2. Navigate to the Statistics Retention Policy screen and expand the Advanced Settings, then select Enable Replicas.
  3. Log in to the most recently upgraded DCD.
  4. Change to the pre-upgrade volume by running the command switchboot -b <old-volume-name>.
  5. Reboot the DCD by running the command reboot.
  6. Wait for the rebooted DCD to join the cluster.
  7. Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards.)
  8. Repeat the upgrade for the DCDs in the cluster.

After the upgrade, the cluster status should change to healthy (green).

Generic failure

Sometimes, for no discernible reason, the DCD cluster fails to assign the primary data shards.

Symptoms

There are no symptoms to confirm this, other than the cluster status is unhealthy (red). However, if you have tried other corrective actions and the problem persists, you can try this remedy to see if it solves the problem.

Recommended Actions

  • Log in to the primary BIG-IQ for the DCD cluster.
  • Change to the pre-upgrade volume by running the command switchboot -b <old-volume-name>.
  • Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards).
  • Check to see that replicas exist for each primary shard.
    1. Log in to the BIG-IQ system or DCD on which the software is failing to install.
    2. Check for replicas of each primary by running the command curl -s localhost:9200/_cat/shards?v. A typical response would look like this:
      Note: In the following sample response, note that each primary index (designated with a p in the docs column) has a corresponding replica (designated with an r in the docs column), and that the replica exists on a node with an IP address that is different than the node that the primary index is on.
      index   shard prirep            state  docs    store    ip node                                 
      websafe_2017-07-21t00-00-00-0700  2     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
      websafe_2017-07-21t00-00-00-0700  2     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
      websafe_2017-07-21t00-00-00-0700  1     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
      websafe_2017-07-21t00-00-00-0700  1     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
      websafe_2017-07-21t00-00-00-0700  3     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
      websafe_2017-07-21t00-00-00-0700  3     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
      websafe_2017-07-21t00-00-00-0700  4     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
      websafe_2017-07-21t00-00-00-0700  4     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
      websafe_2017-07-21t00-00-00-0700  0     r      STARTED  0  159b 10.145.193.11  f7b33853-da66-4587-bb70-5d2dbc254a05 
      websafe_2017-07-21t00-00-00-0700  0     p      STARTED  0  159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 
    3. If there are replicas for each primary shard, repeat the upgrade for the DCDs in the cluster.

After you upgrade the DCDs in the cluster, the cluster status should change to healthy (green).

Data collection device cluster is offline

After upgrading a data collection device (DCD), if the cluster status is completely offline, there is one primary potential cause and corresponding corrective action you can take to attempt to resolve the issue.

Election of new master node failed

When the DCD cluster's master node is rebooted, a new master must be elected. Sometimes that election can fail, which causes the cluster to be offline.

Symptoms

There are a number of different symptoms that can indicate that the master election has failed. Master election failure can only occur when three conditions are met:
  • You are upgrading from version 6.0.0.
  • The DCD that was upgraded and rebooted was the master node before the upgrade.
  • Statistics replicas were enabled shortly before the upgrade and reboot.
One symptom of an election failure is when an API call to the cluster responds with a master not discovered message.
  1. Use SSH to log in to the primary BIG-IQ for the DCD cluster.
  2. Check the cluster status by submitting the following API call: curl http://localhost:9200/_cat/nodes?v.
  3. If the response to the API call is similar to the this example, the master election failed.
     {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}      $ 

Another symptom of an election failure is when the ElasticSearch log file contains error messages.

  1. Use SSH to log in to the primary BIG-IQ for the DCD cluster.
  2. Navigate to the /var/log/elasticsearch/ folder and open the log file eslognode.log.
  3. Examine the log file. The following text snippets are examples of the two error messages that signal an election failure. If either of these messages is in the log file, the master election failed.
    [2017-06-01 16:13:42,792][ERROR][discovery.zen ] [eb1d873a-ffdb-4ae8-ae22-946554970c54] unexpected failure during [zen-disco-join(elected_as_master, 
    [4] joins received)] RoutingValidationException[[Index [statistics_tl0_device_2017-152-21]: Shard [2] routing table has wrong number of replicas, 
    expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [1] routing table has wrong number of replicas, expected [0], got [1], Index
    [statistics_tl0_device_2017-152-21]: Shard [4] routing table has wrong number of replicas, expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: 
    Shard [3] routing table has wrong number of replicas,expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [0] routing table has
    wrong number of replicas, expected [0], got [1]]]
    
    [2017-06-01 16:14:48,992][INFO ][discovery.zen ] [163e016c-3827-496c-b306-b2972d60c8df] failed to send join request to master [{eb1d873a-ffdb-4ae8-ae22-946554970c54}
    {IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147}{10.145.192.147:9300}{data=false, zone=L, master=true}], reason [RemoteTransportException[[eb1d873a-ffdb-4ae8-ae22-946554970c54]
    [10.145.192.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[Node [{eb1d873a-ffdb-4ae8-ae22-946554970c54}{IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147}
    {10.145.192.147:9300}{data=false, zone=L, master=true}] not master for join request]; ]
    [2017-06-01 16:14:50,124][WARN ][rest.suppressed ] path: /_bulk, params: {}ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
    

Recommended Actions

Repeat these steps for each DCD in the cluster.
  1. Use SSH to log in to the first DCD in the cluster.
  2. Restart the ElasticSearch service by running the command: bigstart restart elasticsearch.

Restarting the service for each DCD in the cluster triggers a new master node election. After the last DCD in the cluster is restarted, the cluster status should change to healthy (green).