Manual Chapter :
Troubleshooting Minimal Downtime Upgrade
Issues
Applies To:
Show VersionsBIG-IQ Centralized Management
- 8.1.0
Troubleshooting Minimal Downtime Upgrade
Issues
What do I do if there is not enough disk space?
If there is not enough disk space to install the new software, you need to
extend the
/var
partition. The default
size of the /var
file system in a newly
installed node is 30 GB. This volume size might be insufficient to store your data. You can
see how to extend this file system to a larger size in knowledge article K16103. Refer to:
K16103: Extending disk space on BIG-IQ Virtual Edition
at
support.f5.com/csp/article/K16103
.
Because upgrading a node requires at least two volumes, you must ensure that both volumes can
have their /var
file system extended to
the same size, or upgrades might fail. Symptom
If the message:
UCS restore
failure
displays during software installation, it might be due to insufficient
disk space.To determine if this is the case:
- Log in to the BIG-IQ system or DCD on which the software is failing to install.
- Run the command:tmsh show sys software.The system displaysfailed (UCS application failed; unknown causein response.
- Navigate to theliveinstall.logfile in the/var/log/folder. If the issue triggering the error message is insufficient disk space, then the file contains the following error message:info: capture: status 256 returned by command: F5_INSTALL_MODE=install F5_INSTALL_SESSION_TYPE=hotfix chroot /mnt/tm_install/9934.NdHXAL /usr/local/bin/im -force /var/local/ucs/config.ucs info: >++++ result: info: Extracting manifest: /var/local/ucs/config.ucs info: /var: Not enough free space info: 3404179456 bytes required info: 2206740480 bytes available info: /var/local/ucs/config.ucs: Not enough free disk space to install! info: Operation aborted. info: >---- info: Removing boot loader reference Terminal error: UCS application failed; unknown cause. *** Live install end at 2017/03/21 17:00:54: failed (return code 2) ***
Recommended Actions
- Switch back to the older (pre-upgrade) volume.
- Log in to the BIG-IQ system or DCD on which the software is failing to install.
- Change to the pre-upgrade volume by running the commandswitchboot -b <old-volume-name>.
- Restart the device by running the commandreboot.
- Clean up the audit log entries and snapshot objects.
- Retry the installation.
Data collection device cluster status is
yellow
After upgrading a data collection device (DCD), if the cluster status is
yellow (unhealthy) instead of green (healthy), there are a number of potential causes and
corresponding corrective actions you can
take to
attempt to resolve the issue.
Connectivity issues
in an upgraded data collection device
There could be unassigned replica shards in the cluster. Network
connectivity issues can cause relocation of shards to a newly upgraded DCD to fail.
Find
Symptoms
- Log in to the primary BIG-IQ DCD for the cluster.
- Navigate to the/var/log/elasticsearch/directory.
- Examine theeslognode.logfile, to see if there is an add/remove/add cycle due to temporary network connectivity issues.If there are unassigned replica shards, you will find a pattern similar to this example:[2017-05-03 10:49:20,885][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] added {{7c49c2ef-ad2f-41f5-ab15-e01084b20364} {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true},}, reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364} {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}]) [2017-05-03 10:49:32,172][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] removed {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2} {10.10.10.2:9300}{zone=Seattle, master=true},}, reason: zen-disco-node_failed( {7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300} {zone=Seattle, master=true}), reason transport disconnected [2017-05-03 10:49:36,210][INFO ][cluster.service ] [f992a8aa-49c8-47ba-a59a-2be863f3a042] added {{7c49c2ef-ad2f-41f5-ab15-e01084b20364}{_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300} {zone=Seattle, master=true},}, reason: zen-disco-join(join from node[{7c49c2ef-ad2f-41f5-ab15-e01084b20364} {_xdtaTl5RGSn-pIC192ORA}{10.10.10.2}{10.10.10.2:9300}{zone=Seattle, master=true}])
The sample log
file above illustrates the added/removed/added cycle. Note the presence of the word
added
on line 2, followed by removed
on line 7, and then added
again on line 11.
Recommended
Actions
- Log in to the DCD that you just upgraded.
- Restart the elastic search service by running the command:bigstart restart elasticsearch.
Shard assignment can now succeed, so the cluster status should
change to healthy (green).
Index created in
later version
When a DCD is upgraded, ElasticSearch begins creating indexes on that DCD.
Because the upgraded DCD is running a different software version than the other DCDs in the
cluster, shards cannot be replicated between the DCDs. This can result in unassigned replica
shards in the cluster.
Find
Symptoms
- Log in to the upgraded DCD.
- Check for unassigned replica shards by running this command:curl -s localhost:9200/_cat/shards?v | grep UNASSIGNED
- If there are no unassigned shards, there should be no response to the command. If there are unassigned shards, try the recommended actions.
Recommended
Actions
- For a single zone DCD cluster, continue the upgrade process by upgrading another DCD in the cluster. When another DCD in the cluster is running the same version, shards can begin replicating again and the cluster status should become healthy.
- For a multiple zone DCD cluster, continue the upgrade process but upgrade the next DCD in a different zone than the first DCD. When there are DCDs running the same version in different zones, shards can begin replicating again and the cluster status should become healthy.
Data collection device cluster status is
red
After upgrading a data collection device (DCD), if the cluster status is red (unhealthy)
instead of green (healthy), there are a number of potential causes and corresponding
corrective actions you can attempt to resolve the issue.
Statistics replicas
are not enabled
If statistics replicas were not enabled before you upgraded, the cluster
will not create replicas of your data, and the cluster health will be unhealthy.
Recommended
Actions
- Log in to the primary BIG-IQ for the DCD cluster.
- Navigate to the Statistics Retention Policy screen and expand the Advanced Settings, then selectEnable Replicas.
- Log in to the most recently upgraded DCD.
- Change to the pre-upgrade volume by running the commandswitchboot -b <old-volume-name>.
- Reboot the DCD by running the commandreboot.
- Wait for the rebooted DCD to join the cluster.
- Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards.)
- Repeat the upgrade for the DCDs in the cluster.
After the upgrade, the cluster status should change to healthy
(green).
Generic failure
Sometimes, for no discernible reason, the DCD cluster fails to assign the
primary data shards.
Symptoms
There are no symptoms to confirm this, other than the cluster status is
unhealthy (red). However, if you have tried other corrective actions and the problem
persists, you can try this remedy to see if it solves the problem.
Recommended Actions
- Log in to the primary BIG-IQ for the DCD cluster.
- Change to the pre-upgrade volume by running the commandswitchboot -b <old-volume-name>.
- Wait for the cluster status to return to green (indicating that the cluster has successfully replicated your data shards).
- Check to see that replicas exist for each primary shard.
- Log in to the BIG-IQ system or DCD on which the software is failing to install.
- Check for replicas of each primary by running the commandcurl -s localhost:9200/_cat/shards?v. A typical response would look like this:In the following sample response, note that each primary index (designated with apin the docs column) has a corresponding replica (designated with anrin the docs column), and that the replica exists on a node with an IP address that is different than the node that the primary index is on.index shard prirep state docs store ip node websafe_2017-07-21t00-00-00-0700 2 r STARTED 0 159b 10.145.193.11 f7b33853-da66-4587-bb70-5d2dbc254a05 websafe_2017-07-21t00-00-00-0700 2 p STARTED 0 159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 websafe_2017-07-21t00-00-00-0700 1 r STARTED 0 159b 10.145.193.11 f7b33853-da66-4587-bb70-5d2dbc254a05 websafe_2017-07-21t00-00-00-0700 1 p STARTED 0 159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 websafe_2017-07-21t00-00-00-0700 3 r STARTED 0 159b 10.145.193.11 f7b33853-da66-4587-bb70-5d2dbc254a05 websafe_2017-07-21t00-00-00-0700 3 p STARTED 0 159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 websafe_2017-07-21t00-00-00-0700 4 r STARTED 0 159b 10.145.193.11 f7b33853-da66-4587-bb70-5d2dbc254a05 websafe_2017-07-21t00-00-00-0700 4 p STARTED 0 159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877 websafe_2017-07-21t00-00-00-0700 0 r STARTED 0 159b 10.145.193.11 f7b33853-da66-4587-bb70-5d2dbc254a05 websafe_2017-07-21t00-00-00-0700 0 p STARTED 0 159b 10.145.192.202 687a7b3b-7dc3-4074-9b22-1e26c6092877
- If there are replicas for each primary shard, repeat the upgrade for the DCDs in the cluster.
After you upgrade the DCDs in the cluster, the cluster status should change to healthy
(green).
Data collection device
cluster is offline
After upgrading a data collection device (DCD), if the cluster status is
completely offline, there is one primary potential cause and corresponding corrective action you
can take to attempt to resolve the issue.
Election of new master node failed
When the DCD cluster's master node is rebooted, a new master must be
elected. Sometimes that election can fail, which causes the cluster to be offline.
Symptoms
There are a number of different symptoms that can
indicate that the master election has failed. Master election failure can only occur when
three conditions are met:- You are upgrading from version 6.x or 7.0.
- The DCD that was upgraded and rebooted was the master node before the upgrade.
- Statistics replicas were enabled shortly before the upgrade and reboot.
One symptom of an election failure is when an API call to the
cluster responds with a
master not discovered
message.- Use SSH to log in to the primary BIG-IQ for the DCD cluster.
- Check the cluster status by submitting the following API call:curl http://localhost:9200/_cat/nodes?v.
- If the response to the API call is similar to the this example, the master election failed.{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503} $
Another symptom of an election failure is when the ElasticSearch log
file contains error messages.
- Use SSH to log in to the primary BIG-IQ for the DCD cluster.
- Navigate to the/var/log/elasticsearch/folder and open the log fileeslognode.log.
- Examine the log file. The following text snippets are examples of the two error messages that signal an election failure. If either of these messages is in the log file, the master election failed.[2017-06-01 16:13:42,792][ERROR][discovery.zen ] [eb1d873a-ffdb-4ae8-ae22-946554970c54] unexpected failure during [zen-disco-join(elected_as_master, [4] joins received)] RoutingValidationException[[Index [statistics_tl0_device_2017-152-21]: Shard [2] routing table has wrong number of replicas, expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [1] routing table has wrong number of replicas, expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [4] routing table has wrong number of replicas, expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [3] routing table has wrong number of replicas,expected [0], got [1], Index [statistics_tl0_device_2017-152-21]: Shard [0] routing table has wrong number of replicas, expected [0], got [1]]][2017-06-01 16:14:48,992][INFO ][discovery.zen ] [163e016c-3827-496c-b306-b2972d60c8df] failed to send join request to master [{eb1d873a-ffdb-4ae8-ae22-946554970c54} {IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147}{10.145.192.147:9300}{data=false, zone=L, master=true}], reason [RemoteTransportException[[eb1d873a-ffdb-4ae8-ae22-946554970c54] [10.145.192.147:9300][internal:discovery/zen/join]]; nested: IllegalStateException[Node [{eb1d873a-ffdb-4ae8-ae22-946554970c54}{IyzDAx59Swy6pNqg2YU0-Q}{10.145.192.147} {10.145.192.147:9300}{data=false, zone=L, master=true}] not master for join request]; ] [2017-06-01 16:14:50,124][WARN ][rest.suppressed ] path: /_bulk, params: {}ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
Recommended Actions
Repeat these steps for each DCD in the
cluster. - Use SSH to log in to the first DCD in the cluster.
- Restart the ElasticSearch service by running the command:bigstart restart elasticsearch.
Restarting the service for each DCD in the cluster triggers a new
master node election. After the last DCD in the cluster is restarted, the cluster status
should change to healthy (green).