hcimarkus: 2023

Montag, 20. November 2023

Adding / changing uplinks in vs0 virtual switch configuration failed at nutanix single node cluster

The update of a virtual switch vs0 failed at nutanix single node clusters because the cluster is unable to perform a rolling reboot. You have to convert the switch and change the uplinks via the old manage_ovs commands and convert it back:

Disable the cluster VS(No downtime)

acli net.disable_virtual_switch

Then upgrade the bridge via the CVM with the correct details:

Uplink with LACP:

i. manage_ovs --bridge_name br0 --interfaces <interface names> --bond_name br0-up --bond_mode balance-tcp --lacp_mode fast --lacp_fallback true update_uplinks

Uplink without LACP:

i. manage_ovs --bridge_name br0 --interfaces <interface names> --bond_name br0-up --bond_mode <active-backup/balanced-slb> update_uplinks

Then verify connectivity to the CVM and AHV host
Migrate the bridge back to a VS:

acli net.migrate_br_to_virtual_switch br0 vs_name=vs0

Freitag, 10. November 2023

Expanding Cluster different Vendor / License Class

If you got problems at expanding your cluster with nodes from different vendor, be carefull as this is not supported. Sometimes you have to do this to replace all nodes of a cluster with new ones from a different vendor, so do this only for migration purposes, never run a mixed vendor config in production environments!

(In our case, we replaced Nutanix NX-Nodes with HPE DX)

Maybe the expand fails, as the new vendor has a different license class , so you can edit the file:

/etc/nutanix/hardware_config.json on the nodes, you want to add:

In the section

"hardware_attributes":

you will find an entry

"license_class": "software_only",

Remove this entry on all nodes you want to add and perform a genesis restart on this node. (You have to use sudo to edit the file!)

After expanding the cluster and removing the old nodes, put this entry back in the hardware_config.json and perform an allssh genesis restart on the cluster.

Then everything should be fine.

Maybe you also get an error at the prechecks from test_cassandra_ssd_size_check even if your SSD-Sizes are fine according to KB-8842

Then maybe your /etc/nutanix/hcl.json file on the existing cluster doesn't contain the SSDs in the new nodes. In this case check the hcl.json on the new nodes if this contains the SSDs. In this case copy the hcl.json from one of the new nodes to all CVMs in the running cluster, perform allssh genesis restart at the cluster and try again.

(Copy the file somewhere to /tmp and then use sudo mv to get it to /etc/nutanix/)

Maybe, Licensing of the "converted" Cluster may not work as expected, as the cluster thinks , he is still on "license_class": "appliance".

You can check your downloaded csf for the license class of the Nodes, if this is wrong, engage Nutanix support, they will provide a script to check and to set:

python /home/nutanix/ncc/bin/license_config_zk_util.py --show=config

python /home/nutanix/ncc/bin/license_config_zk_util.py --convert_license_class --license_class=<software_only/appliance>

to fix this in PC Versions prior to PC pc.2023.1.0.2 .

Later Versions has the ability to use "ncli license update-license license-class="<software_only/appliance>".

After updating wait about 1 hour, then you can try licensing from PC again (check the csf, if the Nodes now show the correct license_class)

Sometimes after removing a node (or all, in case of renewing the hardware) you can't reach the virtual cluster ip (cvms are reachable) or you can't resolve alerts. In this case Prism leader did not change correctly to a new node. You can fix this by restarting Prism:

allssh genesis stop prism;cluster start

Mittwoch, 1. Februar 2023

Error installing SSL certifcate on VCenter 7.0.3

Error occurred while fetching tls: String index out of range: -1

If you've got this error while replacing the Certificate through the UI Interface, try to not use the "Browse file" dialog. Just open the certificate in Text editor and copy/paste it in the required field.

Mittwoch, 18. Januar 2023

HPE SPP Installation not possible via LCM on HPE DX Hosts

If somebody accidentaly updated Firmwares on a HPE DX Node manually, it won't be possible to install Firmwares (SPP) on the node via LCM, as Nutanix expects special Versions of SPP on the node. (In LCM hover over the Question mark an it shows you, wich Versions are supported for Update)

Also, if you have LCM Versions older than supported, Nutanix won't show you Updates, so you can fake the SPP Version on the Nodes to get SPP Updates. See the Matrix of supported versions:

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000CvDhCAK

You can fix this and make LCM believe, you have such a version (but be careful, it should be a SPP-version, that reflects the FW-Versions, that are installed):

You should ssh to the affected host and then:

[root@host ~]# export TMP=/usr/local/tmpdir

[root@host ~]# ilorest --nologo login

[root@host ~]# ilorest --nologo select Bios. The dot is important!

[root@host ~]# ilorest --nologo get ServerOtherInfo

Now you can see, the Version of SPP the node has

Set the version to nothing, if something wrong is inside:

[root@host ~]# ilorest --nologo set ServerOtherInfo='' --commit

Set the version to the correct expected version:

[root@host ~]# ilorest --nologo set ServerOtherInfo='2021.04.0.02' --commit

Check, if everything is fine:

[root@host ~]# ilorest --nologo get ServerOtherInfo

[root@host ~]# ilorest --nologo logout

No you can perform a LCM Inventory and the SPP-Update should be possible

If you got the error:

ilorest: error while loading shared libraries: libz.so.1: failed to map segment from shared object

use a different tmp dir as follows:

[root@host ~]# export TMPDIR=/root/

and it will work.

This works for AHV and ESX7.x

For ESX 8.x , you have to open ilorest via /opt/ilorest/bin/ilorest.sh on the host.

ilorest: login

ilorest: select Bios.

ilorest: set ServerOtherInfo='2021.04.0.02' --commit

ilorest: logout

ilorest: exit

Search